virtual environment 'sent.analyzer'

    conda install -n sent.analyzer spacy
    python -m spacy download en_core_web_sm

    Successfully built en-core-web-sm

    Installing collected packages: en-core-web-sm

    Successfully installed en-core-web-sm-2.3.1

    [+] Download and installation successful

    You can now load the model via spacy.load('en_core_web_sm')

# Quick Tutorial

### Tokenizing
Tokenization is the process of breaking down chunks of text into smaller pieces. spaCy comes with a default processing pipeline that begins with tokenization, making this process a snap. In spaCy, you can do either sentence tokenization or word tokenization:

    Word tokenization breaks text down into individual words.
    Sentence tokenization breaks text down into individual sentences.

In [4]:
import spacy

^C


In [5]:
text = """
Dave watched as the forest burned up on the hill,
only a few miles from his house. The car had
been hastily packed and Marta was inside trying to round
up the last of the pets. "Where could she be?" he wondered
as he continued to wait for Marta to appear with the pets.
"""

In [7]:
nlp = spacy.load('en_core_web_sm')


In [10]:
doc = nlp(text)

In [11]:
token_list = [token for token in doc]

In [12]:
token_list

[,
 Dave,
 watched,
 as,
 the,
 forest,
 burned,
 up,
 on,
 the,
 hill,
 ,,
 ,
 only,
 a,
 few,
 miles,
 from,
 his,
 house,
 .,
 The,
 car,
 had,
 ,
 been,
 hastily,
 packed,
 and,
 Marta,
 was,
 inside,
 trying,
 to,
 round,
 ,
 up,
 the,
 last,
 of,
 the,
 pets,
 .,
 ",
 Where,
 could,
 she,
 be,
 ?,
 ",
 he,
 wondered,
 ,
 as,
 he,
 continued,
 to,
 wait,
 for,
 Marta,
 to,
 appear,
 with,
 the,
 pets,
 .,
 ]

In [13]:
filtered_tokens = [token for token in doc if not token.is_stop]
filtered_tokens

[,
 Dave,
 watched,
 forest,
 burned,
 hill,
 ,,
 ,
 miles,
 house,
 .,
 car,
 ,
 hastily,
 packed,
 Marta,
 inside,
 trying,
 round,
 ,
 pets,
 .,
 ",
 ?,
 ",
 wondered,
 ,
 continued,
 wait,
 Marta,
 appear,
 pets,
 .,
 ]

In [14]:
lemmas = [f'Token: {token}, lemma: {token.lemma_}' for token in filtered_tokens]
lemmas

['Token: \n, lemma: \n',
 'Token: Dave, lemma: Dave',
 'Token: watched, lemma: watch',
 'Token: forest, lemma: forest',
 'Token: burned, lemma: burn',
 'Token: hill, lemma: hill',
 'Token: ,, lemma: ,',
 'Token: \n, lemma: \n',
 'Token: miles, lemma: mile',
 'Token: house, lemma: house',
 'Token: ., lemma: .',
 'Token: car, lemma: car',
 'Token: \n, lemma: \n',
 'Token: hastily, lemma: hastily',
 'Token: packed, lemma: pack',
 'Token: Marta, lemma: Marta',
 'Token: inside, lemma: inside',
 'Token: trying, lemma: try',
 'Token: round, lemma: round',
 'Token: \n, lemma: \n',
 'Token: pets, lemma: pet',
 'Token: ., lemma: .',
 'Token: ", lemma: "',
 'Token: ?, lemma: ?',
 'Token: ", lemma: "',
 'Token: wondered, lemma: wonder',
 'Token: \n, lemma: \n',
 'Token: continued, lemma: continue',
 'Token: wait, lemma: wait',
 'Token: Marta, lemma: Marta',
 'Token: appear, lemma: appear',
 'Token: pets, lemma: pet',
 'Token: ., lemma: .',
 'Token: \n, lemma: \n']

# Building Your Own NLP Sentiment Analyzer
From the previous sections, you’ve probably noticed four major stages of building a sentiment analysis pipeline:

    Loading data
    Preprocessing
    Training the classifier
    Classifying data
    
For building a real-life sentiment analyzer, you’ll work through each of the steps that compose these stages. You’ll use the Large Movie Review Dataset compiled by Andrew Maas to train and test your sentiment analyzer. Once you’re ready, proceed to the next section to load your data.

## Loading and Preprocessing Data

In [2]:
def load_training_data(
    data_directory: str = "aclImdb/train",
    split: float = 0.8,
    limit: int = 0
) -> tuple:
    pass

We want to iterate through all the files in the dataset and load them into a list

In [7]:
import os

def load_training_data(data_directory: str = "aclImdb/train",split: float = 0.8,limit: int = 0) -> tuple:
    """
    Split (float) is the proportion of data used to train, remainder tests
    """
    #load from files
    reviews = []
    for label in ["pos", "neg"]:
        labeled_directory = f"{data_directory}/{label}"
        for review in os.listdir(labeled_directory):
            if review.endswith(".txt"):
                with open(f"{labeled_directory}/{review}") as f:
                          text = f.read()
                          text = text.replace("<br />", "\n\n")
                          if text.strip():
                              spacy_label = {
                                  "cats": {
                                      "pos": "pos" == label,
                                      "neg": "neg" == label
                                  }}
                              reviews.append((text,spacy_label))

<b>Below randomly shuffle the order of the reviews to reduce the possible bias produced from loading order

In [9]:
import os
import random


def load_training_data(data_directory: str = "aclImdb/train",split: float = 0.8,limit: int = 0) -> tuple:
    """
    Split (float) is the proportion of data used to train, remainder tests
    """
    #load from files
    reviews = []
    for label in ["pos", "neg"]:
        labeled_directory = f"{data_directory}/{label}"
        for review in os.listdir(labeled_directory):
            if review.endswith(".txt"):
                with open(f"{labeled_directory}/{review}") as f:
                          text = f.read()
                          text = text.replace("<br />", "\n\n")
                          if text.strip():
                              spacy_label = {
                                  "cats": {
                                      "pos": "pos" == label,
                                      "neg": "neg" == label
                                  }}
                              reviews.append((text,spacy_label))
    random.shuffle(reviews)
    
    if limit:
        reviews = reviews[:limit]
    split = int(len(reviews) * split)
    return reviews[:split], reviews[split:]

# Training Your Classifier

Putting the spaCy pipeline together allows you to rapidly build and train a convolutional neural network (CNN) for classifying text data. 
    
    1. Modifying the base spaCy pipeline to include the textcat component
    2. Building a training loop to train the textcat component
    3. Evaluating the progress of your model training after a given number of training loops

## Build Pipeline

In [10]:
def train_model(training_data: list, test_data: list, iterations: int = 20) -> None:
    #Build Pipline
    nlp = spacy.load("en_core_web_sm")
    if "textcat" not in nlp.pipe_names:
        textcat = nlp.create_pipe(
            "textcat", config = {"architecture": "simple_cnn"}
        )
        nlp.add_pipe(textcat, last = True)
    else:
        textcat = nlp.get_pipe("textcat")
    
    textcat.add_label("pos")
    textcat.add_label("neg")

## Build Your Training Loop to Train textcat

In [11]:
import os
import random
import spacy
from spacy.util import minibatch, compounding

In [17]:
def train_model(training_data: list, test_data: list, iterations: int= 20) -> None:
    #Build Pipeline
    nlp = spacy.load('en_core_web_sm') # load the english model
    if "textcat" not in nlp.pipe_names:
        textcat = nlp.create_pipe("textcat", config = {"architecture": "simple_cnn"})
        nlp.add_pipe(textcat,last=True)
    else:
        textcat = nlp.get_pipe('textcat')
    
    textcat.add_label("pos")
    textcat.add_label('neg')
    
    #Train only textcat
    training_excluded_pipes = [pipe for pip3 in nlp.pipe_names if pipe != "textcat"]
    
    with nlp.disable_pips(training_excluded_pipes):
        optimizer = nlp.begin_training()
        # Training loop
        print("Beginning Training")
        print("Loss\tPrecision\tRecall\tF-score")
        batch_sizes = compounding(4.0,32.0,1.001) # a generator that yields infinite series of input numbers
        
        for i in range(iterations):
            loss = {}
            random.shuffle(training_data)
            batches = minibatch(training_data,size = batch_sizes)
            for batch in batches:
                text, labels = zip(*batch)
                nlp.update(
                    text,
                    labels,
                    drop = 0.2,
                    sgd = optimizer,
                    losses = loss
                )
            with textcat.model.use_params(optimizer.averages):
                evaluations_results = evaluate_model(tokenizer = nlp.tokenizer,textcat=textcat,test_data = test_data) #evaluate model function
                
                print(
                    f"{loss['textcat']}\t{evaluation_results['precision']}"
                    f"\t{evaluation_results['recall']}"
                    f"\t{evaluation_results['f-score']}"
                )
            

# Evaluating the Progress of Model Training

In [16]:
def evaluate_model(tokenizer,textcat,test_data:list) -> dict:
    reviews, labels = zip(*test_data)
    reviews = (tokenizer(review) for review in reviews)
    true_positives = 0
    false_positives = 1e-8 #can't be 0 because of the presence in denominator
    true_negatives = 0
    false_negatives = 1e-8
    
    for i,review in enumerate(textcat.pipe(reviews)):
        true_label = labels[i]
        for predicted_label, score in review.cats.items()
        #every category's dictionary includes both labels. You can get all the info you need with just the positive label
        if predicted_label = "neg":
            continue
        
        if score >= 0.5 and true_label["pos"]:
            true_positives +=1
        elif score >= 0.5 and true_label["neg"]:
            false_positives += 1
        elif score <0.5 and true_label["neg"]:
            true_negatives +=1
        elif score < 0.5 and true_label["pos"]:
            false_negatives +=1
    precision = true_positives / (true_positives + false_positives)
    recall = true_positives / (true_positives + false_negatives)
    
    if precision + recall == 0:
        f_score = 0
    else:
        f_score = 2 * (precision * recall) / (precision + recall)
    return {"precision": precision, "recall":recall,"f-score":f_score}

In [18]:
def train_model(training_data: list, test_data: list, iterations: int= 20) -> None:
    #Build Pipeline
    nlp = spacy.load('en_core_web_sm') # load the english model
    if "textcat" not in nlp.pipe_names:
        textcat = nlp.create_pipe("textcat", config = {"architecture": "simple_cnn"})
        nlp.add_pipe(textcat,last=True)
    else:
        textcat = nlp.get_pipe('textcat')
    
    textcat.add_label("pos")
    textcat.add_label('neg')
    
    #Train only textcat
    training_excluded_pipes = [pipe for pip3 in nlp.pipe_names if pipe != "textcat"]
    
    with nlp.disable_pips(training_excluded_pipes):
        optimizer = nlp.begin_training()
        # Training loop
        print("Beginning Training")
        print("Loss\tPrecision\tRecall\tF-score")
        batch_sizes = compounding(4.0,32.0,1.001) # a generator that yields infinite series of input numbers
        
        for i in range(iterations):
            loss = {}
            random.shuffle(training_data)
            batches = minibatch(training_data,size = batch_sizes)
            for batch in batches:
                text, labels = zip(*batch)
                nlp.update(
                    text,
                    labels,
                    drop = 0.2,
                    sgd = optimizer,
                    losses = loss
                )
            with textcat.model.use_params(optimizer.averages):
                evaluations_results = evaluate_model(tokenizer = nlp.tokenizer,textcat=textcat,test_data = test_data) #evaluate model function
                
                print(
                    f"{loss['textcat']}\t{evaluation_results['precision']}"
                    f"\t{evaluation_results['recall']}"
                    f"\t{evaluation_results['f-score']}"
                )
    #Save model
    with nlp.use_params(optimizer.averages):
        nlp.to_disk("model_artifacts")