# **Text Classification**
A common task in NLP is text classification. This is "classification" in the conventional machine learning sense, and it is applied to text. Examples include spam detection, sentiment analysis, and tagging customer queries.

In [1]:
import pandas as pd
import spacy
from spacy.util import minibatch
import random


In [2]:
def load_data(csv_file, split=0.9):
    data = pd.read_csv(csv_file)
    
    # Shuffle data
    train_data = data.sample(frac=1, random_state=7)
    
    texts = train_data.text.values
    labels = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)}
              for y in train_data.sentiment.values]
    split = int(len(train_data) * split)
    
    train_labels = [{"cats": labels} for labels in labels[:split]]
    val_labels = [{"cats": labels} for labels in labels[split:]]
    
    return texts[:split], train_labels, texts[split:], val_labels

In [3]:
train_texts, train_labels, val_texts, val_labels = load_data('yelp_ratings.csv')

In [4]:
train_labels[:10]

[{'cats': {'NEGATIVE': False, 'POSITIVE': True}},
 {'cats': {'NEGATIVE': False, 'POSITIVE': True}},
 {'cats': {'NEGATIVE': False, 'POSITIVE': True}},
 {'cats': {'NEGATIVE': False, 'POSITIVE': True}},
 {'cats': {'NEGATIVE': True, 'POSITIVE': False}},
 {'cats': {'NEGATIVE': False, 'POSITIVE': True}},
 {'cats': {'NEGATIVE': True, 'POSITIVE': False}},
 {'cats': {'NEGATIVE': True, 'POSITIVE': False}},
 {'cats': {'NEGATIVE': False, 'POSITIVE': True}},
 {'cats': {'NEGATIVE': False, 'POSITIVE': True}}]

In [5]:
train_texts

array(["Some of the best sushi I've ever had....and I come from the East Coast.  Unreal toro, have some of it's available.",
       "One of the best burgers I've ever had and very well priced. I got the tortilla burger and is was delicious especially with there tortilla soup!",
       'Review by a vegetarian family with two young kids. \n\nSeveral reviews have lamented the small number of vegetarian options on the menu and, while it is true that there are far more options for meat eaters and there is unfortunately no vegetarian noodle soup option, once you get over these 2 facts this is an excellent place for vegetarians.',
       ...,
       'Their chicken wings is the bomb. I live in Mississauga but drove all the way up there for the wings. We ordered honey garlic, sweet chilli and mango chipotle. \nHalf price wings on Mondays.',
       'The pizza is really good! Staff does a great job especially with how busy they seem to get they handle it like champs! Manager is amazing she does a

In [6]:
val_texts

array(["This magic show was the best one I've ever seen, very funny and great magic!!! We sat in the third row center and couldn't figure out any of the tricks, how does he do it? My sons (8 & 10) absolutely loved the show. My 10 year old volunteered to help on stage, he was super excited and as an extra bonus got a magic kit as a thank you at the end. We will definetely come back to this show, it's perfect family entertainment, loved every second of it!!!",
       "There are a lot of good food places in Las Vegas, it's just a little bit harder to find them in the North side! It's located inside a gas station, but area seems to look just fine in my opinion! I may say a lot of people are nice in my reviews, but really, the owner is welcoming to his customers and if there are any adjustments that needs to be made he is more than willing to fix it or make up for it somehow. The price here is also great! I have only tried the tacos, but they're all really good with the sauce! You have to g

In [7]:
print('Texts from training data\n------')
print(train_texts[:2])
print('\nLabels from training data\n------')
print(train_labels[:2])

Texts from training data
------
["Some of the best sushi I've ever had....and I come from the East Coast.  Unreal toro, have some of it's available."
 "One of the best burgers I've ever had and very well priced. I got the tortilla burger and is was delicious especially with there tortilla soup!"]

Labels from training data
------
[{'cats': {'POSITIVE': True, 'NEGATIVE': False}}, {'cats': {'POSITIVE': True, 'NEGATIVE': False}}]


In [8]:
# Create an empty model
nlp = spacy.blank("en")

# Create the TextCategorizer with exclusive classes and "bow" architecture
textcat = nlp.create_pipe(
              "textcat",
              config={
                "exclusive_classes": True,
                "architecture": "bow"})

# Add the TextCategorizer to the empty model
nlp.add_pipe(textcat)

# Add labels to text classifier
textcat.add_label("NEGATIVE")
textcat.add_label("POSITIVE")

1

In [9]:
textcat

<spacy.pipeline.pipes.TextCategorizer at 0x7f6ca37d5c88>

In [10]:
def train(model, train_data, optimizer):
    losses = {}
    random.seed(1)
    random.shuffle(train_data)
    
    batches = minibatch(train_data, size=8)
    for batch in batches:
        # train_data is a list of tuples [(text0, label0), (text1, label1), ...]
        # Split batch into texts and labels
        texts, labels = zip(*batch)
        
        # Update model with texts and labels
        model.update(texts, labels, sgd=optimizer, losses=losses)
        
    return losses

In [11]:
# Fix seed for reproducibility
spacy.util.fix_random_seed(1)
random.seed(1)

# This may take a while to run!
optimizer = nlp.begin_training()
train_data = list(zip(train_texts, train_labels))
losses = train(nlp, train_data, optimizer)
print(losses['textcat'])

8.704142077092879


In [12]:
text = "This tea cup was full of holes. Do not recommend."
doc = nlp(text)
print(doc.cats)

{'NEGATIVE': 0.7737048864364624, 'POSITIVE': 0.2262951135635376}



# Step 4: Making Predictions

Implement a function `predict` that predicts the sentiment of text examples. 
- First, tokenize the texts using `nlp.tokenizer()`. 
- Then, pass those docs to the TextCategorizer which you can get from `nlp.get_pipe()`. 
- Use the `textcat.predict()` method to get scores for each document, then choose the class with the highest score (probability) as the predicted class.

In [13]:
def predict(nlp, texts): 
    # Use the model's tokenizer to tokenize each input text
    docs = [nlp.tokenizer(text) for text in texts]
    
    # Use textcat to get the scores for each doc
    textcat = nlp.get_pipe('textcat')
    scores, _ = textcat.predict(docs)
    
    # From the scores, find the class with the highest score/probability
    predicted_class = scores.argmax(axis=1)
    
    return predicted_class

In [14]:
texts = val_texts[34:38]
predictions = predict(nlp, texts)

for p, t in zip(predictions, texts):
    print(f"{textcat.labels[p]}: {t} \n")

POSITIVE: Came over and had their "Pick 2" lunch combo and chose their best selling 1/2 chicken sandwich with quinoa.  Both were tasty, the chicken salad is a bit creamy but was perfect with quinoa on the side.  This is a good lunch joint, casual and clean! 

POSITIVE: Went here last night and got oysters, fried okra, fries, and onion rings. I cannot complain. The portions were great and tasty!!! I will definitely be back for more. I cannot wait to try the crawfish boudin and soft shell crab. 

POSITIVE: This restaurant was fantastic! 
The concept of eating without vision was intriguing. The dinner was filled with laughs and good conversation. 

We were lead in a line to our table and each person to their seat. This was not just dark but you could not see something right in front of your face. 

The waiters/waitresses were all blind and allowed us to see how aware you need to be without the vision. 

Taking away one sense is said to increase your other senses so as taste and hearing wh

# Step 5: Evaluate The Model

Implement a function that evaluates a `TextCategorizer` model. This function `evaluate` takes a model along with texts and labels. It returns the accuracy of the model, which is the number of correct predictions divided by all predictions.

First, use the `predict` method you wrote earlier to get the predicted class for each text in `texts`. Then, find where the predicted labels match the true "gold-standard" labels and calculate the accuracy.

In [15]:
def evaluate(model, texts, labels):
    """ Returns the accuracy of a TextCategorizer model. 
    
        Arguments
        ---------
        model: ScaPy model with a TextCategorizer
        texts: Text samples, from load_data function
        labels: True labels, from load_data function
    
    """
    # Get predictions from textcat model
    predicted_class = predict(model, texts)

    # From labels, get the true class as a list of integers (POSITIVE -> 1, NEGATIVE -> 0)
    true_class = [int(each['cats']['POSITIVE']) for each in labels]

    # A boolean or int array indicating correct predictions
    correct_predictions = predicted_class == true_class

    # The accuracy, number of correct predictions divided by all predictions
    accuracy = correct_predictions.mean()

    return accuracy

In [16]:
accuracy = evaluate(nlp, val_texts, val_labels)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.9488


In [17]:
# With the functions implemented, can train and evaluate in a loop.
n_iters = 5
for i in range(n_iters):
    losses = train(nlp, train_data, optimizer)
    accuracy = evaluate(nlp, val_texts, val_labels)
    print(f"Loss: {losses['textcat']:.3f} \t Accuracy: {accuracy:.3f}")

Loss: 4.496 	 Accuracy: 0.945
Loss: 3.106 	 Accuracy: 0.946
Loss: 2.348 	 Accuracy: 0.945
Loss: 1.926 	 Accuracy: 0.944
Loss: 1.594 	 Accuracy: 0.945
