<a href="https://colab.research.google.com/github/Viny2030/UNED/blob/main/Text_Classification_using_SpaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
organizations_snap_amazon_fine_food_reviews_path = kagglehub.dataset_download('organizations/snap/amazon-fine-food-reviews')
honnibal_spacyen_vectors_web_lg_path = kagglehub.dataset_download('honnibal/spacyen-vectors-web-lg')
poonaml_reddit_vectors_for_sense2vec_spacy_path = kagglehub.dataset_download('poonaml/reddit-vectors-for-sense2vec-spacy')

print('Data source import complete.')


### Introduction

#### About Dataset:
We will be using rich dataset of amazon fine food reviews.

####  What are we trying to achieve??
We are going to tackle an interesting natural language processing problem i.e sentiment or text classification.
We will explore texual data using amazing spaCy library and build a text classification model.

### Here is breakdown of concepts I will try to explain.
We will extract linguistic features like
1. tokenization,
1. part-of-speech tagging,
1. dependency parsing,
1. lemmatization ,
1. named entities recognition,
1. Sentence Boundary Detection
for building language models later.

Visualizing Data
1. explacy - explaining how parsing is done
1. displaCy - visualizing named entities

Word vectors and similarity
1. sense2vec - using contextual information for building word embeddings

Text classification model
1. SpaCy TextCategorizer

### Loading data

In [None]:
import pandas as pd
import numpy as np
import spacy
from spacy import displacy
from spacy.util import minibatch, compounding

import matplotlib.pyplot as plt
%matplotlib inline

Let's read in food reviews data

In [None]:
food_reviews_df=pd.read_csv('../input/amazon-fine-food-reviews/Reviews.csv')
food_reviews_df.shape

In [None]:
food_reviews_df.head().T

Text column contains review given by customer.

Let's focus on texual data and ratings for text classification.

In [None]:
food_reviews_df = food_reviews_df[['Text','Score']].dropna()

In [None]:
ax=food_reviews_df.Score.value_counts().plot(kind='bar')
fig = ax.get_figure()
fig.savefig("score.png");

We have five-star rating system.
It looks like we have more reviews with ratings 5, this can lead to unbalanced classes. We will treat rating 4 and 5 as positive and rest as negative reviews.

In [None]:
food_reviews_df.Score[food_reviews_df.Score<=3]=0
food_reviews_df.Score[food_reviews_df.Score>=4]=1

In [None]:
ax=food_reviews_df.Score.value_counts().plot(kind='bar')
fig = ax.get_figure()
fig.savefig("score_boolean.png");

In [None]:
food_reviews_df.head()

Since we have huge data, since it might be difficult to train in kernel, I will reduce data size of 100K rows.
To balance classes, i have selected equal samples from each class.

In [None]:
train_pos_df=food_reviews_df[food_reviews_df.Score==1][:50000]
train_neg_df=food_reviews_df[food_reviews_df.Score==0][:50000]

In [None]:
train_df=train_pos_df.append(train_neg_df)
train_df.shape

In [None]:
val_pos_df=food_reviews_df[food_reviews_df.Score==1][50000:60000]
val_neg_df=food_reviews_df[food_reviews_df.Score==0][50000:60000]
val_df=val_pos_df.append(val_neg_df)
val_df.shape

### Linguistic features

#### Tokenization
First step in any nlp pipeline is tokenizing text i.e breaking down paragraphs into sentenses and then sentenses into words, punctuations and so on.

we will load english language model to tokenize our english text.

Every language is different and have different rules. Spacy offers 8 different language models.

In [None]:
spacy_tok = spacy.load('en_core_web_sm')
sample_review=food_reviews_df.Text[54]
sample_review

In [None]:
parsed_review = spacy_tok(sample_review)
parsed_review

There is not much difference between parsed review and original one. But we will see ahead what has actually happened.
We can see how parsing has been done visually through **explacy**.

In [None]:
!wget https://raw.githubusercontent.com/tylerneylon/explacy/master/explacy.py

In [None]:
import explacy
explacy.print_parse_info(spacy_tok, 'The salad was surprisingly tasty.')

In [None]:
explacy.print_parse_info(spacy_tok,food_reviews_df.Text[0])

#### Part-of-speech tagging
After tokenization we can parse and tag variety of parts of speech to paragraph text. SpaCy uses statistical models in background to predict which tag will go for each word(s) based on the context.

##### Lemmatization
It is the process of extracting uninflected/base form of the word.
Lemma can be like
For eg.

Adjectives: best, better → good
Adverbs: worse, worst → badly
Nouns: ducks, children → duck, child
Verbs: standing,stood → stand


In [None]:
tokenized_text = pd.DataFrame()

for i, token in enumerate(parsed_review):
    tokenized_text.loc[i, 'text'] = token.text
    tokenized_text.loc[i, 'lemma'] = token.lemma_,
    tokenized_text.loc[i, 'pos'] = token.pos_
    tokenized_text.loc[i, 'tag'] = token.tag_
    tokenized_text.loc[i, 'dep'] = token.dep_
    tokenized_text.loc[i, 'shape'] = token.shape_
    tokenized_text.loc[i, 'is_alpha'] = token.is_alpha
    tokenized_text.loc[i, 'is_stop'] = token.is_stop
    tokenized_text.loc[i, 'is_punctuation'] = token.is_punct

tokenized_text[:20]

#### Named Entity Recognition (NER)
Named entity is real world object like Person, Organization etc

Spacy figures out below entities automatically:

|Type	|Description|
|------|------|
|PERSON|	People, including fictional.
|NORP|	Nationalities or religious or political groups.|
|FAC|	Buildings, airports, highways, bridges, etc.|
|ORG|	Companies, agencies, institutions, etc.|
|GPE|	Countries, cities, states.|
|LOC|	Non-GPE locations, mountain ranges, bodies of water.|
|PRODUCT|	Objects, vehicles, foods, etc. (Not services.)|
|EVENT|	Named hurricanes, battles, wars, sports events, etc.|
|WORK_OF_ART|	Titles of books, songs, etc.|
|LAW|	Named documents made into laws.|
|LANGUAGE|	Any named language.|
|DATE|	Absolute or relative dates or periods.|
|TIME|	Times smaller than a day.|
|PERCENT|	Percentage, including "%".|
|MONEY|	Monetary values, including unit.|
|QUANTITY|	Measurements, as of weight or distance.|
|ORDINAL|	"first", "second", etc.|
|CARDINAL|	Numerals that do not fall under another type|

In [None]:
spacy.displacy.render(parsed_review, style='ent', jupyter=True)

In [None]:
spacy.explain('GPE') # to explain POS tag

#### Dependency parsing
Syntactic Parsing or Dependency Parsing is process of identifyig sentenses and assigning a syntactic structure to it.
As in Subject combined with object makes a sentence.
Spacy provides parse tree which can be used to generate this structure.

##### Sentense Boundry Detection
Figuring out where sentense starts and ends is very imporatnt part of nlp.

In [None]:
sentence_spans = list(parsed_review.sents)
sentence_spans

In [None]:
displacy.render(parsed_review, style='dep', jupyter=True,options={'distance': 140})

Kindly scroll down if you can't see the output above.
You can even customize dependency parser's output as below.

In [None]:
options = {'compact': True, 'bg': 'violet','distance': 140,
           'color': 'white', 'font': 'Trebuchet MS'}
displacy.render(parsed_review, jupyter=True, style='dep', options=options)

In [None]:
spacy.explain("ADJ") ,spacy.explain("det") ,spacy.explain("ADP") ,spacy.explain("prep")  # to understand tags

#### Processing Noun chunks

In [None]:
noun_chunks_df = pd.DataFrame()

for i, chunk in enumerate(parsed_review.noun_chunks):
    noun_chunks_df.loc[i, 'text'] = chunk.text
    noun_chunks_df.loc[i, 'root'] = chunk.root,
    noun_chunks_df.loc[i, 'root.text'] = chunk.root.text,
    noun_chunks_df.loc[i, 'root.dep_'] = chunk.root.dep_
    noun_chunks_df.loc[i, 'root.head.text'] = chunk.root.head.text

noun_chunks_df[:20]

### Visualizing using Scattertext

In [None]:
!pip install scattertext
import scattertext as st
nlp = spacy.load('en',disable_pipes=["tagger","ner"])

In [None]:
nlp = spacy.load('en',disable_pipes=["tagger","ner"])
train_df['parsed'] = train_df.Text[49500:50500].apply(nlp)
corpus = st.CorpusFromParsedDocuments(train_df[49500:50500],
                             category_col='Score',
                             parsed_col='parsed').build()

In [None]:
html = st.produce_scattertext_explorer(corpus,
          category=1,
          category_name='Positive',
          not_category_name='Negative',
          width_in_pixels=700,
          minimum_term_frequency=15,
          term_significance = st.LogOddsRatioUninformativeDirichletPrior(),
          )

In [None]:
# uncomment this cell to load the interactive scattertext visualisation
filename = "positive-vs-negative.html"
open(filename, 'wb').write(html.encode('utf-8'))
IFrame(src=filename, width = 900, height=900)


### Word vectors and similarity

Ok let's do some modelling and focus on scoring our food!!

### Sence2vec

The idea is get something better than word2vec model.

The idea behind sense2vec is super simple. If the problem is that duck as in waterfowl and duck as in crouch are different concepts, the straight-forward solution is to just have two entries, duckN and duckV.  Trask et al (2015) published a nice set of experiments showing that the idea worked well.

It assight parts of speech tags like verb, noun , adjective to words, which will in turn be used to make sence of context.
1. Please book [VERB] my ticket.
2. Read the book [NOUN].

Read more [here](https://explosion.ai/blog/sense2vec-with-spacy) and [here](https://github.com/explosion/sense2vec)

Reddit talks about food a lot so we can get nice similarity vectors for food items.

In [None]:
!pip install sense2vec==1.0.0a0

In [None]:
import sense2vec
from sense2vec import Sense2VecComponent

s2v = Sense2VecComponent('../input/reddit-vectors-for-sense2vec-spacy/reddit_vectors-1.1.0/reddit_vectors-1.1.0/')
spacy_tok.add_pipe(s2v)
doc = spacy_tok(u"dessert.")
freq = doc[0]._.s2v_freq
vector = doc[0]._.s2v_vec
most_similar = doc[0]._.s2v_most_similar(5)
most_similar,freq

In [None]:
doc = spacy_tok(u"burger")
most_similar = doc[0]._.s2v_most_similar(4)
most_similar

In [None]:
doc = spacy_tok(u"peanut butter")
most_similar = doc[0]._.s2v_most_similar(4)
most_similar

Similarity between entities can be kind of fun.


The following attributes are available via the ._ property – for example token._.in_s2v:

Name	|Attribute Type|	Type|	Description|
--------|---------------|-------------|---------------|
in_s2v	|property|	bool|	Whether a key exists in the vector map.
s2v_freq|	property|	int|	The frequency of the given key.
s2v_vec|	property|	ndarray[float32]|	The vector of the given key.
s2v_most_similar|	method|	list|	Get the n most similar terms. Returns a list of ((word, sense), score) tuples.



## SpaCy Text Categorizer

We will train a multi-label convolutional neural network text classifier on our food reviews, using spaCy's new TextCategorizer  component.

SpaCy provides classification model with multiple, non-mutually exclusive labels. You can change the model architecture rather easily, but by default, the TextCategorizer class uses a convolutional neural network to assign position-sensitive vectors to each word in the document. The TextCategorizer uses its own CNN model, to avoid sharing weights with the other pipeline components. The document tensor is then summarized by concatenating max and mean pooling, and a multilayer perceptron is used to predict an output vector of length nr_class, before a logistic activation is applied elementwise. The value of each output neuron is the probability that some class is present.

#### Prepare data
Let's prepare the data as SpaCy would like it.
It accepts list of tuples of text and labels.

In [None]:
train_df['tuples'] = train_df.apply(
    lambda row: (row['Text'],row['Score']), axis=1)
train = train_df['tuples'].tolist()
train[:1]

In [None]:
train[-2:]

In [None]:
#functions from spacy documentation
def load_data(limit=0, split=0.8):
    train_data = train
    np.random.shuffle(train_data)
    train_data = train_data[-limit:]
    texts, labels = zip(*train_data)
    cats = [{'POSITIVE': bool(y)} for y in labels]
    split = int(len(train_data) * split)
    return (texts[:split], cats[:split]), (texts[split:], cats[split:])

def evaluate(tokenizer, textcat, texts, cats):
    docs = (tokenizer(text) for text in texts)
    tp = 1e-8  # True positives
    fp = 1e-8  # False positives
    fn = 1e-8  # False negatives
    tn = 1e-8  # True negatives
    for i, doc in enumerate(textcat.pipe(docs)):
        gold = cats[i]
        for label, score in doc.cats.items():
            if label not in gold:
                continue
            if score >= 0.5 and gold[label] >= 0.5:
                tp += 1.
            elif score >= 0.5 and gold[label] < 0.5:
                fp += 1.
            elif score < 0.5 and gold[label] < 0.5:
                tn += 1
            elif score < 0.5 and gold[label] >= 0.5:
                fn += 1
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f_score = 2 * (precision * recall) / (precision + recall)
    return {'textcat_p': precision, 'textcat_r': recall, 'textcat_f': f_score}

#("Number of texts to train from","t" , int)
n_texts=30000
#You can increase texts count if you have more computational power.

#("Number of training iterations", "n", int))
n_iter=10

In [None]:
nlp = spacy.load('en_core_web_sm')  # create english Language class

In [None]:
# add the text classifier to the pipeline if it doesn't exist
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'textcat' not in nlp.pipe_names:
    textcat = nlp.create_pipe('textcat')
    nlp.add_pipe(textcat, last=True)
# otherwise, get it, so we can add labels to it
else:
    textcat = nlp.get_pipe('textcat')

# add label to text classifier
textcat.add_label('POSITIVE')

# load the dataset
print("Loading food reviews data...")
(train_texts, train_cats), (dev_texts, dev_cats) = load_data(limit=n_texts)
print("Using {} examples ({} training, {} evaluation)"
      .format(n_texts, len(train_texts), len(dev_texts)))
train_data = list(zip(train_texts,
                      [{'cats': cats} for cats in train_cats]))

### Training model

In [None]:
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']
with nlp.disable_pipes(*other_pipes):  # only train textcat
    optimizer = nlp.begin_training()
    print("Training the model...")
    print('{:^5}\t{:^5}\t{:^5}\t{:^5}'.format('LOSS', 'P', 'R', 'F'))
    for i in range(n_iter):
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(train_data, size=compounding(4., 32., 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=0.2,
                       losses=losses)
        with textcat.model.use_params(optimizer.averages):
            # evaluate on the dev data split off in load_data()
            scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
        print('{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}'  # print a simple table
              .format(losses['textcat'], scores['textcat_p'],
                      scores['textcat_r'], scores['textcat_f']))


In [None]:
# test the trained model
test_text1 = 'This tea is fun to watch as the flower expands in the water. Very smooth taste and can be used again and again in the same day. If you love tea, you gotta try these "flowering teas"'
test_text2="I bought this product at a local store, not from this seller.  I usually use Wellness canned food, but thought my cat was bored and wanted something new.  So I picked this up, knowing that Evo is a really good brand (like Wellness).<br /><br />It is one of the most disgusting smelling cat foods I've ever had the displeasure of using.  I was gagging while trying to put it into the bowl.  My cat took one taste and walked away, and chose to eat nothing until I replaced it 12 hours later with some dry food.  I would try another flavor of their food - since I know it's high quality - but I wouldn't buy the duck flavor again."
doc = nlp(test_text1)
test_text1, doc.cats

Positive review is indeed close to 1

In [None]:
doc2 = nlp(test_text2)
test_text2, doc2.cats

Negative review is close to 0

In [None]:
output_dir=%pwd
nlp.to_disk(output_dir)
print("Saved model to", output_dir)

In [None]:
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc2 = nlp2(test_text2)
print(test_text2, doc2.cats)

Model looks preety good. We can definitely improve it further by feeding more data and data augmentations.
Thanks for reading. Hope you learnt something new :)  #TODO Data Augmentation.