<a href="https://colab.research.google.com/github/damzC/nlp/blob/main/Sentiment_Analyzer_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook explains the problem of Sentiment Analysis, some of the popular datasets available, approaches to solve this problem and finally a step-by-step guide to solve this problem using Machine Learning.

**Definition**: Sentiment Analysis is the NLP task of computationally identifying the opinion or sentiment (*positive*, *negative*, or *neutral*) expressed in a text.

**Popular Data sets for Sentiment Analysis:**
1. kaggle: Movie reviews on IMDB data set - 50K entries: *Longer text SA*
2. Stanford data set for sentiment analysis:*5 classes*: Very positive, Positive, Neutral, Negative, Very Negative
3. Amazon review data set (kaggle) - Pre-trained models available: *Short text SA* (4 million entries) - Extract the headings only

**Approaches for Sentiment Analysis:**
1. Lexicon based: Senti WordNet
2. NLP Tools: TextBlob, spaCy, NLTK
3. **Machine Learning: NB Classifier, SVM, XGB**
4. Deep learning: LSTMs, GRUs, seq2seq
5. Sentiment Embeddings - Embeddings of words based on sentiments
6. Fine-tuning over Large Language Models (like BERT, RoBERTa *etc.*)

**Sentiment Analyser using Machine Learning**

This notebook exemplifies a sample implementation of a Sentiment Analyzer using a machine learning model (Naive Bayes Calssifier in this case). The implementation shows how to:

1. Import movie review data from NLTK
2. Extract data in (X,Y) pairs for training
3. Vectorize text
4. Train Sentiment Analyser using NB Classifier
5. Show results using Confusion Matrix and Classification Report


## **Import Packages**

In [None]:
import os
import pandas as pd
import numpy as np

In [None]:
import nltk
import nltk.corpus

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

## **Import data for Sentiment Analyser**

In [None]:
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


True

In [None]:
print(os.listdir(nltk.data.find("corpora")))

['movie_reviews.zip', 'movie_reviews']


In [None]:
from nltk.corpus import movie_reviews

In [None]:
print(movie_reviews.categories())

['neg', 'pos']


In [None]:
print(len(movie_reviews.fileids('pos')))
print()
print(movie_reviews.fileids('pos'))

1000

['pos/cv000_29590.txt', 'pos/cv001_18431.txt', 'pos/cv002_15918.txt', 'pos/cv003_11664.txt', 'pos/cv004_11636.txt', 'pos/cv005_29443.txt', 'pos/cv006_15448.txt', 'pos/cv007_4968.txt', 'pos/cv008_29435.txt', 'pos/cv009_29592.txt', 'pos/cv010_29198.txt', 'pos/cv011_12166.txt', 'pos/cv012_29576.txt', 'pos/cv013_10159.txt', 'pos/cv014_13924.txt', 'pos/cv015_29439.txt', 'pos/cv016_4659.txt', 'pos/cv017_22464.txt', 'pos/cv018_20137.txt', 'pos/cv019_14482.txt', 'pos/cv020_8825.txt', 'pos/cv021_15838.txt', 'pos/cv022_12864.txt', 'pos/cv023_12672.txt', 'pos/cv024_6778.txt', 'pos/cv025_3108.txt', 'pos/cv026_29325.txt', 'pos/cv027_25219.txt', 'pos/cv028_26746.txt', 'pos/cv029_18643.txt', 'pos/cv030_21593.txt', 'pos/cv031_18452.txt', 'pos/cv032_22550.txt', 'pos/cv033_24444.txt', 'pos/cv034_29647.txt', 'pos/cv035_3954.txt', 'pos/cv036_16831.txt', 'pos/cv037_18510.txt', 'pos/cv038_9749.txt', 'pos/cv039_6170.txt', 'pos/cv040_8276.txt', 'pos/cv041_21113.txt', 'pos/cv042_10982.txt', 'pos/cv043_15

In [None]:
neg_rev = movie_reviews.fileids('neg')
len(neg_rev)

1000

In [None]:
rev = nltk.corpus.movie_reviews.words('pos/cv000_29590.txt')
rev

['films', 'adapted', 'from', 'comic', 'books', 'have', ...]

In [None]:
len(rev)

862

## **Detokenize (Join all tokens) the text**

In [None]:
# nltk.download('perluniprops')
from nltk.tokenize.treebank import TreebankWordDetokenizer

In [None]:
detokenizer = TreebankWordDetokenizer()

In [None]:
detokenizer.detokenize(rev)

'films adapted from comic books have had plenty of success, whether they\' re about superheroes (batman, superman, spawn), or geared toward kids (casper) or the arthouse crowd (ghost world), but there\' s never really been a comic book like from hell before . for starters, it was created by alan moore (and eddie campbell), who brought the medium to a whole new level in the mid\' 80s with a 12 - part series called the watchmen . to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . the book (or " graphic novel, " if you will) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . in other words, don\' t dismiss this film because of its source . if you can get past the whole comic book thing, you might find another stumbling block in from hell\' s directors, albert and allen hughes . getting the hughes brothers to direct this seems almost as ludicrous as ca

## **Extract text for training**

In [None]:
rev_list = []

In [None]:
for rev in neg_rev:
    rev_text_neg = rev = nltk.corpus.movie_reviews.words(rev)
    review_one_string = " ".join(rev_text_neg)
    review_one_string = review_one_string.replace(' ,' , ',')
    review_one_string = review_one_string.replace(' .' , '.')
    review_one_string = review_one_string.replace("\' " , "'")
    review_one_string = review_one_string.replace(" \'", "'")
    rev_list.append(review_one_string)

In [None]:
len(rev_list)

1000

In [None]:
pos_rev = movie_reviews.fileids('pos')

In [None]:
for rev_pos in pos_rev:
    rev_text_pos = nltk.corpus.movie_reviews.words(rev_pos)
    review_one_string = " ".join(rev_text_pos)
    review_one_string = review_one_string.replace(' ,' , ',')
    review_one_string = review_one_string.replace(' .' , '.')
    review_one_string = review_one_string.replace("\' " , "'")
    review_one_string = review_one_string.replace(" \'", "'")
    rev_list.append(review_one_string)

In [None]:
len(rev_list)

2000

This is our list of reviews with (0-999) is negative and (1000-1999) is positive


In [None]:
print(rev_list[1000])

films adapted from comic books have had plenty of success, whether they're about superheroes ( batman, superman, spawn ), or geared toward kids ( casper ) or the arthouse crowd ( ghost world ), but there's never really been a comic book like from hell before. for starters, it was created by alan moore ( and eddie campbell ), who brought the medium to a whole new level in the mid'80s with a 12 - part series called the watchmen. to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd. the book ( or " graphic novel, " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes. in other words, don't dismiss this film because of its source. if you can get past the whole comic book thing, you might find another stumbling block in from hell's directors, albert and allen hughes. getting the hughes brothers to direct this seems almost as ludicrous as casting c

## **Set Y-values for training**

In [None]:
neg_targets= np.zeros((1000,),dtype=np.int)

In [None]:
type(neg_targets)

numpy.ndarray

In [None]:
len(neg_targets)

1000

In [None]:
pos_targets = np.ones((1000,),dtype=np.int)

In [None]:
len(pos_targets)

1000

In [None]:
target_list = []

In [None]:
for neg_tar in neg_targets:
    target_list.append(neg_tar)

In [None]:
for pos_tar in pos_targets:
    target_list.append(pos_tar)

In [None]:
len(target_list)

2000

In [None]:
target_list[999]

0

In [None]:
target_list[1000]

1

In [None]:
y = pd.Series(target_list) # Labels for training

In [None]:
type(y)

pandas.core.series.Series

In [None]:
y.shape

(2000,)

In [None]:
y.head()

0    0
1    0
2    0
3    0
4    0
dtype: int64

## **Vectorize the text**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
count_vect = CountVectorizer(lowercase=True, stop_words='english', min_df=2)

In [None]:
X_count_vect = count_vect.fit_transform(rev_list)

In [None]:
X_count_vect.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [None]:
type(X_count_vect)

scipy.sparse.csr.csr_matrix

In [None]:
X_count_vect.shape

(2000, 23784)

In [None]:
X_names= count_vect.get_feature_names()

In [None]:
len(X_names) # Vocabulary size

23784

In [None]:
X_count_vect = pd.DataFrame(X_count_vect.toarray(), columns=X_names)

In [None]:
type(X_count_vect)

pandas.core.frame.DataFrame

In [None]:
X_count_vect.shape

(2000, 23784)

In [None]:
X_count_vect.head()

Unnamed: 0,00,000,007,05,10,100,1000,100m,101,102,103,105,106,107,108,10th,11,110,113,115,11th,12,126,129,13,130,132,137,13th,14,14th,15,150,1500s,155,15th,16,160,1600,161,...,zeik,zellweger,zemeckis,zen,zenith,zero,zeroing,zest,zeta,zeus,zhang,zhivago,zhou,ziggy,zilch,zimmer,zinger,zingers,zip,zippel,zipper,zippy,zoe,zombie,zombies,zone,zones,zoo,zoolander,zoologist,zoom,zooming,zooms,zoot,zorg,zorro,zucker,zuko,zwick,zwigoff
0,0,0,0,0,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## **Train Sentiment Analyser Model**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
X_train_cv, X_test_cv, y_train_cv, y_test_cv = train_test_split(X_count_vect,y,test_size=0.25,random_state = 5)

In [None]:
X_train_cv.shape

(1500, 23784)

In [None]:
X_test_cv.shape

(500, 23784)

In [None]:
y_test_cv.shape

(500,)

In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
clf_cv = MultinomialNB()

In [None]:
clf_cv.fit(X_train_cv,y_train_cv)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [None]:
y_pred_cv = clf_cv.predict(X_test_cv)

In [None]:
type(y_pred_cv)

numpy.ndarray

In [None]:
print(metrics.accuracy_score(y_test_cv,y_pred_cv))

0.798


In [None]:
score_clf_cv = confusion_matrix(y_test_cv,y_pred_cv)

In [None]:
print(score_clf_cv)

[[213  45]
 [ 56 186]]


In [None]:
class_report_cv = classification_report(y_test_cv,y_pred_cv)

In [None]:
print(class_report_cv)

              precision    recall  f1-score   support

           0       0.79      0.83      0.81       258
           1       0.81      0.77      0.79       242

    accuracy                           0.80       500
   macro avg       0.80      0.80      0.80       500
weighted avg       0.80      0.80      0.80       500



# **Word Embedding**

In [None]:
import spacy

In [None]:
import sys
python = sys.executable
!{python} -m spacy download en_core_web_md

Collecting en_core_web_md==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz (96.4MB)
[K     |████████████████████████████████| 96.4MB 1.2MB/s 
Building wheels for collected packages: en-core-web-md
  Building wheel for en-core-web-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-md: filename=en_core_web_md-2.2.5-cp36-none-any.whl size=98051305 sha256=11feb65bfda80ae11ff57abad8aaf088833f221fc262f982f1aaa2d4026ecc1d
  Stored in directory: /tmp/pip-ephem-wheel-cache-l370toss/wheels/df/94/ad/f5cf59224cea6b5686ac4fd1ad19c8a07bc026e13c36502d81
Successfully built en-core-web-md
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [None]:
!{python} -m spacy link en_core_web_md en --force;
nlp = spacy.load('en') # Load word2vec model

[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_md -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [None]:
doc1 = nlp("happy glad sad")
doc2 = nlp("king queen man woman")

In [None]:
doc1[0].vector

array([ 0.036775 ,  0.40917  , -0.52141  , -0.067184 ,  0.087702 ,
       -0.048564 ,  0.40947  , -0.42818  ,  0.19304  ,  2.3925   ,
       -0.11441  , -0.22952  , -0.16061  ,  0.035533 , -0.53179  ,
        0.19764  , -0.48827  ,  0.57439  , -0.064301 ,  0.47053  ,
       -0.29647  , -0.15927  , -0.052798 ,  0.10121  , -0.054461 ,
        0.036129 , -0.16118  , -0.34139  ,  0.45834  , -0.20144  ,
       -0.29067  , -0.51888  , -0.062106 ,  0.14084  ,  0.016413 ,
        0.050826 ,  0.13243  , -0.033663 , -0.42228  , -0.30086  ,
        0.06202  ,  0.26338  ,  0.077223 ,  0.27307  ,  0.13392  ,
        0.30183  , -0.16546  ,  0.057011 , -0.0034585, -0.071113 ,
       -0.27287  , -0.10297  ,  0.07457  , -0.32104  ,  0.36696  ,
        0.27051  , -0.15776  ,  0.2978   , -0.18988  ,  0.097477 ,
        0.035665 , -0.49749  , -0.52759  , -0.046148 ,  0.021715 ,
       -0.11047  , -0.18007  ,  0.20295  ,  0.15254  , -0.045976 ,
       -0.21846  , -0.066865 , -0.21355  ,  0.017509 ,  0.6647

In [None]:
doc1[0].text, doc1[0].vector.shape # Vector size is 300

('happy', (300,))

In [None]:
print(doc1[0].text, doc1[1].text, doc1[0].similarity(doc1[1]))

happy glad 0.77018654


In [None]:
doc3 = nlp("synonym antonym")

In [None]:
print(doc3[0].text, doc3[1].text, doc3[0].similarity(doc3[1]))

synonym antonym 0.84056866
