# Week 10 - Sentiment Analysis

## Text Preprocessing and Normalization - Starting on Page 570

The author notes the file `Text Normalization Demo.ipynb`, which doesn't exist in his repo. However, I've created the same output here using the `Normalizer` created in Week 4. My code does not match the author's line-for-line but it has the same content.

In [15]:
# There are several ways to get folders visible in Python. This way isn't the most elegant
# but it works consistently. Replace my path with yours. The path you append to should be the
# folder where your tokenizer Python class is located.
import sys
sys.path.append(r'C:\Users\neugg\OneDrive\Documents\GitHub\dsc360-instructor\12 Week\week_4\assignment')
from text_normalizer import TextNormalizer
import pandas as pd

## Get the Data and Extract
The ideas start on page 573, but I normalize the data unlike the author.

In [2]:
orig_movie_data = pd.read_csv('data/movie_reviews.csv')
orig_movie_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


## Run the Normalizer

In [3]:
tn = TextNormalizer()
movie_reviews_normalized = tn.normalize_corpus(corpus=orig_movie_data['review'])

Starting TextNormalizer
Done strip
Done lower
Done stopword
Done char remove
Done contract exp
Done text lemm
Done spec char remove


## Reassemble the clean data with the sentiments to create a clean DataFrame.

In [16]:
movie_reviews_clean_df = pd.DataFrame({'review': movie_reviews_normalized, 'sentiment': orig_movie_data['sentiment']})
movie_reviews_clean_df.info()
print(movie_reviews_clean_df.head())
# Save the clean data
movie_reviews_clean_df.to_csv('data/movie_reviews_clean.csv', index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB
                                              review sentiment
0  one reviewers mentioned watching oz episode ho...  positive
1  wonderful little production filming technique ...  positive
2  thought wonderful way spend time hot summer we...  positive
3  basically family little boy jake thinks zombie...  negative
4  petter mattei love time money visually stunnin...  positive


## Unsupervised Lexicon-Based Models - Starting on page 573
You can start here if you want to skip cleaning the data.

**NOTE** the model_evaluation_utils.py file is from the author, is referenced in the book, but is actually contained in the GitHub for a totally different book (which may be the new book or something): https://github.com/dipanjanS/practical-machine-learning-with-python/blob/master/notebooks/Ch07_Analyzing_Movie_Reviews_Sentiment/model_evaluation_utils.py. But I also had to modify that code, so the `model_evaluation_utils.py` contained in this GitHub is the one that works.

In [2]:
import pandas as pd
import numpy as np
import model_evaluation_utils as meu
np.set_printoptions(precision=2, linewidth=80)
dataset = pd.read_csv(r'data/movie_reviews_clean.csv')

print(dataset.head())
reviews = np.array(dataset['review'])
sentiments = np.array(dataset['sentiment'])

# extract data for model evaluation
test_reviews = reviews[35000:]
test_sentiments = sentiments[35000:]
sample_review_ids = [7626, 3533, 13010]

                                              review sentiment
0  one reviewers mentioned watching oz episode ho...  positive
1  wonderful little production filming technique ...  positive
2  thought wonderful way spend time hot summer we...  positive
3  basically family little boy jake thinks zombie...  negative
4  petter mattei love time money visually stunnin...  positive


## Text Blob - Starting on page 576

In [3]:
import textblob
for review, sentiment in zip(test_reviews[sample_review_ids], test_sentiments[sample_review_ids]):
    print('REVIEW:', review)
    print('Actual Sentiment:', sentiment)
    print('Predicted Sentiment polarity:', textblob.TextBlob(review).sentiment.polarity)
    print('-'*60)

REVIEW: comment stupid movie acting average worse screenplay sense skip 
Actual Sentiment: negative
Predicted Sentiment polarity: -0.3375
------------------------------------------------------------
REVIEW:  care people voted movie bad want truth good movie every thing movie have really get one 
Actual Sentiment: positive
Predicted Sentiment polarity: 0.06666666666666671
------------------------------------------------------------
REVIEW: worst horror film ever funniest film ever rolled one got see film cheap unbeliaveble see really p s watch carrot
Actual Sentiment: positive
Predicted Sentiment polarity: -0.13333333333333333
------------------------------------------------------------


### Checking sentiments - page 577

In [4]:
sentiment_polarity = [textblob.TextBlob(review).sentiment.polarity for review in test_reviews]
predicted_sentiments = ['positive' if score >= 0.1 else 'negative' for score in sentiment_polarity]
meu.display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=predicted_sentiments, classes=['positive', 'negative'])

Model Performance metrics:
------------------------------
Accuracy: 0.7641
Precision: 0.7641
Recall: 0.7641
F1 Score: 0.7641

Model Classification report:
------------------------------
              precision    recall  f1-score   support

    positive       0.76      0.76      0.76      7510
    negative       0.76      0.76      0.76      7490

   micro avg       0.76      0.76      0.76     15000
   macro avg       0.76      0.76      0.76     15000
weighted avg       0.76      0.76      0.76     15000


Prediction Confusion Matrix:
------------------------------
                 Predicted:         
                   positive negative
Actual: positive       5734     1776
        negative       1762     5728


## AFINN Lexicon - Page 578

In [6]:
from afinn import Afinn
afn = Afinn(emoticons=True) 
for review, sentiment in zip(test_reviews[sample_review_ids], test_sentiments[sample_review_ids]):
    print('REVIEW:', review)
    print('Actual Sentiment:', sentiment)
    print('Predicted Sentiment polarity:', afn.score(review))
    print('-'*60)

REVIEW: comment stupid movie acting average worse screenplay sense skip 
Actual Sentiment: negative
Predicted Sentiment polarity: -5.0
------------------------------------------------------------
REVIEW:  care people voted movie bad want truth good movie every thing movie have really get one 
Actual Sentiment: positive
Predicted Sentiment polarity: 3.0
------------------------------------------------------------
REVIEW: worst horror film ever funniest film ever rolled one got see film cheap unbeliaveble see really p s watch carrot
Actual Sentiment: positive
Predicted Sentiment polarity: -3.0
------------------------------------------------------------


In [7]:
sentiment_polarity = [afn.score(review) for review in test_reviews]
predicted_sentiments = ['positive' if score >= 0.1 else 'negative' for score in sentiment_polarity]
meu.display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=predicted_sentiments, classes=['positive', 'negative'])

Model Performance metrics:
------------------------------
Accuracy: 0.7008
Precision: 0.7211
Recall: 0.7008
F1 Score: 0.6937

Model Classification report:
------------------------------
              precision    recall  f1-score   support

    positive       0.65      0.85      0.74      7510
    negative       0.79      0.55      0.65      7490

   micro avg       0.70      0.70      0.70     15000
   macro avg       0.72      0.70      0.69     15000
weighted avg       0.72      0.70      0.69     15000


Prediction Confusion Matrix:
------------------------------
                 Predicted:         
                   positive negative
Actual: positive       6405     1105
        negative       3383     4107


## SentiWordNet Lexicon - Starting on page 580
My code differs to fix bugs.

In [12]:
from nltk.corpus import sentiwordnet as swn

awesome = list(swn.senti_synsets('awesome', 'a'))[0]
print('Positive Polarity Score:', awesome.pos_score())
print('Negative Polarity Score:', awesome.neg_score())
print('Objective Score:', awesome.obj_score())

Positive Polarity Score: 0.875
Negative Polarity Score: 0.125
Objective Score: 0.0


In [27]:
import nltk
def analyze_sentiment_sentiwordnet_lexicon(review, verbose):
    # tokenize and POS tag text tokens
    text_tokens = nltk.word_tokenize(review)
    tagged_text = nltk.pos_tag(text_tokens)
    pos_score = neg_score = token_count = obj_score = 0
    # get wordnet synsets based on POS tags
    # get sentiment scores if synsets are found
    for word, tag in tagged_text:
        if verbose:
            print(word, tag)
        ss_set = None
        if 'NN' in tag and swn.senti_synsets(word, 'n'):
            ss_set = swn.senti_synsets(word, 'n')
        elif 'VB' in tag and swn.senti_synsets(word, 'v'):
            ss_set = swn.senti_synsets(word, 'v')
        elif 'JJ' in tag and swn.senti_synsets(word, 'a'):
            ss_set = swn.senti_synsets(word, 'a')
        elif 'RB' in tag and swn.senti_synsets(word, 'r'):
            ss_set = swn.senti_synsets(word, 'r')
        # if senti-synset is found        
        if ss_set:
            for synst in ss_set:
                # add scores for all found synsets
                pos_score += synst.pos_score()
                neg_score += synst.neg_score()
                obj_score += synst.obj_score()
                token_count += 1
    
    # aggregate final scores
    final_score = pos_score - neg_score
    norm_final_score = round(float(final_score) / token_count, 2)
    final_sentiment = 'positive' if norm_final_score >= 0 else 'negative'
    if verbose:
        norm_obj_score = round(float(obj_score) / token_count, 2)
        norm_pos_score = round(float(pos_score) / token_count, 2)
        norm_neg_score = round(float(neg_score) / token_count, 2)
        # to display results in a nice table
        sentiment_frame = pd.DataFrame([[final_sentiment, norm_obj_score,
                                         norm_pos_score, norm_neg_score,
                                         norm_final_score]],
                                         columns=pd.MultiIndex(levels=[['SENTIMENT STATS:'], 
                                                                      ['Predicted Sentiment', 'Objectivity',
                                                                       'Positive', 'Negative', 'Overall']], 
                                                              codes=[[0,0,0,0,0],[0,1,2,3,4]]))
        print(sentiment_frame)
        
    return final_sentiment

for review, sentiment in zip(test_reviews[sample_review_ids], test_sentiments[sample_review_ids]):
    print('REVIEW:', review)
    print('Actual Sentiment:', sentiment)
    pred = analyze_sentiment_sentiwordnet_lexicon(review, verbose=True)
    print('-'*60)

REVIEW: comment stupid movie acting average worse screenplay sense skip 
Actual Sentiment: negative
comment NN
stupid JJ
movie NN
acting VBG
average JJ
worse JJ
screenplay NN
sense NN
skip NN
     SENTIMENT STATS:                                      
  Predicted Sentiment Objectivity Positive Negative Overall
0            negative        0.66     0.06     0.28   -0.22
------------------------------------------------------------
REVIEW:  care people voted movie bad want truth good movie every thing movie have really get one 
Actual Sentiment: positive
care NN
people NNS
voted VBD
movie NN
bad JJ
want VBP
truth NN
good JJ
movie NN
every DT
thing NN
movie NN
have VBP
really RB
get VB
one CD
     SENTIMENT STATS:                                      
  Predicted Sentiment Objectivity Positive Negative Overall
0            positive        0.74     0.15     0.11    0.04
------------------------------------------------------------
REVIEW: worst horror film ever funniest film ever rolled one 

## Predict Sentiment on Test Reviews and Evaluation Performance - Page 583

In [30]:
# note that the corpus is already cleaned to the first line is skipped (just use test_reviews)
predicted_sentiments = [analyze_sentiment_sentiwordnet_lexicon(review, verbose=False) for review in test_reviews]
meu.display_model_performance_metrics(true_labels=test_sentiments, 
    predicted_labels=predicted_sentiments, classes=['positive', 'negative'])

Model Performance metrics:
------------------------------
Accuracy: 0.5847
Precision: 0.6905
Recall: 0.5847
F1 Score: 0.5172

Model Classification report:
------------------------------
              precision    recall  f1-score   support

    positive       0.55      0.96      0.70      7510
    negative       0.83      0.21      0.34      7490

   micro avg       0.58      0.58      0.58     15000
   macro avg       0.69      0.58      0.52     15000
weighted avg       0.69      0.58      0.52     15000


Prediction Confusion Matrix:
------------------------------
                 Predicted:         
                   positive negative
Actual: positive       7193      317
        negative       5913     1577


## Vader Lexicon - Starting on Page 585
Note that my code does not clean the text because it's already cleaned.

In [34]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# only need to run the following line of code once
nltk.download('vader_lexicon') 
def analyze_sentiment_vader_lexicon(review, threshold=0.1, verbose=False):
    # analyze the sentiment for review
    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(review)
    # get aggregate scores and final sentiment
    agg_score = scores['compound']
    final_sentiment = 'positive' if agg_score >= threshold\
                                   else 'negative'
    if verbose:
        # display detailed sentiment statistics
        positive = str(round(scores['pos'], 2)*100)+'%'
        final = round(agg_score, 2)
        negative = str(round(scores['neg'], 2)*100)+'%'
        neutral = str(round(scores['neu'], 2)*100)+'%'
        sentiment_frame = pd.DataFrame([[final_sentiment, final, positive,
                                        negative, neutral]],
                                        columns=pd.MultiIndex(levels=[['SENTIMENT STATS:'], 
                                                                      ['Predicted Sentiment', 'Polarity Score',
                                                                       'Positive', 'Negative',
                                                                       'Neutral']], 
                                                              codes=[[0,0,0,0,0],[0,1,2,3,4]]))
        print(sentiment_frame)
    
    return final_sentiment

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\neugg\AppData\Roaming\nltk_data...


In [35]:
for review, sentiment in zip(test_reviews[sample_review_ids], test_sentiments[sample_review_ids]):
    print('REVIEW:', review)
    print('Actual Sentiment:', sentiment)
    pred = analyze_sentiment_vader_lexicon(review, threshold=0.4, verbose=True)
    print('-'*60)

REVIEW: comment stupid movie acting average worse screenplay sense skip 
Actual Sentiment: negative
     SENTIMENT STATS:                                         
  Predicted Sentiment Polarity Score Positive Negative Neutral
0            negative          -0.76     0.0%    48.0%   52.0%
------------------------------------------------------------
REVIEW:  care people voted movie bad want truth good movie every thing movie have really get one 
Actual Sentiment: positive
     SENTIMENT STATS:                                                     
  Predicted Sentiment Polarity Score Positive             Negative Neutral
0            positive           0.64    40.0%  14.000000000000002%   46.0%
------------------------------------------------------------
REVIEW: worst horror film ever funniest film ever rolled one got see film cheap unbeliaveble see really p s watch carrot
Actual Sentiment: positive
     SENTIMENT STATS:                                      \
  Predicted Sentiment Polarity

### Page 587

In [36]:
predicted_sentiments = [analyze_sentiment_vader_lexicon(review, threshold=0.4, verbose=False) for review in test_reviews]
meu.display_model_performance_metrics(true_labels=test_sentiments, 
    predicted_labels=predicted_sentiments, classes=['positive', 'negative'])

Model Performance metrics:
------------------------------
Accuracy: 0.6895
Precision: 0.7081
Recall: 0.6895
F1 Score: 0.6823

Model Classification report:
------------------------------
              precision    recall  f1-score   support

    positive       0.65      0.84      0.73      7510
    negative       0.77      0.54      0.63      7490

   micro avg       0.69      0.69      0.69     15000
   macro avg       0.71      0.69      0.68     15000
weighted avg       0.71      0.69      0.68     15000


Prediction Confusion Matrix:
------------------------------
                 Predicted:         
                   positive negative
Actual: positive       6306     1204
        negative       3453     4037


## Classifying Sentiment with Supervised Learning - Starting on page 589
Text is already cleaned and is reloaded in case you want to start here.

In [2]:
import pandas as pd 
import numpy as np 
np.set_printoptions(precision=2, linewidth=80)
import model_evaluation_utils as meu
import nltk

In [4]:
dataset = pd.read_csv(r'data/movie_reviews_clean.csv')
print(dataset.head())

reviews = np.array(dataset['review'])
sentiments = np.array(dataset['sentiment'])

# build train and test datasets
# NOTE: this is NOT how to split test and train, but it follows the book
train_reviews = reviews[:35000]
train_sentiments = sentiments[:35000]
test_reviews = reviews[35000:]
test_sentiments = sentiments[35000:]

# skipping normalizing the dataset, already normalized

                                              review sentiment
0  one reviewers mentioned watching oz episode ho...  positive
1  wonderful little production filming technique ...  positive
2  thought wonderful way spend time hot summer we...  positive
3  basically family little boy jake thinks zombie...  negative
4  petter mattei love time money visually stunnin...  positive


## Traditional Supervised Machine Learning Models - Starting on page 590

In [5]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# build BOW features on train reviews
cv = CountVectorizer(binary=False, min_df=0.0, max_df=1.0, ngram_range=(1,2))
cv_train_features = cv.fit_transform(train_reviews)

# build TFIDF features on train reveiws
tv = TfidfVectorizer(use_idf=True, min_df=0.0, max_df=1.0, ngram_range=(1,2), sublinear_tf=True)
tv_train_features = tv.fit_transform(train_reviews)

# transform test reviews into features
cv_test_features = cv.transform(test_reviews)
tv_test_features= tv.transform(test_reviews)

print('BOW model: TRAIN features shape:', cv_train_features.shape)
print('TEST features shape:', cv_test_features.shape, '\n')

print('TFIDF mode: TRAIN features shape:', tv_train_features.shape)
print('TEST features shape:', tv_test_features.shape)


BOW model: TRAIN features shape: (35000, 2487514)
TEST features shape: (15000, 2487514) 

TFIDF mode: TRAIN features shape: (35000, 2487514)
TEST features shape: (15000, 2487514)


In [6]:
from sklearn.linear_model import SGDClassifier, LogisticRegression
lr = LogisticRegression(penalty='l2', max_iter=100, C=1)
svm = SGDClassifier(loss='hinge', max_iter=100)

# Logistic Regression model on BOW features
lr_bow_predictions = meu.train_predict_model(classifier=lr, train_features=cv_train_features, train_labels=train_sentiments,
                                             test_features=cv_test_features, test_labels=test_sentiments)
meu.display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=lr_bow_predictions, classes=['positive', 'negative'])

Model Performance metrics:
------------------------------
Accuracy: 0.9017
Precision: 0.9017
Recall: 0.9017
F1 Score: 0.9017

Model Classification report:
------------------------------
              precision    recall  f1-score   support

    positive       0.90      0.91      0.90      7510
    negative       0.90      0.90      0.90      7490

   micro avg       0.90      0.90      0.90     15000
   macro avg       0.90      0.90      0.90     15000
weighted avg       0.90      0.90      0.90     15000


Prediction Confusion Matrix:
------------------------------
                 Predicted:         
                   positive negative
Actual: positive       6797      713
        negative        762     6728


In [7]:
# Logistic Regression model on TF-IDF features (page 592)
lr_tfidf_predictions = meu.train_predict_model(classifier=lr, train_features=tv_train_features, train_labels=train_sentiments,
                                               test_features=tv_test_features, test_labels=test_sentiments)
meu.display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=lr_tfidf_predictions, classes=['positive', 'negative'])

Model Performance metrics:
------------------------------
Accuracy: 0.8945
Precision: 0.8948
Recall: 0.8945
F1 Score: 0.8945

Model Classification report:
------------------------------
              precision    recall  f1-score   support

    positive       0.88      0.91      0.90      7510
    negative       0.90      0.88      0.89      7490

   micro avg       0.89      0.89      0.89     15000
   macro avg       0.89      0.89      0.89     15000
weighted avg       0.89      0.89      0.89     15000


Prediction Confusion Matrix:
------------------------------
                 Predicted:         
                   positive negative
Actual: positive       6815      695
        negative        887     6603


## Newer Supervised Deep Learning Models - Starting on page 593

I did not go into the TensorFlow / Keras models. You'll see some of this is DSC 410 and at this point running these models for 
text analytics can require very intensive CPU power.

## Analyzing Sentiment Causation and Interpreting Predictive Models
Starting on page 615.

**NOTE** the code at this point doesn't appear to exist in the author's repository and the code itself is repeated in an entirely different book (*Practical Machine Learning with Python: A Problem-Solver's Guide to Building Real-World Intelligent Systems*) from the same author.

My code is slightly different because I did not clean the data.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

# build BOW features on train reviews
cv = CountVectorizer(binary=False, min_df=0.0, max_df=1.0, ngram_range=(1,2))
cv_train_features = cv.fit_transform(train_reviews)
# build Logistic Regression model
lr = LogisticRegression()
lr.fit(cv_train_features, train_sentiments)

# Build Text Classification Pipeline
lr_pipeline = make_pipeline(cv, lr)

# save the list of prediction classes (positive, negative)
classes = list(lr_pipeline.classes_)

In [11]:
# skipped normalizing the new reviews - these's no point since it's so small and can be normalized manually
new_corpus = ['the lord of the rings is an excellent movie',
              'i hated the recent movie on tv, it was so bad']
lr_pipeline.predict(new_corpus)

#array(['positive', 'negative'], dtype=object) - THIS DOES NOTHING

pd.DataFrame(lr_pipeline.predict_proba(new_corpus), columns=classes)

Unnamed: 0,negative,positive
0,0.188661,0.811339
1,0.81333,0.18667


### You need to install the `skater` package - `pip install skater`

In [14]:
# page 617
from skater.core.local_interpretation.lime.lime_text import LimeTextExplainer

explainer = LimeTextExplainer(class_names=classes)
# helper function for model interpretation
def interpret_classification_model_prediction(doc_index, norm_corpus, corpus,
                                              prediction_labels, explainer_obj):
    # display model prediction and actual sentiments
    print("Test document index: {index}\nActual sentiment: {actual} \
                                       \nPredicted sentiment: {predicted}"
      .format(index=doc_index, actual=prediction_labels[doc_index],
              predicted=lr_pipeline.predict([norm_corpus[doc_index]])))
    # display actual review content    print("\nReview:", corpus[doc_index])
    # display prediction probabilities    print("\nModel Prediction Probabilities:")
    for probs in zip(classes, lr_pipeline.predict_proba([norm_corpus[doc_index]])[0]):
        print(probs)
    # display model prediction interpretation
    exp = explainer.explain_instance(norm_corpus[doc_index],
                                     lr_pipeline.predict_proba, num_features=10,
                                     labels=[1])
    exp.show_in_notebook()

ModuleNotFoundError: No module named 'skater'