# **In Class Assignment: Sentiment Analysis**

## Name: KEY
## *IS 5150*

In this in-class assignment we will examine the sentiment of movie reviews using both unsupervised lexicon-based modeling and through supervised classification. We will then leverage tools to interpret the decision of our sentiment analysis model to determine the words and topics associated with positive and negative sentiments.

We begin, as always, be importing our dependencies:

In [67]:
import pandas as pd
import numpy as np
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

import sklearn
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.decomposition import LatentDirichletAllocation

!pip install pyLDAvis
import pyLDAvis
import pyLDAvis.sklearn


np.set_printoptions(precision=2, linewidth=80)

import warnings
warnings.filterwarnings("ignore")

## **Load and Normalize the Dataset**

In [20]:
df = pd.read_csv('/content/movie_reviews.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [64]:
reviews = np.array(df['review'])
sentiments = np.array(df['sentiment'])

test_reviews = reviews[45000:]
test_sentiments = sentiments[45000:]
sample_review_ids = [2626, 3533, 4012]

### Normalizer Function

In [89]:
import nltk
nltk.download('stopwords')
from nltk.tokenize.toktok import ToktokTokenizer
tokenizer = ToktokTokenizer()
### THIS IS AN ADDITION
stop_words = nltk.corpus.stopwords.words('english')
stop_words.remove('no')
stop_words.remove('but')
stop_words.remove('not')
#####
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

from pprint import pprint
import numpy as np
import re
from bs4 import BeautifulSoup

import spacy
nlp = spacy.load('en_core_web_sm')                                                                                            # dependencies

import unicodedata

!pip install contractions
import contractions

def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    [s.extract() for s in soup(['iframe', 'script'])]                                                                         # html parsing
    stripped_text = soup.get_text()
    stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
    return stripped_text

def tokenize_text(text):                                                                                                      # text tokenization
    sentences = nltk.sent_tokenize(text)
    word_tokens = [nltk.word_tokenize(sentence) for sentence in sentences] 
    return word_tokens

def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')                            # accent removal
    return text

def expand_contractions(text):                                                                                                # expand contractions
    expanded_words = []
    for word in text.split():
        expanded_words.append(contractions.fix(word))
        expanded_text = ' '.join(expanded_words)
    return expanded_text

def remove_special_characters(text, remove_digits=False):                                                                    # special character removal
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

def simple_stemmer(text):                                                                                                   # stemmer
    ps = nltk.porter.PorterStemmer()
    text = ' '.join([ps.stem(word) for word in text.split()])
    return text

def lemmatize_text(text):
    text = nlp(text)                                                                                                        # lemmatizer
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

def remove_stopwords(text, is_lower_case=False, stopwords=stop_words):                                                   # stopword removal
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopwords]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopwords]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

def normalize_corpus(corpus, html_stripping=True, contraction_expansion=True,                                                # define normalize corpus function
                     accented_char_removal=True, text_lower_case=True, 
                     text_lemmatization=True, special_char_removal=True, 
                     stopword_removal=True, remove_digits=True):
    
    normalized_corpus = []
    # normalize each document in the corpus
    for doc in corpus:
        # strip HTML
        if html_stripping:
            doc = strip_html_tags(doc)
        # remove accented characters
        if accented_char_removal:
            doc = remove_accented_chars(doc)
        # expand contractions    
        if contraction_expansion:
            doc = expand_contractions(doc)
        # lowercase the text    
        if text_lower_case:
            doc = doc.lower()
        # remove extra newlines
        doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
        # lemmatize text
        if text_lemmatization:
            doc = lemmatize_text(doc)
        # remove special characters and\or digits    
        if special_char_removal:
            # insert spaces between special characters to isolate them    
            special_char_pattern = re.compile(r'([{.(-)!}])')
            doc = special_char_pattern.sub(" \\1 ", doc)
            doc = remove_special_characters(doc, remove_digits=remove_digits)  
        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)
        # remove stopwords
        if stopword_removal:
            doc = remove_stopwords(doc, is_lower_case=text_lower_case)
            
        normalized_corpus.append(doc)
        
    return normalized_corpus

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## **Sentiment Analysis with VADER**
### Build the Model

In [37]:
def analyze_sentiment_vader_lexicon(review, 
                                    threshold=0.1,
                                    verbose=False):
    review = strip_html_tags(review)
    review = remove_accented_chars(review)
    review = expand_contractions(review)

    # analyze the sentiment for review
    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(review)

    # get aggregate scores and final sentiment
    agg_score = scores['compound']
    final_sentiment = 'positive' if agg_score >= threshold\
                                   else 'negative'
    if verbose:
        # display detailed sentiment statistics
        positive = str(round(scores['pos'], 2)*100)+'%'
        final = round(agg_score, 2)
        negative = str(round(scores['neg'], 2)*100)+'%'
        neutral = str(round(scores['neu'], 2)*100)+'%'
        sentiment_frame = pd.DataFrame([[final_sentiment, final, positive,
                                        negative, neutral]],
                                        columns=pd.MultiIndex(levels=[['SENTIMENT STATS:'], 
                                                                      ['Predicted Sentiment', 'Polarity Score',
                                                                       'Positive', 'Negative', 'Neutral']], 
                                                              codes=[[0,0,0,0,0],[0,1,2,3,4]]))
        print(sentiment_frame)
    
    return final_sentiment


### Predict Sentiment of Sample Reviews

In [65]:
for reviews, sentiments in zip(reviews[sample_review_ids], sentiments[sample_review_ids]):
    print('REVIEW:', reviews)
    print('Actual Sentiment:', sentiments)
    pred = analyze_sentiment_vader_lexicon(reviews, threshold=0.4, verbose=True)    
    print('-'*60)

REVIEW: I'm a fan of Zhang Yimou and finally found this DVD title from the shelves of a Shenzhen bookstore after a long search at many places.<br /><br />This is a huge departure from previous Zhang Yimou work, esp in terms of style and locale. The director himself has said that this is the first and only time he'll ever attempt to make a black comedy set in contemporary China. You may even say this work is experimental in nature, compared to his other well known big budget and formal pieces.<br /><br />Filmed with a hand-held camera and wide angle lens throughout the duration of the whole film, the quick pace editing and high energy performance & naturalistic tone never let you go once it grips you from the start. It presents a very realistic account of modern Chinese urban sensibilities, which in this case is set in Beijing. If you appreciate and love this kind of black humor, you will love this film totally. Also look out for hilarious cameos by Zhao Benshan (Happy Times)and the dir

**What impact does the threshold value have on the classification of sentiments within reviews?**

**Do you agree with all of the ratings? If not, what words or phrases might be throwing off the sentiment score?**

## **Next let's predict sentiment via machine learning, using several classification algorithms...**

### Build a training and test set

In [91]:
train_reviews = reviews[:3500]
train_sentiments = sentiments[:3500]
test_reviews = reviews[3500:4500]
test_sentiments = sentiments[3500:4500]

### Normalize the corpora

In [92]:
norm_train_reviews = normalize_corpus(train_reviews)
norm_test_reviews = normalize_corpus(test_reviews)

In [None]:
norm_train_reviews[0]

### Feature Engineering

In [96]:
# build TFIDF features on train reviews
tv = TfidfVectorizer(use_idf=True, min_df=0.0, max_df=1.0, ngram_range=(1,2),
                     sublinear_tf=True)
tv_train_features = tv.fit_transform(norm_train_reviews)

In [97]:
# transform test reviews into features
tv_test_features = tv.transform(norm_test_reviews)

In [107]:
print(tv_train_features.shape)
print(tv_test_features.shape)

(3500, 322500)
(1000, 322500)


### Model Training, Prediction, and Performance Eval

In [101]:
lr = LogisticRegression(penalty='l2', max_iter=100, C=1)
svm = SGDClassifier(loss='hinge', max_iter=100)

#### Logistic Regression with TF-IDF Features

In [110]:
def train_predict_model(classifier, 
                        train_features, train_labels, 
                        test_features, test_labels):
    # build model    
    classifier.fit(train_features, train_labels)
    # predict using model
    predictions = classifier.predict(test_features) 
    return predictions  

lr_tfidf_predictions = train_predict_model(classifier=lr, 
                                               train_features=tv_train_features, train_labels=train_sentiments,
                                               test_features=tv_test_features, test_labels=test_sentiments)


In [111]:
print(classification_report(test_sentiments, lr_tfidf_predictions))

              precision    recall  f1-score   support

    negative       0.89      0.86      0.87       526
    positive       0.85      0.88      0.86       474

    accuracy                           0.87      1000
   macro avg       0.87      0.87      0.87      1000
weighted avg       0.87      0.87      0.87      1000



**Which metric would be most appropriate to use in this case and why?**

Accuracy isn't a bad metric in this case since the class distribution isn't too unequal. If we're saying neither false positive or negatives matter more, then probably accuracy or f-1 score would suffice.

## **Interpretation Time!**

First let's examine predicted probabilities to see how confident our model was in its predictions...

In [119]:
model = lr.fit(tv_train_features, train_sentiments)

In [138]:
predicted_probas = list(zip(*model.predict_proba(tv_test_features)))[0]

In [141]:
predictions_df = pd.DataFrame()
predictions_df['Reviews'] = test_reviews
predictions_df['True Sentiment'] = test_sentiments
predictions_df['Predicted Sentiment'] = lr_tfidf_predictions
predictions_df['Probability_Negative'] = predicted_probas

In [None]:
predictions_df

**Sort the predictions_df by predicted probability to find movies with high, moderate and low probabilities to examine a) whether the predictions were correct and b) try to identify what sorts of words or phrases led to high or low probabilities of a negative sentiment**

## Topic Modeling for Positive \& Negative Sentiments

In [143]:
norm_reviews = norm_train_reviews + norm_test_reviews

### Display Topics for **Positive** Reviews

In [178]:
# get tf-idf features for only positive reviews
positive_reviews = [review for review, sentiment in zip(norm_reviews, sentiments) if sentiment == 'positive']
ptvf = CountVectorizer(min_df = 0.02, max_df=0.75)
ptvf_features = ptvf.fit_transform(positive_reviews)

In [181]:
lda_tf = LatentDirichletAllocation(n_components=10, random_state=0)
lda_tf.fit(ptvf_features)

LatentDirichletAllocation(random_state=0)

In [182]:
pyLDAvis.sklearn.prepare(lda_tf, ptvf_features, ptvf)

### Display Topics for **Negative** Reviews

In [183]:
# get tf-idf features for only negative reviews
negative_reviews = [review for review, sentiment in zip(norm_reviews, sentiments) if sentiment == 'negative']
ntvf = CountVectorizer(min_df = 0.02, max_df=0.75)
ntvf_features = ntvf.fit_transform(negative_reviews)

In [185]:
lda_tf = LatentDirichletAllocation(n_components=10, random_state=0)
lda_tf.fit(ntvf_features)

LatentDirichletAllocation(random_state=0)

In [186]:
pyLDAvis.sklearn.prepare(lda_tf, ntvf_features, ntvf)

**Examine the 30 most salient terms as well as the topics amongst the positive and negative reviews, are there any interesting differences in the most salient terms or topics between positive and negative movie reviews?**