In [None]:
# Initialize Otter Grader
import otter
grader = otter.Notebook()



![data-x](https://raw.githubusercontent.com/afo/data-x-plaksha/master/imgsource/dx_logo.png)


___

#### NAME:

#### STUDENT ID:
___

# NLP for Sentiment Analysis on IMDB Movie Reviews

In this assignment we will be exploring tools for Natural Language Processing (NLP). Our task is sentiment analysis for movie reviews and in that context we will touch upon multiple areas:

- Feature engineering
- Bag of words modeling
- Word2Vec modeling

In [1]:
# import Beautiful Soup, NumPy and Pandas, etc
import bs4 as bs
import numpy as np
import pandas as pd
import re
import hashlib
 
# download NLTK classifiers - these are cached locally on your machine
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')

# import ml classifiers
from nltk.tokenize import sent_tokenize # tokenizes sentences
from nltk.stem import PorterStemmer     # parsing/stemmer
from nltk.tag import pos_tag            # parts-of-speech tagging
from nltk.corpus import wordnet         # sentiment scores
from nltk.stem import WordNetLemmatizer # stem and context
from nltk.corpus import stopwords       # stopwords
from nltk.util import ngrams            # ngram iterator

# import word2vec
from gensim.test.utils import datapath
from gensim import utils
from gensim.models import Word2Vec

# import sklearn
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize, FunctionTransformer
from sklearn.feature_extraction.text import CountVectorizer

## Part I: Data Loading and Preprocessing

<br>
___

### Data Description
>Data source: https://www.kaggle.com/c/word2vec-nlp-tutorial/data (originally from [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/))<br>
>
>Data Description:<br><br>
>We will be using Kaggle's **Bag of Words Meets Bags of Popcorn** dataset to explore [IMBD](https://www.imdb.com/) movie review data. This dataset in included with the zip file distribution of your homework. Labeled training dataset consists of 25,000 IMDB movie reviews. The sentiment of the reviews are binary, meaning an IMDB rating < 5 results in a sentiment score of 0, and a rating >=7 have a sentiment score of 1 (no reviews with score 5 or 6 are included in the analysis). No individual movie has more than 30 reviews. The training data set is constructed in a balanced way so that there are an equal number of positive and negative reviews for each movie. There is also an unlabeled test set with 25,000 IMDB movie reviews. We don't use this for testing, but we do use it to improve unsupervised learning.
>
>Data Sets:<br>
>* ```labeledTrainData.tsv``` --> The labeled training set. The file is tab-delimited and has a header row followed by 25,000 rows containing an id (numerical), sentiment (categorical), and text for each review (textual).<br>
>* ```testData.tsv``` --> The unlabeled test set. 25,000 rows containing an id (numerical), and text for each review (textual). 
>
> Further Reading:<br>
> 
> * [Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).](http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf)

In [2]:
# training data
train = pd.read_csv("labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)

In [3]:
# first 5 rows
train.head()

<br>

___


## Preparing data for classification



We have provided the function `review_cleaner` to preprocess reviews. Here is an overview of what it does:

> - Removes HTML tags (using beautifulsoup)
> - Extract emoticons (emotion symbols, aka smileys :D )
> - Removes non-letters (using regular expression)
> - Converts all words to lowercase letters and tokenizes them (using .split() method on the review strings, so that every word in the review is an element in a list)
> - Removes all the English stopwords from the list of movie review words
> - Applies either stemming or lemmatization, as indicated by the arguments
> - Join the words back into one string seperated by space, append the emoticons to the end

Note that you do not need to make any changes to `review_cleaner`. We will explore some examples of the cleaning process below.

<br>


In [4]:
ps = PorterStemmer()
wnl = WordNetLemmatizer()
eng_stopwords = set(stopwords.words("english"))

def review_cleaner(review, lemmatize=True, stem=False):
    '''
        Clean and preprocess a review.
            1. Remove HTML tags
            2. Extract emoticons
            3. Use regex to remove all special characters (only keep letters)
            4. Make strings to lower case and tokenize / word split reviews
            5. Remove English stopwords
            6. Lemmatize
            7. Rejoin to one string
        
        @review (type:str) is an unprocessed review string
        @return (type:str) is a 6-step preprocessed review string
    '''

    

    if lemmatize == True and stem == True:
        raise RuntimeError("May not pass both lemmatize and stem flags")

    #1. Remove HTML tags
    review = bs.BeautifulSoup(review).text    

    #2. Use regex to find emoticons
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', review)

    #3. Remove punctuation
    review = re.sub("[^a-zA-Z]", " ",review)

    #4. Tokenize into words (all lower case)
    review = review.lower().split()

    #5. Remove stopwords, Lemmatize, Stem
    clean_review=[]
    for word in review:
        if word not in eng_stopwords:
            if lemmatize is True:
                word=wnl.lemmatize(word)
            elif stem is True:
                if word == 'oed':
                    continue
                word=ps.stem(word)
            clean_review.append(word)

    #6. Join the review to one sentence
    review_processed = ' '.join(clean_review+emoticons)
    
    return review_processed

# Explore text cleaning

To make things interesting, everyone gets to analyze a different review. Set `seed_value` to your favorite number, your name, or whatever else you'd like.
<!--
BEGIN QUESTION
name: q0_set_seed
manual: false
points: 1
-->

In [5]:
# Your code here
seed_value = ...

In [None]:
grader.check("q0_set_seed")

In [7]:
# Print out a cleaned version of the randomly selected review
my_review_id = int(hashlib.md5(str(seed_value).encode("utf-8")).hexdigest()[:8], 16) % len(train.index)
my_review = train.iloc[my_review_id]["review"]
print(my_review)

### Question 1 - Find the stopwords

By manual inspection, find the first 5 stopwords in your chosen review. It might seem easier to write the code to do this, but the point of the exercise is to understand what the algorithm is doing.

First review the list of stopwords below:

In [8]:
# See what the stopwords are
print(" ".join(stopwords.words("english")))

Inspect the review and look for the 5 first stopwords. Store them in `first_5_stopwords` in the order in which they appear in the review.

e.g., 
```
first_5_stopwords = ['having', 'the', 'to', 'some', 'of']
```
<!--
BEGIN QUESTION
name: q1_stopwords_type
manual: false
points: 1
-->

In [9]:
# Your code here
first_5_stopwords = ...

In [None]:
grader.check("q1_stopwords_type")

<!--
BEGIN QUESTION
name: q1_stopwords_length
manual: false
points: 0
-->

In [None]:
grader.check("q1_stopwords_length")

<!--
BEGIN QUESTION
name: q1_stopwords_match
manual: false
points: 1
-->

In [None]:
grader.check("q1_stopwords_match")

### Question 2 - Lemmatization

Lemmatization allows grouping of common forms of a word.

Here are some examples of lemmatization:
* images -> image
* waxworks -> waxwork
* sweets -> sweet

By manual inspection, find the first 3 words in `my_review` that are lemmatized. Store them in `first_3_lemmatized` in the order in which they appear in the review.

E.g.:
```
first_3_lemmatized = ['images', 'waxworks', 'sweets']
```

<!--
BEGIN QUESTION
name: q2_lemmatization_type
manual: false
points: 0
-->

In [13]:
# Your code here
first_3_lemmatized = ...

print("Lemmatization examples:")
for w in first_3_lemmatized:
    print("{} -> {}".format(w, wnl.lemmatize(w)))

In [None]:
grader.check("q2_lemmatization_type")

<!--
BEGIN QUESTION
name: q2_lemmatization_length
manual: false
points: 0
-->

In [None]:
grader.check("q2_lemmatization_length")

<!--
BEGIN QUESTION
name: q2_lemmatization_match
manual: false
points: 1
-->

In [None]:
grader.check("q2_lemmatization_match")

### Question 3 - Stemming

Lemmatization allows grouping of common forms of a word.

Here are some examples of stemming:
* nonsense -> nonsens
* investigates -> investig
* disappearance -> disappear

By manual inspection, find the first 3 words in `my_review` that are modified by stemming. Store them in `first_3_stemmed` in the order in which they appear in the review.

E.g.:
```
first_3_stemmed = ['nonsense', 'investigates', 'disappearance']
```

<!--
BEGIN QUESTION
name: q3_stemming_type
manual: false
points: 0
-->

In [17]:
# Your code here
first_3_stemmed = ...

print("Stemming examples:")
for w in first_3_stemmed:
    print("{} -> {}".format(w, ps.stem(w)))

In [None]:
grader.check("q3_stemming_type")

<!--
BEGIN QUESTION
name: q3_stemming_length
manual: false
points: 0
-->

In [None]:
grader.check("q3_stemming_length")

<!--
BEGIN QUESTION
name: q3_stemming_match
manual: false
points: 0
-->

In [None]:
grader.check("q3_stemming_match")

<br>

___

## Part II: Train and validate a sentiment analysis model using a Random Forest Classifier

In this section we have written the code to train the classifier for you. Your task will be to explore its performance characteristics with your own movie reviews.

In [21]:
# We vectorize the text using a bag of words model
def get_vectorizer(ngram, max_features):
    return CountVectorizer(ngram_range=(1, ngram),
                             analyzer = "word",
                             tokenizer = None,
                             preprocessor = review_cleaner,
                             stop_words = None, 
                             max_features = max_features)

# Model training
def train_predict_sentiment(reviews, vectorizer, y=train["sentiment"], ngram=1, max_features=1000, model_random_state=123):
    '''
        This function will:
            1. split data into train and test set.
            2. get n-gram counts from cleaned reviews 
            3. train a random forest model using train n-gram counts and y (labels)
            4. test the model on your test split
            5. print accuracy of sentiment prediction on test and training data
            6. print confusion matrix on test data results

            To change n-gram type, set value of ngram argument
            To change the number of features you want the countvectorizer to generate, set the value of max_features argument
            
            @cleaned_review (type:str) is preprocessed string from review_cleaner()
            @return none
    '''

    print("Creating the bag of words model!\n")
    
    # train / test split
    X_train, X_test, y_train, y_test = train_test_split(reviews, y, random_state=0, test_size=.2)

    # Then we use fit_transform() to fit the model / learn the vocabulary,
    # then transform the data into feature vectors.
    # The input should be a list of strings. .toarray() converts to a numpy array
    
    train_bag = vectorizer.fit_transform(X_train)
    if not isinstance(train_bag, np.ndarray):
        train_bag = train_bag.toarray()
    test_bag = vectorizer.transform(X_test)
    if not isinstance(test_bag, np.ndarray):
        test_bag = test_bag.toarray()

    print("Training the random forest classifier!\n")
    # Initialize a Random Forest classifier with 50 trees
    forest = RandomForestClassifier(n_estimators = 50, random_state = model_random_state) 

    # Fit the forest to the training set, using the bag of words as 
    # features and the sentiment labels as the target variable
    forest = forest.fit(train_bag, y_train)

    # predict
    train_predictions = forest.predict(train_bag)
    test_predictions = forest.predict(test_bag)
    
    # validation
    train_acc = metrics.accuracy_score(y_train, train_predictions)
    valid_acc = metrics.accuracy_score(y_test, test_predictions)
    
    print(" The training accuracy is: ", train_acc, "\n", "The validation accuracy is: ", valid_acc)
    print()
    print('CONFUSION MATRIX:')
    print('         Predicted')
    print('          neg pos')
    print(' Actual')
    c=confusion_matrix(y_test, test_predictions)
    print('     neg  ',c[0])
    print('     pos  ',c[1])

    return forest

# Print out the top features
def top_features(forest, vectorizer, n):
    #Extract feature importance
    print('\nTOP TEN IMPORTANT FEATURES:')
    feature_text = vectorizer.get_feature_names().copy()
    feature_importance = forest.feature_importances_.copy()
    
    indices = np.argsort(feature_importance)[::-1]
    
    top_n_ind = indices[:n]
    top_n = list([vectorizer.get_feature_names()[ind] for ind in top_n_ind])
    
    print(top_n)

# Print out whether the prediction is accurate
def check_prediction(model, vectorizer, review, expected):
    prediction = model.predict(vectorizer.transform([review]))[0]
    sentiment = "👍" if prediction else "👎"
    correct = "\x1b[92mcorrect\x1b[0m" if prediction == expected else "\x1b[31mincorrect\x1b[0m"
    print("{} ⟶ {} {}".format(review, sentiment, correct))

<br>

## Train Random Forest Classifier Model

In [22]:
# Train RFC model
vectorizer = get_vectorizer(ngram=1, max_features=100)
forest_model = train_predict_sentiment(reviews=train["review"], vectorizer=vectorizer, y=train["sentiment"])
top_features(forest_model, vectorizer, 20)

### Question 4 - Construct a positive sentiment review

Think of a movie that you like and write a review for it. Store as a string in `good_review`. If the model doesn't give a positive prediction for your review iterate on it until it does.

<!--
BEGIN QUESTION
name: q4_positive_review_type
manual: false
points: 0
-->

In [23]:
# Your code here
good_review = ...
check_prediction(forest_model, vectorizer, good_review, 1)

In [None]:
grader.check("q4_positive_review_type")

<!--
BEGIN QUESTION
name: q4_positive_review_prediction
manual: false
points: 1
-->

In [None]:
grader.check("q4_positive_review_prediction")

### Question 5 - Construct a negative sentiment review

Think of a movie that you like and write a review for it. Store as a string in `bad_review`. If the model doesn't give a negative prediction for your review iterate on it until it does.

<!--
BEGIN QUESTION
name: q5_negative_review_type
manual: false
points: 0
-->

In [26]:
# Your code here
bad_review = ...
check_prediction(forest_model, vectorizer, bad_review, 0)

In [None]:
grader.check("q5_negative_review_type")

<!--
BEGIN QUESTION
name: q5_negative_review_prediction
manual: false
points: 1
-->

In [None]:
grader.check("q5_negative_review_prediction")

### Question 6 - Construct a misclassified negative sentiment review

Now try to write a review that you view as negative but the model views as positive. Iterate and experiment as necessary and store it as a string  `bad_review_error`.

<!--
BEGIN QUESTION
name: q6_misclassified_review_type
manual: false
points: 0
-->

In [29]:
# Your code here
bad_review_error = ...
check_prediction(forest_model, vectorizer, bad_review_error, 0)

In [None]:
grader.check("q6_misclassified_review_type")

<!--
BEGIN QUESTION
name: q6_misclasified_review_prediction
manual: false
points: 1
-->

In [None]:
grader.check("q6_misclasified_review_prediction")

## Part III: Word2Vec Model

In [32]:
w2v_model = Word2Vec(sentences=[utils.simple_preprocess(review) for review in train['review']], size=100, seed=123, workers=1)

### Question 7 - Explain Word2Vec similarity on display below

In [33]:
w2v_model.wv.most_similar(positive=['actress'])

#### Pick one word similar to 'actress' and explain why it appears

<!--
BEGIN QUESTION
name: q7_word2vec_similar_actress
manual: true
points: 1
-->
<!-- EXPORT TO PDF -->

*Your answer here*

### Question 8 - Explain Word2Vec comparison on display below

In [34]:
w2v_model.wv.most_similar(positive=['actress'], negative=['actor'])

#### Pick one word similar to 'actress' and dissimilar to 'actor' explain why it appears

<!--
BEGIN QUESTION
name: q8_word2vec_similar_actress_dissimilar_actor
manual: true
points: 1
-->
<!-- EXPORT TO PDF -->

*Your answer here*

## Fit the Word2Vec model

In [35]:
def get_avg_feature_vecs(reviews, model):
    # Given a set of reviews (each one a list of words), calculate 
    # the average feature vector for each one 

    
    # Index2word is a list that contains the names of the words in 
    # the model's vocabulary. Convert it to a set, for speed 
    index2word_set = set(model.wv.index2word)
    
    reviewFeatureVecs = []
    # Loop through the reviews
    for counter, review in enumerate(reviews):
        
        # Print a status message every 5000th review
        if (counter + 1) % 5000. == 0.:
            print("Review %d of %d" % (counter + 1, len(reviews)))

        # Function to average all of the word vectors in a given paragraph
        featureVec = []

        # Loop over each word in the review and, if it is in the model's
        # vocaublary, add its feature vector to the total
        for n,word in enumerate(utils.simple_preprocess(review)):
            if word in index2word_set: 
                featureVec.append(model.wv[word])

        # Average the word vectors for a 
        featureVec = np.mean(featureVec, axis=0).reshape(1,-1)

        reviewFeatureVecs.append(featureVec)

    return np.concatenate(reviewFeatureVecs, axis=0)

w2v_vectorizer = FunctionTransformer(lambda x: get_avg_feature_vecs(x, w2v_model))

In [36]:
v2v_forest_model = train_predict_sentiment(reviews=train["review"], vectorizer=w2v_vectorizer, y=train["sentiment"])

### Question 9 - compare Word2Vec to Bag of Words

Comment on how Word2Vec compares with the Bag of Words Model. Please use the template below for your answer

<!--
BEGIN QUESTION
name: q9_word2vec_comparison
manual: true
points: 2
-->
<!-- EXPORT TO PDF -->

*Complete answers below*:

* **Is it an improvement?**

*your answer here*

* **How significant is the difference?**

*your answer here*

* **Is this a fair and meaningful comparision? Why or why not?**

*your answer here*

* **What other experiments might you run to further compare?**

*your answer here*

## Add more training data

You will now try to further improve the performance of the Word2Vec model by enhancing it with unlabeled data. 
<!--
BEGIN QUESTION
name: q10_word2vec_train_more
manual: true
points: 2
-->
<!-- EXPORT TO PDF -->

In [37]:
# Load unlabeled test data set
more_training_data = pd.read_csv("testData.tsv", header=0, \
                    delimiter="\t", quoting=3)

In [38]:
# View first 5 rows
more_training_data.head()

In [39]:
# Update model so that it learns from the reviews in more_training_data.
# This involves two steps: 1/ updating the vocabulary, 2/ training the model.
# Be sure to preprocess the reviews. Your code should modify and update w2v_model

# Your code here
...

In [40]:
w2v_forest_model = train_predict_sentiment(reviews=train["review"], vectorizer=FunctionTransformer(lambda x: get_avg_feature_vecs(x, w2v_model)), y=train["sentiment"])

### Question 10 - Comment on the impact of more data

*Complete answers below*:

* **Did adding more training data improve the model?**

*your answer here*

* **How significant is the difference?**

*your answer here*

* **Why could one expect a model to improve even when provided with unlabeled data?**

*your answer here*

## Word2Vec Prediction Analysis

Check to see how the Word2Vec model works on the reviews that you wrote previously.

In [41]:
check_prediction(w2v_forest_model, w2v_vectorizer, good_review, 1)

In [42]:
check_prediction(v2v_forest_model, w2v_vectorizer, bad_review, 0)

In [43]:
print("With Bag of Words:")
check_prediction(forest_model, vectorizer, bad_review_error, 0)

print("With Word2Vec:")
check_prediction(v2v_forest_model, w2v_vectorizer, bad_review_error, 0)

### Question 11 - how does the Word2Vec model compare to Bag of Words?
<!--
BEGIN QUESTION
name: q11_word2vec_comment
manual: true
points: 1
-->

Just comment, don't change your reviews to achieve a particular outcome.
<!-- EXPORT TO PDF -->

*Complete answers below:*

* **Is your positive review classified correctly by Word2Vec?**

*your answer here*

* **Is your negative review classified correctly by Word2Vec?**

*your answer here*

* **Is your negative review misclassified by Bag of Words now classified correctly by Word2Vec?**

*your answer here*

### Question 12 - create a review where the models disagree
<!--
BEGIN QUESTION
name: q12_word2vec_split_decision_exists
manual: false
points: 0
-->

If your originally misclassified negative review was properly classified by Word2Vec you may use it to answer this question.
If not, construct some other review that is properly classified by one model and improperly classified by the other model. Store that review as a string in `split_prediction`. Store the expected prediction in `split_prediction_expected` as 1 for positive sentiment or 0 for negative sentiment.

In [44]:
# Your code here
split_prediction = ...
split_prediction_expected = ...

print("With Bag of Words:")
check_prediction(forest_model, vectorizer, split_prediction, split_prediction_expected)

print("With Word2Vec:")
check_prediction(v2v_forest_model, w2v_vectorizer, split_prediction, split_prediction_expected)

<!--
BEGIN QUESTION
name: q12_word2vec_split_decision_defined
manual: true
points: 1
-->
<!-- EXPORT TO PDF -->

In [None]:
grader.check("q12_word2vec_split_decision_defined")

<!--
BEGIN QUESTION
name: q12_word2vec_split_decision_predict
manual: false
points: 1
-->

In [None]:
grader.check("q12_word2vec_split_decision_predict")

# Submit
Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output.
**Please save before submitting!**

<!-- EXPECT 6 EXPORTED QUESTIONS -->