In [None]:
# Initialize Otter Grader
import otter
grader = otter.Notebook()



![data-x](https://raw.githubusercontent.com/afo/data-x-plaksha/master/imgsource/dx_logo.png)


___

#### NAME:

#### STUDENT ID:
___

#  HW3-4: NLP (Text Processing and  Feature Engineering & Text Representation)
**(Total 120 points)**

# NLP for Sentiment Analysis on IMDB Movie Reviews

In this assignment we will be exploring tools for Natural Language Processing (NLP). Our task is sentiment analysis for movie reviews and in that context we will touch upon multiple areas:

- Feature engineering
- Bag of words modeling
- Word2Vec modeling

Run the following cell to install the packages you need for this assignment.

In [1]:
!pip install gensim

Run the following cell to load the required modules.

In [2]:
# import Beautiful Soup, NumPy and Pandas, etc
import bs4 as bs
import numpy as np
import pandas as pd
import re
import hashlib
 
# download NLTK classifiers - these are cached locally on your machine
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')

# import ml classifiers
from nltk.tokenize import sent_tokenize # tokenizes sentences
from nltk.stem import PorterStemmer     # parsing/stemmer
from nltk.tag import pos_tag            # parts-of-speech tagging
from nltk.corpus import wordnet         # sentiment scores
from nltk.stem import WordNetLemmatizer # stem and context
from nltk.corpus import stopwords       # stopwords
from nltk.util import ngrams            # ngram iterator

# import word2vec
from gensim.test.utils import datapath
from gensim import utils
from gensim.models import Word2Vec

# import sklearn
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize, FunctionTransformer
from sklearn.feature_extraction.text import CountVectorizer

## 1. Data Loading and Preprocessing
**(Total 70 points)**

<br>
___

### Data Description
>Data source: https://www.kaggle.com/c/word2vec-nlp-tutorial/data (originally from [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/))<br>
>
>Data Description:<br><br>
>We will be using Kaggle's **Bag of Words Meets Bags of Popcorn** dataset to explore [IMBD](https://www.imdb.com/) movie review data.  Labeled training dataset consists of 25,000 IMDB movie reviews. The sentiment of the reviews are binary, meaning an IMDB rating < 5 results in a sentiment score of 0, and a rating >=7 have a sentiment score of 1 (no reviews with score 5 or 6 are included in the analysis). No individual movie has more than 30 reviews. The training data set is constructed in a balanced way so that there are an equal number of positive and negative reviews for each movie.
>
>Data Set:<br>
>* ```labeledTrainData.tsv``` --> The labeled training set. The file is tab-delimited and has a header row followed by 25,000 rows containing an id (numerical), sentiment (categorical), and text for each review (textual).<br>
>
>
> Further Reading:<br>
> 
> * [Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).](http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf)

In [3]:
# training data
train = pd.read_csv("labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)

In [4]:
# first 5 rows
train.head()

### 1.a Clean function
**(Total 40 points)**

Finish the function `review_cleaner` to preprocess reviews. Here is an overview of what it does:

> - Removes HTML tags (using beautifulsoup)
> - Extract emoticons (emotion symbols, aka smileys :D )
> - Removes non-letters (using regular expression)
> - Converts all words to lowercase letters and tokenizes them (using .split() method on the review strings, so that every word in the review is an element in a list)
> - Removes all the English stopwords from the list of movie review words
> - Applies either stemming or lemmatization, as indicated by the arguments
> - Join the words back into one string seperated by space, append the emoticons to the end

More details can be found in the introduction [slides](https://datax.berkeley.edu/wp-content/uploads/2020/06/Module-1_-Preprocessing-Slides.pdf).




<!--
BEGIN QUESTION
name: q1_a_0
manual: false
points: 10
-->

In [5]:
ps = PorterStemmer()
wnl = WordNetLemmatizer()
eng_stopwords = set(stopwords.words("english"))


def review_cleaner(review, lemmatize=True, stem=False):
    '''
        Clean and preprocess a review.
            1. Remove HTML tags
            2. Extract emoticons
            3. Use regex to remove all special characters (only keep letters)
            4. Make strings to lower case and tokenize / word split reviews
            5. Remove English stopwords
            6. Lemmatize
            7. Rejoin to one string
        
        @review (type:str) is an unprocessed review string
        @return (type:str) is a 6-step preprocessed review string
    '''

    

    if lemmatize == True and stem == True:
        raise RuntimeError("May not pass both lemmatize and stem flags")

    #1. Remove HTML tags
    review = ...

    #2. Use regex to find emoticons
    emoticons = ...

    #3. Remove punctuation
    review = ...

    #4. Tokenize into words (all lower case)
    review = ...

    #5. Remove stopwords, Lemmatize, Stem
    ### YOUR CODE HERE ###
    ...
    
    #6. Join the review to one sentence
    review_processed = ...
    
    return review_processed

In [None]:
grader.check("q1_a_0")


<!--
BEGIN QUESTION
name: q1_a_1
manual: false
points: 10
-->

In [7]:
# test for HTML Tags (10 points)

In [None]:
grader.check("q1_a_1")


<!--
BEGIN QUESTION
name: q1_a_2
manual: false
points: 10
-->

In [9]:
# test for emoticons (10 points)

In [None]:
grader.check("q1_a_2")


<!--
BEGIN QUESTION
name: q1_a_3
manual: false
points: 10
-->

In [11]:
# test for stopwords, Lemmatize, Stem (10 points)

In [None]:
grader.check("q1_a_3")

### 1.b Set up your review
**(Total 0 points)**

To make things interesting, everyone gets to analyze a different review. Set `seed_value` to your favorite number, your name, or whatever else you'd like.
<!--
BEGIN QUESTION
name: q1_b
manual: false
points: 0
-->

In [13]:
seed_value = ...

In [None]:
grader.check("q1_b")

In [15]:
# Print out a cleaned version of the randomly selected review
my_review_id = int(hashlib.md5(str(seed_value).encode("utf-8")).hexdigest()[:8], 16) % len(train.index)
my_review = train.iloc[my_review_id]["review"]
print(my_review)

### 1.c Find the stopwords

**(Total 10 points)**

Find the first 5 stopwords in your chosen review. 

First review the list of stopwords below:

In [16]:
# See what the stopwords are
print(" ".join(stopwords.words("english")))

For your selected review, find the 5 first stopwords. Store them in the list named `first_5_stopwords` in the order in which they appear in the review.

e.g., 
```
first_5_stopwords = ['having', 'the', 'to', 'some', 'of']
```


<!--
BEGIN QUESTION
name: q1_c
manual: false
points: 10
-->

In [17]:
first_5_stopwords = ...
first_5_stopwords

In [None]:
grader.check("q1_c")

### 1.d Lemmatization

**(Total 10 points)**

Lemmatization allows grouping of common forms of a word.

Here are some examples of lemmatization:
* images -> image
* waxworks -> waxwork
* sweets -> sweet

Find the first 3 words in `my_review` that are lemmatized. Store them in the list named `first_3_lemmatized` in the order in which they appear in the review.

E.g.:
```
first_3_lemmatized = ['images', 'waxworks', 'sweets']
```


<!--
BEGIN QUESTION
name: q1_d
manual: false
points: 10
-->

In [21]:
first_3_lemmatized = ...

print("Lemmatization examples:")
for w in first_3_lemmatized:
    print("{} -> {}".format(w, wnl.lemmatize(w)))

In [None]:
grader.check("q1_d")

### 1.e Stemming

**(Total 10 points)**

Stemming allows grouping of common forms of a word.

Here are some examples of stemming:
* nonsense -> nonsens
* investigates -> investig
* disappearance -> disappear

Find the first 3 words in `my_review` that are modified by stemming. Store them in the list named `first_3_stemmed` in the order in which they appear in the review.

E.g.:
```
first_3_stemmed = ['nonsense', 'investigates', 'disappearance']
```

<!--
BEGIN QUESTION
name: q1_e
manual: false
points: 10
-->

In [25]:
first_3_stemmed = ...

print("Stemming examples:")
for w in first_3_stemmed:
    print("{} -> {}".format(w, ps.stem(w)))

In [None]:
grader.check("q1_e")

<br>

___

## 2. Train and Validate a Sentiment Analysis Model using a Random Forest Classifier
**(Total 30 points)**

In this section we have written the code to train the classifier for you. Your task will be to explore its performance characteristics with your own movie reviews.

In [29]:
# We vectorize the text using a bag of words model
def get_vectorizer(ngram, max_features):
    return CountVectorizer(ngram_range=(1, ngram),
                             analyzer = "word",
                             tokenizer = None,
                             preprocessor = review_cleaner,
                             stop_words = None, 
                             max_features = max_features)

# Model training
def train_predict_sentiment(reviews, vectorizer, y=train["sentiment"], ngram=1, max_features=1000, model_random_state=0):
    '''
        This function will:
            1. split data into train and test set.
            2. get n-gram counts from cleaned reviews 
            3. train a random forest model using train n-gram counts and y (labels)
            4. test the model on your test split
            5. print accuracy of sentiment prediction on test and training data
            6. print confusion matrix on test data results

            To change n-gram type, set value of ngram argument
            To change the number of features you want the countvectorizer to generate, set the value of max_features argument
            
            @cleaned_review (type:str) is preprocessed string from review_cleaner()
            @return none
    '''

    print("Creating the model!\n")
    
    # train / test split
    X_train, X_test, y_train, y_test = train_test_split(reviews, y, random_state=0, test_size=.2)

    # Then we use fit_transform() to fit the model / learn the vocabulary,
    # then transform the data into feature vectors.
    # The input should be a list of strings. .toarray() converts to a numpy array
    
    train_bag = vectorizer.fit_transform(X_train)
    if not isinstance(train_bag, np.ndarray):
        train_bag = train_bag.toarray()
    test_bag = vectorizer.transform(X_test)
    if not isinstance(test_bag, np.ndarray):
        test_bag = test_bag.toarray()

    print("Training the random forest classifier!\n")
    # Initialize a Random Forest classifier with 50 trees
    forest = RandomForestClassifier(n_estimators = 50, random_state = model_random_state) 

    # Fit the forest to the training set, using the bag of words as 
    # features and the sentiment labels as the target variable
    forest = forest.fit(train_bag, y_train)

    # predict
    train_predictions = forest.predict(train_bag)
    test_predictions = forest.predict(test_bag)
    
    # validation
    train_acc = metrics.accuracy_score(y_train, train_predictions)
    valid_acc = metrics.accuracy_score(y_test, test_predictions)
    
    print(" The training accuracy is: ", train_acc, "\n", "The validation accuracy is: ", valid_acc)
    print()
    print('CONFUSION MATRIX:')
    print('         Predicted')
    print('          neg pos')
    print(' Actual')
    c=confusion_matrix(y_test, test_predictions)
    print('     neg  ',c[0])
    print('     pos  ',c[1])

    return forest

# Print out the top features
def top_features(forest, vectorizer, n):
    #Extract feature importance
    print('\nTOP TEN IMPORTANT FEATURES:')
    feature_text = vectorizer.get_feature_names().copy()
    feature_importance = forest.feature_importances_.copy()
    
    indices = np.argsort(feature_importance)[::-1]
    
    top_n_ind = indices[:n]
    top_n = list([vectorizer.get_feature_names()[ind] for ind in top_n_ind])
    
    return top_n

# Print out whether the prediction is accurate
def check_prediction(model, vectorizer, review, expected):
    prediction = model.predict(vectorizer.transform([review]))[0]
    sentiment = "👍" if prediction else "👎"
    correct = "\x1b[92mcorrect\x1b[0m" if prediction == expected else "\x1b[31mincorrect\x1b[0m"
    print("{} ⟶ {} {}".format(review, sentiment, correct))

<br>


### 2.a Train Random Forest Classifier Model

**(Total 15 points)**

Use the above functions to train your random forest model. Set `ngram=1`, `max_features=100` for the `get_vectorizer` function, then use  `train_predict_sentiment` function to train your model using the train dataset. Finally, use `top_features` function to print the top 10 features. This cell may take a few minutes to run.

<!--
BEGIN QUESTION
name: q2a
manual: false
points: 15
-->

In [30]:
# Train RFC model
vectorizer = ...
forest_model = ...
top_10 = ...
print(top_10)

In [None]:
grader.check("q2a")

### 2.b Construct a positive sentiment review

**(Total 5 points)**

Think of a movie that you like and write a review for it. Store as a string in `good_review`. If the model doesn't give a positive prediction for your review iterate on it until it does.

<!--
BEGIN QUESTION
name: q2_b
manual: false
points: 5
-->

In [34]:
good_review = ...
check_prediction(forest_model, vectorizer, good_review, 1)

In [None]:
grader.check("q2_b")

### 2.c Construct a negative sentiment review

**(Total 5 points)**

Think of a movie that you like and write a review for it. Store as a string in `bad_review`. If the model doesn't give a negative prediction for your review iterate on it until it does.

<!--
BEGIN QUESTION
name: q2_c
manual: false
points: 5
-->

In [37]:
bad_review = ...
check_prediction(forest_model, vectorizer, bad_review, 0)

In [None]:
grader.check("q2_c")

### 2.d - Construct a misclassified negative sentiment review

**(Total 5 points)**

Now try to write a review that you view as negative but the model views as positive. Iterate and experiment as necessary and store it as a string  `bad_review_error`.

<!--
BEGIN QUESTION
name: q2_d
manual: false
points: 5
-->

In [40]:
bad_review_error = ...
check_prediction(forest_model, vectorizer, bad_review_error, 0)

In [None]:
grader.check("q2_d")

## 3. Word2Vec Model
**(Total 20 points)**

Run the cell below to train the Word2Vec Model for train dataset.

In [43]:
w2v_model = Word2Vec(sentences=[utils.simple_preprocess(review) for review in train['review']], size=100, workers=1)

### 3.a - Word2Vec similarity

**(Total 5 points)**

Use the Word2Vec Model we trained above to find 10 similar words for `'actors'`. Store the result in `sim`

<!--
BEGIN QUESTION
name: q3_a
manual: false
points: 5
-->

In [44]:
sim = ...
sim

In [None]:
grader.check("q3_a")

### 3.b - Word2Vec doesn't mach

**(Total 5 points)**


Use the Word2Vec Model we trained above to find the word that doesn't match the others. We will test the words: 'professor', 'engineer', 'scientist', 'cat'. Store the result in `no_match`

<!--
BEGIN QUESTION
name: q3_b
manual: false
points: 5
-->

In [47]:
no_match = ...
no_match

In [None]:
grader.check("q3_b")

### 3.c Fit the Word2Vec model

**(Total 10 points)**

Vector Averaging to get feature encoding of review:


One challenge with the IMDB dataset is the variable-length reviews. We need to find a way to take individual word vectors and transform them into a feature set that is the same length for every review.

We can use vector operations to combine the words in each review. One method we tried was to simply average the word vectors in a given review (for this purpose, we removed stop words, which would just add noise).

The following code averages the feature vectors. You don't need to modify the cell.

In [50]:
def get_avg_feature_vecs(reviews, model):
    # Given a set of reviews (each one a list of words), calculate 
    # the average feature vector for each one 

    
    # Index2word is a list that contains the names of the words in 
    # the model's vocabulary. Convert it to a set, for speed 
    index2word_set = set(model.wv.index2word)
    
    reviewFeatureVecs = []
    # Loop through the reviews
    for counter, review in enumerate(reviews):
        
        # Print a status message every 5000th review
        if (counter + 1) % 5000. == 0.:
            print("Review %d of %d" % (counter + 1, len(reviews)))

        # Function to average all of the word vectors in a given paragraph
        featureVec = []

        # Loop over each word in the review and, if it is in the model's
        # vocaublary, add its feature vector to the total        
        for n,word in enumerate(utils.simple_preprocess(review)):
            if word in index2word_set: 
                featureVec.append(model.wv[word])

        
        
        # Average the word vectors
        featureVec = np.mean(featureVec, axis=0).reshape(1,-1)

        reviewFeatureVecs.append(featureVec)

    return np.concatenate(reviewFeatureVecs, axis=0)

w2v_vectorizer = FunctionTransformer(lambda x: get_avg_feature_vecs(x, w2v_model))

Again, use the function `train_predict_sentiment` to train your random forest model. Remember: we vectorize the text using the Word2Vec model now. Save the trained model in `w2v_forest_model`.

<!--
BEGIN QUESTION
name: q3_c
manual: false
points: 10
-->

In [51]:
w2v_forest_model = ...

In [None]:
grader.check("q3_c")

How Word2Vec compares with the Bag of Words Model? Is it an improvement? How significant is the difference?

### 3.d Word2Vec Prediction Analysis

**(Total 0 points)**

Run the following cells to check to see how the Word2Vec model works on the reviews that you wrote previously.

In [54]:
check_prediction(w2v_forest_model, w2v_vectorizer, good_review, 1)

In [55]:
check_prediction(w2v_forest_model, w2v_vectorizer, bad_review, 0)

In [56]:
print("With Bag of Words:")
check_prediction(forest_model, vectorizer, bad_review_error, 0)

print("With Word2Vec:")
check_prediction(w2v_forest_model, w2v_vectorizer, bad_review_error, 0)

Think about the questions below:

* Is your positive review classified correctly by Word2Vec?

* Is your negative review classified correctly by Word2Vec?

* Is your negative review misclassified by Bag of Words now classified correctly by Word2Vec?


# Submit
Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output.
**Please save before submitting!**

In [None]:
# Save your notebook first, then run this cell to create a pdf for your reference.