# <u><center>Project 2 Part 5 Core
- Authored by: Eric N. Valdez
- Date: 4/6/2024

## `Sentiment Analysis and Rating Prediction of Moving Reviews`

### <u>Overview:
- This project is an extension of the movies porject from Enricment. This portion focuses on applying Natural Language Processing (NLP) techniques to analyze a database of movie reviews.
- Students will leverage NLP tools such as NLTK, Spacy, WordCloud, and Scikit-Learn to explore, analyze, and model text data. The ultimate goal is to establish a relationship between the textual content of the reviews and their associated ratings and subsequently predict these ratings.

### <u>Dataset: TMDB Movie Reviews
<center> <img src="Data-NLP/Images/IMDB.png">

#### [TMDB Movie Reviews](https://drive.google.com/file/d/1vLUzSYleJXqsjNMsq76yTQ5fmNlSHFJI/view). Ratings Range from 1 to 10</center>
- Gathering through `tmdbsimple` pyhton wrapper for the TMDB API. To legally cite TMDB, please follow their attribution requirements, which we have [summarized here](https://docs.google.com/document/d/1LzFQDulDdQjiMuZ8sBYeDbHnN62ZWjFU_xt_4eSwVIw/edit).

# <u>Imports:

In [1]:
import re
import matplotlib.pyplot as plt
import spacy
import pandas as pd
import matplotlib as mpl
import seaborn as sns
import numpy as np
import nltk


from wordcloud import WordCloud
from wordcloud import STOPWORDS
from nltk.tokenize import word_tokenize
from nltk.tokenize import TweetTokenizer
from nltk import ngrams
from nltk.probability import FreqDist
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

# Increase column width
pd.set_option('display.max_colwidth', 250)
pd.set_option('display.max_colwidth',300)


# <u> Custom Functions:

In [2]:
def batch_preprocess_texts(
    texts,
    nlp=None,
    remove_stopwords=True,
    remove_punct=True,
    use_lemmas=False,
    disable=["ner"],
    batch_size=50,
    n_process=-1,
):
    """Efficiently preprocess a collection of texts using nlp.pipe()
    Args:
        texts (collection of strings): collection of texts to process (e.g. df['text'])
        nlp (spacy pipe), optional): Spacy nlp pipe. Defaults to None; if None, it creates a default 'en_core_web_sm' pipe.
        remove_stopwords (bool, optional): Controls stopword removal. Defaults to True.
        remove_punct (bool, optional): Controls punctuation removal. Defaults to True.
        use_lemmas (bool, optional): lemmatize tokens. Defaults to False.
        disable (list of strings, optional): named pipeline elements to disable. Defaults to ["ner"]: Used with nlp.pipe(disable=disable)
        batch_size (int, optional): Number of texts to process in a batch. Defaults to 50.
        n_process (int, optional): Number of CPU processors to use. Defaults to -1 (meaning all CPU cores).
    Returns:
        list of tokens
    """
    # from tqdm.notebook import tqdm
    from tqdm import tqdm
    if nlp is None:
        nlp = spacy.load("en_core_web_sm")
    processed_texts = []
    for doc in tqdm(nlp.pipe(texts, disable=disable, batch_size=batch_size, n_process=n_process)):
        tokens = []
        for token in doc:
            # Check if should remove stopwords and if token is stopword
            if (remove_stopwords == True) and (token.is_stop == True):
                # Continue the loop with the next token
                continue
            # Check if should remove stopwords and if token is stopword
            if (remove_punct == True) and (token.is_punct == True):
                continue
            # Check if should remove stopwords and if token is stopword
            if (remove_punct == True) and (token.is_space == True):
                continue
            
            ## Determine final form of output list of tokens/lemmas
            if use_lemmas:
                tokens.append(token.lemma_.lower())
            else:
                tokens.append(token.text.lower())
        processed_texts.append(tokens)
    return processed_texts

In [3]:
def classification_metrics(y_true, y_pred, label='',
                           output_dict=False, figsize=(8,4),
                           normalize='true', cmap='Blues',
                           colorbar=False):
  # Get the classification report
  report = classification_report(y_true, y_pred)
  ## Print header and report
  header = "-"*70
  print(header, f" Classification Metrics: {label}", header, sep='\n')
  print(report)
  ## CONFUSION MATRICES SUBPLOTS
  fig, axes = plt.subplots(ncols=2, figsize=figsize)
  # create a confusion matrix  of raw counts
  ConfusionMatrixDisplay.from_predictions(y_true, y_pred,
                normalize=None, cmap='gist_gray', colorbar=colorbar,
                ax = axes[0],);
  axes[0].set_title("Raw Counts")
  # create a confusion matrix with the test data
  ConfusionMatrixDisplay.from_predictions(y_true, y_pred,
                normalize=normalize, cmap=cmap, colorbar=colorbar,
                ax = axes[1]);
  axes[1].set_title("Normalized Confusion Matrix")
  # Adjust layout and show figure
  fig.tight_layout()
  plt.show()
  # Return dictionary of classification_report
  if output_dict==True:
    report_dict = classification_report(y_true, y_pred, output_dict=True)
    return report_dict

def evaluate_classification(model, X_train, y_train, X_test, y_test,
                         figsize=(6,4), normalize='true', output_dict = False,
                            cmap_train='Blues', cmap_test="Reds",colorbar=False):
  # Get predictions for training data
  y_train_pred = model.predict(X_train)
  # Call the helper function to obtain regression metrics for training data
  results_train = classification_metrics(y_train, y_train_pred, #verbose = verbose,
                                     output_dict=True, figsize=figsize,
                                         colorbar=colorbar, cmap=cmap_train,
                                     label='Training Data')
  print()
  # Get predictions for test data
  y_test_pred = model.predict(X_test)
  # Call the helper function to obtain regression metrics for test data
  results_test = classification_metrics(y_test, y_test_pred, #verbose = verbose,
                                  output_dict=True,figsize=figsize,
                                         colorbar=colorbar, cmap=cmap_test,
                                    label='Test Data' )
  if output_dict == True:
    # Store results in a dataframe if ouput_frame is True
    results_dict = {'train':results_train,
                    'test': results_test}
    return results_dict

In [4]:
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import numpy as np
def classification_metrics(y_true, y_pred, label='',
                           output_dict=False, figsize=(8,4),
                           normalize='true', cmap='Blues',
                           colorbar=False,values_format=".2f"):
    """Modified version of classification metrics function from Intro to Machine Learning.
    Updates:
    - Reversed raw counts confusion matrix cmap  (so darker==more).
    - Added arg for normalized confusion matrix values_format
    """
    # Get the classification report
    report = classification_report(y_true, y_pred)
    
    ## Print header and report
    header = "-"*70
    print(header, f" Classification Metrics: {label}", header, sep='\n')
    print(report)
    
    ## CONFUSION MATRICES SUBPLOTS
    fig, axes = plt.subplots(ncols=2, figsize=figsize)
    
    # Create a confusion matrix  of raw counts (left subplot)
    ConfusionMatrixDisplay.from_predictions(y_true, y_pred,
                                            normalize=None, 
                                            cmap='gist_gray_r',# Updated cmap
                                            values_format="d", 
                                            colorbar=colorbar,
                                            ax = axes[0]);
    axes[0].set_title("Raw Counts")
    
    # Create a confusion matrix with the data with normalize argument 
    ConfusionMatrixDisplay.from_predictions(y_true, y_pred,
                                            normalize=normalize,
                                            cmap=cmap, 
                                            values_format=values_format, #New arg
                                            colorbar=colorbar,
                                            ax = axes[1]);
    axes[1].set_title("Normalized Confusion Matrix")
    
    # Adjust layout and show figure
    fig.tight_layout()
    plt.show()
    
    # Return dictionary of classification_report
    if output_dict==True:
        report_dict = classification_report(y_true, y_pred, output_dict=True)
        return report_dict
    
    
def evaluate_classification(model, X_train, y_train, X_test, y_test,
                         figsize=(6,4), normalize='true', output_dict = False,
                            cmap_train='Blues', cmap_test="Reds",colorbar=False):
  # Get predictions for training data
  y_train_pred = model.predict(X_train)
  # Call the helper function to obtain regression metrics for training data
  results_train = classification_metrics(y_train, y_train_pred, #verbose = verbose,
                                     output_dict=True, figsize=figsize,
                                         colorbar=colorbar, cmap=cmap_train,
                                     label='Training Data')
  print()
  # Get predictions for test data
  y_test_pred = model.predict(X_test)
  # Call the helper function to obtain regression metrics for test data
  results_test = classification_metrics(y_test, y_test_pred, #verbose = verbose,
                                  output_dict=True,figsize=figsize,
                                         colorbar=colorbar, cmap=cmap_test,
                                    label='Test Data' )
  if output_dict == True:
    # Store results in a dataframe if ouput_frame is True
    results_dict = {'train':results_train,
                    'test': results_test}
    return results_dict

In [5]:
# import matplotlib.pyplot as plt
# ax = dist.plot(20, show=False)
# ax.set_title('Number of Occurances of Top 20 Words')
# ax.grid(False)
# plt.tight_layout()
# plt.savefig('frequency_distribution.png')

In [6]:
def preprocess_doc(doc, remove_stopwords=True, remove_punct=True, use_lemmas=False):
    """Temporary Fucntion - for Education Purposes (we will make something better below)
    """
    tokens = [ ]
    for token in doc:
        # Check if should remove stopwords and if token is stopword
        if (remove_stopwords == True) and (token.is_stop == True):
            # Continue the loop with the next token
            continue
    
        # Check if should remove stopwords and if token is stopword
        if (remove_punct == True) and (token.is_punct == True):
            continue
    
        # Check if should remove stopwords and if token is stopword
        if (remove_punct == True) and (token.is_space == True):
            continue
    
        ## Determine final form of output list of tokens/lemmas
        if use_lemmas:
            tokens.append(token.lemma_.lower())
        else:
            tokens.append(token.text.lower())
    return tokens

# `Tasks`
## 0) <u>Update Your Project 2 Repo
- Create a new `Data-NLP/` folder in your project 2 repository
- Add the dowloaded review to this new `Data-NLP/` folder.
- Make sure you have an `` folder. If <u>not</u>, create one.

## 1) <u>Data Preprocessing 
- `Load and Inspect the dataset`
    - How many reviews?
    - What does the distribution of ratings look like?
    - Any Null values?
- `Use the rating column to create a new` <u>`target column`</u> `with 2 groups: high-rating and low-rating groups`
    - We recommend defining 'High-rating' reviews as any review with a rating >=9; and 'Low-Rating' reviews as any review with a rating <=4.
    - The middel ratings between 4 and 9 will be excluded from the analysis.
    - You may use an alternative definition for High & Low reviews, but justify your choice in your notebook/readme
- `Utilize NLTK & SpaCy for basic text processing including:`
    - Removing stopwards
    - Tokenization
    - Lemmatization
    - `TIPS:`
        - Be sure to creat a custom NLP Object & disable the named entity recognizer. Otherwise, processing will take a very long time!
        - **You will want to create several versions of the data, lemmatized, tolkenized, lemmatized joined back to one string per review, and tokenized joined back to one string per review.** This will be useful for different analysis and modeling techniques.
- `NOTE:`you may find some artifacts during your EDA e.g. HTML code like `"href"`. You are allowed to drop rows from the dataset after identifying problematic trends in some of the texts. `(Hint: Remember df[col].str.contains)`
- <u>Save your processed data frame in a `joblib` file saved in the `Data-NLP/` folder for future modeling. 

In [7]:
# Load the Data
mr = 'Data-NLP/movie_reviews_v2.csv'
df = pd.read_csv(mr, index_col='movie_id')
df.head()

Unnamed: 0_level_0,review_id,imdb_id,original_title,review,rating
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
843,64ecc16e83901800af821d50,tt0118694,花樣年華,"This is a fine piece of cinema from Wong Kar-Wai that tells us a story of two people whom circumstance throws together - but not in a way you might expect. We start with two couples who move into a new building. One a newspaper man with his wife, the other a business executive and his wife. The ...",7.0
7443,57086ff5c3a3681d29001512,tt0120630,Chicken Run,"A guilty pleasure for me personally, as I love both 'The Great Escape' and most of the works I have seen, over the years, from this rightfully-esteemed British animation company. Highly recommended both for children and for adults who enjoy animation.",9.0
7443,5bb5ac829251410dcb00810c,tt0120630,Chicken Run,"Made my roommate who hates stop-motion animation watched this in 2018 and even he had a good time. It's maybe not as great as I remember thinking it was when I was a little kid, but it still holds up to some degree.\r\n\r\n_Final rating:★★★ - I liked it. Would personally recommend you give it a ...",6.0
7443,5f0c53a013a32000357ec505,tt0120630,Chicken Run,"A very good stop-motion animation!\r\n\r\n<em>'Chicken Run'</em>, which I watched a crap tonne when I was little but not for a vast number of years now, is an impressive production given it came out in 2000. Despite a pretty simple feel to the film, it's a very well developed concept.\r\n\r\nThe...",8.0
7443,64ecc027594c9400ffe77c91,tt0120630,Chicken Run,"Ok, there is an huge temptation to riddle this review with puns - but I'm just going to say it's a cracking little family adventure. It's seemingly based on a whole range of classic movies from the ""Great Escape"", ""Star Trek"" to ""Love Story"" with a score cannibalised from just about any/everythi...",7.0


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8650 entries, 843 to 575264
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   review_id       8650 non-null   object 
 1   imdb_id         8650 non-null   object 
 2   original_title  8650 non-null   object 
 3   review          8650 non-null   object 
 4   rating          7454 non-null   float64
dtypes: float64(1), object(4)
memory usage: 405.5+ KB


Cleaing the data 

In [9]:
# Checking for duplicates
df.duplicated().sum()

0

In [10]:
# Checking for null Values
df.isna().sum()

review_id            0
imdb_id              0
original_title       0
review               0
rating            1196
dtype: int64

In [11]:
# Dropping unnecesary columns
movie_reviews = df.drop(df.columns[[0, 1]], axis=1)
movie_reviews.head()

Unnamed: 0_level_0,original_title,review,rating
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
843,花樣年華,"This is a fine piece of cinema from Wong Kar-Wai that tells us a story of two people whom circumstance throws together - but not in a way you might expect. We start with two couples who move into a new building. One a newspaper man with his wife, the other a business executive and his wife. The ...",7.0
7443,Chicken Run,"A guilty pleasure for me personally, as I love both 'The Great Escape' and most of the works I have seen, over the years, from this rightfully-esteemed British animation company. Highly recommended both for children and for adults who enjoy animation.",9.0
7443,Chicken Run,"Made my roommate who hates stop-motion animation watched this in 2018 and even he had a good time. It's maybe not as great as I remember thinking it was when I was a little kid, but it still holds up to some degree.\r\n\r\n_Final rating:★★★ - I liked it. Would personally recommend you give it a ...",6.0
7443,Chicken Run,"A very good stop-motion animation!\r\n\r\n<em>'Chicken Run'</em>, which I watched a crap tonne when I was little but not for a vast number of years now, is an impressive production given it came out in 2000. Despite a pretty simple feel to the film, it's a very well developed concept.\r\n\r\nThe...",8.0
7443,Chicken Run,"Ok, there is an huge temptation to riddle this review with puns - but I'm just going to say it's a cracking little family adventure. It's seemingly based on a whole range of classic movies from the ""Great Escape"", ""Star Trek"" to ""Love Story"" with a score cannibalised from just about any/everythi...",7.0


In [12]:
movie_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8650 entries, 843 to 575264
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   original_title  8650 non-null   object 
 1   review          8650 non-null   object 
 2   rating          7454 non-null   float64
dtypes: float64(1), object(2)
memory usage: 270.3+ KB


<u>Creating target column

In [13]:
# Checking what values are in the overall ratings
movie_reviews['rating'].value_counts()

7.0     1576
6.0     1386
8.0     1259
5.0      732
9.0      616
10.0     564
4.0      514
1.0      284
3.0      254
2.0      153
7.5       27
8.5       23
6.5       22
9.5       15
0.5       10
5.5        6
3.5        4
4.5        4
1.5        3
2.5        2
Name: rating, dtype: int64

In [14]:
def create_groups(x):
    if x>=9:
        return "High-rating"
    elif x <=4:
        return "Low-rating"
    else: 
        return None

In [15]:
# Should return high
create_groups(9)

'High-rating'

In [16]:
# Should return low
create_groups(4)

'Low-rating'

In [17]:
# Use the function to create a new "rating" column with groups
movie_reviews['ratings'] = movie_reviews['rating'].map(create_groups)
movie_reviews['ratings'].value_counts(dropna=False)

None           6231
Low-rating     1224
High-rating    1195
Name: ratings, dtype: int64

In [18]:
# Check class balance of 'rating'
movie_reviews['ratings'].value_counts(normalize=True)

Low-rating     0.505994
High-rating    0.494006
Name: ratings, dtype: float64

In [19]:
# Divide documents by sentiment
high = movie_reviews.loc[movie_reviews['ratings'] == 'High-rating']
low = movie_reviews.loc[movie_reviews['ratings'] == 'Low-rating']
print('High Ratings')
display(high.head())
print('Low Ratings')
display(low.head())

High Ratings


Unnamed: 0_level_0,original_title,review,rating,ratings
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
7443,Chicken Run,"A guilty pleasure for me personally, as I love both 'The Great Escape' and most of the works I have seen, over the years, from this rightfully-esteemed British animation company. Highly recommended both for children and for adults who enjoy animation.",9.0,High-rating
2621,Return to Me,"Okay, I will admit I can be a bit of an easy grader when it comes to romantic comedies, as long as they are witty with likable characters, don’t insult my intelligence and have suitable happy endings (I guess if they don’t end happily, they aren’t rom-coms).\r\n\r\nI saw this movie many years ag...",9.0,High-rating
2787,Pitch Black,"All you people are so scared of me. Most days I'd take that as a compliment. But it ain't me you gotta worry about now.\r\n\r\nPitch Black is directed by David Twohy and collectively written by Twohy and Ken and Jim Wheat. It stars Vin Diesel, Radha Mitchell, Cole Hauser, Keith David, Lewis Fitz...",9.0,High-rating
2787,Pitch Black,"One of those few movies that most people don't care for, but I personally think is **criminally** underrated.\r\n\r\n_Final rating:★★★★½ - Ridiculously strong appeal. I can’t stop thinking about it._",9.0,High-rating
2787,Pitch Black,"The movie that put Vin Diesel on the map as Riddick, the crooked anti-hero wanted by bounty hunters. This is another movie that benefits from knowing very little before watching the film.",9.0,High-rating


Low Ratings


Unnamed: 0_level_0,original_title,review,rating,ratings
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
955,Mission: Impossible II,"The first underwhelmed me, but this one straight-up bored me. Again, of course seeing Hunt climb a mountain without a harness is impressive sure. And I even quite liked the idea behind the villain of the piece (though even that angle was woefully underdeveloped).\r\n\r\nEven setting it in predom...",3.0,Low-rating
955,Mission: Impossible II,After quite entertainign Mission Impossible I the second installment turned out ... terrible. As if the screenwriters didn't know how to fill the 2 hrs with action the overuse of heroic slow motion scenes is horrible. You almost might need a barf bag if you can't stand slo-mo every five seconds....,2.0,Low-rating
4234,Scream 3,"**_Scream 3’s_ lackluster screenplay and unimaginative kills leave a film that is a bore to watch.**\r\n\r\nThe meta-narrative of trilogies throughout the film does not make up for how abysmal the plot was. This film creates so much lore for the past movies seemingly out of the blue, muddling up...",4.0,Low-rating
12211,Highlander: Endgame,"**There should have been only one!**\r\n\r\nIf “Highlander 2” was a complete disgrace and “Highlander 3” somehow tried to give us some compensation, this movie makes it look worse and more worn out. However, a TV series had been made that had little or nothing to do with the original film. What ...",1.0,Low-rating
479,Shaft,_**A black detective in Gotham desperately wants to nail a snooty racist murderer**_ \r\n\r\nThe nephew of the original John Shaft is a detective in New York City (Samuel L. Jackson) where he tries to apprehend an arrogant racist killer (Christian Bale) by finding a key witness (Toni Collette) w...,4.0,Low-rating


<u>Creating new columns:

In [20]:
# StopWords
from wordcloud import STOPWORDS
STOPWORDS

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'all',
 'also',
 'am',
 'an',
 'and',
 'any',
 'are',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 "can't",
 'cannot',
 'com',
 'could',
 "couldn't",
 'did',
 "didn't",
 'do',
 'does',
 "doesn't",
 'doing',
 "don't",
 'down',
 'during',
 'each',
 'else',
 'ever',
 'few',
 'for',
 'from',
 'further',
 'get',
 'had',
 "hadn't",
 'has',
 "hasn't",
 'have',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 "he's",
 'hence',
 'her',
 'here',
 "here's",
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 "how's",
 'however',
 'http',
 'i',
 "i'd",
 "i'll",
 "i'm",
 "i've",
 'if',
 'in',
 'into',
 'is',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'k',
 "let's",
 'like',
 'me',
 'more',
 'most',
 "mustn't",
 'my',
 'myself',
 'no',
 'nor',
 'not',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'otherwise',
 'ought',
 'our',
 

In [21]:
# Removing Stopwords
import nltk
from nltk.corpus import stopwords

# Download the stop words list
nltk.download('stopwords')

# Create a list of stop words
stop_words = set(stopwords.words('english'))

# Split the text into words
text = "This is a sample text with stop words."
words = text.split()

# Remove stop words from the list of words
filtered_words = [word for word in words if word not in stop_words]

# Join the remaining words back into a string
filtered_text = " ".join(filtered_words)

# Print the filtered text
print(filtered_text)

This sample text stop words.


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Valde\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [22]:
## Download NLTK stopword list
nltk.download('stopwords')

## Load the English stop words.
stop_words = nltk.corpus.stopwords.words('english')
stop_words[:10]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Valde\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [23]:
nlp_lite = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
nlp_lite

<spacy.lang.en.English at 0x2a8f0ec6d70>

In [24]:
movie_reviews['tokens'] = batch_preprocess_texts(
    movie_reviews['review'],
    nlp = nlp_lite,
    remove_stopwords=True,
    remove_punct=True,
    use_lemmas=True,
    batch_size=50,
    n_process=-1)

8650it [01:32, 94.01it/s] 


In [25]:
# # Tolkenize created above instead of seperate
# movie_reviews['tokens'] = movie_reviews['review'].map(lambda doc: doc.lower().split())
movie_reviews.head()

Unnamed: 0_level_0,original_title,review,rating,ratings,tokens
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
843,花樣年華,"This is a fine piece of cinema from Wong Kar-Wai that tells us a story of two people whom circumstance throws together - but not in a way you might expect. We start with two couples who move into a new building. One a newspaper man with his wife, the other a business executive and his wife. The ...",7.0,,"[fine, piece, cinema, wong, kar, wai, tell, story, people, circumstance, throw, way, expect, start, couple, new, building, newspaper, man, wife, business, executive, wife, businessman, rarely, home, journalist, wife, leave, increasingly, loose, end, long, friendship, develop, usually, noodle, en..."
7443,Chicken Run,"A guilty pleasure for me personally, as I love both 'The Great Escape' and most of the works I have seen, over the years, from this rightfully-esteemed British animation company. Highly recommended both for children and for adults who enjoy animation.",9.0,High-rating,"[guilty, pleasure, personally, love, great, escape, work, see, year, rightfully, esteem, british, animation, company, highly, recommend, child, adult, enjoy, animation]"
7443,Chicken Run,"Made my roommate who hates stop-motion animation watched this in 2018 and even he had a good time. It's maybe not as great as I remember thinking it was when I was a little kid, but it still holds up to some degree.\r\n\r\n_Final rating:★★★ - I liked it. Would personally recommend you give it a ...",6.0,,"[roommate, hate, stop, motion, animation, watch, 2018, good, time, maybe, great, remember, think, little, kid, hold, degree, final, rating, ★, ★, ★, like, personally, recommend]"
7443,Chicken Run,"A very good stop-motion animation!\r\n\r\n<em>'Chicken Run'</em>, which I watched a crap tonne when I was little but not for a vast number of years now, is an impressive production given it came out in 2000. Despite a pretty simple feel to the film, it's a very well developed concept.\r\n\r\nThe...",8.0,,"[good, stop, motion, animation, <, em>'chicken, run'</em, >, watch, crap, tonne, little, vast, number, year, impressive, production, give, come, 2000, despite, pretty, simple, feel, film, develop, concept, admittedly, short, run, time, truly, fly, course, look, relatively, terrific, impress, pac..."
7443,Chicken Run,"Ok, there is an huge temptation to riddle this review with puns - but I'm just going to say it's a cracking little family adventure. It's seemingly based on a whole range of classic movies from the ""Great Escape"", ""Star Trek"" to ""Love Story"" with a score cannibalised from just about any/everythi...",7.0,,"[ok, huge, temptation, riddle, review, pun, go, crack, little, family, adventure, seemingly, base, range, classic, movie, great, escape, star, trek, love, story, score, cannibalise, write, messrs., korngold, williams, bernstein, add, super, stop, motion, animation, ray, harryhausen, proud, flock..."


In [26]:
# Lemmatization
movie_reviews['lemmatized'] = batch_preprocess_texts(movie_reviews['review'], nlp=nlp_lite,use_lemmas=True)
movie_reviews.head()

8650it [01:30, 95.26it/s] 


Unnamed: 0_level_0,original_title,review,rating,ratings,tokens,lemmatized
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
843,花樣年華,"This is a fine piece of cinema from Wong Kar-Wai that tells us a story of two people whom circumstance throws together - but not in a way you might expect. We start with two couples who move into a new building. One a newspaper man with his wife, the other a business executive and his wife. The ...",7.0,,"[fine, piece, cinema, wong, kar, wai, tell, story, people, circumstance, throw, way, expect, start, couple, new, building, newspaper, man, wife, business, executive, wife, businessman, rarely, home, journalist, wife, leave, increasingly, loose, end, long, friendship, develop, usually, noodle, en...","[fine, piece, cinema, wong, kar, wai, tell, story, people, circumstance, throw, way, expect, start, couple, new, building, newspaper, man, wife, business, executive, wife, businessman, rarely, home, journalist, wife, leave, increasingly, loose, end, long, friendship, develop, usually, noodle, en..."
7443,Chicken Run,"A guilty pleasure for me personally, as I love both 'The Great Escape' and most of the works I have seen, over the years, from this rightfully-esteemed British animation company. Highly recommended both for children and for adults who enjoy animation.",9.0,High-rating,"[guilty, pleasure, personally, love, great, escape, work, see, year, rightfully, esteem, british, animation, company, highly, recommend, child, adult, enjoy, animation]","[guilty, pleasure, personally, love, great, escape, work, see, year, rightfully, esteem, british, animation, company, highly, recommend, child, adult, enjoy, animation]"
7443,Chicken Run,"Made my roommate who hates stop-motion animation watched this in 2018 and even he had a good time. It's maybe not as great as I remember thinking it was when I was a little kid, but it still holds up to some degree.\r\n\r\n_Final rating:★★★ - I liked it. Would personally recommend you give it a ...",6.0,,"[roommate, hate, stop, motion, animation, watch, 2018, good, time, maybe, great, remember, think, little, kid, hold, degree, final, rating, ★, ★, ★, like, personally, recommend]","[roommate, hate, stop, motion, animation, watch, 2018, good, time, maybe, great, remember, think, little, kid, hold, degree, final, rating, ★, ★, ★, like, personally, recommend]"
7443,Chicken Run,"A very good stop-motion animation!\r\n\r\n<em>'Chicken Run'</em>, which I watched a crap tonne when I was little but not for a vast number of years now, is an impressive production given it came out in 2000. Despite a pretty simple feel to the film, it's a very well developed concept.\r\n\r\nThe...",8.0,,"[good, stop, motion, animation, <, em>'chicken, run'</em, >, watch, crap, tonne, little, vast, number, year, impressive, production, give, come, 2000, despite, pretty, simple, feel, film, develop, concept, admittedly, short, run, time, truly, fly, course, look, relatively, terrific, impress, pac...","[good, stop, motion, animation, <, em>'chicken, run'</em, >, watch, crap, tonne, little, vast, number, year, impressive, production, give, come, 2000, despite, pretty, simple, feel, film, develop, concept, admittedly, short, run, time, truly, fly, course, look, relatively, terrific, impress, pac..."
7443,Chicken Run,"Ok, there is an huge temptation to riddle this review with puns - but I'm just going to say it's a cracking little family adventure. It's seemingly based on a whole range of classic movies from the ""Great Escape"", ""Star Trek"" to ""Love Story"" with a score cannibalised from just about any/everythi...",7.0,,"[ok, huge, temptation, riddle, review, pun, go, crack, little, family, adventure, seemingly, base, range, classic, movie, great, escape, star, trek, love, story, score, cannibalise, write, messrs., korngold, williams, bernstein, add, super, stop, motion, animation, ray, harryhausen, proud, flock...","[ok, huge, temptation, riddle, review, pun, go, crack, little, family, adventure, seemingly, base, range, classic, movie, great, escape, star, trek, love, story, score, cannibalise, write, messrs., korngold, williams, bernstein, add, super, stop, motion, animation, ray, harryhausen, proud, flock..."


In [32]:
# Join list of tokens into a string with spaces between each token
movie_reviews['tokens-joined'] = movie_reviews['tokens'].map(lambda x: " ".join(x))
# Join list of lemmas into a string with spaces between each lemma
movie_reviews['lemmas-joined'] = movie_reviews['lemmatized'].map(lambda x: " ".join(x))
movie_reviews.head(3)

Unnamed: 0_level_0,original_title,review,rating,ratings,tokens,lemmatized,tokens-joined,lemmas-joined
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
843,花樣年華,"This is a fine piece of cinema from Wong Kar-Wai that tells us a story of two people whom circumstance throws together - but not in a way you might expect. We start with two couples who move into a new building. One a newspaper man with his wife, the other a business executive and his wife. The ...",7.0,,"[fine, piece, cinema, wong, kar, wai, tell, story, people, circumstance, throw, way, expect, start, couple, new, building, newspaper, man, wife, business, executive, wife, businessman, rarely, home, journalist, wife, leave, increasingly, loose, end, long, friendship, develop, usually, noodle, en...","[fine, piece, cinema, wong, kar, wai, tell, story, people, circumstance, throw, way, expect, start, couple, new, building, newspaper, man, wife, business, executive, wife, businessman, rarely, home, journalist, wife, leave, increasingly, loose, end, long, friendship, develop, usually, noodle, en...",fine piece cinema wong kar wai tell story people circumstance throw way expect start couple new building newspaper man wife business executive wife businessman rarely home journalist wife leave increasingly loose end long friendship develop usually noodle entirely platonic relationship solid tru...,fine piece cinema wong kar wai tell story people circumstance throw way expect start couple new building newspaper man wife business executive wife businessman rarely home journalist wife leave increasingly loose end long friendship develop usually noodle entirely platonic relationship solid tru...
7443,Chicken Run,"A guilty pleasure for me personally, as I love both 'The Great Escape' and most of the works I have seen, over the years, from this rightfully-esteemed British animation company. Highly recommended both for children and for adults who enjoy animation.",9.0,High-rating,"[guilty, pleasure, personally, love, great, escape, work, see, year, rightfully, esteem, british, animation, company, highly, recommend, child, adult, enjoy, animation]","[guilty, pleasure, personally, love, great, escape, work, see, year, rightfully, esteem, british, animation, company, highly, recommend, child, adult, enjoy, animation]",guilty pleasure personally love great escape work see year rightfully esteem british animation company highly recommend child adult enjoy animation,guilty pleasure personally love great escape work see year rightfully esteem british animation company highly recommend child adult enjoy animation
7443,Chicken Run,"Made my roommate who hates stop-motion animation watched this in 2018 and even he had a good time. It's maybe not as great as I remember thinking it was when I was a little kid, but it still holds up to some degree.\r\n\r\n_Final rating:★★★ - I liked it. Would personally recommend you give it a ...",6.0,,"[roommate, hate, stop, motion, animation, watch, 2018, good, time, maybe, great, remember, think, little, kid, hold, degree, final, rating, ★, ★, ★, like, personally, recommend]","[roommate, hate, stop, motion, animation, watch, 2018, good, time, maybe, great, remember, think, little, kid, hold, degree, final, rating, ★, ★, ★, like, personally, recommend]",roommate hate stop motion animation watch 2018 good time maybe great remember think little kid hold degree final rating ★ ★ ★ like personally recommend,roommate hate stop motion animation watch 2018 good time maybe great remember think little kid hold degree final rating ★ ★ ★ like personally recommend


## <u>2) EDA and Visualizaton:
- `Create Word Clouds to visulize the most frequent and significant words in each group`
    - Remember, you can use this analysis to identify additional custom EDA stop words to use for visualization. (e.g. if the words are common in both groups)
    - <u>Save your WordClouds as .png files in the 'Images/' folder in your repo.</u>
- `Apply NLTK's` <u>`FreqDist`</u> `class to compare the frequency distribution of words in the review groups`
    - Remember, you can use this analysis to identify additional custom EDA stop words to use for visualization. (e.g. if the words are common in both groups)
    - <u>Save your `FreqDist` as .png files in the 'Images/' folder in your repo.</u>
- `Perform n-grams anaysis (bigrams and trigrams)`
    - Remember, you can use this analysis to identify additional custom stop words to use for EDA. (e.g., if the words are common in both groups)
    - Focus on BiGrams or TriGrams, using NLTK's `BigramCollectionFinder` and BigramAssocMeasures classes(or the Trigram equivalent Finder and Measures) to explore commonly used groups of words for each rating-group.
    - Describe any differences. What do these differences tell you?
    - <u>Save your data frame coparison of the top ngrams for each group as a Markdown Table.</u>
        - You can use the df.to_markdown() method to create a string version of your data frame that can be copied & pasted into a Markdown cell & your readme.
        - <center><img src='Images/1701388160__copymarkdowndataframe.png'></center>
- `Perform sentiment analysis to creat polarity scores according to VADER's sentiment lexicon`
    - Compare the sentiments of high-rating and low-rating texts
    - Compare the compound sentiment sores for high and low-rating reviews.
    - Which review polarity scores don't match the ratings? Why do you think this is?

In [34]:
%conda install -c conda-forge spacy

Retrieving notices: ...working... done
Channels:
 - conda-forge
 - defaults
 - plotly
Platform: win-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\Users\Valde\anaconda3\envs\dojo-env

  added / updated specs:
    - spacy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    spacy-3.7.3                |  py310h4856b71_0         5.8 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         5.8 MB

The following packages will be UPDATED:

  spacy                               3.7.2-py310h4856b71_0 --> 3.7.3-py310h4856b71_0 



Downloading and Extracting Packages

spacy-3.7.3          | 5.8 MB    |            |   0% 
spacy-3.7.3          | 5.8 MB    |            |   0% 
spacy-3.7.3          | 5.8 MB    | ####5

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): repo.anaconda.com:443
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): conda.anaconda.org:443
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): repo.anaconda.com:443
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): repo.anaconda.com:443
DEBUG:urllib3.connectionpool:https://repo.anaconda.com:443 "GET /pkgs/r/notices.json HTTP/1.1" 404 None
DEBUG:urllib3.connectionpool:https://conda.anaconda.org:443 "GET /conda-forge/notices.json HTTP/1.1" 404 None
DEBUG:urllib3.connectionpool:https://repo.anaconda.com:443 "GET /pkgs/msys2/notices.json HTTP/1.1" 404 None
DEBUG:urllib3.connectionpool:https://repo.anaconda.com:443 "GET /pkgs/main/notices.json HTTP/1.1" 404 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): conda.anaconda.org:443
DEBUG:urllib3.connectionpool:https://conda.anaconda.org:443 "HEAD /plotly/noarch/repodata.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool

In [39]:
## Join the words for each sentiment
high_rating = ' '.join(high_rating_words)
low_rating = ' '.join(low_rating_words)
type(ratings)

NameError: name 'high_rating_words' is not defined

In [36]:
## Generate the WordCloud Images
pos_cloud = WordCloud(min_word_length=2).generate(all_pos_lemmas)
neg_cloud = WordCloud(min_word_length=2).generate(Low Ratings)
## Plot the Images

fig, axes = plt.subplots(ncols=2, figsize=(12, 6))
axes[0].imshow(pos_cloud)
axes[0].set_title('Positive Tweet Words')
axes[0].axis('off')

axes[1].imshow(neg_cloud)
axes[1].set_title('Negative Tweet Words')
axes[1].axis('off')

NameError: name 'all_pos_lemmas' is not defined

In [None]:
## Import the ngrams function
from nltk import ngrams

In [None]:
## Isolate the first lemmatized document
lemma_doc = df['spacy_lemmas'][5]
lemma_doc

In [None]:
# Create bigrams
list(ngrams(lemma_doc,2))

In [None]:
# Create trigrams
list(ngrams(lemma_doc,3))

In [None]:
from nltk.tokenize import word_tokenize
# NLTK's Word Tokenization
word_tokens = word_tokenize(sample_text.lower())
print("Original text: \n", sample_text, '\n\n')
print('Word tokens: \n', word_tokens)

In [None]:
from nltk.tokenize import TweetTokenizer
# NLTK's Tweet Tokenization
tweet_tokenizer = TweetTokenizer()
tweet_tokens = tweet_tokenizer.tokenize(sample_text.lower())
print("Original text: \n", sample_text, '\n\n')
print(f"Tweet Tokens: \n", tweet_tokens,'\n')

In [None]:
## Create a function to create bigrams
def make_bigrams(doc):
    bigrams = ngrams(doc, 2)
    bigrams = list(bigrams)
    return bigrams

In [None]:
# add bigrams to the df with .apply()
df['bigrams'] = df['spacy_lemmas'].apply(make_bigrams)
df.head(10)

In [None]:
# Define bigrams
bigrams = ngrams(tweet_tokens, 2)
# display bigrams
list(bigrams)

In [None]:
# define trigrams
trigrams = ngrams(tweet_tokens, 3)
# display trigrams
list(trigrams)

In [None]:
# Create a figure and axes
fig, axes = plt.subplots(1,2, figsize=(12,8))

## Plot the ngram frequencies
pos_ngram_scores.head(20).plot(x='positive ngram', kind='bar', title='Positive Ngram Frequency',
                                        ax=axes[0])
neg_ngram_scores.head(20).plot(x='negative ngram', kind='bar', title='Negative Ngram Frequency',
                                        ax=axes[1])

## <u>3) Evaluation and Reporting:
- `Based on your anlysis, what should someone do (or not do) if they want to make a highly-rated movie?`
    - List 3 things associated with high-rating reviews
    - List 3 things associated with low-rating reviews 
- `Update your project README with a new Section for 'NLP Analysis of Movie Reviews.'`
    - Include what reviews were used (source and what the original rating numbers were b4 they were converted to a categorical target)
    - Include your EDA visualization in your README:
        - One WordCloud comparing both groups
        - 2 FreqDist plots(1 per group)
        - A Markdown table of the Top Ngrams for each group.
    - Your recommendations/conclusions for what to do/not to do make a highly-rated movie 

In [None]:
## Create list of all tokens in all documents
pos_words = positive['tokens'].explode().to_list()
neg_words = negative['tokens'].explode().to_list()
pos_words[:10]

In [None]:
## Instantiate positive frequency distribution
pos_freq_dist = FreqDist(pos_words)
neg_freq_dist = FreqDist(neg_words)

## Plot the distribution
pos_freq_dist.plot(20, title='Positive Tweet Token Distribution')

neg_freq_dist.plot(20, title='Negative Tweet Token Frequency Distribution');

In [None]:
## Create list of all tokens in all documents
pos_words = positive['spacy_lemmas'].explode().to_list()
neg_words = negative['spacy_lemmas'].explode().to_list()

## Instantiate positive frequency distribution
pos_freq_dist = FreqDist(pos_words)
neg_freq_dist = FreqDist(neg_words)

## Plot the distribution
pos_freq_dist.plot(20, title='Positive Tweet Token Distribution')

neg_freq_dist.plot(20, title='Negative Tweet Token Frequency Distribution');

## <u>Deliverables:
1. Notebook files for Preprocessing and EDA
2. EDA Images saved in an `"Images"` folder.
3. Update README 