# Natural Language Processing: Marvel's Rotten Tomatoes Reviews

## Setup

For this project, I have decided to create a NLP model that would be able to differentiate between good and bad movie reviews. 

The reason behind this project was me watching a latest Marvel movie, which I have enjoyed, only to realise that it has pretty bad reviews on Rotten Tomatoes. So I have decided to get all the critics' reviews from 26 movies that are considered part of Marvel's Cinematic Universe (MCU). An important note is that Rotten Tomatoes publishes only the summary rather than a full review. 

So this project consists of 3 major parts:

1. Scraping Rotten Tomatoes to download all reviews for each of the 26 movies.

2. Quick Exploratory Data Analysis.

3. Creating Bag of Words model. 

4. Choosing the best classification model + hypertuning the parameters to (hopefully!) improve the performance.

## Importing Libraries

In [1]:
# web scraping
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
PATH = 'C:\Program Files (x86)\chromedriver.exe'

# data manipulation
import pandas as pd
import numpy as np

# strings processing
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

# machine learning
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score
from statistics import mean
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# ignore warnings
import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\artur\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Web Scraping with BeautifulSoup and Selenium

While usually web scraping is an easy process, Rotten Tomatoes did not want to make it easy! 

Each movie has a page with critics review. Usually, it has a link in the following format: 'https://www.rottentomatoes.com/m/the_avengers/reviews'. The problem is that not all reviews are on the same page. Moreover, the link for the reviews does not contain a page number (i.e. you can't just change the number in the link to go to another page), so the only way to go to the next page of reviews is to press the button. 

As a result, for this project I am going to use Selenium library that uses Chrome Webdriver to open each page manually by pressing 'Next' button. So the new plan is the following:

1. Use beautiful soup to scrape the page that contains the names and links to all 26 MCU movies in chronological order (MCU chronology).

2. For each movie, use selenium to open all review pages and use beautiful soup to scrape all reviews and scores. 

Firstly, I am going to create a helper function that uses beautiful soup to scrape all reviews and score in a page. It takes html code of a page, the name of the movie, as well as movies, reviews and scores lists to populate. All this lists would be used to create a dataframe later on.

In [2]:
def reviews_on_page_html(html, movie, movies, reviews, scores):
    # creates a soup object
    pageSoup = BeautifulSoup(html, 'html.parser')
    # grabs all the reviews
    all_reviews = pageSoup.find_all('div', class_='row review_table_row')
    # for each review appends the movie name, review itself and the score
    for review in all_reviews:
        movies.append(movie)
        review_text = review.find('div', class_='the_review').text
        review_text = review_text.strip()
        reviews.append(review_text)
        score = review.find('div', class_='col-xs-16 review_container').contents[1]['class'][3]
        scores.append(score)

The following helper function gets all reviews for a particular movie. It takes html code for a first page of reviews, movie name, as well as movies, reviews, and scores lists to populate. 

In [3]:
def all_movie_reviews(html, movie, movies, reviews, scores):
    # special case for the first page
    headers = {'User-Agent': 
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
    pageTree = requests.get(html, headers=headers)
    reviews_on_page_html(pageTree.content, movie, movies, reviews, scores)
    # intialize webdriver for Chrome
    ser = Service(PATH)
    op = webdriver.ChromeOptions()
    driver = webdriver.Chrome(service=ser, options=op)
    # boolean that controls the loop
    cont = True
    driver.get(html)
    while cont:
        # try to find the next button
        try:
            # tries to find the next button by XPATH, and clicks it
            WebDriverWait(driver,2).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="content"]/div/div/div/nav[2]/button[2]'))).click()
            # if found sleeps for 2 seconds to let the page load
            time.sleep(2)
            # gets the new html
            html = driver.page_source
            reviews_on_page_html(html, movie, movies, reviews, scores)
        # if no next button, end the loop
        except:
            cont = False

This is a final function, that uses two previous helper functions to get all movies from a website and for each movie get all the reviews. It returns three lists: movies, reviews, and scores. 

In [4]:
def get_reviews():
   reviews = []
   scores = []
   movies = []
   all_movies_link = 'https://editorial.rottentomatoes.com/guide/marvel-movies-in-order/'
   headers = {'User-Agent': 
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
   pageTree = requests.get(all_movies_link, headers=headers)
   pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
   all_movies = pageSoup.find_all('h2')
   all_movies = all_movies[0:26]
    
   for item in all_movies:
      movie_link = item.find('a')['href']
      movie = item.find('a').text
      reviews_link = movie_link + '/reviews'
      all_movie_reviews(reviews_link, movie, movies, reviews, scores)

   return reviews, scores, movies

Now I can run the method above and create a dataframe. I will also save all results to a csv file, so I can reuse the results of web scraping without a need to run get_reviews.

get_reviews took me 24 minutes to run because selenium takes time to load. If each page of reviews had a unique url, this process would be significantly faster. 

In [5]:
#reviews, scores, movies = get_reviews()
#dataframe = pd.DataFrame({'movie': movies, 'review': reviews, 'score': scores})
#dataframe.to_csv('reviews.csv', index=False)

In [6]:
dataframe = pd.read_csv('reviews.csv')

## Exploratory Data Analysis

Now I will have a quick look over data. This is not a required step for a Bag of Words model, since the whole point of it is to strip text of all the unneccessary words, only to keep the most important ones. However, I am still curious about the data I have collected.

In [7]:
dataframe.head()

Unnamed: 0,movie,review,score
0,Captain America: The First Avenger,Lack of pop. There's not really much here that...,rotten
1,Captain America: The First Avenger,A perfect throwback and homage to the classic ...,fresh
2,Captain America: The First Avenger,I love the 1940s touches throughout and the ac...,fresh
3,Captain America: The First Avenger,"The 40s setting is well realised, the characte...",fresh
4,Captain America: The First Avenger,Like a drastically more unbelievable Indiana J...,rotten


In [8]:
dataframe.tail()

Unnamed: 0,movie,review,score
9877,Eternals,Eternals is beautifully shot and terrifically ...,fresh
9878,Eternals,Director Chlo Zhao's entry into the superhero ...,fresh
9879,Eternals,While Eternals has most of the benchmarks of a...,fresh
9880,Eternals,Unlike anything ever seen in the MCU before. T...,fresh
9881,Eternals,"Setting aside whatever faults I found, Eternal...",fresh


So there are 9881 reviews for all MCU movies, starting with Captain America and ending with the latest release: Eternals. 

Firstly, lets check that for each movie, we have collected all the reviews and all the scores.

In [9]:
dataframe.groupby('movie').count()

Unnamed: 0_level_0,review,score
movie,Unnamed: 1_level_1,Unnamed: 2_level_1
Ant-Man,336,336
Ant-Man and The Wasp,439,439
Avengers: Age of Ultron,375,375
Avengers: Endgame,547,547
Avengers: Infinity War,485,485
Black Panther,525,525
Black Widow,437,437
Captain America: Civil War,426,426
Captain America: The First Avenger,274,274
Captain America: The Winter Soldier,306,306


We also need to change our scoring from categorical to numerical data type to be used later for classification.

In [10]:
dataframe = dataframe.replace({'score': {'fresh': 1, 'rotten': 0}})
dataframe.head()

Unnamed: 0,movie,review,score
0,Captain America: The First Avenger,Lack of pop. There's not really much here that...,0
1,Captain America: The First Avenger,A perfect throwback and homage to the classic ...,1
2,Captain America: The First Avenger,I love the 1940s touches throughout and the ac...,1
3,Captain America: The First Avenger,"The 40s setting is well realised, the characte...",1
4,Captain America: The First Avenger,Like a drastically more unbelievable Indiana J...,0


Moreover, by assigning rotten to 0 and fresh to 1, we can easily calculate how many fresh (or rotten) scores each movie has, as well as the fraction of fresh scores.

In [11]:
dataframe.groupby('movie').sum('scores').sort_values(by='score', ascending=False)

Unnamed: 0_level_0,score
movie,Unnamed: 1_level_1
Avengers: Endgame,513
Black Panther,506
Captain Marvel,429
Avengers: Infinity War,412
Thor: Ragnarok,407
Spider-Man: Far From Home,407
Captain America: Civil War,384
Ant-Man and The Wasp,383
Spider-Man: Homecoming,360
Guardians of the Galaxy Vol. 2,358


In [12]:
dataframe.groupby('movie').mean('scores').sort_values(by='score', ascending=False)

Unnamed: 0_level_0,score
movie,Unnamed: 1_level_1
Black Panther,0.96381
Avengers: Endgame,0.937843
Iron Man,0.935943
Thor: Ragnarok,0.93135
Guardians of the Galaxy,0.916168
Shang-Chi and the Legend of the Ten Rings,0.915094
Spider-Man: Homecoming,0.913706
Marvel's the Avengers,0.90884
Spider-Man: Far From Home,0.902439
Captain America: Civil War,0.901408


It seems like reviewers hated Eternals. I quite liked it!

Finally, it is important to check the distribution of rotten and fresh scores. Unbalanced distrbution would be problematic when we train classification models, so if dataset is unbalanced, over(under)sampling should be used.

In [13]:
dataframe.groupby('score').count().drop('movie', axis=1)

Unnamed: 0_level_0,review
score,Unnamed: 1_level_1
0,1570
1,8312


As we can see marvel has significantly more fresh reviews. So before training I will need to solve this problem!

## Data Cleanup

First thing to do would be to clean up review column. Quite often in the reviews, there are [] brackets that have words in them that complete the sentence. Let's look at all reviews that have it. 

In [14]:
reviews_with_brackets = []
for review in dataframe['review']:
    if review.find('[') != -1:
        reviews_with_brackets.append(review)

print(reviews_with_brackets[:10])

['[E]normously entertaining... deft and witty... [C]heering and poignant at the same time, recalling the promise of a future that ended up going in a different direction...', '[A] clean, unpretentious, brawnily entertaining fantasia.', "[it] generally moves along with moxie, charm, gobs of special effects and stunt work, and fond memories of the USO style shows that sold war bonds and boosted support in the battle against the Nazis. Yes, there's even some singing and dancing.", '[A] so-so experience.', "[The hero's] transformation from commonplace but with personality to bulky but generally nondescript echoes the movie's own shift.", "[Marvel] saved one of their very best movies for last, and I suspect Captain America: The First Avenger will send audiences out of the theater rabid to see what's next.", '[A] hokey, hacky, two-hour-plus exercise in franchise transition/price gouging, complete with utterly unnecessary post-converted 3-D.', 'Larson plays [Captain Marvel] with dignity and h

We can see that there are quite a lot of reviews that were written in the other language, so they have a note at the end. We should remove them since these are irrelevant. 

In [15]:
for review in dataframe['review']:
    last_bracket = review.rfind('[Full')
    if last_bracket != -1:
        review = review[:last_bracket-1]

## Bag of Words

Now it is time for a bag of words model. First thing is that we want to download all the stopwords in english language.

In [16]:
corpus = []
all_stopwords = stopwords.words('english')
print(all_stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

While the majority of the stopwords are good to go, there are some that would actually damage our predictions. In particular, didn't. If we remove it from the review, the phrase "I wish I didn't go" which clearly has a negative connotation, turn to "I wish I go" which is positive. So we need to remove some stopwords from this least. 

In [17]:
not_stopwords = ['between', 'below', 'above', 'most', 'not', 'nor', 'too', 'very', 'don', "don't", 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

for word in not_stopwords:
    try:
        all_stopwords.remove(word)
    except:
        pass
print(all_stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'into', 'through', 'during', 'before', 'after', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'other', 'some', 'such', 'no', 'only', 'own', 'same', 'so', 'than', 's', 't', 'can', 'will', 'just', 'should', "should'v

Now it is time to cleanup the reviews. There are several things we want to do. Firstly, we want to remove all the punctuation, and the brackets (thats why in Data Cleaning we only removed some brackets, cause other brackets would be removed here now), and just leave empty spaces instead. Secondly, we want to lower all the words, so words like 'Hello' and 'hello' are counted as same. Thirdly, we want to split each word in a review into a list of words. Finally, we want to take the stem of each word, so words like 'party' and 'partying' are counted as same. This step is crucial with verbs in particular. At this step we also remove all the stop words. Finally, we join all the words together with spaces in between and put them in corpus array.

In [18]:
for review in dataframe['review']:
    review = re.sub('[^a-zA-Z]', ' ', review)
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(all_stopwords)]
    review = ' '.join(review)
    corpus.append(review)

Final step is to use CountVectorizer, to create a matrix, in which each unique word corresponds to a column, and each review is a row. So if the review contained specific word, it has its count; otherwise it has 0. 

In [19]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
y = dataframe['score']

## Classification: Data Preprocessing and Model Selection

Now it's time for data preprocessing. First thing to do is to create train and test sets. 

In [20]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=0, stratify=y)

Above, I have discussed how this set is unbalanced, i.e. it has much more fresh scores than rotten. We want to have test set to be representative of the data, that's why we have set stratify to 'yes' in a previous step. Moreover, it is crucial to do a split before adding or removing samples, because test sets needs to be representative of the real world data. 

So now the question arises: should I oversample or undersample. Usually, it is pretty easy to oversample, however, with the reviews, it is difficult to make up new observations. So we are going to undersample.

In [21]:
sampler = RandomUnderSampler(sampling_strategy='majority')
X_train_sample, y_train_sample = sampler.fit_resample(X_train, y_train)

In [22]:
y_train_sample.value_counts()

0    1334
1    1334
Name: score, dtype: int64

After undersampling, we get a training size of 2668 reviews, which is still pretty good. 

Now it is time to choose the best classification model. For this project, I have decided to try Logistic Regression, K-Nearest Neighbors, Random Forest Classifier, and XGBoost Classifier. Since the training dataset is balanced, we can use accuracy as our main evaluation metric.

In [23]:
lr = LogisticRegression(solver='liblinear', class_weight='balanced')
knn = KNN()
rf = RandomForestClassifier()
xgb = XGBClassifier(eval_metric='mlogloss')
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('XGB Classifier', xgb),('Random Forest', rf)]

In [24]:
for clf_name, clf in classifiers:
    pipe = Pipeline(steps=[('classifier', clf)])
    pipe.fit(X_train_sample, y_train_sample)
    y_pred = pipe.predict(X_test)
    print(f'{clf_name} ')
    print(mean(cross_val_score(clf, X_train_sample, y_train_sample, scoring='accuracy', cv=5)))

Logistic Regression 
0.7106449958190161
K Nearest Neighbours 
0.5554637378698766
XGB Classifier 
0.6817807477988349
Random Forest 
0.6964001377265285


So we can see that Logistic Regression performs the best. Now we can try to use it on a test set.

In [25]:
lr.fit(X_train_sample, y_train_sample)
y_pred = lr.predict(X_test)
print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred, labels=[1, 0]))
print(classification_report(y_test, y_pred))

0.6945380984490896
[[870 377]
 [ 76 160]]
              precision    recall  f1-score   support

           0       0.30      0.68      0.41       236
           1       0.92      0.70      0.79      1247

    accuracy                           0.69      1483
   macro avg       0.61      0.69      0.60      1483
weighted avg       0.82      0.69      0.73      1483



Wow! While the accuracy seems to be around 70%, we can see that the model actually does not perform very well. While it generally performs well for fresh reviews, we can see that the precision for rotten reviews is too low. In other words, out of all reviews that were classified as rotten, only 30 percent of them were, in fact, rotten. It has quite good recall; however, the main explanation is that there are only 236 observation in the support, and since the model overclassifies many reviews for rotten, it is bound to get a lot of correct as well. 

For fresh reviews everything seems to be much better. While its recall is 70%, the precision is 92% meaning that while it did not get all the positive(fresh) reviews, from the ones it got, 92% of them were correctly classified. 

## Conclusion

Overall, the model does not perform that well. We can see that it really struggles to identify negative reviews and instead classifies way to many positive reviews as negatives. There are two possible solutions:

1. Maybe we can improve upon Bag of Word model. Maybe we removed some words that change the meaning from positive to negative. 
2. Get more reviews. The model may need more observations to correctly identify what words make negative reviews negative. While we had lost a lot of positive reviews, we used all negative reviews, and it was still not enough.