<a href="https://colab.research.google.com/github/aryankapoorr/moviesentiment/blob/main/SentimentAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Building
NLP Sentiment Analysis Model to gather sentiment of movie reviews using a RNN

In [1]:
# install sentiment analysis libraries
! pip install tensorflow scikit-learn pandas numpy pickle5
! pip install datasets
! pip install tensorflow
! pip install selenium
! pip install pandas gspread gspread-dataframe oauth2client

Collecting pickle5
  Downloading pickle5-0.0.11.tar.gz (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.1/132.1 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pickle5
  Building wheel for pickle5 (setup.py) ... [?25l[?25hdone
  Created wheel for pickle5: filename=pickle5-0.0.11-cp310-cp310-linux_x86_64.whl size=255313 sha256=b3fabcf1012e8aa94ebdf4d933906fdc80bfac970b37af6c531d7f20fb90784e
  Stored in directory: /root/.cache/pip/wheels/7d/14/ef/4aab19d27fa8e58772be5c71c16add0426acf9e1f64353235c
Successfully built pickle5
Installing collected packages: pickle5
Successfully installed pickle5-0.0.11
Collecting datasets
  Downloading datasets-2.19.0-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-

In [2]:
# import required libraries
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense, Dropout
import pickle5 as pickle
import csv
import tensorflow.keras.utils

In [3]:
# download the dataset from huggingface (courtesy of stanfordnlp)
from datasets import load_dataset
dataset = load_dataset("stanfordnlp/imdb")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [4]:
# load dataset
df = pd.DataFrame(dataset['train'])
tmpDf = pd.DataFrame(dataset['test'])
sets = [df, tmpDf]
df = pd.concat(sets)
print(df)

                                                    text  label
0      I rented I AM CURIOUS-YELLOW from my video sto...      0
1      "I Am Curious: Yellow" is a risible and preten...      0
2      If only to avoid making this type of film in t...      0
3      This film was probably inspired by Godard's Ma...      0
4      Oh, brother...after hearing about this ridicul...      0
...                                                  ...    ...
24995  Just got around to seeing Monster Man yesterda...      1
24996  I got this as part of a competition prize. I w...      1
24997  I got Monster Man in a box set of three films ...      1
24998  Five minutes in, i started to feel how naff th...      1
24999  I caught this movie on the Sci-Fi channel rece...      1

[50000 rows x 2 columns]


In [5]:
df = df.sample(frac=1).reset_index(drop=True)

The next few code blocks establish a tokenizer that sets the size of the vocabulary of the model, which maximized performance at 5000. The tokenizer then fits words to padded sequences

In [6]:
tokenizer = Tokenizer(num_words=5000, oov_token='<OOV>')

In [7]:
tokenizer.fit_on_texts(df['text'])
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(df['text'])
padded_sequences = pad_sequences(sequences, maxlen=100, truncating='post')

In [8]:
# converts the sentiment labels
sentiment_labels = pd.get_dummies(df['label']).values

In [9]:
# randomly splits data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(padded_sequences, sentiment_labels, test_size=0.2)

In [10]:
# @title Model Training
# Based on the described hyperparameters, activaton functions, and optimizer,
# creates a RNN sentiment analysis model.
model = Sequential()
model.add(Embedding(5000, 100, input_length=100))
model.add(Conv1D(64, 5, activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 100, 100)          500000    
                                                                 
 conv1d (Conv1D)             (None, 96, 64)            32064     
                                                                 
 global_max_pooling1d (Glob  (None, 64)                0         
 alMaxPooling1D)                                                 
                                                                 
 dense (Dense)               (None, 32)                2080      
                                                                 
 dropout (Dropout)           (None, 32)                0         
                                                                 
 dense_1 (Dense)             (None, 2)                 66        
                                                        

In [11]:
# train the neural network for 13 epochs on the training data
model.fit(x_train, y_train, epochs=13, batch_size=560, validation_data=(x_test, y_test))

Epoch 1/13
Epoch 2/13
Epoch 3/13
Epoch 4/13
Epoch 5/13
Epoch 6/13
Epoch 7/13
Epoch 8/13
Epoch 9/13
Epoch 10/13
Epoch 11/13
Epoch 12/13
Epoch 13/13


<keras.src.callbacks.History at 0x7b8614b56020>

# Model Testing and Scoring

In [12]:
# predict the sentiment labels for the test set
y_pred = np.argmax(model.predict(x_test), axis=-1)
print("Accuracy:", accuracy_score(np.argmax(y_test, axis=-1), y_pred))

Accuracy: 0.8327


In [13]:
# Print out test results to visualize data
predictions = model.predict(x_test)
outputPred = []

for i in range(500):
  sent = ""

  if np.argmax(predictions[i], axis=-1) == 0:
    sent = "Positive"
  else:
    sent = "Negative"

  outputPred.append((predictions[i][0], predictions[i][1], sent))

columns = ['posSent', 'negSent', 'prediction']
d = pd.DataFrame(outputPred, columns=columns)
d.to_csv('sentiment_test_short.csv', index=False)
print(d)

      posSent       negSent prediction
0    0.000533  9.994670e-01   Negative
1    0.999955  4.524133e-05   Positive
2    0.999996  4.263753e-06   Positive
3    0.999686  3.137217e-04   Positive
4    1.000000  3.436396e-10   Positive
..        ...           ...        ...
495  0.000123  9.998770e-01   Negative
496  0.033838  9.661622e-01   Negative
497  0.000188  9.998116e-01   Negative
498  0.993504  6.496445e-03   Positive
499  0.999995  4.859855e-06   Positive

[500 rows x 3 columns]


In [29]:
# save the model
model.save('sentiment_analysis_model.h5')
with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=4)

In [30]:
# Load the saved model and tokenizer
import keras

model = keras.models.load_model('sentiment_analysis_model.h5')
with open('tokenizer.pickle', 'rb') as handle:
    tokenizer = pickle.load(handle)

In [31]:
# function to use the model to predict the sentiment of text

def predict_sentiment(text):
    # Tokenize and pad the input text
    text_sequence = tokenizer.texts_to_sequences([text])
    text_sequence = pad_sequences(text_sequence, maxlen=100)

    # Make a prediction using the trained model
    predicted_rating = model.predict(text_sequence, verbose=None)[0]

    predicted_probabilities = np.array(predicted_rating)
    #print(predicted_rating)

    pos_threshold = 0.9
    neg_threshold = 0.1
    neutral_threshold = 0.999  # Adjust this threshold as needed

    # Calculate the difference between positive and negative probabilities
    diff = abs(predicted_rating[1] - predicted_rating[0])
    #print("Diff: " + str(diff))

    # Check if the difference is below the neutral threshold
    if diff < neutral_threshold:
        predicted_sentiment = 'NEUTRAL'
    # Check if sentiment is positive
    elif predicted_rating[1] > pos_threshold:
        predicted_sentiment = 'POSITIVE'
    # Check if sentiment is negative
    elif predicted_rating[0] > neg_threshold:
        predicted_sentiment = 'NEGATIVE'
    else:
        predicted_sentiment = 'NEUTRAL'  # Default to neutral if none of the conditions are met

    return predicted_probabilities


In [32]:
# predict sentiment of sample text
text_input = "The movie was awesome and fantastic."
predicted_sentiment = predict_sentiment(text_input)
print(predicted_sentiment)

[3.264732e-08 1.000000e+00]


In [18]:
# @title Plotting function to see score distribution
# Used to modify gaussian function parameters correctly
import numpy as np
import matplotlib.pyplot as plt

def plot_distribution(numbers, title, score):
    """
    Plot the distribution of a list of numbers using a histogram.
    """
    plt.figure(figsize=(8, 6))
    plt.hist(numbers, bins=20, color='skyblue', edgecolor='black', alpha=0.7)
    plt.title('Distribution of ' + str(title) + ", score of " + str(round((score * 100), 2)) + "%")
    plt.xlabel('Value')
    plt.ylabel('Frequency')
    plt.grid(True)
    plt.show()

Below imports are for the selenium sessions that automate the review data scraping from IMDb

In [19]:
import requests
import math
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.options import Options as FirefoxOptions
from selenium.webdriver.common.keys import Keys

The below function scrapes review data from IMDb for the relevant movie. The sentiment values are then used to create a score for the movie. The gaussian function:

$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}$

is used to put sentiment values along a normal curve, in order to add weight to reviews with overwhelmingly positive/negative sentiment. These numbers are then normalized to represent the score as a percentage.

> Note the helper function *gaussian_weight*, which uses "mu" and "sigma" from the equation above to put the results along a normal curve. The values of "mu" and "sigma" have been optimized to properly represent the sentiment of a list of reviews.




In [20]:
def webscrape(url, title, w = 10):
  rvs = []
  sentiment = []
  data = requests.get(url)

  html = BeautifulSoup(data.text, 'html.parser')
  review = html.find_all("div", class_="content")

  firefox_options = FirefoxOptions()
  firefox_options.add_argument("--headless")

  driver = webdriver.Firefox(options=firefox_options)

  try:
    # Navigate to the webpage
    driver.get(url)

    for i in range(w):
      # Find the button element you want to click
      button = WebDriverWait(driver, 10).until(
          EC.element_to_be_clickable((By.XPATH, "//button[@id='load-more-trigger']"))
      )

      # Click the button
      button.click()


    div_elements = WebDriverWait(driver, 10).until(
        EC.presence_of_all_elements_located((By.XPATH, "//div[@class='content']"))
    )

    # Extract text from each div element
    texts = [div.text for div in div_elements]

    for t in texts:
      if t != '' and len(t) >= 200:
        s = predict_sentiment(t[:-60])
        sentiment.append(s[1] - s[0])

    def gaussian_weight(x, mu=0.5, sigma=0.1):
      return 1 - math.exp(-0.5 * ((x - mu) / sigma) ** 2)

    # Calculate the weighted sum and total weight
    scaled = []
    for num in sentiment:
      scaled.append((num + 1) / 2)

    gauss = [gaussian_weight(x) for x in scaled]
    combined = list(zip(*sorted(list(zip(scaled, gauss)), key= lambda x:x[1])))[0]
    weights = np.logspace(.1, 10.1, num = len(scaled), base=2).flatten()

    return (title, np.average(combined, weights=weights))

  finally:
      # Close the browser to release resources
      driver.quit()

  return (None)

# Database Building
Now that the model is built and tested, and the scoring system for the movies has been determined, the next step is to create a database of movie sentiments in order to show the results in the UI

The colab notebook is connected to google drive, so that the results can then easily be put in a sheet document



In [21]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [22]:
from google.colab import auth
from gspread_dataframe import set_with_dataframe
auth.authenticate_user()

import gspread
from google.auth import default
creds, _ = default()

gc = gspread.authorize(creds)

The selenium script below finds the exact URL for the reviews page for any given movie

In [23]:
def grabURL(name):
  data = requests.get("https://www.imdb.com/")

  html = BeautifulSoup(data.text, 'html.parser')
  review = html.find_all("div", class_="content")

  firefox_options = FirefoxOptions()
  firefox_options.add_argument("--headless")

  driver = webdriver.Firefox(options=firefox_options)

  try:
    # Navigate to the webpage
    driver.get("https://www.imdb.com/")

    search_bar = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "suggestion-search"))
    )

    # Clear the search bar and enter movie name
    search_bar.clear()
    search_bar.send_keys(name)
    search_bar.send_keys(Keys.RETURN)

    # Click on the first search result (assumed to be the movie page)
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CLASS_NAME, "ipc-metadata-list-summary-item__t"))
    ).click()

    # Click on the "Reviews" link
    reviews_link = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, "//a[text()='User reviews']"))
    )
    reviews_link.click()

    return (name, driver.current_url)

  except Exception as e:
      print("An error occurred:", e)

  finally:
      # Close the browser
      driver.quit()

A list of names are able to be easily put into a list of URLs, based on the above function

In [24]:
def namesToURLs(names):
  movies = []

  for i in range(len(names)):
    movies.append((names[i], grabURL(names[i])))
    print(names[i])

  return movies

In [39]:
names = [
  "The Departed",
  "No Country for Old Men",
  "There Will Be Blood",
  "Million Dollar Baby",
  "The Hurt Locker",
  "The King's Speech",
  "Slumdog Millionaire",
  "Crash",
  "Gladiator",
  "American Beauty",
  "Shakespeare in Love",
  "Schindler's List",
  "Forrest Gump",
  "Unforgiven",
  "Dances with Wolves",
  "Rain Man",
  "Platoon",
  "Chariots of Fire",
  "Kramer vs. Kramer",
  "One Flew Over the Cuckoo's Nest",
  "The Godfather",
  "The French Connection",
  "Rocky",
  "The Sting",
  "The French Connection",
  "Midnight Cowboy",
  "In the Heat of the Night",
  "Oliver!",
  "Tom Jones",
  "Lawrence of Arabia"
]

movies = namesToURLs(names)

The Departed
No Country for Old Men
There Will Be Blood
Million Dollar Baby
The Hurt Locker
The King's Speech
Slumdog Millionaire
Crash
Gladiator
American Beauty
Shakespeare in Love
Schindler's List
Forrest Gump
Unforgiven
Dances with Wolves
Rain Man
Platoon
Chariots of Fire
Kramer vs. Kramer
The Godfather
The French Connection
Rocky
The Sting
The French Connection
Midnight Cowboy
In the Heat of the Night
An error occurred: Message: 
Stacktrace:
RemoteError@chrome://remote/content/shared/RemoteError.sys.mjs:8:8
WebDriverError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:193:5
NoSuchElementError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:511:5
dom.find/</<@chrome://remote/content/shared/DOM.sys.mjs:136:16

Oliver!
Tom Jones
Lawrence of Arabia


Run the Sentiment Analysis model on the list of movies, then convert the results into a dataframe that can easily be added to the google sheet

In [36]:
def findLink(movie):
  l = movie.split()
  url = "https://www.movieposterdb.com/search?q="
  target_text = movie
  for i in range(len(l)):
    url += l[i]
    if i+1 < len(l):
      url += "%20"

  url += "&imdb=0"

  try:
    # Send a GET request to the URL
    response = requests.get(url)

    # Check if request was successful
    if response.status_code == 200:
      # Parse the HTML content of the page
      soup = BeautifulSoup(response.content, 'html.parser')

      # Find the first img tag on the page
      img_tag = soup.find('img', title=lambda value: value and target_text.lower() in value.lower())

      # Get the 'src' attribute of the img tag
      img_src = img_tag.get('data-src')

      return img_src
    else:
      print("Failed to fetch the URL:", response.status_code)
  except Exception as e:
      print("An error occurred:", str(e))
      return ''

In [37]:
def createDF(data):
    data = []

    for i in range(len(movies)):
      if movies[i][1] == None: continue

      print(movies[i])
      link = movies[i][1][1]
      name = movies[i][1][0]
      url = findLink(name)

      val = webscrape(link, name)
      data.append((val[0], val[1], url))

    print(data)

    df = pd.DataFrame(data, columns=['Movie', 'Score', 'Poster_URL'])
    sheet = gc.open("Sentiment Analysis Data").get_worksheet(0)
    existing_rows = sheet.get_all_values()
    num_existing_rows = len(existing_rows)
    set_with_dataframe(sheet, df, row=num_existing_rows+1, include_index=False, include_column_header=False)

In [38]:
createDF(movies)

('Inception', ('Inception', 'https://www.imdb.com/title/tt1375666/reviews/?ref_=tt_ql_2'))
('La La Land', ('La La Land', 'https://www.imdb.com/title/tt3783958/reviews/?ref_=tt_ql_2'))
('Mad Max: Fury Road', ('Mad Max: Fury Road', 'https://www.imdb.com/title/tt1392190/reviews/?ref_=tt_ql_2'))
('Interstellar', ('Interstellar', 'https://www.imdb.com/title/tt0816692/reviews/?ref_=tt_ql_2'))
('The Grand Budapest Hotel', ('The Grand Budapest Hotel', 'https://www.imdb.com/title/tt2278388/reviews/?ref_=tt_ql_2'))
('Whiplash', ('Whiplash', 'https://www.imdb.com/title/tt2582802/reviews/?ref_=tt_ql_2'))
('Birdman', ('Birdman', 'https://www.imdb.com/title/tt2562232/reviews/?ref_=tt_ql_2'))
('Her', ('Her', 'https://www.imdb.com/title/tt1798709/reviews/?ref_=tt_ql_2'))
('The Shape of Water', ('The Shape of Water', 'https://www.imdb.com/title/tt5580390/reviews/?ref_=tt_ql_2'))
('The Social Network', ('The Social Network', 'https://www.imdb.com/title/tt1285016/reviews/?ref_=tt_ql_2'))
('Arrival', ('Ar

Now that the movie titles and scores have been put into the dataset, find the corresponding movie poster URLs for all of the values in the dataset

In [106]:
sheet = gc.open("Sentiment Analysis Data").get_worksheet(0)
existing_rows = sheet.get_all_values()
num_existing_rows = len(existing_rows)
moviePosters = []

for i in range(1, num_existing_rows):
  if sheet.cell(i+1, 3) != '':
    moviePosters.append(sheet.get('A' + str(i + 1))[0][0])

print(moviePosters)

['The Shawshank Redemption', 'Back to the Future', 'Forrest Gump', 'The Artist', 'Birdman or (The Unexpected Virtue of Ignorance)', 'Toy Story', 'Zootopia', 'The Blues Brothers', 'The Lion King', "Schindler's List", 'The Shape of Water', 'E.T. the Extra-Terrestrial', 'The Dark Knight', 'Green Book', 'Who Framed Roger Rabbit', 'Parasite', 'Home Alone', 'Braveheart', 'The Empire Strikes Back', 'The Fugitive', 'A Few Good Men', 'Stand by Me', 'Argo', 'The Breakfast Club', 'The Silence of the Lambs', 'Captain America: Civil War', 'The Sixth Sense', 'Raging Bull', 'Indiana Jones and the Raiders of the Lost Ark', 'The Green Mile', 'Terminator 2: Judgment Day', 'Mrs. Doubtfire', 'Jurassic Park', 'The Karate Kid', 'The Lord of the Rings: The Return of the King', 'The Shining', '12 Years a Slave', 'Indiana Jones and the Last Crusade', 'Die Hard', 'Harry Potter and the Deathly Hallows – Part 2', 'Ghost', 'Rain Man', 'Iron Man 3', 'Apollo 13', 'Star Wars: Episode VI - Return of the Jedi', 'The Lo