<a href="https://colab.research.google.com/github/dyarparvar/NLP/blob/main/Sentiment_Analysis_of_Movie_Reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment analysis of movie reviews

## Scenario
As part of a market research exercise for a film studio planning a new science-fiction film, you have been tasked with a data science project to research customer feedback on films in a related genre. One question you will be asked to investigate is whether there’s a relationship between the proportion of feedback that is positive and production budgets. Before you compare sentiment scores between films, however, you need to construct a viable preprocessing pipeline and train a model.

## ✅ 0-2. Setup & Data

In [None]:
!pip install datasets

In [None]:
!pip install beautifulsoup4

In [None]:
# For GloVe via spaCy
!pip install spacy
!python -m spacy download en_core_web_lg

In [None]:
# For Word2Vec
!pip install gensim
from gensim.models import Word2Vec

In [None]:
# For NLTK
!pip install nltk
!pip install svgling
import nltk
nltk.download("all")

In [None]:
import pandas as pd
import numpy as np

from scipy.spatial.distance import cosine

from datasets import load_dataset

import spacy

import string
import re

from bs4 import BeautifulSoup # remove HTML tags



from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

from sklearn.feature_extraction.text import CountVectorizer  # BoW
from sklearn.feature_extraction.text import TfidfVectorizer  # Tf-idf


from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report

In [None]:
dataset = load_dataset("stanfordNLP/sst2")

In [None]:
dataset

## ✅ 3. Train & Validation split

In [None]:
train_data = dataset["train"]
train_data = train_data.to_pandas()
train_data

Unnamed: 0,idx,sentence,label
0,0,hide new secretions from the parental units,0
1,1,"contains no wit , only labored gags",0
2,2,that loves its characters and communicates som...,1
3,3,remains utterly satisfied to remain the same t...,0
4,4,on the worst revenge-of-the-nerds clichés the ...,0
...,...,...,...
67344,67344,a delightful comedy,1
67345,67345,"anguish , anger and frustration",0
67346,67346,"at achieving the modest , crowd-pleasing goals...",1
67347,67347,a patient viewer,1


In [None]:
train_sentence = train_data["sentence"]
train_label = train_data["label"]

In [None]:
val_data = dataset["validation"]
val_data = val_data.to_pandas()
val_data

Unnamed: 0,idx,sentence,label
0,0,it 's a charming and often affecting journey .,1
1,1,unflinchingly bleak and desperate,0
2,2,allows us to hope that nolan is poised to emba...,1
3,3,"the acting , costumes , music , cinematography...",1
4,4,"it 's slow -- very , very slow .",0
...,...,...,...
867,867,has all the depth of a wading pool .,0
868,868,a movie with a real anarchic flair .,1
869,869,a subject like this should inspire reaction in...,0
870,870,... is an arthritic attempt at directing by ca...,0


In [None]:
val_sentence = val_data["sentence"]
val_label = val_data["label"]

In [None]:
test_data = dataset["test"]
test_data = test_data.to_pandas()
test_data

Unnamed: 0,idx,sentence,label
0,0,uneasy mishmash of styles and genres .,-1
1,1,this film 's relationship to actual tension is...,-1
2,2,"by the end of no such thing the audience , lik...",-1
3,3,director rob marshall went out gunning to make...,-1
4,4,lathan and diggs have considerable personal ch...,-1
...,...,...,...
1816,1816,"it risks seeming slow and pretentious , becaus...",-1
1817,1817,take care of my cat offers a refreshingly diff...,-1
1818,1818,davis has filled out his cast with appealing f...,-1
1819,1819,it represents better-than-average movie-making...,-1


In [None]:
test_sentence = test_data["sentence"]
test_label = test_data["label"]

In [None]:
# Check labels
print(f"Unique train labels: {np.unique(train_label)}")
print(f"Unique validation labels: {np.unique(val_label)}")
print(f"Unique test labels: {np.unique(test_label)}")

Unique train labels: [0 1]
Unique validation labels: [0 1]
Unique test labels: [-1]


*SST2's test set is designed for competition submission and therefore it has placeholder labels (-1) instead of real labels. Therefore we will use the validation set as the test set.*

## ✅ 4-7. Similarity

4. Calculate the cosine similarity of the 5th and 100th sentence within the train split.
5. Calculate the cosine similarity of the 5th and 15,000th sentence within the train split.
6. Calculate the cosine similarity of the 5th and 50,000th sentence within the train split.
7. Comment on the cosine similarity scores.


### **✅ GloVe via spaCy**

(pre-trained)

*Semantic Similarity*

In [None]:
# Load the model
nlp = spacy.load("en_core_web_lg")

Converting each sentence to an nlp object using spaCy

In [None]:
# Function to convert sentence to nlp object using spaCy
def sentence_to_nlp_spacy_glove(sentence, nlp_model):
    doc = nlp_model(sentence)
    # return the averaged word vectors
    return doc

In [None]:
# Convert each sentence to nlp object
nlp_5_glove = sentence_to_nlp_spacy_glove(train_sentence[4], nlp)
nlp_100_glove = sentence_to_nlp_spacy_glove(train_sentence[99], nlp)
nlpc_15000_glove = sentence_to_nlp_spacy_glove(train_sentence[14999], nlp)
nlp_50000_glove = sentence_to_nlp_spacy_glove(train_sentence[49999], nlp)

# Calculate semantic similarity
sim_5_100_glove = nlp_5_glove.similarity(nlp_100_glove)
sim_5_15000_glove = nlp_5_glove.similarity(nlpc_15000_glove)
sim_5_50000_glove = nlp_5_glove.similarity(nlp_50000_glove)

In [None]:
print(f"{train_sentence[4]} \n {sim_5_100_glove} \n {train_sentence[99]}")

on the worst revenge-of-the-nerds clichés the filmmakers could dredge up  
 0.7925307154655457 
 acted and directed , it 's clear that washington most certainly has a new career ahead of him 


In [None]:
print(f"{train_sentence[4]} \n {sim_5_15000_glove} \n {train_sentence[14999]}")

on the worst revenge-of-the-nerds clichés the filmmakers could dredge up  
 0.2320917546749115 
 eloquent 


In [None]:
print(f"{train_sentence[4]} \n {sim_5_50000_glove} \n {train_sentence[49999]}")

on the worst revenge-of-the-nerds clichés the filmmakers could dredge up  
 0.23432505130767822 
 stylish 


**Alternative Approach**

Converting each sentence to a vector using spaCy

In [None]:
# Convert each sentence to vector
vec_5_glove = nlp_5_glove.vector
vec_100_glove = nlp_100_glove.vector
vec_15000_glove = nlpc_15000_glove.vector
vec_50000_glove = nlp_50000_glove.vector


In [None]:
# Function to calculate cosine similarities
def cosine_similarity(vec1, vec2):
  if (np.linalg.norm(vec1) == 0) | (np.linalg.norm(vec2) == 0):
      return 0
  else:
      return 1 - cosine(vec1, vec2)

In [None]:
# Calculate semantic similarity
sim_5_100_glove = cosine_similarity(vec_5_glove, vec_100_glove)
sim_5_15000_glove = cosine_similarity(vec_5_glove, vec_15000_glove)
sim_5_50000_glove = cosine_similarity(vec_5_glove, vec_50000_glove)

In [None]:
print(f"{train_sentence[4]} \n {sim_5_100_glove} \n {train_sentence[99]}")

on the worst revenge-of-the-nerds clichés the filmmakers could dredge up  
 0.7925306558609009 
 acted and directed , it 's clear that washington most certainly has a new career ahead of him 


In [None]:
print(f"{train_sentence[4]} \n {sim_5_15000_glove} \n {train_sentence[14999]}")

on the worst revenge-of-the-nerds clichés the filmmakers could dredge up  
 0.2320917248725891 
 eloquent 


In [None]:
print(f"{train_sentence[4]} \n {sim_5_50000_glove} \n {train_sentence[49999]}")

on the worst revenge-of-the-nerds clichés the filmmakers could dredge up  
 0.23432505130767822 
 stylish 


### **✅ Word2Vec**

(trained on our data)

*Semantic Similarity*

In [None]:
# Tokenise and train on the data
train_tokens_w2v = [word_tokenize(sentence.lower()) for sentence in train_sentence]

w2v = Word2Vec(train_tokens_w2v,
               vector_size=100,
               window=5,
               min_count=2)

In [None]:
train_tokens_w2v[4]

['on',
 'the',
 'worst',
 'revenge-of-the-nerds',
 'clichés',
 'the',
 'filmmakers',
 'could',
 'dredge',
 'up']

In [None]:
# Define the function
def get_sentence_vector_w2v(sentence, model):
    tokens = word_tokenize(sentence.lower())

    vectors = [model.wv[token] for token in tokens]

    if len(vectors) == 0:
        return np.zeros(model.vector_size)

    return np.mean(vectors, axis=0)

In [None]:
# Calculate similarity
sim_5_100_w2v = cosine_similarity(get_sentence_vector_w2v(train_sentence[4], w2v),
                           get_sentence_vector_w2v(train_sentence[99], w2v))
sim_5_15000_w2v = cosine_similarity(get_sentence_vector_w2v(train_sentence[4], w2v),
                             get_sentence_vector_w2v(train_sentence[14999], w2v))
sim_5_50000_w2v = cosine_similarity(get_sentence_vector_w2v(train_sentence[4], w2v),
                             get_sentence_vector_w2v(train_sentence[49999], w2v))

In [None]:
print(f"{train_sentence[4]} \n {sim_5_100_w2v} \n {train_sentence[99]}")

on the worst revenge-of-the-nerds clichés the filmmakers could dredge up  
 0.6973946690559387 
 acted and directed , it 's clear that washington most certainly has a new career ahead of him 


In [None]:
print(f"{train_sentence[4]} \n {sim_5_15000_w2v} \n {train_sentence[14999]}")

on the worst revenge-of-the-nerds clichés the filmmakers could dredge up  
 0.4082188606262207 
 eloquent 


In [None]:
print(f"{train_sentence[4]} \n {sim_5_50000_w2v} \n {train_sentence[49999]}")

on the worst revenge-of-the-nerds clichés the filmmakers could dredge up  
 0.40622401237487793 
 stylish 


## ✅ 8. Preprocessing

perform several processing steps as described below to the train and validation texts



- Remove any punctuation and html tags.

In [None]:
# Function to remove punctuation
def remove_punctuation(text):
    # Make a translation table for str.translate() that maps each punctuation character to none.
    translator = str.maketrans("", "", string.punctuation)
    # Translate the text using the translation table.
    return text.translate(translator)

In [None]:
# Uses BeautifulSoup for HTML cleaning
processed_train_sentence = [BeautifulSoup(sentence, "html.parser").get_text() for sentence in train_sentence]
processed_val_sentence = [BeautifulSoup(sentence, "html.parser").get_text() for sentence in val_sentence]

# Uses a regular expression to remove most standard punctuation
processed_train_sentence = [remove_punctuation(sentence) for sentence in processed_train_sentence]
processed_val_sentence = [remove_punctuation(sentence) for sentence in processed_val_sentence]

# Convert to lowercase to maintain consistency
processed_train_sentence = [sentence.lower() for sentence in processed_train_sentence]
processed_val_sentence = [sentence.lower() for sentence in processed_val_sentence]

In [None]:
train_sentence[8]

"a depressed fifteen-year-old 's suicidal poetry "

In [None]:
processed_train_sentence[8]

'a depressed fifteenyearold s suicidal poetry '

### ✅ NLTK

 - Tokenize the text into tokens.


In [None]:
train_tokens = [word_tokenize(sentence) for sentence in processed_train_sentence]
val_tokens = [word_tokenize(sentence) for sentence in processed_val_sentence]
# test_tokens = [word_tokenize(sentence) for sentence in processed_test_sentence]

In [None]:
train_tokens[8]

['a', 'depressed', 'fifteenyearold', 's', 'suicidal', 'poetry']

 - Remove stop words from your text.


In [None]:
# Stopwords
stop_words = set(stopwords.words("english"))

# Filter stopwords while keeping sentence structure
# without flattening the tokens all into one single list
filtered_train_tokens = [[token for token in sentence if token not in stop_words]
                         for sentence in train_tokens]
filtered_val_tokens = [[token for token in sentence if token not in stop_words]
                       for sentence in val_tokens]

In [None]:
filtered_train_tokens[8]

['depressed', 'fifteenyearold', 'suicidal', 'poetry']

 - Perform lemmatisation and stemming on your text (one at a time).

In [None]:
# Initialise lemmatiser
wnl = WordNetLemmatizer()

In [None]:
lemma_train = [" ".join([wnl.lemmatize(token, pos="v") for token in sentence])
                      for sentence in filtered_train_tokens]
lemma_val = [" ".join([wnl.lemmatize(token, pos="v")  for token in sentence])
                    for sentence in filtered_val_tokens]

In [None]:
lemma_train[3]

'remain utterly satisfy remain throughout'

In [None]:
# Initialise the stemmer
stemmer = PorterStemmer()

In [None]:
stem_train = [" ".join([stemmer.stem(token) for token in sentence])
                     for sentence in filtered_train_tokens]
stem_val = [" ".join([stemmer.stem(token) for token in sentence])
                     for sentence in filtered_val_tokens]

In [None]:
stem_train[3]

'remain utterli satisfi remain throughout'

In [None]:
train_data["lemma_sentence"] = lemma_train

In [None]:
val_data["lemma_sentence"] = lemma_val

In [None]:
train_data["stem_sentence"] = stem_train

In [None]:
val_data["stem_sentence"] = stem_val

In [None]:
train_data.head()

Unnamed: 0,idx,sentence,label,lemma_sentence,stem_sentence
0,0,hide new secretions from the parental units,0,hide new secretions parental units,hide new secret parent unit
1,1,"contains no wit , only labored gags",0,contain wit labor gag,contain wit labor gag
2,2,that loves its characters and communicates som...,1,love character communicate something rather be...,love charact commun someth rather beauti human...
3,3,remains utterly satisfied to remain the same t...,0,remain utterly satisfy remain throughout,remain utterli satisfi remain throughout
4,4,on the worst revenge-of-the-nerds clichés the ...,0,worst revengeofthenerds clichés filmmakers cou...,worst revengeofthenerd cliché filmmak could dredg


In [None]:
val_data.head()

Unnamed: 0,idx,sentence,label,lemma_sentence,stem_sentence
0,0,it 's a charming and often affecting journey .,1,charm often affect journey,charm often affect journey
1,1,unflinchingly bleak and desperate,0,unflinchingly bleak desperate,unflinchingli bleak desper
2,2,allows us to hope that nolan is poised to emba...,1,allow us hope nolan poise embark major career ...,allow us hope nolan pois embark major career c...
3,3,"the acting , costumes , music , cinematography...",1,act costume music cinematography sound astound...,act costum music cinematographi sound astound ...
4,4,"it 's slow -- very , very slow .",0,slow slow,slow slow


## ✅ 9. BoW & Similarity
*Statistical Similarity*

In [None]:
# Create vectorizer
count_vectorizer = CountVectorizer(analyzer = "word",
                                   lowercase=True,
                                   stop_words="english",
                                   max_features=3000)

**lemmatised sentences**

In [None]:
# Calculate BoW - lemmatised sentences
train_bow_lemma = count_vectorizer.fit_transform(lemma_train) # Learn vocabulary from training
val_bow_lemma = count_vectorizer.transform(lemma_val) # Apply to validation

In [None]:
# Get vocabulary and tokens after fitting
vocabulary_bow = count_vectorizer.vocabulary_
tokens_bow = count_vectorizer.get_feature_names_out()

In [None]:
vocabulary_bow

In [None]:
# Calculate statistical similarity
sim_5_100_bow = cosine_similarity(train_bow_lemma[4].toarray()[0],
                             train_bow_lemma[99].toarray()[0])
sim_5_15000_bow = cosine_similarity(train_bow_lemma[4].toarray()[0],
                               train_bow_lemma[14999].toarray()[0])
sim_5_50000_bow = cosine_similarity(train_bow_lemma[4].toarray()[0],
                               train_bow_lemma[49999].toarray()[0])

In [None]:
print(f"{lemma_train[4]} \n {sim_5_100_bow} \n {lemma_train[99]}")

worst revengeofthenerds clichés filmmakers could dredge 
 0.0 
 act direct clear washington certainly new career ahead


In [None]:
print(f"{lemma_train[4]} \n {sim_5_15000_bow} \n {lemma_train[14999]}")

worst revengeofthenerds clichés filmmakers could dredge 
 0.0 
 eloquent


In [None]:
print(f"{lemma_train[4]} \n {sim_5_50000_bow} \n {lemma_train[49999]}")

worst revengeofthenerds clichés filmmakers could dredge 
 0.0 
 stylish


**stemmed sentences**

In [None]:
# Calculate BoW - stemmed sentences
train_bow_stem = count_vectorizer.fit_transform(stem_train) # Learn vocabulary from training
val_bow_stem = count_vectorizer.transform(stem_val) # Apply to validation

In [None]:
# Get vocabulary and tokens after fitting
vocabulary_bow = count_vectorizer.vocabulary_
tokens_bow = count_vectorizer.get_feature_names_out()

In [None]:
vocabulary_bow

In [None]:
# Calculate statistical similarity
sim_5_100_bow = cosine_similarity(train_bow_stem[4].toarray()[0],
                             train_bow_stem[99].toarray()[0])
sim_5_15000_bow = cosine_similarity(train_bow_stem[4].toarray()[0],
                               train_bow_stem[14999].toarray()[0])
sim_5_50000_bow = cosine_similarity(train_bow_stem[4].toarray()[0],
                               train_bow_stem[49999].toarray()[0])

In [None]:
print(f"{stem_train[4]} \n {sim_5_100_bow} \n {stem_train[99]}")

worst revengeofthenerd cliché filmmak could dredg 
 0.0 
 act direct clear washington certainli new career ahead


In [None]:
print(f"{stem_train[4]} \n {sim_5_15000_bow} \n {stem_train[14999]}")

worst revengeofthenerd cliché filmmak could dredg 
 0.0 
 eloqu


In [None]:
print(f"{stem_train[4]} \n {sim_5_50000_bow} \n {stem_train[49999]}")

worst revengeofthenerd cliché filmmak could dredg 
 0.0 
 stylish


## ✅ 9. TF-IDF & Similarity

*Statistical Similarity*

In [None]:
# Create vectorizer (handles all preprocessing)
tfidf_vectorizer = TfidfVectorizer(analyzer = "word",
                             lowercase=True,
                             stop_words="english",
                             max_features=3000)

**lemmatised sentences**

In [None]:
# Calculate TF-IDF - lemmatised sentences
train_tfidf_lemma = tfidf_vectorizer.fit_transform(lemma_train) # Learn vocabulary from training
val_tfidf_lemma = tfidf_vectorizer.transform(lemma_val) # Apply to validation

In [None]:
# Get vocabulary and tokens after fitting
vocabulary_tfidf = tfidf_vectorizer.vocabulary_
tokens_tfidf = tfidf_vectorizer.get_feature_names_out()

In [None]:
vocabulary_tfidf

In [None]:
# Calculate statistical similarity
sim_5_100_tfidf = cosine_similarity(train_tfidf_lemma[4].toarray()[0],
                             train_tfidf_lemma[99].toarray()[0])
sim_5_15000_tfidf = cosine_similarity(train_tfidf_lemma[4].toarray()[0],
                               train_tfidf_lemma[14999].toarray()[0])
sim_5_50000_tfidf = cosine_similarity(train_tfidf_lemma[4].toarray()[0],
                               train_tfidf_lemma[49999].toarray()[0])

In [None]:
print(f"{lemma_train[4]} \n {sim_5_100_tfidf} \n {lemma_train[99]}")

worst revengeofthenerds clichés filmmakers could dredge 
 0.0 
 act direct clear washington certainly new career ahead


In [None]:
print(f"{lemma_train[4]} \n {sim_5_15000_tfidf} \n {lemma_train[14999]}")

worst revengeofthenerds clichés filmmakers could dredge 
 0.0 
 eloquent


In [None]:
print(f"{lemma_train[4]} \n {sim_5_50000_tfidf} \n {lemma_train[49999]}")

worst revengeofthenerds clichés filmmakers could dredge 
 0.0 
 stylish


**stemmed sentences**

In [None]:
# Calculate TF-IDF - stemmed sentences
train_tfidf_stem = tfidf_vectorizer.fit_transform(stem_train) # Learn vocabulary from training
val_tfidf_stem = tfidf_vectorizer.transform(stem_val) # Apply to validation

In [None]:
# Get vocabulary and tokens after fitting
vocabulary_tfidf = tfidf_vectorizer.vocabulary_
tokens_tfidf = tfidf_vectorizer.get_feature_names_out()

In [None]:
vocabulary_tfidf

In [None]:
# Calculate statistical similarity
sim_5_100_tfidf = cosine_similarity(train_tfidf_stem[4].toarray()[0],
                             train_tfidf_stem[99].toarray()[0])
sim_5_15000_tfidf = cosine_similarity(train_tfidf_stem[4].toarray()[0],
                               train_tfidf_stem[14999].toarray()[0])
sim_5_50000_tfidf = cosine_similarity(train_tfidf_stem[4].toarray()[0],
                               train_tfidf_stem[49999].toarray()[0])

In [None]:
print(f"{stem_train[4]} \n {sim_5_100_tfidf} \n {stem_train[99]}")

worst revengeofthenerd cliché filmmak could dredg 
 0.0 
 act direct clear washington certainli new career ahead


In [None]:
print(f"{stem_train[4]} \n {sim_5_15000_tfidf} \n {stem_train[14999]}")

worst revengeofthenerd cliché filmmak could dredg 
 0.0 
 eloqu


In [None]:
print(f"{stem_train[4]} \n {sim_5_50000_tfidf} \n {stem_train[49999]}")

worst revengeofthenerd cliché filmmak could dredg 
 0.0 
 stylish


## ✅ 10. Logistic Regression
using scikit-learn first with the Bag-of-Words and then TF-IDF, and report the performance of the sentiment classifier.


### BoW

**lemmatised sentences**

In [None]:
# For BoW - lemmatised sentences:
clf_bow_lemma = LogisticRegression()
clf_bow_lemma.fit(train_bow_lemma, train_label)

train_score_bow_lemma = clf_bow_lemma.score(train_bow_lemma, train_label)

In [None]:
train_score_bow_lemma

0.8421802847852232

In [None]:
# Evaluate the model
y_pred_bow_lemma = clf_bow_lemma.predict(val_bow_lemma)
print(classification_report(y_pred_bow_lemma, val_label))

              precision    recall  f1-score   support

           0       0.70      0.81      0.75       368
           1       0.84      0.74      0.79       504

    accuracy                           0.77       872
   macro avg       0.77      0.78      0.77       872
weighted avg       0.78      0.77      0.77       872



**stemmed sentences**

In [None]:
# For BoW - stemmed sentences:
clf_bow_stem = LogisticRegression()
clf_bow_stem.fit(train_bow_stem, train_label)

train_score_bow_stem = clf_bow_stem.score(train_bow_stem, train_label)

In [None]:
train_score_bow_stem

0.8507475983310814

In [None]:
# Evaluate the model
y_pred_bow_stem = clf_bow_stem.predict(val_bow_stem)
print(classification_report(y_pred_bow_stem, val_label))

              precision    recall  f1-score   support

           0       0.70      0.82      0.76       363
           1       0.86      0.75      0.80       509

    accuracy                           0.78       872
   macro avg       0.78      0.79      0.78       872
weighted avg       0.79      0.78      0.78       872



### TF-IDF

**lemmatised sentences**

In [None]:
# For TF-IDF - lemmatised sentences:
clf_tfidf_lemma = LogisticRegression()
clf_tfidf_lemma.fit(train_tfidf_lemma, train_label)

train_score_tfidf_lemma = clf_tfidf_lemma.score(train_tfidf_lemma, train_label)

In [None]:
train_score_tfidf_lemma

0.8389582621865209

In [None]:
# Evaluate the model
y_pred_tfidf_lemma = clf_tfidf_lemma.predict(val_tfidf_lemma)
print(classification_report(y_pred_tfidf_lemma, val_label))

              precision    recall  f1-score   support

           0       0.70      0.84      0.76       357
           1       0.87      0.75      0.81       515

    accuracy                           0.79       872
   macro avg       0.78      0.79      0.78       872
weighted avg       0.80      0.79      0.79       872



**stemmed sentences**

In [None]:
# For TF-IDF - stemmed sentences:
clf_tfidf_stem = LogisticRegression()
clf_tfidf_stem.fit(train_tfidf_stem, train_label)

train_score_tfidf_stem = clf_tfidf_stem.score(train_tfidf_stem, train_label)

In [None]:
train_score_tfidf_stem

0.8482085851311824

In [None]:
# Evaluate the model
y_pred_tfidf_stem = clf_tfidf_stem.predict(val_tfidf_stem)
print(classification_report(y_pred_tfidf_stem, val_label))

              precision    recall  f1-score   support

           0       0.70      0.84      0.76       359
           1       0.87      0.75      0.80       513

    accuracy                           0.79       872
   macro avg       0.78      0.79      0.78       872
weighted avg       0.80      0.79      0.79       872



**All models compared to each other**

In [None]:
print(f"BoW lemmatised sentences \n {classification_report(y_pred_bow_lemma, val_label)}")
print(f"BoW stemmed sentences \n {classification_report(y_pred_bow_stem, val_label)}")
print(f"TF-IDF lemmatised sentences \n {classification_report(y_pred_tfidf_lemma, val_label)}")
print(f"TF-IDF stemmed sentences \n {classification_report(y_pred_tfidf_stem, val_label)}")

BoW lemmatised sentences 
               precision    recall  f1-score   support

           0       0.70      0.81      0.75       368
           1       0.84      0.74      0.79       504

    accuracy                           0.77       872
   macro avg       0.77      0.78      0.77       872
weighted avg       0.78      0.77      0.77       872

BoW stemmed sentences 
               precision    recall  f1-score   support

           0       0.70      0.82      0.76       363
           1       0.86      0.75      0.80       509

    accuracy                           0.78       872
   macro avg       0.78      0.79      0.78       872
weighted avg       0.79      0.78      0.78       872

TF-IDF lemmatised sentences 
               precision    recall  f1-score   support

           0       0.70      0.84      0.76       357
           1       0.87      0.75      0.81       515

    accuracy                           0.79       872
   macro avg       0.78      0.79      0.78    

# 💠 Insights on Model Performance

# 💠 Insights on Similarity

**GloVe**

- Pre-trained word embeddings (300 dimensions)
- Understands semantic meaning and word relationships
- Each word is converted to a vector capturing its meaning
- Sentences are averaged word vectors

*Strengths:*
- Captures semantic relationships
- Recognises "good" and "great" as similar
- Recognises "good" and "bad" as opposite
- Produces non-zero similarities between random reviews
- Understands context and meaning
- Good for classification tasks

*Weaknesses:*
- Requires pre-trained model (must download)
- Maps all instances of same word to same vector
- May miss rare words not in training data
- Slower than BoW and TF-IDF

*Best For:*
- Sentiment analysis
- Semantic similarity
- Text classification
- Understanding meaning
- Paraphrase detection

*Not Good For:*
- Exact duplicate detection
- Spam filtering (need exact patterns)
- Simple word counting tasks



**Word2Vec** (Word to Vector)

- Neural network model that learns word embeddings from your corpus
- Trains on local word contexts (nearby words)
- Each word converted to a vector (typically 100-300 dimensions)
- Sentences are averaged word vectors
- Two approaches: Skip-gram and CBOW

*Strengths:*
- Captures semantic relationships like GloVe
- Trained on your specific corpus (learns domain-specific meanings)
- "good" and "great" recognised as similar
- "good" and "bad" recognised as opposite
- Better than GloVe for domain-specific text
- Produces non-zero similarities between random reviews
- Can find most similar words to any word

*Weaknesses:*
- Requires training on your data (slow for large datasets)
- Needs sufficient data to train well
- Different training runs produce different vectors
- Can overfit with small datasets
- More complex to implement than TF-IDF or BoW

*Best For:*
- Sentiment analysis on domain-specific text
- Semantic similarity in specialized domains
- Text classification with custom embeddings
- Understanding meaning in your specific corpus
- When you have enough training data

*Not Good For:*
- Small datasets (won't train well)
- Exact duplicate detection
- Spam filtering (need exact patterns)
- When you need pre-trained general knowledge



**BoW**

- Counts word occurrences in each document
- Ignores word order and document structure
- Creates sparse vectors (mostly zeros)
- Each word equals its count

*Strengths:*
- Very simple and fast
- Easy to implement
- Good for exact word matching
- Easy to interpret
- Low computational cost
- Works with small datasets

*Weaknesses:*
- No information about word importance
- Only exact word overlap
- Zero for random sentences with no common words
- Loses word order completely
- Treats "good" and "great" as different words
- High sparsity (mostly zeros)
- Treats all words equally

*Best For:*
- Duplicate detection (exact same words)
- Spam filtering (similar document structure)
- Topic matching (documents about exact same topic)
- Simple baseline
- Large-scale document retrieval

*Not Good For:*
- Sentiment analysis (different words for same sentiment)
- Paraphrase detection (same meaning, different words)
- General text similarity
- Semantic tasks
- Synonym handling


**TF-IDF**

- Weights words by their importance (frequency times rarity)
- TF: how often a word appears in a document
- IDF: how rare a word is across all documents
- Each word gets a weight between 0 and 1
- Sentences become weighted word vectors

*Strengths:*
- Simple and fast
- Easy to understand and interpret
- Emphasises distinctive and important words
- Better than BoW for text classification
- Handles variable-length documents well
- Good baseline method

*Weaknesses:*
- Relies only on word overlap
- Different words mean zero similarity
- Produces zero for random sentences with no common words
- Treats "good" and "great" as completely different
- Cannot recognise synonyms
- Creates high-dimensional sparse vectors

*Best For:*
- Text classification when words matter
- Information retrieval
- Baseline comparisons
- Simple similarity tasks
- Documents with repeated keywords

*Not Good For:*
- Sentiment analysis (needs semantic understanding)
- Paraphrase detection (different words = 0)
- General semantic similarity
- Handling synonyms


Considering all these nuances, it is normal that GloVe and Word2Vec give non-zero, BoW and TF-IDF give zero similarities. Brcause diverse reviews have different vocabulary.

Word2Vec is trained on the specific corpus (movie reviews), therefore learns the specific patterns in our data.
GloVe is pre-trained on billions of general words, so captures broader semantic meaning.

**Word2Vec:**

- Gives more moderate scores (clusters closer together)
- Less able to distinguish between different meanings
- Reason: Trained on limited corpus (just our data)

**GloVe:**

- Gives wider range of scores
- Better at distinguishing similar vs dissimilar
- More confident in judgments
- Reason: Trained on massive general corpus

