<div class="alert alert-success">
Important: Some of this project is pre-code, and may not fully reflect how my projects would usually look.

# Project Statement

The Film Junky Union, a new edgy community for classic movie enthusiasts, is developing a system for filtering and categorizing movie reviews. The goal is to train a model to automatically detect negative reviews. You'll be using a dataset of IMBD movie reviews with polarity labelling to build a model for classifying positive and negative reviews. It will need to have an F1 score of at least 0.85.

## Initialization

In [1]:
import math

import numpy as np
import pandas as pd

import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns

from tqdm.auto import tqdm

import nltk

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize, sent_tokenize

from nltk.corpus import stopwords

import en_core_web_sm
import re

ModuleNotFoundError: No module named 'en_core_web_sm'

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'png'
# the next line provides graphs of better quality on HiDPI screens
%config InlineBackend.figure_format = 'retina'

plt.style.use('seaborn')

In [None]:
# this is to use progress_apply, read more at https://pypi.org/project/tqdm/#pandas-integration
tqdm.pandas()

## Load Data

In [None]:
df_reviews = pd.read_csv('/datasets/imdb_reviews.tsv', sep='\t', dtype={'votes': 'Int64'})

In [None]:
df_reviews.head()

In [None]:
# Dropping unneeded columns
df_reviews = df_reviews.drop(['title_type', 'primary_title', 'original_title', 'end_year', 'runtime_minutes', 
                              'is_adult', 'genres', 'average_rating', 'votes', 'sp', 'idx'], axis=1)
df_reviews

In [None]:
df_reviews.describe()

## EDA

Let's check the number of movies and reviews over years.

In [None]:
fig, axs = plt.subplots(2, 1, figsize=(16, 8))

ax = axs[0]

dft1 = df_reviews[['tconst', 'start_year']].drop_duplicates() \
    ['start_year'].value_counts().sort_index()
dft1 = dft1.reindex(index=np.arange(dft1.index.min(), max(dft1.index.max(), 2021))).fillna(0)
dft1.plot(kind='bar', ax=ax)
ax.set_title('Number of Movies Over Years')

ax = axs[1]

dft2 = df_reviews.groupby(['start_year', 'pos'])['pos'].count().unstack()
dft2 = dft2.reindex(index=np.arange(dft2.index.min(), max(dft2.index.max(), 2021))).fillna(0)

dft2.plot(kind='bar', stacked=True, label='#reviews (neg, pos)', ax=ax)

dft2 = df_reviews['start_year'].value_counts().sort_index()
dft2 = dft2.reindex(index=np.arange(dft2.index.min(), max(dft2.index.max(), 2021))).fillna(0)
dft3 = (dft2/dft1).fillna(0)
axt = ax.twinx()
dft3.reset_index(drop=True).rolling(5).mean().plot(color='orange', label='reviews per movie (avg over 5 years)', ax=axt)

lines, labels = axt.get_legend_handles_labels()
ax.legend(lines, labels, loc='upper left')

ax.set_title('Number of Reviews Over Years')

fig.tight_layout()

The number of movies showed a steady increase from the 1930's up until the mid 1990's, where the number of movies being released increased significantly.

After to 2006 there is a rapid decline in movies produced up until 2011, likely year the data was being compiled.

The number of reviews also shows a steady increase but is overall much more stable than the rate of movie growth.

The rate of negative and positive reviews seems fairly consistent but seems to be getting more positive than negative as the industry learns and grows. Originally about half of all reviews being negative, slowly grows into about 1/3rd being negative as it reaches the 2000s.

Let's check the distribution of number of reviews per movie with the exact counting and KDE (just to learn how it may differ from the exact counting)

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(16, 5))

ax = axs[0]
dft = df_reviews.groupby('tconst')['review'].count() \
    .value_counts() \
    .sort_index()
dft.plot.bar(ax=ax)
ax.set_title('Bar Plot of #Reviews Per Movie')

ax = axs[1]
dft = df_reviews.groupby('tconst')['review'].count()
sns.kdeplot(dft, ax=ax)
ax.set_title('KDE Plot of #Reviews Per Movie')

fig.tight_layout()

We can see that the number of movies generally populate the 1 - 5 reviews range, meaning that they are not as popular.

The number of movies rises again specifically at movies with 30 reviews, indicating that about 400 movies were very popular with review critics.

The large peak from 0 - 5 confirms that most movies have this amount of reviews, while only  a handful become popular enough to receive more.

These graphs indirectly communicate how difficult it is for most movies to achieve an audience.

In [None]:
df_reviews['pos'].value_counts()

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(12, 4))

ax = axs[0]
dft = df_reviews.query('ds_part == "train"')['rating'].value_counts().sort_index()
dft = dft.reindex(index=np.arange(min(dft.index.min(), 1), max(dft.index.max(), 11))).fillna(0)
dft.plot.bar(ax=ax)
ax.set_ylim([0, 5000])
ax.set_title('The train set: distribution of ratings')

ax = axs[1]
dft = df_reviews.query('ds_part == "test"')['rating'].value_counts().sort_index()
dft = dft.reindex(index=np.arange(min(dft.index.min(), 1), max(dft.index.max(), 11))).fillna(0)
dft.plot.bar(ax=ax)
ax.set_ylim([0, 5000])
ax.set_title('The test set: distribution of ratings')

fig.tight_layout()

The train and test set have a very close disribution of ratings, so a class imbalance won't be an issue here.

Distribution of negative and positive reviews over the years for two parts of the dataset

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(16, 8), gridspec_kw=dict(width_ratios=(2, 1), height_ratios=(1, 1)))

ax = axs[0][0]

dft = df_reviews.query('ds_part == "train"').groupby(['start_year', 'pos'])['pos'].count().unstack()
dft.index = dft.index.astype('int')
dft = dft.reindex(index=np.arange(dft.index.min(), max(dft.index.max(), 2020))).fillna(0)
dft.plot(kind='bar', stacked=True, ax=ax)
ax.set_title('The train set: number of reviews of different polarities per year')

ax = axs[0][1]

dft = df_reviews.query('ds_part == "train"').groupby(['tconst', 'pos'])['pos'].count().unstack()
sns.kdeplot(dft[0], color='blue', label='negative', kernel='epa', ax=ax)
sns.kdeplot(dft[1], color='green', label='positive', kernel='epa', ax=ax)
ax.legend()
ax.set_title('The train set: distribution of different polarities per movie')

ax = axs[1][0]

dft = df_reviews.query('ds_part == "test"').groupby(['start_year', 'pos'])['pos'].count().unstack()
dft.index = dft.index.astype('int')
dft = dft.reindex(index=np.arange(dft.index.min(), max(dft.index.max(), 2020))).fillna(0)
dft.plot(kind='bar', stacked=True, ax=ax)
ax.set_title('The test set: number of reviews of different polarities per year')

ax = axs[1][1]

dft = df_reviews.query('ds_part == "test"').groupby(['tconst', 'pos'])['pos'].count().unstack()
sns.kdeplot(dft[0], color='blue', label='negative', kernel='epa', ax=ax)
sns.kdeplot(dft[1], color='green', label='positive', kernel='epa', ax=ax)
ax.legend()
ax.set_title('The test set: distribution of different polarities per movie')

fig.tight_layout()

The distribution for the positive and negative reviews is similar between sets, further ensuring a lack of class imbalance.

## Evaluation Procedure

Composing an evaluation routine which can be used for all models in this project

In [None]:
import sklearn.metrics as metrics

def evaluate_model(model, train_features, train_target, test_features, test_target):
    
    eval_stats = {}
    
    fig, axs = plt.subplots(1, 3, figsize=(20, 6)) 
    
    for type, features, target in (('train', train_features, train_target), ('test', test_features, test_target)):
        
        eval_stats[type] = {}
    
        pred_target = model.predict(features)
        pred_proba = model.predict_proba(features)[:, 1]
        
        # F1
        f1_thresholds = np.arange(0, 1.01, 0.05)
        f1_scores = [metrics.f1_score(target, pred_proba>=threshold) for threshold in f1_thresholds]
        
        # ROC
        fpr, tpr, roc_thresholds = metrics.roc_curve(target, pred_proba)
        roc_auc = metrics.roc_auc_score(target, pred_proba)    
        eval_stats[type]['ROC AUC'] = roc_auc

        # PRC
        precision, recall, pr_thresholds = metrics.precision_recall_curve(target, pred_proba)
        aps = metrics.average_precision_score(target, pred_proba)
        eval_stats[type]['APS'] = aps
        
        if type == 'train':
            color = 'blue'
        else:
            color = 'green'

        # F1 Score
        ax = axs[0]
        max_f1_score_idx = np.argmax(f1_scores)
        ax.plot(f1_thresholds, f1_scores, color=color, label=f'{type}, max={f1_scores[max_f1_score_idx]:.2f} @ {f1_thresholds[max_f1_score_idx]:.2f}')
        # setting crosses for some thresholds
        for threshold in (0.2, 0.4, 0.5, 0.6, 0.8):
            closest_value_idx = np.argmin(np.abs(f1_thresholds-threshold))
            marker_color = 'orange' if threshold != 0.5 else 'red'
            ax.plot(f1_thresholds[closest_value_idx], f1_scores[closest_value_idx], color=marker_color, marker='X', markersize=7)
        ax.set_xlim([-0.02, 1.02])    
        ax.set_ylim([-0.02, 1.02])
        ax.set_xlabel('threshold')
        ax.set_ylabel('F1')
        ax.legend(loc='lower center')
        ax.set_title(f'F1 Score') 

        # ROC
        ax = axs[1]    
        ax.plot(fpr, tpr, color=color, label=f'{type}, ROC AUC={roc_auc:.2f}')
        # setting crosses for some thresholds
        for threshold in (0.2, 0.4, 0.5, 0.6, 0.8):
            closest_value_idx = np.argmin(np.abs(roc_thresholds-threshold))
            marker_color = 'orange' if threshold != 0.5 else 'red'            
            ax.plot(fpr[closest_value_idx], tpr[closest_value_idx], color=marker_color, marker='X', markersize=7)
        ax.plot([0, 1], [0, 1], color='grey', linestyle='--')
        ax.set_xlim([-0.02, 1.02])    
        ax.set_ylim([-0.02, 1.02])
        ax.set_xlabel('FPR')
        ax.set_ylabel('TPR')
        ax.legend(loc='lower center')        
        ax.set_title(f'ROC Curve')
        
        # PRC
        ax = axs[2]
        ax.plot(recall, precision, color=color, label=f'{type}, AP={aps:.2f}')
        # setting crosses for some thresholds
        for threshold in (0.2, 0.4, 0.5, 0.6, 0.8):
            closest_value_idx = np.argmin(np.abs(pr_thresholds-threshold))
            marker_color = 'orange' if threshold != 0.5 else 'red'
            ax.plot(recall[closest_value_idx], precision[closest_value_idx], color=marker_color, marker='X', markersize=7)
        ax.set_xlim([-0.02, 1.02])    
        ax.set_ylim([-0.02, 1.02])
        ax.set_xlabel('recall')
        ax.set_ylabel('precision')
        ax.legend(loc='lower center')
        ax.set_title(f'PRC')        

        eval_stats[type]['Accuracy'] = metrics.accuracy_score(target, pred_target)
        eval_stats[type]['F1'] = metrics.f1_score(target, pred_target)
    
    df_eval_stats = pd.DataFrame(eval_stats)
    df_eval_stats = df_eval_stats.round(2)
    df_eval_stats = df_eval_stats.reindex(index=('Accuracy', 'F1', 'APS', 'ROC AUC'))
    
    print(df_eval_stats)
    
    return

## Normalization

We assume all models below accepts texts in lowercase and without any digits, punctuations marks etc.

In [None]:
#moved

In [None]:
nlp = en_core_web_sm.load(disable=['parser', 'ner'])
corpus = df_reviews['review']

def clear_text(text):
    
    for i in text:
        pattern = r"[^a-zA-z']" 
        text = re.sub(pattern, " ", str(text)).split()
        text = " ".join(text)
        return text

def lemmatize(text):
    
    for i in text:
        doc = nlp(text.lower())
        lemmas = [token.lemma_ for token in doc]
        return " ".join(lemmas)

In [None]:
df_reviews['review_norm'] = corpus.apply(clear_text)

## Train / Test Split

Luckily, the whole dataset is already divided into train/test one parts. The corresponding flag is 'ds_part'.

In [None]:
df_reviews_train = df_reviews.query('ds_part == "train"').copy()
df_reviews_test = df_reviews.query('ds_part == "test"').copy()

X_train = df_reviews_train['review_norm'].to_frame().squeeze()
X_test = df_reviews_test['review_norm'].to_frame().squeeze()
y_train = df_reviews_train['pos'].to_frame().squeeze()
y_test = df_reviews_test['pos'].to_frame().squeeze()

print(df_reviews_train.shape)
print(df_reviews_test.shape)

In [None]:
df_reviews_train['review']

In [None]:
df_reviews_train['review_norm']

In [None]:
X_train

## Working with models

### Model 0 - Constant

In [None]:
from sklearn.dummy import DummyClassifier

In [None]:
dummy_clf = DummyClassifier()
dummy_clf.fit(df_reviews_train.drop('pos', axis=1), df_reviews_train['pos'])
dummy_clf.predict(df_reviews_train.drop('pos', axis=1))
dummy_clf.score(df_reviews_train.drop('pos', axis=1), df_reviews_train['pos'])

In [None]:
evaluate_model(dummy_clf, df_reviews_train.drop('pos', axis=1), df_reviews_train['pos'], 
               df_reviews_test.drop('pos', axis=1), df_reviews_test['pos'])

### Model 1 - NLTK, TF-IDF and LR

TF-IDF

In [None]:
def lemmatize(text):
    lemmatizer = WordNetLemmatizer()
    tokens = word_tokenize(text)
    lemmas=[]
    for w in tokens:
        lemmas.append(lemmatizer.lemmatize(w))
        lemmas.append(" ")
    return "".join(lemmas)

In [None]:
# Creating a counter (TfidfVectorizer) and define stop words
count_tf_idf = TfidfVectorizer(stop_words=set(stopwords.words('english')))

# Defining train features & target
X_train_lemm = X_train.progress_apply(lemmatize)
X_test_lemm = X_test.progress_apply(lemmatize)

X_train_trans = count_tf_idf.fit_transform(X_train_lemm)
X_test_trans = count_tf_idf.transform(X_test_lemm)

# Defining Logistic Regression Model
model_1 = LogisticRegression()
model_1.fit(X_train_trans, y_train)

In [None]:
evaluate_model(model_1, X_train_trans, y_train, X_test_trans, y_test)

### Model 2 - spaCy, TF-IDF and LR

In [None]:
import spacy

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

In [None]:
# spaCy preprocessing function
nlp = en_core_web_sm.load(disable=['parser', 'ner'])

def lemmatize_spacy(text):
    doc = nlp(text.lower())
    lemmas = [token.lemma_ for token in doc]
    return " ".join(lemmas)

In [None]:
# Creating a counter (TfidfVectorizer) and define stop words
count_tf_idf2 = TfidfVectorizer(stop_words=set(stopwords.words('english')))

# Defining train features & target
X_train_spacy = X_train.progress_apply(lemmatize_spacy)
X_test_spacy = X_test.progress_apply(lemmatize_spacy)

X_train_trans = count_tf_idf2.fit_transform(X_train_spacy)
X_test_trans = count_tf_idf2.transform(X_test_spacy)

# Defining Logistic Regression Model
model_2 = LogisticRegression()
model_2.fit(X_train_trans, y_train)

In [None]:
evaluate_model(model_2, X_train_trans, y_train, X_test_trans, y_test)

### Model 3 - spaCy, TF-IDF and LGBMClassifier

In [None]:
from lightgbm import LGBMClassifier

In [None]:
# Creating a LGBM model using previous variables
model_3 = LGBMClassifier(num_leaves=12)
model_3.fit(X_train_trans, y_train)

In [None]:
evaluate_model(model_3, X_train_trans, y_train, X_test_trans, y_test)

## My Reviews

In [None]:
my_reviews = pd.DataFrame([
    'I did not simply like it, not my kind of movie.',
    'Well, I was bored and felt asleep in the middle of the movie.',
    'I was really fascinated with the movie',    
    'Even the actors looked really old and disinterested, and they got paid to be in the movie. What a soulless cash grab.',
    'I didn\'t expect the reboot to be so good! Writers really cared about the source material',
    'The movie had its upsides and downsides, but I feel like overall it\'s a decent flick. I could see myself going to see it again.',
    'What a rotten attempt at a comedy. Not a single joke lands, everyone acts annoying and loud, even kids won\'t like this!',
    'Launching on Netflix was a brave move & I really appreciate being able to binge on episode after episode, of this exciting intelligent new drama.'
], columns=['review'])

my_reviews['review_norm'] = my_reviews['review'].apply(clear_text)

my_reviews

### Model 1

In [None]:
texts = my_reviews['review_norm']

my_reviews_pred_prob = model_1.predict_proba(count_tf_idf.transform(texts))[:, 1]

for i, review in enumerate(texts.str.slice(0, 100)):
    print(f'{my_reviews_pred_prob[i]:.2f}:  {review}')

### Model 2

In [None]:
texts = my_reviews['review_norm']

my_reviews_pred_prob = model_2.predict_proba(count_tf_idf2.transform(texts.apply(lambda x: lemmatize_spacy(x))))[:, 1]

for i, review in enumerate(texts.str.slice(0, 100)):
    print(f'{my_reviews_pred_prob[i]:.2f}:  {review}')

### Model 3

In [None]:
texts = my_reviews['review_norm']

my_reviews_pred_prob = model_3.predict_proba(count_tf_idf2.transform(texts.apply(lambda x: lemmatize_spacy(x))))[:, 1]

for i, review in enumerate(texts.str.slice(0, 100)):
    print(f'{my_reviews_pred_prob[i]:.2f}:  {review}')

## Conclusions

Models 1 and 2 passed the minimum F1 score, but model 3 did not. This could be due to inadequate hyperparameter tuning.

Between the first two models, the first one is better in terms of speed.