# Movie reviews prediction
NLP | Binary classification | NLTK • spaCy • BERT

***

## Project Description

### The objective

The Film Junky Union, a new edgy community for classic movie enthusiasts, is developing a system for filtering and categorizing movie reviews. The goal is to train a model to automatically detect negative reviews. For this task will be used a dataset of IMBD movie reviews with polarity labeling to build a model for classifying positive and negative reviews.

### Data description

Here's the description of the fields selected for this task:
- `review`: the review text
- `pos`: the target, `0` for negative and `1` for positive
- `ds_part`: `train`/`test` for the train/test part of dataset, correspondingly  

*The data was provided by Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).*

***

## Basic libraries and settings

In [1]:
import numpy as np
import pandas as pd

import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns

# import math
from tqdm.auto import tqdm
import re
import time

import warnings
warnings.filterwarnings("ignore")

In [2]:
plt.style.use("seaborn-v0_8")
sns.set_style("darkgrid", {"axes.facecolor": ".95"})
sns.set_palette("mako")
sns.set_context("notebook")

plt.rcParams["figure.figsize"] = (10, 4)
%matplotlib inline
%config InlineBackend.figure_format = "retina"

# this one is to use progress_apply
tqdm.pandas()

***

## Data Preprocessing

### Read and look at the data

In [3]:
data = pd.read_csv("datasets/imdb_reviews.tsv", sep="\t",
                   usecols=["review","pos","ds_part","start_year","end_year","tconst", "is_adult"])

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47331 entries, 0 to 47330
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   tconst      47331 non-null  object
 1   start_year  47331 non-null  int64 
 2   end_year    47331 non-null  object
 3   is_adult    47331 non-null  int64 
 4   review      47331 non-null  object
 5   pos         47331 non-null  int64 
 6   ds_part     47331 non-null  object
dtypes: int64(3), object(4)
memory usage: 2.5+ MB


In [22]:
data.sample(3)

Unnamed: 0,tconst,start_year,end_year,is_adult,review,pos,ds_part
34986,tt0440963,2007,\N,0,I don't understand people. Why is it that this...,0,train
36916,tt0080731,1980,\N,0,Even if one didn't realize that Sellers was in...,0,train
169,tt0450951,2005,\N,0,I had heard interesting critics on this movie....,1,test


In [23]:
data.describe(percentiles=np.arange(0.1, 1, 0.1)).T

Unnamed: 0,count,mean,std,min,10%,20%,30%,40%,50%,60%,70%,80%,90%,max
start_year,47331.0,1989.631235,19.600364,1894.0,1957.0,1978.0,1986.0,1993.0,1998.0,2000.0,2003.0,2005.0,2006.0,2010.0
is_adult,47331.0,0.001732,0.041587,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
pos,47331.0,0.498954,0.500004,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0


In [24]:
data.describe(include="O").T

Unnamed: 0,count,unique,top,freq
tconst,47331,6648,tt0067445,30
end_year,47331,60,\N,45052
review,47331,47240,Loved today's show!!! It was a variety and not...,5
ds_part,47331,2,train,23796


**NOTE**
- The `end_year` column has an object dtype due to having "\N" string;
- The are only two rows where values are explicitly missing;
- The `rating` values distribution seem to be right-skewed;
- The target variable `pos` is fairly class-balanced;
- According to [IMDb site](https://www.imdb.com/interfaces/#:~:text=tconst%20(string)%20%2D-,alphanumeric%20unique%20identifier%20of%20the%20title,-directors%20(array%20of), `tsconst` is a unique identifier of the movie title;
- Several review texts are duplicated;

*Fun notice:*
- More than 90% of the reviewers are not adults :)

<div class="alert alert-success">
<b>Reviewer's comment</b>

The data was loaded and inspected

</div>

### Cleaning

Duplicate review texts won't bring new information to a model. Check if there are any duplicates. 

In [None]:
# calculate how many duplicated reviews there are
data.duplicated(subset=["review"]).sum()

In [None]:
# make sure these reviews were evaluatate similarly
data.duplicated(subset=["review","pos"]).sum()

In [None]:
# drop the rows with duplicated reviews
data.drop_duplicates(subset=["review"], inplace=True)

# reset the dataframe index
data.reset_index(drop=True, inplace=True)
# check the number of rows left
data.shape

<div class="alert alert-success">
<b>Reviewer's comment</b>

Great, duplicate entries were removed

</div>

***

## EDA

### Number of movies/reviews over years

In [None]:
# calculate the number of movies for every year
n_movies = (
    data[["tconst","start_year"]] # select only the unique title identifier and start year
    .drop_duplicates()["start_year"] # drop duplicated rows and select only the column with years
    .value_counts() # count the number of movies realeased each year
    .sort_index() # sort the series in ascending order
)

In [None]:
# calculate the nuber of reviews made every year
n_reviews = (
    data.groupby(["start_year","pos"])["pos"]
    .count()
    .unstack() # make positive and negative reviews to be separate columns
)

In [None]:
# calulate the moving average for reviews per movie ratio
ratio = (
    (data["start_year"].value_counts().sort_index() / n_movies)
    .reset_index(drop=True)
    .rolling(5)
    .mean()
)

In [None]:
fig, axs = plt.subplots(2, 1, figsize=(12, 8))

# plot the number of movies realeased every year
ax = axs[0] 
n_movies.plot(kind="bar", ax=ax, color="MediumPurple")
ax.set_title("Number of Movies Over Years")
ax.set_ylabel("Number of Movies")
ax.set_xticklabels(n_movies.index,fontsize=8)

# plot the number of reviews
ax = axs[1]
n_reviews.plot(kind="bar", stacked=True, ax=ax)
ax.set_title("Number of Reviews Over Years")
ax.set_ylabel("Number of Reviews")
ax.set_xlabel("Start Year")
ax.set_xticklabels(n_movies.index,fontsize=8)
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, ["negative", "positive"], loc="best")

# plot the rolling reviews per movie ratio
ax_t = ax.twinx()
ratio.plot(color="MediumPurple", label="Revies per movie (5 years avg)", grid=False)
ax_t.set_ylabel("Reviews per Movie Ratio")
lines, labels = ax_t.get_legend_handles_labels()
ax_t.legend(lines, labels, loc="center left")

fig.tight_layout();

**NOTE**  

The average number of reviews per movie increases over years, although not as drastically as the total number of movies.

### Reviews number distribution

In [None]:
fig, axs = plt.subplots(1,2, figsize=(16, 4))
fig.suptitle("Reviews Per Movie", fontsize=14)

ax = axs[0]
(data
 .groupby("tconst")["review"] # group the data by unique movie identifiers
 .count() # count the number of reviews per each movie
 .value_counts() # count the repeating number of reviews
 .sort_index() # sort values befor plotting
 .plot(kind="bar", ax=axs[0])
)
ax.set_title("Bar Plot")
ax.set_xlabel("Number of Reviews")
ax.set_ylabel("Quantity")

ax = axs[1]
(data
 .groupby("tconst")["review"]
 .count()
 .plot(kind="kde", ax=axs[1])
)
ax.set_xlim([-5, 35])
ax.set_xlabel("Number of Reviews")
ax.set_title("KDE Plot");

**NOTE**  

We see that mostly movies have few reviews, though we have a spike in number of movies with 30 reviews.

### Classes balance

In [None]:
# check the quantity of each target class in the train data
data[data["ds_part"] == "train"]["pos"].value_counts()

In [None]:
data[data["ds_part"] == "test"]["pos"].value_counts()

**NOTE**  

The target classes are balanced.

### Ratings distribution

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(16,4))
plt.suptitle("Ratings Distribution", fontsize=14)

for axis, ds_part in ((0, "train"), (1, "test")):
    
    ax = axs[axis]
    
    ratings = (data.query("ds_part == @ds_part")["rating"]
               .value_counts().sort_index())
    
    ratings = (ratings
               .reindex(index=np.arange(min(ratings.index.min(), 1), max(ratings.index.max(), 11)))
               .fillna(0))
       
    ratings.plot.bar(ax=ax)
    ax.set_ylabel("Quantity")
    ax.set_xlabel("Rating value")
    ax.set_ylim([0, 5000])
    ax.set_title("{} data".format(ds_part))

**NOTE**  

We have an even distribution of ratings between the training and test datasets.

### Neg vs Pos reviews distribution

Distribution of negative and positive reviews over the years for two parts of the dataset.

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(16, 8), gridspec_kw={"width_ratios":(3,2), "height_ratios":(1,1)})

for axis, ds_part in ((0, "train"), (1, "test")):
    
    ax = axs[axis][0]
    reviews = (data
              .query("ds_part == @ds_part")
              .groupby(["start_year", "pos"])["pos"]
              .count()
              .unstack()
             )
    reviews.index = reviews.index.astype("int")
    reviews = reviews.reindex(
        index=np.arange(reviews.index.min(), max(reviews.index.max(), 2015))
    ).fillna(0)
    
    reviews.plot(kind="bar", stacked=True, ax=ax, grid=False)
    ax.set_xticklabels("")
    ax.set_title("The {} set: distribution of different polarities per movie.".format(ds_part))
    ax.set_xlabel("Timeline")
    ax.set_ylabel("Number of reviews")
    lines, labels = ax.get_legend_handles_labels()
    ax.legend(lines, ["negative", "positive"], loc="best")
    
    ax = axs[axis][1]
    reviews = (data
               .query("ds_part == @ds_part")
               .groupby(["tconst","pos"])["pos"]
               .count()
               .unstack()
              )
    
    sns.kdeplot(reviews[0], color="blue", label="negative", ax=ax)
    sns.kdeplot(reviews[1], color="green", label="positive", ax=ax)
    ax.set_title("The {} set: distribution of different polarities per movie".format(ds_part))
    ax.set_xlabel("Number of reviews")
    ax.legend()
    
fig.tight_layout()

**NOTE**  

As we can see from the graphs, the test and training sets have fairly even representation for all features and each target class. 

<div class="alert alert-success">
<b>Reviewer's comment</b>

Very well, you explored the data and made some interesting observations

</div>

***

## ML models

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer

import spacy
import torch
import transformers

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import sklearn.metrics as metrics
from sklearn.dummy import DummyClassifier

from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

We assume all models in this project accept texts in lowercase and without any digits, punctuation marks, etc.

### Evaluation Procedure
<a id="evaluation-procedure"></a>

Compose an evaluation routine which will be used for all models in this project.

In [None]:
def evaluate_model(model, train_features, train_target, test_features, test_target):
    """
    Takes in a model, features and ...
    """
    eval_stats = {}
    fig, axs = plt.subplots(1, 3, figsize=(10, 4))
    
    for ds_part, features, target in (("train", train_features, train_target), ("test", test_features, test_target)):
        
        eval_stats[ds_part] = {}
        color = "gold" if ds_part == "train" else "steelblue"
        
        pred_target = model.predict(features)
        pred_proba = model.predict_proba(features)[:,1]
            
        # F1 Score
        ax = axs[0]
        f1_thresholds = np.arange(0, 1.01, 0.05)
        
        f1_scores = [metrics.f1_score(target, pred_proba>=threshold) for threshold in f1_thresholds]
        accuracies = [metrics.accuracy_score(target, pred_proba>=threshold) for threshold in f1_thresholds]
        
        max_f1_score_idx = np.argmax(f1_scores)
        max_accuracy_idx = np.argmax(accuracies)
        eval_stats[ds_part]['F1'] = f1_scores[max_f1_score_idx]
        eval_stats[ds_part]['Accuracy'] = accuracies[max_accuracy_idx]
        
        ax.plot(f1_thresholds, f1_scores, color=color, 
                label=f'{ds_part}, max={f1_scores[max_f1_score_idx]:.2f} @ {f1_thresholds[max_f1_score_idx]:.2f}')
        # setting the point for the best thershold
        ax.plot(f1_thresholds[max_f1_score_idx], f1_scores[max_f1_score_idx], color="green", marker='o', markersize=7)
        # setting crosses for some thresholds
        for threshold in np.arange(0.1, 1.01, 0.1):
            closest_value_idx = np.argmin(np.abs(f1_thresholds-threshold))
            marker_color = 'orange' if threshold != 0.5 else 'red'
            ax.plot(f1_thresholds[closest_value_idx], f1_scores[closest_value_idx], color=marker_color, marker='X', markersize=5)
        ax.set_xlim([-0.02, 1.02])    
        ax.set_ylim([-0.02, 1.02])
        ax.set_xlabel('threshold')
        ax.set_ylabel('F1')
        ax.legend(loc='lower center')
        ax.set_title(f'F1 Score')

        # ROC
        ax = axs[1]    
        fpr, tpr, roc_thresholds = metrics.roc_curve(target, pred_proba)
        roc_auc = metrics.roc_auc_score(target, pred_proba)
        eval_stats[ds_part]["ROC AUC"] = roc_auc
        
        ax.plot(fpr, tpr, color=color, label=f'{ds_part}, ROC AUC={roc_auc:.2f}')
        # setting crosses for some thresholds
        for threshold in np.arange(0.1, 1.01, 0.1):
            closest_value_idx = np.argmin(np.abs(roc_thresholds-threshold))
            marker_color = 'orange' if threshold != 0.5 else 'red'            
            ax.plot(fpr[closest_value_idx], tpr[closest_value_idx], color=marker_color, marker='X', markersize=5)
        ax.plot([0, 1], [0, 1], color='grey', linestyle='--')
        ax.set_xlim([-0.02, 1.02])    
        ax.set_ylim([-0.02, 1.02])
        ax.set_xlabel('FPR')
        ax.set_ylabel('TPR')
        ax.legend(loc='lower center')        
        ax.set_title(f'ROC Curve')
        
        # Precision Recall Curve
        ax = axs[2]
        precision, recall, pr_thresholds = metrics.precision_recall_curve(target, pred_proba)
        ap_score = metrics.average_precision_score(target, pred_proba) 
        eval_stats[ds_part]["APS"] = ap_score
        
        ax.plot(recall, precision, color=color, label=f'{ds_part}, AP={ap_score:.2f}') 
        # setting crosses for some thresholds
        for threshold in np.arange(0.1, 1.01, 0.1):
            closest_value_idx = np.argmin(np.abs(pr_thresholds-threshold))
            marker_color = 'orange' if threshold != 0.5 else 'red'
            ax.plot(recall[closest_value_idx], precision[closest_value_idx], color=marker_color, marker='X', markersize=5)
        ax.set_xlim([-0.02, 1.02])    
        ax.set_ylim([0.48, 1.02])
        ax.set_xlabel('recall')
        ax.set_ylabel('precision')
        ax.legend(loc='lower center')
        ax.set_title('Precision Recall Curve')        
    
    fig.tight_layout()
    
    df_eval_stats = pd.DataFrame(eval_stats)
    df_eval_stats = df_eval_stats.round(2)
    df_eval_stats = df_eval_stats.reindex(index=('Accuracy', 'F1', 'APS', 'ROC AUC'))
    
    return df_eval_stats

### Text preprocessing

#### Normalization

In [None]:
# create a new colum with a normalized review text
data["review_norm"] = [
    " ".join( # split and join to remove extra spaces made after substitution
        re.sub(r"[^a-z']"," ", text.lower()).split() # substitute all characters 
                                                     # that don't fit the pattern
    ) 
    for text 
    in data["review"]
]

In [None]:
# check how normalization has worked
for index in data.sample(5).index:
    print(" RAW: ", data.loc[index, "review"][:100])
    print("NORM: ", data.loc[index, "review_norm"][:100], "\n")

<div class="alert alert-success">
<b>Reviewer's comment</b>

Normalization was done successfully

</div>

#### Lemmatize with NLTK

In [None]:
# lemmatize texts with NLTK package
lemmatizer = WordNetLemmatizer()

def lemmatize_nltk(text):
    text_lemmatized = " ".join( # join lemmatized words back into text
        [lemmatizer.lemmatize(token) for token in word_tokenize(text)] # get lemmas for each word in the text
    )
    return text_lemmatized

data["lemmatized_nltk"] = data["review_norm"].apply(lemmatize_nltk)

In [None]:
# check how lemmatization has worked
for index in data.sample(5).index:
    print(" NORM: ", data.loc[index, "review_norm"][:100])
    print("LEMMA: ", data.loc[index, "lemmatized_nltk"][:100], "\n")

In [None]:
# create masks for selecting the train and the test set
mask_train = data["ds_part"] == "train"
mask_test = data["ds_part"] == "test"

In [None]:
# vectorize texts
stop_words = set(stopwords.words("english"))

vectorizer_1 = TfidfVectorizer(stop_words=stop_words)

train_features_1 = vectorizer_1.fit_transform(data[mask_train]["lemmatized_nltk"])
test_features_1 = vectorizer_1.transform(data[mask_test]["lemmatized_nltk"])

words_1 = vectorizer_1.get_feature_names_out()

#### Lemmatize with spaCy

In [None]:
nlp = spacy.load("en_core_web_sm", disable=["parser","ner"])

data["lemmatized_spacy"] = [
    " ".join(
        [token.lemma_ for token in nlp(text) if not token.is_stop]
    )
    for text
    in data["review_norm"]
]

In [None]:
# check how lemmatization has worked
for index in data.sample(5).index:
    print(" NORM: ", data.loc[index, "review_norm"][:100])
    print("LEMMA: ", data.loc[index, "lemmatized_spacy"][:100], "\n")

In [None]:
# vectorize texts

vectorizer_2 = TfidfVectorizer() # we already excluded stop words

train_features_2 = vectorizer_2.fit_transform(data[mask_train]["lemmatized_spacy"])
test_features_2 = vectorizer_2.transform(data[mask_test]["lemmatized_spacy"])

words_2 = vectorizer_2.get_feature_names_out()

<div class="alert alert-success">
<b>Reviewer's comment</b>

It's nice that you tried lemmatization! TF-IDF vectorizer was applied correctly

</div>

#### N-grams vectorizing

Try to train the model with n_grams.

In [None]:
vectorizer_3 = TfidfVectorizer(stop_words=stop_words, ngram_range=(1,5))

train_features_3 = vectorizer_3.fit_transform(data[mask_train]["lemmatized_nltk"])
test_features_3 = vectorizer_3.transform(data[mask_test]["lemmatized_nltk"])

words_3 = vectorizer_3.get_feature_names_out()

<div class="alert alert-success">
<b>Reviewer's comment</b>

4- and 5-grams seems overkill, but cool anyway!

</div>

#### Stemming nltk

Preprocess texts using stemming method.

In [None]:
# initialize a stmming module 
ps = PorterStemmer()

def stemming_nltk(text):
    text_stemmed = " ".join( # join lemmatized words back into text
        [ps.stem(token) for token in word_tokenize(text)] # get lemmas for each word in the text
    )
    return text_stemmed

data["stemmed_nltk"] = data["review_norm"].apply(stemming_nltk)

In [None]:
# check how stemming has worked
for index in data.sample(5).index:
    print("NORM: ", data.loc[index, "review_norm"][:100])
    print("STEM: ", data.loc[index, "stemmed_nltk"][:100], "\n")

In [None]:
# vectorize texts
vectorizer_4 = TfidfVectorizer(stop_words=stop_words)

train_features_4 = vectorizer_4.fit_transform(data[mask_train]["stemmed_nltk"])
test_features_4 = vectorizer_4.transform(data[mask_test]["stemmed_nltk"])

words_4 = vectorizer_4.get_feature_names_out()

<div class="alert alert-success">
<b>Reviewer's comment</b>

Alright, stemming is another way to normalize text data

</div>

#### BERT embeddings

In [None]:
tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-uncased')
config = transformers.BertConfig.from_pretrained('bert-base-uncased')
model = transformers.BertModel.from_pretrained('bert-base-uncased')

In [None]:
def get_bert_embeddings(texts, max_length=512, batch_size=100, force_device=None, disable_progress_bar=False):
    
    ids_list = []
    attention_mask_list = []
    
    for text in tqdm(texts, disable=disable_progress_bar):
        ids = tokenizer.encode(text, add_special_tokens=True, truncation=True, max_length=max_length)
        padded = np.array(ids + [0] * (max_length - len(ids)))
        attention_mask = np.where(padded != 0, 1, 0)
        ids_list.append(padded)
        attention_mask_list.append(attention_mask)
        
    if force_device is None:
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    else:
        device = torch.device(force_device)
        
    model.to(device)
    
    if not disable_progress_bar:
        print("Using the {} device.".format(device))
        
    embeddings = []
    
    for i in tqdm(range(math.ceil(len(ids_list)/batch_size)), disable=disable_progress_bar):
        
        ids_batch = torch.LongTensor(ids_list[batch_size*i:batch_size*(i+1)]).to(device)
        attention_mask_batch = torch.LongTensor(attention_mask_list[batch_size*i:batch_size*(i+1)]).to(device)
        
        with torch.no_grad():
            model.eval()
            batch_embeddings = model(input_ids=ids_batch, attention_mask=attention_mask_batch)
        embeddings.append(batch_embeddings[0][:,0,:].detach().cpu().numpy())
        
    return np.concatenate(embeddings)

<div class="alert alert-success">
<b>Reviewer's comment</b>

The code for generating BERT embeddings is correct

</div>

In [None]:
train_features_5 = get_bert_embeddings(data[mask_train]["review_norm"], force_device="mps", batch_size=64)

In [None]:
test_features_5 = get_bert_embeddings(data[mask_test]["review_norm"], force_device="mps", batch_size=64,)

### Model 0: Constant

Luckily, the whole dataset is already divided into train/test one part (*the corresponding flag is 'ds_part'*). Now store the target and features into variables.

In [None]:
train_features, train_target = data[mask_train]["review_norm"], data[mask_train]["pos"]
test_features, test_target = data[mask_test]["review_norm"], data[mask_test]["pos"]

print(train_features.shape)
print(test_features.shape)

In [None]:
# train a dummy model for predicting classes by random
dummy_model = DummyClassifier(strategy="uniform", random_state=555).fit(train_features, train_target)
# check the classes balance of the predicted values
print("Dummy predictions counts:")
pd.Series(dummy_model.predict(test_features)).value_counts()

In [None]:
evaluate_model(dummy_model, train_features, train_target, test_features, test_target)

**NOTE**  

We have a F1 score of 0.67 using random predictions.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Alright, there is a simple baseline

</div>

### Model I: LogisticRegression

#### LR and NLTK lemmatization

In [None]:
# initialize and fit the model
lr_1 = LogisticRegression().fit(train_features_1, train_target)
# evaluate the model
evaluate_model(lr_1, train_features_1, train_target, test_features_1, test_target)

**NOTE**

The LogisticRegression model seem to work quite well. Try to improve using different text preprocessing techniques.

#### LR and spaCy lemmatization

In [None]:
 # initialize and fit the model again
lr_2 = LogisticRegression().fit(train_features_2, train_target)
# evaluate the model
evaluate_model(lr_2, train_features_2, train_target, test_features_2, test_target)

**NOTE**  

When using spaCy package for lemmatization the model performs slightly worse. It also takes much more time to preprocess texts with spaCy than it is when using nltk.

#### LR and NLTK n_grams

In [None]:
# initialize and fit the model
lr_3 = LogisticRegression().fit(train_features_3, train_target)
# evaluate the model
evaluate_model(lr_3, train_features_3, train_target, test_features_3, test_target)

**NOTE**

When using n_grams the model gets overfitted. Try different hyperparameters to improve the model's performance.

<div class="alert alert-warning">
<b>Reviewer's comment</b>

Yep, the ngram range is a better hyperparameter to tune here

</div>

In [None]:
lr_31 = LogisticRegression()

# parameters grid 
parameters = dict(C=(1,10,100,1000))

# initialize and fit the gridsearch module
lr_31 = GridSearchCV(lr_31, parameters, cv=3, verbose=0, scoring="f1").fit(train_features_3, train_target)

# get the best parameters
lr_31.best_params_

In [None]:
# initialize and fit the model with the regualrization parameter updated
lr_31 = LogisticRegression(C=1000).fit(train_features_3, train_target)
# evaluate the model
evaluate_model(lr_31, train_features_3, train_target, test_features_3, test_target)

**NOTE**

Now, the model knows the data perfectly, but the key metrics on the test set are the same as it was with NLTK lemmatization text preprocessing.

#### LR and NLTK stemming

In [None]:
# initialize and fit the model
lr_4 = LogisticRegression().fit(train_features_4, train_target)
# evaluate the model
evaluate_model(lr_4, train_features_4, train_target, test_features_4, test_target)

**NOTE**

When texts preprocessed with the stemming method, the LR model yields the same results as it is when using lemmatization.

#### LR and BERT embeddings

In [None]:
# initialize and fit the model
lr_5 = LogisticRegression().fit(train_features_5, train_target)
# evaluate the model
evaluate_model(lr_5, train_features_5, train_target, test_features_5, test_target)

**NOTE**  

When using BERT embeddings the model does not overfit on graphs, the curves for the test data are very close to those for the training set. But the overall performance on the test set is a bit worse that with previous processing methods. 

### Model 2: RandomForest

#### RF and NLTK lemmatization

In [None]:
forest_1 = RandomForestClassifier(random_state=12345, n_jobs=-1).fit(train_features_1, train_target)

evaluate_model(forest_1, train_features_1, train_target, test_features_1, test_target)

Try to find better hyperparameters >>>

In [None]:
f1_best = 0
forest_11 = None

best_depth = 0
best_split = 0
best_leaf = 0
best_n_est = 0

for depth in tqdm(range(30, 101, 10)):
    for split in (2,4,8,16):
        for leaf in (1,2,5):
                
            forest = RandomForestClassifier(
                random_state=12345,
                n_jobs=-1,
                max_depth=depth,
                min_samples_split=split,
                min_samples_leaf=leaf
            )

            f1_scores = cross_val_score(
                forest,
                train_features_1,
                train_target,
                scoring="f1",
                cv=5,
                n_jobs=-1
            )

            f1_average = np.mean(f1_scores)

            if f1_average > f1_best:

                f1_best = f1_average
                forest_11 = forest
                best_depth = depth
                best_split = split
                best_leaf = leaf
                    
print(
    "Best params: ",
    "\nmax depth: ", best_depth,
    "\nmin_samples_split: ", best_split,
    "\nmin_samples_leaf: ", best_leaf,
)

In [None]:
forest_11.fit(train_features_1, train_target)

evaluate_model(forest_11, train_features_1, train_target, test_features_1, test_target)

**NOTE**  

Not much of improvement on the test set.

#### RF and n_grams

In [None]:
forest_3 = RandomForestClassifier(random_state=12345, n_jobs=-1).fit(train_features_3, train_target)

evaluate_model(forest_3, train_features_3, train_target, test_features_3, test_target)

#### RF and BERT embeddings

In [None]:
forest_5 = RandomForestClassifier(random_state=12345, n_jobs=-1).fit(train_features_5, train_target)

evaluate_model(forest_5, train_features_5, train_target, test_features_5, test_target)

### Model 3: LGBMClassifier

#### LGBM and lemmatization

In [None]:
lgbm_1 = LGBMClassifier(random_state=1345, n_jobs=-1).fit(train_features_1, train_target)

evaluate_model(lgbm_1, train_features_1, train_target, test_features_1, test_target)

#### LGBM and n_grams

In [None]:
lgbm_3 = LGBMClassifier(random_state=1345, n_jobs=-1).fit(train_features_3, train_target)

evaluate_model(lgbm_3, train_features_3, train_target, test_features_3, test_target)

#### LGBM and BERT embeddings

In [None]:
lgbm_5 = LGBMClassifier(random_state=1345, n_jobs=-1).fit(train_features_5, train_target)

evaluate_model(lgbm_5, train_features_5, train_target, test_features_5, test_target)

<div class="alert alert-warning">
<b>Reviewer's comment</b>

Great, you tried different models with differently preprocessed texts and tuned the model's hyperparameters using cross-validation. Note that strictly speaking the test set should only be used to evaluate the final model to get an unbiased estimate of the final model's generalization performance. It would be better to use a separate validation set or cross-validation (although it would be a bit trickier to have similar visualizations for the metrics) instead for evaluation of different models.

</div>

***

## New Reviews Test

Use some reviews and star rating evaluations from Google Audience reviews to test the models.

In [None]:
new_reviews = pd.DataFrame(
    {"review":[
        "I like hard science fiction, when fantasy about possible events is grounded on current scientific facts. This movie does a good job in being fantastic enough, but not far from reality. With that said, I didn't really understand the motivation of the main character. She was presented as a very smart, wise person whose decisions made a huge impact on humanity. However, in the end, with all her knowledge, abilities, and absolute awareness, she makes an irrational, illogical, and unreasonable decision which causes suffering to two of her closest people. So, after all, the amazing events of the main part of the movie, are dimmed with this silly-human ending. Shame! But I like the movie, after all if I were me, I would watch it again.",
        "I have watched every marvel movie, loved most of them, this Thor movie very disappointing, like a really bad B grade sci-fi from syfy channel, I don't blame the actors, Hemsworth & Bale are great actors, script writers and director are to blame. It was like a poorly scripted 80s movie, meant to be funny but just sad, gaps in the story line that didn't add up, trying not to give away the plot for those who may be thinking of seeing the movie. Yes there has always been a humour element to the marvel movies, this one seriously missed the mark, it was  reminiscent of a terrible spoof Thor movie. 3 of us went to see the movie, we were looking at each other with an expression of what is going on. 30 mins into the movie, we decided to go outside, spoke to a couple of the cinema staff who all said your not the first ones to come out, general opinion is the movie is bizarre and waste of money. We debated on going back in to watch a bit more to see if it gets any better, tried again it didn't get any better, Russell Crowe playing the part of Zeus King of the gods, not the image of you would expect, more like a really overweight grandad in a tutu prancing around.",
        "Truly an enjoyable film, in some ways it felt different from most MCU films, and while it didn’t turn out to be as groundbreaking to critics as it’s predecessor, it proves to be the perfect superhero movie for the summer. Many others before me say that it didn’t have enough story, but let me tell you what it lacks in a “stellar” story it more than makes up for in action, humor, outstanding visuals, pop culture references, cameos, and overall creativity. While I was a little surprised not seeing Gorr do as much god butchering on screen, Christian Bale makes up for the character’s limited screen time through a powerful and albeit creepy performance, one scene in particular reminiscent of Pennywise from It, and his powers are unbelievably awesome. My only real complaints is are there is too much focus on the love story between Thor and Jane Foster, the first part of the film feels somewhat rushed (though I did enjoy the action packed ‘80s style fight), and of course without Loki, it just didn’t feel the same. Best of all, the mid and post credits scenes make me optimistic for the future of both Thor’s story and the MCU overall. My complaints are few and my thumbs are all the way up to the heavens to join the gods themselves 😇. Highly recommend this film, and please, don’t worry about what others think, go see it and make you’re own opinion ;). Also, Russell Crowe as Zeus was both powerful and hysterical, respect for holding the Greek accent 👍",
        "I still play it even today. The graphics are amazing for a 2010 game and the story and missions are so well done. It’s incredibly fun with 2 players and since there are so many things to do you never get bored. There is an entire story mode that takes a decent chunk of time to finish and then the toy box mode Has TONS more missions and a little story of its own. You unlock cars. Rideable animals. (Bullseye, dragons),weapons and way more things!. This game will always have a special place in my heart and I’m lucky I found it so many years ago. Honestly one of the best games I’ve ever played I even recommend it today. You get bored of the new games these days it’s nice to just go back and enjoy the old days",
        "Antman as a hero and a movie is underrated. Scott may be weird (which makes him different than the other Avengers and more funny) but his  ‘powers’ are really cool! This movie has action, great visuals, a well-written script, and has a few humorous lines/scenes. Honestly I wasn’t sure it was worth watching but I will now 100% recommend it to Marvel fans. It doesn’t have as much blood/gore, swearing, or violence as other Marvel movies, which can be a good or bad thing, but I think any Marvel fan should have no trouble enjoying it.",
        "Great film, new cast is great, my only thing is that there should have been way, way more action scenes. Think about the Matrix Reloaded, they built a freeway and there was a 30min car vs bike chase against the agents, Morpheus, the key maker and Trinity on a Ducatti no less, with Neo flying back to save them all at the last second before the two lorries collided in a huge explosion, in bullet time with each other. That scene still blows my mind to this day! In this one Neo can't even fly, there is way too much backstory, most people know the story of the Matrix already, so it needed more action and special effects. John Wick 3 Parrabellum was great with Keanu Reeves for that very reason it was full of gun crazed action from start to finish. If you're a fan of the original Matrix and in a nostalgic mood it's a must watch, but I can't help but think they will need to back up (no pun intended) the Matrix Resurrections with another film for newcomers to the franchise, who will be totally confused with the back and forth awkward slowness of this version of the Matrix and lack of action, fight scenes and passion the first Trilogy of Matrix films had to offer. Enjoy!",
        "I love the Matrix movies, and this one I thought was pretty good as well.  It’s been so long since I saw the trilogy that I was a little lost at first, mainly cause I had really forgotten what all happened to everyone, but after a refresher I was up to speed.  Keanu Reeves is great, but what I found that really surprised me is Neil Patrick Harris, he played that part remarkably well.  All in all, if you are a fan of the Matrix movies, as I am, you will like this movie too.  I hope this is the first of another possible trilogy and the story will continue on.  Not trying to disagree with those who didn’t like it, everyone has their own opinions, but maybe many fans was satisfied with how the trilogy concluded.  I could understand that point of view as well, and that some great things so just stay great and to leave it as it was, but hey, it makes for great conversation pieces.  The only negative I will say about the movie without giving any spoilers is that it was hard to follow along at the first, especially if you can’t remember the first three, so my suggestion is to maybe rewatch or watch for the first time, the first three movies and then flow into this new one. Just my opinion, everyone enjoy and god bless.",
        "This movie presented the whole story about the hero saving his lady and making her fall in love again ...in a very Nolanistic way. The action scenes felt very real but not of the level of the movie which is being made in the 2020s. the whole movie had this nostalgic effect to it by which you can make all your OG fans from the OF Matrix decade. This movie explained a lot of things that made people confused from watching the 2nd & 3rd parts of Quadrology. Loved the whole concept of the dream, use of your brains and imagination with your reality related to it in parallel... keep this concept n any type of movie and I am gonna love it, No doubt. But yes, Nothing can match the level - The Matrix(1999). And yes can we please talk about Casting Jonathan Groff as Mr. Smith, Omg this was soo good to see him After the series- Mindhunters.This film made me more eager for the 3rd season.P.S- The Cameo of Chelsea (Cat) in this movie was so unexpected and also the last 30 secs post credit scene was the best thing About the movie",
        "As a life long Matrix fan I was busting to see Resurrections 😀 I enjoyed it because the concepts and philosophy behind true reality intrigue me 🤔 If you are young and just go to the cinema to watch this film you will be so very DISAPPOINTED 😷 You will never be able to follow the plot. If you have not seen the previous 3 Matrix films you will subsequently fall asleep 🛏️ due to induced boardem. Even worse if you wanted great martial arts combat scenes 🥋 you are not getting any of those either 😤 All the action scenes were open to rubbish camera angles and a lack of awe and inspiration 🤯 The book 'Phenomenon' sold on Amazon 'Before Conception and Beyond' was Matrix 4 in many ways but unlike Resurrections ⚰️ the book takes you on the Greatest Adventure Ever Experienced 💡which cannot be said for the film sadly 🛀 The Phenomenon book plot is a very similar plot to the film and smashes into Matrix 5 'The Paradox' At least with the book 📚 'Phenomenon' you could put it down and raid the fridge for food 🌮and beers🍻. Honestly though 🤔 worth a watch if you are a true Matrix fan. If not honestly DON'T bother keep your money and buy the book. It will last longer and be more entertaining on your settee 🛋️ 📺  Hope you have enjoyed my review for you. Now it's time ☎️ for you to choose which pill 💊💊 to take. From a very rainy 🌧️ UK 🇬🇧 have a nice day 🌈🎱",
        "If you have seen the trilogy + you like lot of actions then their is a huge chance that you will be disappointed from this movie. But this movie is almost a decent sequel I would say because it provide us the depth into the characters of not only Neo but also Trinity. This movie gives equal value to both of them and is MORE OF A LOVE STORY than an action packed superhero type movie- so if you don't expected that from the movie you will hate it like most of the others are. For me, the climax and ending satisfied me and I expect that another sequel (if it comes ever) will have all that great fight what we were expecting from this movie. Otherwise the first two parts will only remain favourite for me forever 💗",
        "The Matrix: Resurrections sucks! it's worst Matrix movie ever and one of worst movies of 2021 also it did not feel like Matrix movie at all! The only good stuff I like was acting by some cast were great like Keanu Reeves was great as always as Neo along with Carrie-Anne Moss as Trinity, Jada Pinkett Smith as Niobe, and Priyanka Chopra as Sati, I like new character name Bugs who was played by Jessica Henwick even though it was confusing of where she came from, visuals effects were good, music was great, and fight scene were okay but mainly poorly. But bad stuff I hate recast of Morpheus who was played by Yahya Abdul-Mateen II being because Laurence Fishburne will always be Morpheus so it was stupid move, came out no where of villain name The Analyst when he saw Neo and Trinity dying in Revolutions, some of new powers did not make any sense at all like Analyst's power to slow down time, I hate they underuse Trinity, I hate how this movie rehash of first Matrix, I hate they brought back Agent Smith to life no offense he is defiantly one best cinematic villain especially that he was played by Hugo Weaving from The Matrix Trilogy but it was bad idea to bring him back to life and should stay dead also Jonathan Groff was bad choice to Agent Smith, ending was so bad, and my most biggest problem of this movie was Neo being video game maker and made The Matrix Trilogy and was making 4th one and it so stupid and bad! So yeah I think this movie is very unnecessary sequel and that Revolution was great ending to franchise. But because some few stuff that I like from movie that I had mention I would give this movie a 3/10 and I would defiantly watch 2 first sequels to The Matrix over 4th sequel."
    ],
    "stars":[4, 1, 4, 3, 5, 3, 5, 2, 4, 3, 2]}
)

#### Normalize texts

In [None]:
new_reviews["review_norm"] = [
    " ".join( # split and join to remove extra spaces made after substitution
        re.sub(r"[^a-z']"," ", text.lower()).split() # substitute all characters      
    )                                                # that don't fit the pattern
    for text 
    in new_reviews["review"]
]

#### LR and NLTK lemmatization

In [None]:
# lemmatize
corpus = new_reviews["review_norm"].apply(lemmatize_nltk)
stars = new_reviews["stars"]
# give vectors
vectors = vectorizer_1.transform(corpus)
# get predictions
new_reviews_prob_pred = lr_1.predict_proba(vectors)[:, 1]

for i, review in enumerate(new_reviews["review"].str.slice(0,100)):
    print("{} stars,  {:.2f} pred | {}".format(stars[i],new_reviews_prob_pred[i], review))

#### LR and BERT embeddings

In [None]:
vectors_bert = get_bert_embeddings(new_reviews["review_norm"], force_device="mps", batch_size=64)

# get predictions
new_reviews_prob_pred = lr_5.predict_proba(vectors_bert)[:, 1]

for i, review in enumerate(new_reviews["review"].str.slice(0,100)):
    print("{} stars,  {:.2f} pred | {}".format(stars[i],new_reviews_prob_pred[i], review))

#### RF and NLTK lemmatization

In [None]:
# get predictions
new_reviews_prob_pred = forest_11.predict_proba(vectors)[:, 1]

for i, review in enumerate(new_reviews["review"].str.slice(0,100)):
    print("{} stars,  {:.2f} pred | {}".format(stars[i],new_reviews_prob_pred[i], review))

<div class="alert alert-success">
<b>Reviewer's comment</b>

The models were applied to custom reviews correctly

</div>

***

## Conclusions

1. **Exploratory data analysis** shows that the dataset is split into the train and test parts evenly: the target variable classes are balanced, the distributions of ratings, number of reviews, and dates in both sets are similar.   

2. For **preprocessing texts** the NLTK package shows both better quality(when tested on a model) and the speed of preprocessing than that with spaCy. Stemming shows no obvious difference compared to lemmatization. On the test set, using n_grams shows no improvement compared to single word TF-IDF vectorizing. Using BERT encoding gives slightly worse results than simple lemmatization/stemming + TF-IDF techniques. 

3. Among all **models** and text preprocessing techniques a simple logistic regression with either stemming or lemmatization, plus TF-IDF vectorizing showed the best F1 score of `0.88`, which is 24% better than the baseline. Using more complex models for both text processing and making predictions doesn't give better results.

4. When testing the model's **performance on new examples**, subjectively, the *lemmatization + TF-IDF + LR* setup yields the most reasonable predictions.

<div class="alert alert-warning">
<b>Reviewer's comment</b>

Conclusions look good! Note that the custom reviews section is intended for illustration purposes: it doesn't make sense to use them to judge the models' performance, when we have a much more representative test set
    
</div>