## Text classification - For movies and webseries
**In this notebook we will try the following approaches for training machine learning models for text classification:**
1. Text classification using TFIDF, CountVectorizer and Spacy
2. Text classification using spacy, and Spacy's inbuilt word embeddings
2. Text classification using Spacy and Gensim models (Like google-news-300 & twitter-25)
3. Text classification using Spacy and fasttext (by meta)

In [1]:
# Necessary imports
import numpy as np
import pandas as pd

import spacy

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import confusion_matrix

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
# Not considering Decision Trees (Prone to overfit)

from sklearn.model_selection import GridSearchCV
from joblib import dump
from joblib import load

In [2]:
df = pd.read_excel("../data_sets/rotten_tomatoes_100.xlsx")

In [3]:
df.head()

Unnamed: 0,title,synopsis,label
0,Jason's Lyric,"In a violent, drug-infested neighborhood in Ho...",romantic
1,Chocolat,When mysterious Vianne and her child arrive in...,romantic
2,Pretty Woman,A prostitute and a wealthy businessman fall fo...,romantic
3,Love Actually,Nine intertwined stories examine the complexit...,romantic
4,An Affair to Remember,A man and a woman have a romance while on a cr...,romantic


In [4]:
df["label"].unique()

array(['romantic', 'thriller', 'action', 'horror', 'sci-fi', 'drama'],
      dtype=object)

In [5]:
df["label"] = df["label"].replace("sci-fi","scifi")

### 1. Training and evaluating the base form of classification algorithms on the dataset
This will give us an idea that what kind of approaches we can further implement to improve the performance. 

**Approach:**  
Here we will simply preprocess the dataset using spacy language pipline, creating a vocabulary using the word vectorizers, fitting the models and then evaluating the models.

In [6]:
# Loading the spacy language pipleine
nlp = spacy.load("en_core_web_sm")

In [7]:
# Function for preprocsessing the text data of synopsis
def preprocess(text, spacy_model=nlp):
    """Pass a text and it will preprocess it!"""
    filtered = []
    doc = spacy_model(text)
    for token in doc:
        if (not token.is_stop) and (not token.is_punct):
            filtered.append(token.text)

    return " ".join(filtered)

In [8]:
# Function for training and evaluating the classification algorithms
# (With default hyper parameters)
def base_train_eval(models, X, y, tsize=0.20, rstate=45, 
                    vec_type="tfidf", acc=True, cfreport=True, show_vocab=True, max_performer=True,
                    complete_res=True):
    
    eval_res = {} # Results of evaluation process
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=tsize, random_state=rstate)
    if vec_type == "tfidf":
        vec = TfidfVectorizer()
    elif vec_type == "count":
        vec = CountVectorizer()
    else:
        print("Enter a valid string for choosing the vectorizer!")
        return # terminates here
    
    vec.fit(X_train)
    spX_train = vec.transform(X_train)
    spX_test = vec.transform(X_test)

    if show_vocab == True:
        # bow = vec.vocabulary_ (each word is mapped to an index)
        bow = vec.get_feature_names_out()[200:300] # Just a glimpse of BOW
        print("\n---> Given below is the BOW:\n")
        print(bow)
        print()
    
    print("\n----> Classification report of the classification algorithms")
    for model in models:
        model.fit(spX_train, y_train)
        preds = model.predict(spX_test)

        print(f"Model Name: {model}")
        if acc == True:
            acc_score = accuracy_score(y_true=y_test, y_pred=preds)
            print(f"Overall accuracy: {round(acc_score * 100, 2)} %")
        if cfreport == True:
            report = classification_report(y_true=y_test, y_pred=preds)
            print(f"Classification report:\n{report}")
        if max_performer == True:
            eval_res[str(model)] = round(acc_score * 100, 2)

    if complete_res == True:
        print("Complete results for all algorithms:")
        print(eval_res)

    if max_performer == True:
        max_a = 0 
        max_p = ""
        for model in eval_res:
            if eval_res[model] > max_a:
                max_a = eval_res[model]
                max_p = model
        # print(f"\n\n---> Max performer: {max_p}")
        # print(f"---> Overall accuracy of this model: {max_a}")

        return max_p, max_a # return max performer and it's accuracy 

In [9]:
# Using the function we just prepared and training and evaluating the base algorithms
df["processed_synop"] = df["synopsis"].apply(preprocess)

In [10]:
df.head() # Now the df will have a new column with processed synopsis

Unnamed: 0,title,synopsis,label,processed_synop
0,Jason's Lyric,"In a violent, drug-infested neighborhood in Ho...",romantic,violent drug infested neighborhood Houston Jas...
1,Chocolat,When mysterious Vianne and her child arrive in...,romantic,mysterious Vianne child arrive tranquil French...
2,Pretty Woman,A prostitute and a wealthy businessman fall fo...,romantic,prostitute wealthy businessman fall forming un...
3,Love Actually,Nine intertwined stories examine the complexit...,romantic,intertwined stories examine complexities emoti...
4,An Affair to Remember,A man and a woman have a romance while on a cr...,romantic,man woman romance cruise Europe New York Despi...


In [11]:
print(f"SYNOPSIS: {df["synopsis"][0]}")
print(f"PROCESSED SYNOPSIS: {df["processed_synop"][0]}")

SYNOPSIS: In a violent, drug-infested neighborhood in Houston, Jason (Allen Payne) dreams of something better. He works as a TV salesman and helps out his mother, and tries to steer his criminally minded brother, Joshua (Bokeem Woodbine), onto the right path. But real joy enters Jason's life when he meets Lyric (Jada Pinkett). As their romance develops, Jason starts to see a future for himself -- while also being forced to confront a painful secret from his past.
PROCESSED SYNOPSIS: violent drug infested neighborhood Houston Jason Allen Payne dreams better works TV salesman helps mother tries steer criminally minded brother Joshua Bokeem Woodbine right path real joy enters Jason life meets Lyric Jada Pinkett romance develops Jason starts future forced confront painful secret past


In [12]:
max_p, max_a = base_train_eval(
    models=[
        AdaBoostClassifier(algorithm="SAMME"),
        GradientBoostingClassifier(),
        LogisticRegression(),
        RandomForestClassifier(),
        MultinomialNB()
    ],
    X=df["processed_synop"],
    y=df["label"],
    tsize=0.20,
    rstate=455,
    vec_type="tfidf",
    acc=True, cfreport=True, show_vocab=True,
    max_performer=True,
    complete_res=False
)


---> Given below is the BOW:

['aircraft' 'airport' 'aja' 'akin' 'alan' 'alarum' 'alaskan' 'alba'
 'albright' 'alcaino' 'alcohol' 'alden' 'alejandro' 'aleksandr' 'alerting'
 'alessio' 'alex' 'alexander' 'alexandre' 'algerian' 'algout' 'ali'
 'alice' 'alicia' 'alien' 'aliens' 'alike' 'alita' 'alive' 'allegiances'
 'allen' 'alleviate' 'alliance' 'alliances' 'allied' 'allies' 'allow'
 'allowed' 'allows' 'alluring' 'ally' 'almasy' 'almighty' 'alongside'
 'aloof' 'alter' 'altered' 'altering' 'alternate' 'alternative'
 'alzheimer' 'amanda' 'amateur' 'ambassadors' 'amber' 'ambition'
 'ambitious' 'ambitiously' 'ambush' 'ambushed' 'amelia' 'america'
 'american' 'amiable' 'amid' 'amidst' 'amitabh' 'amy' 'amélie' 'am茅lie'
 'ana' 'analyst' 'anand' 'anarchic' 'anatoliy' 'ancestor' 'ancestral'
 'anchors' 'ancient' 'and' 'andrew' 'andrews' 'androids' 'andré' 'andy'
 'aneta' 'angel' 'angela' 'angeles' 'animals' 'animated' 'animator'
 'animatronic' 'anime' 'anjali' 'ann' 'anna' 'anne' 'annie' 'annihil

In [13]:
# Let's print the returned max performer and it's accuracy
print(f"Max performer: {max_p}")
print(f"It's overall accyracy: {max_a}")

Max performer: LogisticRegression()
It's overall accyracy: 40.5


**Conclusion:** Performance is very poor as compared what we saw in the reports of the data testing part. It was performing good in binary classification, but in multiclass classification, it is performing poorly. And this might be because of our data. The data is not enough to train a model on the entire dataset performing a multiclass classification with good results.

However there's another the approach we can try. We can combine multiple binary classification to form a single multi-class classification system.

### 2. Combining multiple binary class classification models

In [14]:
df["label"].unique()

array(['romantic', 'thriller', 'action', 'horror', 'scifi', 'drama'],
      dtype=object)

In [15]:
# Conditional filtering for two classes
def pdcond_filter(c1, c2, df, y):
    filt_df = df[(df[y]==c1) | (df[y]==c2)]

# Function to find the best pair of labels...
# Pairs on which the model performs binary classification with good results
def pairs_train_eval(X, y, label_col, synop_col, models=None, 
                     tsize=0.20, rstate=45, vec_type="tfidf", acc=True,
                     cfreport=True, show_vocab=True, max_performer=True, complete_res=True,
                     data=df):
    label_pairs = set()
    all_labels = list(y.unique())
    for label in all_labels:
        label_current = label
        for label in all_labels:
            if label_current != label:
                pair = str(label_current) + "-" + str(label)
                label_pairs.add(pair)

    label_pairs = list(label_pairs)
    final_res = {} # Final result from the train & eval
    for i, lb in enumerate(label_pairs):
        lb1 = lb.split("-")[0]
        lb2 = lb.split("-")[1]
        filt_data = data[(data[label_col]==lb1) | (data[label_col]==lb2)] # filtered segment

        print(f"\n\n---> Pair-{i + 1}, Pair Name: {lb}\n\n")
        max_p, max_a = base_train_eval(
            models=models,
            X=filt_data[synop_col],
            y=filt_data[label_col],
            tsize=tsize,
            rstate=rstate,
            vec_type=vec_type,
            acc=acc,
            cfreport=cfreport,
            show_vocab=show_vocab,
            max_performer=max_performer,
            complete_res=complete_res
        )

        final_res[lb] = f"{max_p}  <---> {max_a:.2f}"
        # filt_data = data[(data[y]==lb1) | (data[y]==lb2)]
        # print(filt_data.sample(10)[])
        # print(lb1,lb2)

    return final_res
    

In [16]:
final_res = pairs_train_eval(
    X=df["processed_synop"],
    y=df["label"], 
    label_col="label",
    synop_col="processed_synop",
    models=[
        GradientBoostingClassifier(), 
        AdaBoostClassifier(algorithm="SAMME"),
        LogisticRegression(),
        RandomForestClassifier(),
        MultinomialNB()
    ],
    tsize=0.20, 
    rstate=45, 
    vec_type="tfidf",
    acc=True, 
    cfreport=False, show_vocab=False,
    max_performer=True, complete_res=False, data=df
)



---> Pair-1, Pair Name: scifi-drama



----> Classification report of the classification algorithms
Model Name: GradientBoostingClassifier()
Overall accuracy: 64.29 %
Model Name: AdaBoostClassifier(algorithm='SAMME')
Overall accuracy: 64.29 %
Model Name: LogisticRegression()
Overall accuracy: 85.71 %
Model Name: RandomForestClassifier()
Overall accuracy: 73.81 %
Model Name: MultinomialNB()
Overall accuracy: 83.33 %


---> Pair-2, Pair Name: drama-romantic



----> Classification report of the classification algorithms
Model Name: GradientBoostingClassifier()
Overall accuracy: 75.61 %
Model Name: AdaBoostClassifier(algorithm='SAMME')
Overall accuracy: 63.41 %
Model Name: LogisticRegression()
Overall accuracy: 80.49 %
Model Name: RandomForestClassifier()
Overall accuracy: 65.85 %
Model Name: MultinomialNB()
Overall accuracy: 75.61 %


---> Pair-3, Pair Name: drama-horror



----> Classification report of the classification algorithms
Model Name: GradientBoostingClassifier()
Overall acc

In [17]:
final_res

{'scifi-drama': 'LogisticRegression()  <---> 85.71',
 'drama-romantic': 'LogisticRegression()  <---> 80.49',
 'drama-horror': 'MultinomialNB()  <---> 75.00',
 'scifi-romantic': 'LogisticRegression()  <---> 90.24',
 'romantic-action': 'MultinomialNB()  <---> 92.50',
 'drama-scifi': 'LogisticRegression()  <---> 85.71',
 'drama-thriller': 'MultinomialNB()  <---> 58.54',
 'thriller-scifi': 'GradientBoostingClassifier()  <---> 70.73',
 'horror-action': "AdaBoostClassifier(algorithm='SAMME')  <---> 66.67",
 'action-romantic': 'MultinomialNB()  <---> 92.50',
 'action-horror': "AdaBoostClassifier(algorithm='SAMME')  <---> 66.67",
 'horror-romantic': 'LogisticRegression()  <---> 95.00',
 'horror-scifi': 'MultinomialNB()  <---> 87.80',
 'drama-action': 'GradientBoostingClassifier()  <---> 67.50',
 'scifi-thriller': 'GradientBoostingClassifier()  <---> 70.73',
 'scifi-action': "AdaBoostClassifier(algorithm='SAMME')  <---> 70.73",
 'action-drama': 'GradientBoostingClassifier()  <---> 67.50',
 'thr

**Summarizing the best results:**

|S.No.|Pair|Acc|Algo|
|-----|----|---|----|
|1|r-d|80.49|logistic|
|2|r-h|95.00|logistic|
|3|r-a|92.50|naive bayes|
|4|s-h|87.80|naive bayes|
|5|r-s|90.24|logistic|
|6|s-d|85.71|logistic|
|7|r-t|95.00|naive bayes|

**We will consider sno 1, 2, 3, 5, and 7.** And combine the result of these models and see if it works.

**IMPORTANT NOTE:-**  
It's not like we exactly stick to this table and the huge report we generated of each pair, but we can use this as a reference to make generalized model for each pair, and then combine them.

### 3. Training (fine tuning if needed) models for the mentioned pairs. And then combining the models.

In [18]:
# Defining the hyper parameter grid for logistic regresion models
param_grid = {
    "C":np.logspace(1,30,20),
    "class_weight":[None, "balanced"],
    "solver":["liblinear"],
    "max_iter":[1000],
    "multi_class":["ovr"],
}

# We will not use elastic net penalty here
# Because only "saga" solver supports elastic net

In [19]:
# This model is for the pair romantic-drama
# Let's call it rd
base_log = LogisticRegression()
log_rd = GridSearchCV(
    estimator=base_log,
    param_grid=param_grid,
    scoring="accuracy",
    cv=5,
    n_jobs=4,
)

In [20]:
# Let's prepare the dataset for it and then fit the model
df1 = df[(df["label"]=="romantic") | (df["label"]=="drama")]

X_train1, X_test1, y_train1, y_test1 = train_test_split(
    df1["processed_synop"], df1["label"], test_size=0.20, random_state=45
)

tfidf = TfidfVectorizer()
tfidf.fit(X_train1)
spX_train1 = tfidf.transform(X_train1)
spX_test1 = tfidf.transform(X_test1)

spX_train1, spX_test1, spX_train1.shape, spX_test1.shape

(<160x3378 sparse matrix of type '<class 'numpy.float64'>'
 	with 5588 stored elements in Compressed Sparse Row format>,
 <41x3378 sparse matrix of type '<class 'numpy.float64'>'
 	with 827 stored elements in Compressed Sparse Row format>,
 (160, 3378),
 (41, 3378))

In [21]:
# Training the first model, of pair romantic-drama
log_rd.fit(spX_train1, y_train1)
preds_rd = log_rd.predict(spX_test1)

acc_rd = accuracy_score(y_true=y_test1, y_pred=preds_rd)
print(f"Overal Accuracy of log_rd model: {acc_rd * 100} percent")
creport = classification_report(y_true=y_test1, y_pred=preds_rd)
print("\nClassification report of log_rd model:")
print(creport)

print("\nConfusion matrix of log_rd model:")
print(confusion_matrix(y_true=y_test1, y_pred=preds_rd))

# Predictions are a little of than the base value!
# Let's try out the default model once again!

Overal Accuracy of log_rd model: 75.60975609756098 percent

Classification report of log_rd model:
              precision    recall  f1-score   support

       drama       0.78      0.70      0.74        20
    romantic       0.74      0.81      0.77        21

    accuracy                           0.76        41
   macro avg       0.76      0.75      0.75        41
weighted avg       0.76      0.76      0.76        41


Confusion matrix of log_rd model:
[[14  6]
 [ 4 17]]


In [22]:
log_rd.best_params_

{'C': 10.0,
 'class_weight': 'balanced',
 'max_iter': 1000,
 'multi_class': 'ovr',
 'solver': 'liblinear'}

In [23]:
base_log.fit(spX_train1, y_train1)
base_preds = base_log.predict(spX_test1)
print(classification_report(y_true=y_test1, y_pred=base_preds))

              precision    recall  f1-score   support

       drama       0.80      0.80      0.80        20
    romantic       0.81      0.81      0.81        21

    accuracy                           0.80        41
   macro avg       0.80      0.80      0.80        41
weighted avg       0.80      0.80      0.80        41



In [24]:
base_log.get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

In [25]:
confusion_matrix(y_true=y_test1, y_pred=base_preds) # Confusion matrix for the base model

array([[16,  4],
       [ 4, 17]], dtype=int64)

**Note:**  We did try some hyper parameter tuning here, but default params of the model are performing better. So let's go with the default hyper params now!

In [26]:
# Training the logistic regression model for the pair 1 and 2
# Pair 1 is r-d (romantic and drama)
# Pair 2 is r-h (romantic and horror)

log_rd = LogisticRegression()
log_rh = LogisticRegression()

df1 = df[(df["label"]=="romantic") | (df["label"]=="drama")]
df2 = df[(df["label"]=="romantic") | (df["label"]=="horror")]

X_train1, X_test1, y_train1, y_test1 = train_test_split(
    df1["processed_synop"], df1["label"], test_size=0.20, random_state=45)
X_train2, X_test2, y_train2, y_test2 = train_test_split(
    df2["processed_synop"], df2["label"], test_size=0.20, random_state=45)

# Fitting serparate vectorizer
# Because vectorizers were serparate for each pair in our pair-function
vec1 = TfidfVectorizer()
vec1.fit(X_train1, y_train1)
spX_train1 = vec1.transform(X_train1)
spX_test1 = vec1.transform(X_test1)

vec2 = TfidfVectorizer()
vec2.fit(X_train2, y_train2)
spX_train2 = vec2.transform(X_train2)
spX_test2 = vec2.transform(X_test2)

log_rd.fit(spX_train1, y_train1)
log_rh.fit(spX_train2, y_train2)
preds_rd = log_rd.predict(spX_test1)
preds_rh = log_rh.predict(spX_test2)

acc_log = accuracy_score(y_true=y_test1, y_pred=preds_rd)
acc2 = accuracy_score(y_true=y_test2, y_pred=preds_rh)
print(f"Overall accuracy of log_rd: {acc_log}")
print(f"Overall accuracy of log_rh: {acc2}\n")
print(f"\nClassification report of log_rd:{classification_report(y_test1, preds_rd)}")
print(f"\nClassification report of log_rh:{classification_report(y_test2, preds_rh)}")

# Our second model seems to overfit a little!
# Since all the best cases either Logistic regression algorithm or naive bayes was performing well...
# We can try other algorithms like naive bayes on the same dataset...

Overall accuracy of log_rd: 0.8048780487804879
Overall accuracy of log_rh: 0.95


Classification report of log_rd:              precision    recall  f1-score   support

       drama       0.80      0.80      0.80        20
    romantic       0.81      0.81      0.81        21

    accuracy                           0.80        41
   macro avg       0.80      0.80      0.80        41
weighted avg       0.80      0.80      0.80        41


Classification report of log_rh:              precision    recall  f1-score   support

      horror       1.00      0.90      0.95        20
    romantic       0.91      1.00      0.95        20

    accuracy                           0.95        40
   macro avg       0.95      0.95      0.95        40
weighted avg       0.95      0.95      0.95        40



In [27]:
nb = MultinomialNB()
nb.fit(spX_train2, y_train2)
nb_preds = nb.predict(spX_test2)
print(classification_report(y_true=y_test2, y_pred=nb_preds))
# The naive bayes here looks a bit generalized to what we saw in the logistic regression

acc_nb = accuracy_score(y_true=y_test2, y_pred=nb_preds)
print(f"Accuracy of the nb model: {acc_nb}")

              precision    recall  f1-score   support

      horror       0.95      0.90      0.92        20
    romantic       0.90      0.95      0.93        20

    accuracy                           0.93        40
   macro avg       0.93      0.93      0.92        40
weighted avg       0.93      0.93      0.92        40

Accuracy of the nb model: 0.925


In [28]:
# Let's test the nb model 
text = "Jack Torrance (Jack Nicholson) becomes winter caretaker at the isolated Overlook Hotel in Colorado, hoping to cure his writer's block. He settles in along with his wife, Wendy (Shelley Duvall), and his son, Danny (Danny Lloyd), who is plagued by psychic premonitions. As Jack's writing goes nowhere and Danny's visions become more disturbing, Jack discovers the hotel's dark secrets and begins to unravel into a homicidal maniac hell-bent on terrorizing his family."
procesed_txt = preprocess(text=text)
print(f"Processed text: {procesed_txt}")
text_vec = vec2.transform([procesed_txt])
output = nb.predict_proba(text_vec)
print(output)

Processed text: Jack Torrance Jack Nicholson winter caretaker isolated Overlook Hotel Colorado hoping cure writer block settles wife Wendy Shelley Duvall son Danny Danny Lloyd plagued psychic premonitions Jack writing goes Danny visions disturbing Jack discovers hotel dark secrets begins unravel homicidal maniac hell bent terrorizing family
[[0.59164289 0.40835711]]


In [29]:
# Let's test our log model for romantic drama pair as well
text = """Hazel Grace Lancaster (Shailene Woodley), a 16-year-old cancer patient, meets and falls in love with Gus Waters (Ansel Elgort), a similarly afflicted teen from her cancer support group. Hazel feels that Gus really understands her. They both share the same acerbic wit and a love of books, especially Grace's touchstone, "An Imperial Affliction" by Peter Van Houten. When Gus scores an invitation to meet the reclusive author, he and Hazel embark on the adventure of their brief lives."""

procesed_txt = preprocess(text=text)
print(f"Processed text: {procesed_txt}")
text_vec = vec1.transform([procesed_txt]) # Use vec1 here
output = log_rd.predict_proba(text_vec)
print(output)

Processed text: Hazel Grace Lancaster Shailene Woodley 16 year old cancer patient meets falls love Gus Waters Ansel Elgort similarly afflicted teen cancer support group Hazel feels Gus understands share acerbic wit love books especially Grace touchstone Imperial Affliction Peter Van Houten Gus scores invitation meet reclusive author Hazel embark adventure brief lives
[[0.43679495 0.56320505]]


**We tested both the models!** Both are performing fine! We can proceed further!

In [30]:
# So far we have trained our two models
# For Romantic Drama part - With 80 percent accuracy (Logistic), log_rd 
# For Romantic Horror part - With 93 percent accyracy (Naive Bayes), nb
# Let's move further!

# Training on romantic - action pair
df3 = df[(df["label"]=="romantic") | (df["label"]=="action")]
X_train3, X_test3, y_train3, y_test3 = train_test_split(
    df3["processed_synop"], df3["label"], test_size=0.20, random_state=45)

vec3 = TfidfVectorizer()
vec3.fit(X_train3)
spX_train3 = vec3.transform(X_train3)
spX_test3 = vec3.transform(X_test3)

nb2 = MultinomialNB()
nb2.fit(spX_train3, y_train3)
nb2_preds = nb2.predict(spX_test3)
acc_nb2 = accuracy_score(y_test3, nb2_preds)
print(f"Overall accuracy of nb2: {acc_nb2}")
print(f"Classification report of nb2:\n{classification_report(y_test3, nb2_preds)}")

Overall accuracy of nb2: 0.925
Classification report of nb2:
              precision    recall  f1-score   support

      action       0.95      0.90      0.92        20
    romantic       0.90      0.95      0.93        20

    accuracy                           0.93        40
   macro avg       0.93      0.93      0.92        40
weighted avg       0.93      0.93      0.92        40



In [31]:
# Let's test our nb2 model
text = """Tom Cruise and Cameron Crowe reunite after "Jerry Maguire" for "Vanilla Sky," the story of a young New York City publishing magnate who finds himself on an unexpected roller-coaster ride of romance, comedy, suspicion, love, sex and dreams in a mind-bending search for his soul."""

procesed_txt = preprocess(text=text)
print(f"Processed text: {procesed_txt}")
text_vec = vec3.transform([procesed_txt]) # Use vec3 here
output = nb2.predict_proba(text_vec)
print(output)

Processed text: Tom Cruise Cameron Crowe reunite Jerry Maguire Vanilla Sky story young New York City publishing magnate finds unexpected roller coaster ride romance comedy suspicion love sex dreams mind bending search soul
[[0.36991366 0.63008634]]


**Our nb2 model, which classifies romantic and action movies also works fine, and is a generalized model!**

In [None]:
# So far we have trained our 3 models
# log_rd, nb, and nb2
# Let's sum up and average their accuracy to get an overall accuracy (so far)
(acc_log + acc_nb + acc_nb2) / 3

0.8849593495934961

**So far we have achieved 88% accuracy! That's really good!**

In [33]:
df["label"].unique()

array(['romantic', 'thriller', 'action', 'horror', 'scifi', 'drama'],
      dtype=object)

In [34]:
# Moving over to our 4th model
# Romantic Scifi - (Logistic recommended by our report)
# So let's try logistic

df4 = df[(df["label"]=="romantic") | (df["label"]=="scifi")]
X_train4, X_test4, y_train4, y_test4 = train_test_split(
    df4["processed_synop"], df4["label"], test_size=0.20, random_state=45)

vec4 = TfidfVectorizer()
vec4.fit(X_train4)
spX_train4 = vec4.transform(X_train4)
spX_test4 = vec4.transform(X_test4)

log_rs = LogisticRegression()
log_rs.fit(spX_train4, y_train4)
pred_rs = log_rs.predict(spX_test4)
acc_rs = accuracy_score(y_test4, pred_rs)
print(f"Accuracy of log_rs model: {log_rs}")
print(f"Classification report of log_rs model:\n{classification_report(y_test4, pred_rs)}")

Accuracy of log_rs model: LogisticRegression()
Classification report of log_rs model:
              precision    recall  f1-score   support

    romantic       1.00      0.81      0.89        21
       scifi       0.83      1.00      0.91        20

    accuracy                           0.90        41
   macro avg       0.92      0.90      0.90        41
weighted avg       0.92      0.90      0.90        41



In [35]:
# 4th model seems to overfit
# Let's try other algorithms
nb_rs = MultinomialNB()
nb_rs.fit(spX_train4, y_train4)
pred_rs = nb_rs.predict(spX_test4)

# update the acc_rs here
acc_rs = accuracy_score(y_test4, pred_rs)
print(f"Accuracy: {acc_rs}")
print(f"Classification report:\n{classification_report(y_test4, pred_rs)}\n")

Accuracy: 0.9024390243902439
Classification report:
              precision    recall  f1-score   support

    romantic       0.95      0.86      0.90        21
       scifi       0.86      0.95      0.90        20

    accuracy                           0.90        41
   macro avg       0.91      0.90      0.90        41
weighted avg       0.91      0.90      0.90        41




**This model also seems pretty generalized!**

In [None]:
# Time to test our fourth model, nb_rs
text = """Rachel Stone is an intelligence operative, the only woman who stands between her powerful global peacekeeping organization and the loss of its most valuable -- and dangerous -- asset."""

procesed_txt = preprocess(text=text)
print(f"Processed text: {procesed_txt}")
text_vec = vec4.transform([procesed_txt]) # Use vec4 here
output = nb_rs.predict_proba(text_vec)
print(output)

Processed text: Rachel Stone intelligence operative woman stands powerful global peacekeeping organization loss valuable dangerous asset
[[0.39567084 0.60432916]]


**This model also works fine! Let's proceed.**

In [None]:
# So far we have got four models
# log_rd, nb, nb2, and nb_rs
# Now for the last pair romantic thriller...
# Again we will start with naive bayes, because that's what our report says

df5 = df[(df["label"]=="romantic") | (df["label"]=="thriller")]
X_train5, X_test5, y_train5, y_test5 = train_test_split(
    df5["processed_synop"], df5["label"], test_size=0.20, random_state=45)

vec5 = TfidfVectorizer()
vec5.fit(X_train5)
spX_train5 = vec5.transform(X_train5)
spX_test5 = vec5.transform(X_test5)

nb3 = MultinomialNB()
nb3.fit(spX_train5, y_train5)
pred_nb3 = nb3.predict(spX_test5)
acc_nb3 = accuracy_score(y_test5, pred_nb3)
print(f"Accuracy of log_rs model: {acc_nb3}")
print(f"Classification report of log_rs model:\n{classification_report(y_test5, pred_nb3)}")

Accuracy of log_rs model: 0.95
Classification report of log_rs model:
              precision    recall  f1-score   support

    romantic       1.00      0.90      0.95        20
    thriller       0.91      1.00      0.95        20

    accuracy                           0.95        40
   macro avg       0.95      0.95      0.95        40
weighted avg       0.95      0.95      0.95        40



In [38]:
# The naive bayes seems to overfit a little
# Let's try other algorithms and find a generalized one

log_rt = LogisticRegression()
log_rt.fit(spX_train5, y_train5)
preds_rt = log_rt.predict(spX_test5)
acc_rt = accuracy_score(y_test5, preds_rt)
print(classification_report(y_test5, preds_rt))

              precision    recall  f1-score   support

    romantic       0.95      0.90      0.92        20
    thriller       0.90      0.95      0.93        20

    accuracy                           0.93        40
   macro avg       0.93      0.93      0.92        40
weighted avg       0.93      0.93      0.92        40



In [39]:
# Let's test the last model!
text = """Rachel Stone is an intelligence operative, the only woman who stands between her powerful global peacekeeping organization and the loss of its most valuable -- and dangerous -- asset."""

procesed_txt = preprocess(text=text)
print(f"Processed text: {procesed_txt}")
text_vec = vec5.transform([procesed_txt]) # Use vec4 here
output = log_rt.predict_proba(text_vec)
print(output)

Processed text: Rachel Stone intelligence operative woman stands powerful global peacekeeping organization loss valuable dangerous asset
[[0.41964653 0.58035347]]


**This model also works fine!**

In [None]:
# Now we have all our models
# We have log_rd, nb, nb2, log_rs, log_rt
# Let's see take the average of the accuracies of all the 5 models
(acc_log + acc_nb + acc_nb2 + acc_rs + acc_rt) / 5

0.8964634146341464

**Approximately - 90% good if combined!**
**Model variable names are follows:-**
1. log_rd (d-r)
2. nb (h-r)
3. nb2 (a-r)
4. nb_rs (r-s)
5. log_rt (r-t)

Either logistic regression or naive bayes algorithm is used in each pair!

In [42]:
# Let's give a fancy name to combining function
# 'Fusionator' sounds good.
# For now this version - 0

# Pairs:
# d-r
# h-r
# a-r
# r-s
# r-t

def fusionator_v0(best_models, vecs, text):
    """Fusionator Version - v0"""
    
    ptext = preprocess(text)
    i = 0
    rom_prob = 0
    dra_prob = 0
    hor_prob = 0
    act_prob = 0
    sci_prob = 0
    thr_prob = 0

    while i < len(vecs):
        if i <= 2:
            ptext_vec = vecs[i].transform([ptext])
            prob = best_models[i].predict_proba(ptext_vec)
            rom_prob += prob[0][1]

            if best_models[i].predict(ptext_vec) == "drama":
                dra_prob += prob[0][0]
            if best_models[i].predict(ptext_vec) == "horror":
                hor_prob += prob[0][0]
            if best_models[i].predict(ptext_vec) == "action":
                act_prob += prob[0][0]

        elif i > 2 and i <= 5:
            ptext_vec = vecs[i].transform([ptext])
            prob = best_models[i].predict_proba(ptext_vec)

            # reverse here (First one belongs to label r)
            rom_prob += prob[0][0]
            if best_models[i].predict(ptext_vec) == "scifi":
                sci_prob += prob[0][1]
            if best_models[i].predict(ptext_vec) == "thriller":
                thr_prob += prob[0][1]
        i += 1 
        # loop ends here
    rom_prob = rom_prob / 5
    sum_prob = rom_prob + dra_prob + hor_prob + act_prob + sci_prob + thr_prob

    rum_prob_f = (rom_prob / sum_prob) * 100
    dra_prob_f = (dra_prob / sum_prob) * 100
    hor_prob_f = (hor_prob / sum_prob) * 100
    act_prob_f = (act_prob / sum_prob) * 100
    sci_prob_f = (sci_prob / sum_prob) * 100
    thr_prob_f = (thr_prob / sum_prob) * 100

    # Final results (combined)
    print(f"Romance: {rum_prob_f}")
    print(f"Drama: {dra_prob_f}")
    print(f"Horror: {hor_prob_f}")
    print(f"Action: {act_prob_f}")
    print(f"Scifi: {sci_prob_f}")
    print(f"Thriller: {thr_prob_f}")

In [43]:
# Testing the fusionator version v0
fusionator_v0(
    best_models=[log_rd, nb, nb2, nb_rs, log_rt],
    vecs=[vec1, vec2, vec3, vec4, vec5],
    text="""Henry Barthes (Adrien Brody) is a substitute teacher who shuns emotional connections, and never stays long enough in one district to bond with his students or colleagues. Troubled and lost, Henry lands at a public school where an apathetic student body and disinterested parents have created a frustrated, burned-out group of teachers and administrators. Inadvertently, Henry becomes a role model to his disaffected students and bonds with a teenage runaway who is just as lost as he is.""")


Romance: 51.056415478588214
Drama: 48.94358452141177
Horror: 0.0
Action: 0.0
Scifi: 0.0
Thriller: 0.0


### 4. Saving the models and vectorizers
Some manual test were run on the fusionator version v0 classification system, but it's peformance is average, not that good, not that bad, we can assume around 75%. It can be improved by using word vectors using spacy, or gensim, or maybe fasttext.

For now, let's save the models and the vectorizors.

In [44]:
# Saving all the five models!
dump(value=log_rd, filename="../models/log_rd.pickle")
dump(value=nb, filename="../models/nb.pickle")
dump(value=nb2, filename="../models/nb2.pickle")
dump(value=nb_rs, filename="../models/nb_rs.pickle")
dump(value=log_rt, filename="../models/log_rt.pickle")

['../models/log_rt.pickle']

In [45]:
# Saving all the five vectorizers (TFIDF)
dump(value=vec1, filename="../vectorizers/vec1.pickle")
dump(value=vec2, filename="../vectorizers/vec2.pickle")
dump(value=vec3, filename="../vectorizers/vec3.pickle")
dump(value=vec4, filename="../vectorizers/vec4.pickle")
dump(value=vec5, filename="../vectorizers/vec5.pickle")

['../vectorizers/vec5.pickle']