## Improving text classification using spacy word vectors and gensim:

**Objectives:**
1. We'll see if the spacy word vectors outperform the basic TFIDF or not.
2. And then we will also see if gensim models outperform the spacy word vectors or not.
3. And then the final aim is to create a improved version of the `fusionator` function that we made in movie1 notebook which can also be used from `ml_backend/scripts/fusionator_v0.py`.

**Note:** Many functions used in this notebook are from the our utils custom package, built and design for fast implementation of ml algorithms.

In [28]:
# Some necessary imports
from utils.train_eval import base_train_eval
from utils.train_eval_spacy import spacy_vector_train
from sklearn.model_selection import train_test_split

from joblib import load
from joblib import dump

from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score

import pandas as pd
import numpy as np
import spacy

from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

#### 1. Comparing spacy word vectors with TFIDF

In [2]:
# Load in the dataset
df = pd.read_excel("../data_sets/rotten_tomatoes_100.xlsx")

In [3]:
df.sample(5)

Unnamed: 0,title,synopsis,label
116,Any Day Now,Steve (Taylor Gray) is a night watchman in his...,thriller
471,Predestination,A temporal agent (Ethan Hawke) embarks on a fi...,sci-fi
118,The Actor,Based on the novel Memory by Donald E. Westlak...,thriller
228,Clone Cops,"In a future dumbed down by next-day-delivery, ...",action
161,Deep Water,Based on the celebrated novel by famed mystery...,thriller


In [4]:
# Define the models first
models_list = [
    AdaBoostClassifier(algorithm="SAMME"),
    GradientBoostingClassifier(),
    MultinomialNB(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    LogisticRegression(max_iter=5000)
]

In [5]:
max_p, max_a = base_train_eval(
    models=models_list,
    X=df["synopsis"],
    y=df["label"],
    tsize=0.20,
    rstate=45,
    vec_type="tfidf",
    acc=True,
    cfreport=True,
    show_vocab=True,
    max_performer=True,
    complete_res=False,
)


---> Given below is the BOW:

['agents' 'ages' 'aggressive' 'aggressively' 'agility' 'aging' 'ago'
 'agrees' 'aguirre' 'agutter' 'ah' 'ahead' 'ai' 'aia' 'aidan' 'aided'
 'aids' 'aiello' 'air' 'airport' 'aja' 'akin' 'alan' 'alarum' 'alaskan'
 'alba' 'albright' 'alcaino' 'alcohol' 'alden' 'alec' 'alejandro'
 'aleksandr' 'alerting' 'alessio' 'alex' 'alexander' 'alexandre' 'alfie'
 'algerian' 'algout' 'ali' 'alice' 'alicia' 'alien' 'alienoid' 'aliens'
 'alike' 'alita' 'alive' 'all' 'allegiances' 'allen' 'allergic'
 'alleviate' 'alliance' 'allies' 'allow' 'allowed' 'allows' 'alluring'
 'ally' 'alma' 'almasy' 'almighty' 'almost' 'alone' 'along' 'alongside'
 'aloof' 'alps' 'already' 'also' 'alter' 'altered' 'altering' 'alternate'
 'alternative' 'although' 'altogether' 'always' 'alys' 'alzheimer'
 'amanda' 'amateur' 'ambassadors' 'amber' 'ambition' 'ambitions'
 'ambitious' 'ambitiously' 'ambush' 'ambushed' 'america' 'american'
 'amiable' 'amid' 'amidst' 'amitabh' 'amnesia']


----> Classifica

In [6]:
# Load large spacy model do get vectors of doc
nlp = spacy.load("en_core_web_lg")

In [7]:
max_p, max_a = spacy_vector_train(
    models=models_list,
    df=df,
    X="synopsis",
    y="label",
    nlp=nlp,
    normalize=True,
    tsize=0.20,
    rstate=45,
    cfreport=True,
    max_performer=True,
    complete_res=False
)


Wait please...
Preprocessing the text...
Wait please...
Converting the text into spacy word vectors...
-> Text has been converted into word vectors!

-> Training the models now:
Model name: AdaBoostClassifier(algorithm='SAMME') | Acc: 0.34710743801652894
Classification report:
              precision    recall  f1-score   support

      action       0.34      0.53      0.42        19
       drama       0.27      0.22      0.24        18
      horror       0.50      0.20      0.29        25
    romantic       0.50      0.47      0.48        15
      sci-fi       0.50      0.35      0.41        23
    thriller       0.22      0.38      0.28        21

    accuracy                           0.35       121
   macro avg       0.39      0.36      0.35       121
weighted avg       0.39      0.35      0.35       121

Model name: GradientBoostingClassifier() | Acc: 0.4214876033057851
Classification report:
              precision    recall  f1-score   support

      action       0.38      0.47

**Conclusion:** Spacy word vectors are much better than simple sparse vectors created using tfidf. So we will use the spacy word vectors to create an improved version of `fusionator`.

#### 2. Fusionator Version V1, a better version of the fusionator we made in notebook movie1

From notebook 1, we know the best pairs. A table is mentioned below mentioning the pairs with corresponding models that we trained in fusionator version v1.

|Pairs|Saved models|
|-----|------------|
|d-r|log_rd|
|h-r|nb|
|a-r|nb2|
|r-s|nb_rs|
|r-t|log_rt|

Let's a train and evaluate a some classification algorithms, with spacy word vectors this time, and let's build fusionator version v1.


In [8]:
# Pair - 1 (drama - romantic)
df_dr = df[(df["label"]=="drama") | (df["label"]=="romantic")]
df_dr.sample(5)

Unnamed: 0,title,synopsis,label,vec
8,You've Got Mail,Struggling boutique bookseller Kathleen Kelly ...,romantic,"[-0.026945505, 0.14203969, -0.10058184, -0.014..."
20,Lars and the Real Girl,Extremely shy Lars Ryan Gosling finds impossib...,romantic,"[-0.052894183, 0.19409372, -0.14619413, -0.022..."
547,Fair Play,coveted promotion cutthroat financial firm ari...,drama,"[-0.044098582, 0.077271715, 0.05329323, -0.016..."
79,Spontaneous,students high school inexplicably start explod...,romantic,"[0.034950946, 0.17531309, -0.09303985, -0.0488..."
513,You Hurt my feelin gs,acclaimed filmmaker Nicole Holofcener comes sh...,drama,"[-0.024295159, 0.068214044, -0.08313387, 0.028..."


In [9]:
# First we need to drop the vec column otherwise copy will be created
df_dr = df_dr.drop("vec", axis=1)
max_p1, max_a1 = spacy_vector_train(
    models=models_list, # model list defined earlier
    df=df_dr, X="synopsis", y="label",
    nlp=nlp,
    normalize=True,
    tsize=0.20,
    rstate=45,
    cfreport=True,
    max_performer=True,
    complete_res=False,
)


Wait please...
Preprocessing the text...
Wait please...
Converting the text into spacy word vectors...
-> Text has been converted into word vectors!

-> Training the models now:
Model name: AdaBoostClassifier(algorithm='SAMME') | Acc: 0.8292682926829268
Classification report:
              precision    recall  f1-score   support

       drama       0.84      0.80      0.82        20
    romantic       0.82      0.86      0.84        21

    accuracy                           0.83        41
   macro avg       0.83      0.83      0.83        41
weighted avg       0.83      0.83      0.83        41

Model name: GradientBoostingClassifier() | Acc: 0.7560975609756098
Classification report:
              precision    recall  f1-score   support

       drama       0.75      0.75      0.75        20
    romantic       0.76      0.76      0.76        21

    accuracy                           0.76        41
   macro avg       0.76      0.76      0.76        41
weighted avg       0.76      0.76

In [10]:
# Also let's create a dict which will hold all our algorithms, with their acc
spvec_models = {}

# Function that prepares dataset and train a particular model and retruns it
def train_component(df:pd.DataFrame, model, tsize=0.20, rstate=45):
    """Trains components for the fusionator!"""
    X_train, X_test, y_train, y_test = train_test_split(
        df["vec"], df["label"], test_size=0.20, random_state=45)
    
    X_train_2D = np.stack(X_train, axis=0)
    X_test_2D = np.stack(X_test, axis=0)

    norm = MinMaxScaler()
    norm.fit(X_train_2D)
    X_train_N = norm.transform(X_train_2D)
    X_test_N = norm.transform(X_test_2D)

    model.fit(X_train_N, y_train)
    preds = model.predict(X_test_N)
    acc = accuracy_score(y_test, preds)
    spvec_models[model] = acc

    return model, acc

# Note:-
# This function will be used once we have used the spacy vector train first
# We will use this function to train a specific component of our new fusionator

In [11]:
# For pair-1, let's go with logistic regression model
# In fusionator version v0, we also used logistic regression model here
# Let's go with the base algorithm itself
# No need of hyper parameter tuning, it's already performing way better

splog_dr, splog_dr_acc = train_component(
    df=df_dr, model=LogisticRegression(max_iter=5000),
    tsize=0.20, rstate=45
)

splog_dr, splog_dr_acc


(LogisticRegression(max_iter=5000), 0.8780487804878049)

In [12]:
# For Pair-2
df_hr = df[(df["label"]=="horror") | (df["label"]=="romantic")]
df_hr = df_hr.drop("vec", axis=1)

max_p2, max_a2 = spacy_vector_train(
    models=models_list, # model list defined earlier
    df=df_hr, X="synopsis", y="label",
    nlp=nlp,
    normalize=True,
    tsize=0.20,
    rstate=45,
    cfreport=True,
    max_performer=True,
    complete_res=False,
)



Wait please...
Preprocessing the text...
Wait please...
Converting the text into spacy word vectors...
-> Text has been converted into word vectors!

-> Training the models now:
Model name: AdaBoostClassifier(algorithm='SAMME') | Acc: 0.925
Classification report:
              precision    recall  f1-score   support

      horror       0.90      0.95      0.93        20
    romantic       0.95      0.90      0.92        20

    accuracy                           0.93        40
   macro avg       0.93      0.93      0.92        40
weighted avg       0.93      0.93      0.92        40

Model name: GradientBoostingClassifier() | Acc: 0.875
Classification report:
              precision    recall  f1-score   support

      horror       0.86      0.90      0.88        20
    romantic       0.89      0.85      0.87        20

    accuracy                           0.88        40
   macro avg       0.88      0.88      0.87        40
weighted avg       0.88      0.88      0.87        40

Mode

In [13]:
# Naive Bayes is the best performing model here
# But it might be a little overfitted
# Precision, recall, average accuracy, everything is just 95, which seems odd
# Doesn't look like that much of a generalized model
# We can go for adaboost classifier here

spada_hr, spada_hr_acc = train_component(
    model=AdaBoostClassifier(algorithm="SAMME"),
    df=df_hr,
    tsize=0.20,
    rstate=45
)

spada_hr, spada_hr_acc

(AdaBoostClassifier(algorithm='SAMME'), 0.925)

In [14]:
# Pair-3 (Action romantic)
df_ar = df[(df["label"]=="action") | (df["label"]=="romantic")]
df_ar = df_ar.drop("vec", axis=1)

max_p3, max_a3 = spacy_vector_train(
    models=models_list, # model list defined earlier
    df=df_ar, X="synopsis", y="label",
    nlp=nlp,
    normalize=True,
    tsize=0.20,
    rstate=45,
    cfreport=True,
    max_performer=True,
    complete_res=False,
)


Wait please...
Preprocessing the text...
Wait please...
Converting the text into spacy word vectors...
-> Text has been converted into word vectors!

-> Training the models now:
Model name: AdaBoostClassifier(algorithm='SAMME') | Acc: 0.95
Classification report:
              precision    recall  f1-score   support

      action       0.95      0.95      0.95        20
    romantic       0.95      0.95      0.95        20

    accuracy                           0.95        40
   macro avg       0.95      0.95      0.95        40
weighted avg       0.95      0.95      0.95        40

Model name: GradientBoostingClassifier() | Acc: 0.875
Classification report:
              precision    recall  f1-score   support

      action       0.86      0.90      0.88        20
    romantic       0.89      0.85      0.87        20

    accuracy                           0.88        40
   macro avg       0.88      0.88      0.87        40
weighted avg       0.88      0.88      0.87        40

Model

In [15]:
# For pair-3 Gradient boosting seems like the most relevant and most generalized
spgrad_ar, spgrad_ar_acc = train_component(
    model=GradientBoostingClassifier(),
    df=df_ar,
    tsize=0.20,
    rstate=45
)

spgrad_ar, spgrad_ar_acc

(GradientBoostingClassifier(), 0.9)

In [16]:
# For Pair-4
# Pair 4 is romance and scifi

df_rs = df[(df["label"]=="romantic") | (df["label"]=="sci-fi")]
df_rs = df_rs.drop("vec", axis=1)

max_p4, max_a4 = spacy_vector_train(
    models=models_list, # model list defined earlier
    df=df_rs, X="synopsis", y="label",
    nlp=nlp,
    normalize=True,
    tsize=0.20,
    rstate=45,
    cfreport=True,
    max_performer=True,
    complete_res=False,
)


Wait please...
Preprocessing the text...
Wait please...
Converting the text into spacy word vectors...
-> Text has been converted into word vectors!

-> Training the models now:
Model name: AdaBoostClassifier(algorithm='SAMME') | Acc: 0.926829268292683
Classification report:
              precision    recall  f1-score   support

    romantic       0.95      0.90      0.93        21
      sci-fi       0.90      0.95      0.93        20

    accuracy                           0.93        41
   macro avg       0.93      0.93      0.93        41
weighted avg       0.93      0.93      0.93        41

Model name: GradientBoostingClassifier() | Acc: 0.9512195121951219
Classification report:
              precision    recall  f1-score   support

    romantic       0.95      0.95      0.95        21
      sci-fi       0.95      0.95      0.95        20

    accuracy                           0.95        41
   macro avg       0.95      0.95      0.95        41
weighted avg       0.95      0.95 

In [17]:
# So many models are performing really well in this section
# Let's take the random forest model
sprand_rs, sprand_rs_acc = train_component(
    model=RandomForestClassifier(),
    df=df_rs,
    tsize=0.20,
    rstate=45
)

sprand_rs, sprand_rs_acc

(RandomForestClassifier(), 0.9024390243902439)

In [18]:
# For pair-5
# The last pair (romantic-thriller)
df_rt = df[(df["label"]=="romantic") | (df["label"]=="thriller")]
df_rt = df_rt.drop("vec", axis=1)

max_p5, max_a5 = spacy_vector_train(
    models=models_list, # model list defined earlier
    df=df_rt, X="synopsis", y="label",
    nlp=nlp,
    normalize=True,
    tsize=0.20,
    rstate=45,
    cfreport=True,
    max_performer=True,
    complete_res=False,
)



Wait please...
Preprocessing the text...
Wait please...
Converting the text into spacy word vectors...
-> Text has been converted into word vectors!

-> Training the models now:
Model name: AdaBoostClassifier(algorithm='SAMME') | Acc: 0.9
Classification report:
              precision    recall  f1-score   support

    romantic       0.94      0.85      0.89        20
    thriller       0.86      0.95      0.90        20

    accuracy                           0.90        40
   macro avg       0.90      0.90      0.90        40
weighted avg       0.90      0.90      0.90        40

Model name: GradientBoostingClassifier() | Acc: 0.875
Classification report:
              precision    recall  f1-score   support

    romantic       0.89      0.85      0.87        20
    thriller       0.86      0.90      0.88        20

    accuracy                           0.88        40
   macro avg       0.88      0.88      0.87        40
weighted avg       0.88      0.88      0.87        40

Model 

In [19]:
# For the last pair
# Adaboost classifier seems the best algorithm here
spada_rt, spada_rt_acc = train_component(
    model=AdaBoostClassifier(algorithm="SAMME"),
    df=df_rt,
    tsize=0.20,
    rstate=45
)

spada_rt, spada_rt_acc

(AdaBoostClassifier(algorithm='SAMME'), 0.9)

In [20]:
# Now we have all our models
# Let's have a look at the models and their accuracy

spvec_models

{LogisticRegression(max_iter=5000): 0.8780487804878049,
 AdaBoostClassifier(algorithm='SAMME'): 0.925,
 GradientBoostingClassifier(): 0.9,
 RandomForestClassifier(): 0.9024390243902439,
 AdaBoostClassifier(algorithm='SAMME'): 0.9}

In [21]:
# Calculating the combined accuracy of our fusionator version v1
np.array(list(spvec_models.values())).sum() / 5

# It's just mathematically
# Practically this fusionator is better than the previous one

0.9010975609756098

In [22]:
# Code for the fusionator version v1
nlp = spacy.load("en_core_web_lg")

# Function for preprocessing
def preprocess(text, spacy_model=nlp):
    """Pass a text and it will preprocess it!"""
    filtered = []
    doc = spacy_model(text)
    for token in doc:
        if (not token.is_stop) and (not token.is_punct):
            filtered.append(token.lemma_)

    filt_txt = " ".join(filtered)
    return filt_txt

def fusionator_v1(text, models, nlp=nlp):
    ptext = preprocess(text)
    doc = nlp(ptext)
    doc_vec = doc.vector

    i = 0
    rom_prob = 0
    dra_prob = 0
    hor_prob = 0
    act_prob = 0
    sci_prob = 0
    thr_prob = 0

    while i < len(models):
        if i <= 2:
            prob = models[i].predict_proba([doc_vec])
            rom_prob += prob[0][1]
            if models[i].predict([doc_vec]) == "drama":
                dra_prob += prob[0][0]
            if models[i].predict([doc_vec]) == "horror":
                hor_prob += prob[0][0]
            if models[i].predict([doc_vec]) == "action":
                act_prob += prob[0][0]

        elif i > 2 and i <= 5:
            prob = models[i].predict_proba([doc_vec])
            # Reverse here
            # First one belong to label r
            rom_prob += prob[0][0]
            if models[i].predict([doc_vec]) == "scifi":
                sci_prob += prob[0][1]
            if models[i].predict([doc_vec]) == "thriller":
                thr_prob += prob[0][1]

        i += 1
        # Loop ends here
    rom_prob = rom_prob / 5
    sum_prob = rom_prob + dra_prob + hor_prob + act_prob + sci_prob + thr_prob

    rom_prob_f = (rom_prob / sum_prob) * 100
    dra_prob_f = (dra_prob / sum_prob) * 100
    hor_prob_f = (hor_prob / sum_prob) * 100
    act_prob_f = (act_prob / sum_prob) * 100
    sci_prob_f = (sci_prob / sum_prob) * 100
    thr_prob_f = (thr_prob / sum_prob) * 100

    # Final results (Combined)
    print(f"Romance: {rom_prob_f}")
    print(f"Drama: {dra_prob_f}")
    print(f"Horror: {hor_prob_f}")
    print(f"Action: {act_prob_f}")
    print(f"Scifi: {sci_prob_f}")
    print(f"Thriller: {thr_prob_f}")

In [23]:
fusionator_v1(
    text="Ramanna, a maniac murderer, finds a soulmate in Raghavan, a policeman, who inspects his murder cases. He tries to make Raghavan realize how they both are similar.",
    models=[splog_dr, spada_hr, spgrad_ar, sprand_rs, spada_rt],
    nlp=nlp
)

Romance: 25.425375046185838
Drama: 46.95328759012382
Horror: 27.621337363690344
Action: 0.0
Scifi: 0.0
Thriller: 0.0


**Conclusion:** It's working fine. It's easily able to identify the main element.

In [24]:
len(list(df["synopsis"])), len(list(df["label"]))

(601, 601)

In [29]:
# Saving all these models that we just trained so that we can load these later on
dump(value=splog_dr, filename="../models/splog_dr.pkl")
dump(value=spada_hr, filename="../models/spada_hr.pkl")
dump(value=spgrad_ar, filename="../models/spgrad_ar.pkl")
dump(value=sprand_rs, filename="../models/sprand_rs.pkl")
dump(value=spada_rt, filename="../models/spada_rt.pkl")

['../models/spada_rt.pkl']