# Training a Sentiment Analysis Classifer

In [1]:
!pip freeze | grep scikit-learn
!pip freeze | grep pandas

scikit-learn==1.1.1
pandas==1.4.3


In [260]:
import nltk


#TODO:  ####### should be set upfront at the environment level (eg docker level) #########
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')
##########################################################################################

from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline

import mlflow

import pandas as pd
from pathlib import Path

import re

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/bengsoon/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/bengsoon/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/bengsoon/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /home/bengsoon/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/bengsoon/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [10]:
!ls {data_path}

ls: cannot access 'data': No such file or directory


In [11]:
data_path = Path("./dataset")

## Load Data

In [137]:
df_train = pd.read_csv(data_path / "training.csv")
df_valid = pd.read_csv(data_path / "validation.csv")

As shown in [`00_preparation.ipynb`]("./00_preparation.ipynb"), our `df_train` is quite well-balanced. So we don't have to quite worry about [imbalanced dataset]("https://github.com/bengsoon/Handling_Imbalanced_Data") in this project. This is great news to us as we do not want to digress from our main objective ➡ **ACTIVE LEARNING** 🙌🥳🎉🎉

In [97]:
df_train["sentiment"].value_counts()

negative    2511
positive    2489
Name: sentiment, dtype: int64

In [99]:
df_valid["sentiment"].value_counts()

negative    5040
positive    4960
Name: sentiment, dtype: int64

## Preprocessing

We will be employing a simple Naive Bayes classification model using the TF-idf method. As such, our preprocessing pipeline will cater to that

A quick research on the Internet provided us some quick insights to the dataset: 
1. The review texts contain HTML tags, so we should remove those.
2. There are brackets and non-alphanumeric characters that should be removed.

We will use `regex` / `re` to clean up our dataset.

References: https://www.kaggle.com/code/lakshmi25npathi/sentiment-analysis-of-imdb-movie-reviews/notebook

### Text Cleaning

In [100]:
def clean_text(text: str) -> str: 
    '''
     - removes all html tags
     - replaces all whitespaces and non-alphanumeric characters with ' '
     - returns lowercase
    '''

    # remove html tags
    text = re.sub(r'<[^>]+>', '', text)

    # replace non-alphanumeric
    text = re.sub("[^a-zA-Z0-9]+", " ", text)

    # replace unnecessary whitespaces
    text = re.sub("\s+", " ", text)

    return text.lower()

In [101]:
# test out function
test_string = '''<br /><br />""The Grudge 2"" has scary sound and visual effects, with the creepy woman and boy, and I have startled a couple of times while watching this movie. \
     However, the complex screenplay with three subplots is totally confused, making the entwined story a complete mess. There are too much characters and situations, \
    and in a certain moment I was completely lost with the disconnected and fragmented narrative. In the end, I was completely disappointed with this confused, but also spooky film. \
    My vote is four.<br /><br />'''

cleaned_text = clean_text(test_string)
print(f"Original Text:\n{test_string}\n")
print(f"Cleaned String:\n{cleaned_text}")

Original Text:
<br /><br />""The Grudge 2"" has scary sound and visual effects, with the creepy woman and boy, and I have startled a couple of times while watching this movie.      However, the complex screenplay with three subplots is totally confused, making the entwined story a complete mess. There are too much characters and situations,     and in a certain moment I was completely lost with the disconnected and fragmented narrative. In the end, I was completely disappointed with this confused, but also spooky film.     My vote is four.<br /><br />

Cleaned String:
 the grudge 2 has scary sound and visual effects with the creepy woman and boy and i have startled a couple of times while watching this movie however the complex screenplay with three subplots is totally confused making the entwined story a complete mess there are too much characters and situations and in a certain moment i was completely lost with the disconnected and fragmented narrative in the end i was completely dis

### Stopwords

We will remove the commonly occuring words in the English using NLTK's stopwords dictionary.

In [102]:
def remove_stopwords(text: str) -> str:
    # removes common English stopwords with nltk

    stopwords_dict = set(stopwords.words("english"))
    tokens = word_tokenize(text)
    filtered_tokens = [token for token in tokens if token.lower() not in stopwords_dict]
    
    return ' '.join(filtered_tokens) 

In [103]:
# test remove_stopwords
cleaned_stopwords_text = remove_stopwords(cleaned_text)
print(f"Cleaned Text:\n{cleaned_text}\n")
print(f"Removed Stopwords:\n{cleaned_stopwords_text}")

Cleaned Text:
 the grudge 2 has scary sound and visual effects with the creepy woman and boy and i have startled a couple of times while watching this movie however the complex screenplay with three subplots is totally confused making the entwined story a complete mess there are too much characters and situations and in a certain moment i was completely lost with the disconnected and fragmented narrative in the end i was completely disappointed with this confused but also spooky film my vote is four 

Removed Stopwords:
grudge 2 scary sound visual effects creepy woman boy startled couple times watching movie however complex screenplay three subplots totally confused making entwined story complete mess much characters situations certain moment completely lost disconnected fragmented narrative end completely disappointed confused also spooky film vote four


### Normalization: Lemmatization or Stemming
https://www.machinelearningplus.com/nlp/lemmatization-examples-python/

In [104]:
def pos_tagger(word):
    """
    Obtains the Parts of Speech (POS) for NLTK's lemmatizer mapping
    """
    tag = nltk.pos_tag([word])[0][1][0].lower()
    tag_dict = {"j": wordnet.ADJ,
                "n": wordnet.NOUN,
                "v": wordnet.VERB,
                "r": wordnet.ADV}

    # returns the pos tag, defaults to noun
    return tag_dict.get(tag, wordnet.NOUN)

lemmatizer = WordNetLemmatizer()

lemmatized_text = ' '.join([lemmatizer.lemmatize(w, pos_tagger(w)) for w in word_tokenize(cleaned_stopwords_text)])
print(f"Cleaned Stopwords Text:\n{cleaned_stopwords_text}\n")
print(f"Lemmatized Text:\n{lemmatized_text}")


Cleaned Stopwords Text:
grudge 2 scary sound visual effects creepy woman boy startled couple times watching movie however complex screenplay three subplots totally confused making entwined story complete mess much characters situations certain moment completely lost disconnected fragmented narrative end completely disappointed confused also spooky film vote four

Lemmatized Text:
grudge 2 scary sound visual effect creepy woman boy startle couple time watch movie however complex screenplay three subplots totally confuse make entwine story complete mess much character situation certain moment completely lose disconnect fragment narrative end completely disappointed confuse also spooky film vote four


In [124]:
stemmer = PorterStemmer()
' '.join([stemmer.stem(word) for word in cleaned_stopwords_text.split()])

'grudg 2 scari sound visual effect creepi woman boy startl coupl time watch movi howev complex screenplay three subplot total confus make entwin stori complet mess much charact situat certain moment complet lost disconnect fragment narr end complet disappoint confus also spooki film vote four'

### Preprocessing: Bringing it altogether

In [187]:
CONFIG = {}
CONFIG["NORMALIZER"] = "stemmer"

def preprocess_text(text: str) -> str:
    """ 
    Args
        - text: text to be preprocessed
        - normalizer: 
            - normalization method ("stemmer" or "lemmatizer").
            - stemmer is simplistic that chops off the end of the words while 
                lemmatizer would bring in vocabulary and morphological analysis
                and aiming to return the base form of the word (lemma)
            - stemmer is computionally cheaper and simpler at the expense of inaccuracy
    """ 
    normalizer = CONFIG["NORMALIZER"]
    
    # remove html tags
    text = re.sub(r'<[^>]+>', '', text)

    # replace non-alphanumeric
    text = re.sub("[^a-zA-Z0-9]+", " ", text)

    # replace unnecessary whitespaces
    text = re.sub("\s+", " ", text)

    # removes common English stopwords with nltk
    stopwords_dict = set(stopwords.words("english"))
    tokens = word_tokenize(text)
    filtered_tokens = [token for token in tokens if token.lower() not in stopwords_dict]
    
    # normalize
    
    if normalizer == "stemmer": 
        stemmer = PorterStemmer()
        text = ' '.join([stemmer.stem(word) for word in filtered_tokens])
    
    elif normalizer == "lemmatizer":
        lemmatizer = WordNetLemmatizer()
        text = ' '.join([lemmatizer.lemmatize(w, pos_tagger(w)) for w in filtered_tokens])
    else:
        raise Exception("please enter normalizer as 'stemmer' or 'lemmatizer'")

    return text

In [158]:
# test out function
test_string = '''"House of Games is spell binding. It's so nice to occasionally see films that are perfect tens.\
     There are few movies I've seen that can grip you so quickly. From the opening scene this movie just gets you.<br /><br />\
    I'm trying really hard not to give to much away to those who may not yet have seen this but there will be a FEW SPOILERS SO DON'T READ ANYMORE IF YOU DON'T WANT TO KNOW.<br /><br />\
    I would say House of Games is not just a superb film but is the best movie about con artists I have ever seen-bar none.\
    From the moment the movie is over it begs to be replayed.<br /><br />Lindsay Crouse as Margaret Ford is simply perfection, from her mannerisms to the inflection of her voice\
    she gets into the role immediately. Joe Mantegna was also wonderful. The dialogue in this movie has an unforced almost unscripted quality\
    and these two people communicate as much in a look as they do with their voices. I also loved the way the movie was filmed, in that grainy,\
    surreal type of way, it fit perfectly and helped make the film what it was.<br /><br />There were a few movies I've seen and loved that this reminded me of including\
    The Grifters and The usual Suspects but really, House of games is completely different in it's way. Margaret and Mike are two of the most absorbing characters\
    I've seen on the big screen and not only do they have screen chemistry that is strong and palpable from the moment they meet,\
    but the buildup that starts from the moment they set eyes on each other is electrifying. You know something's going to happen but you have no idea what.\
    And just when you think you've guessed what the ""something"" is, you realize you haven't even scratched the surface....<br /><br />'''

print(f"Original Text:\n{test_string}\n")
print(f"Preprocessed Text with Stemmer:\n{preprocess_text(test_string)}\n")
print(f"Preprocessed Text with Lemmatizer:\n{preprocess_text(test_string, 'lemmatizer')}")

Original Text:
"House of Games is spell binding. It's so nice to occasionally see films that are perfect tens.     There are few movies I've seen that can grip you so quickly. From the opening scene this movie just gets you.<br /><br />    I'm trying really hard not to give to much away to those who may not yet have seen this but there will be a FEW SPOILERS SO DON'T READ ANYMORE IF YOU DON'T WANT TO KNOW.<br /><br />    I would say House of Games is not just a superb film but is the best movie about con artists I have ever seen-bar none.    From the moment the movie is over it begs to be replayed.<br /><br />Lindsay Crouse as Margaret Ford is simply perfection, from her mannerisms to the inflection of her voice    she gets into the role immediately. Joe Mantegna was also wonderful. The dialogue in this movie has an unforced almost unscripted quality    and these two people communicate as much in a look as they do with their voices. I also loved the way the movie was filmed, in that gr

In [193]:
CONFIG["NORMALIZER"] = "stem"

In [192]:
%%timeit 

list(map(preprocess_text, (df_train["review"].to_list())))

13.2 s ± 29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [194]:
CONFIG["NORMALIZER"] = "lemma"

In [195]:
%%timeit 

list(map(preprocess_text, (df_train["review"].to_list())))

1min 16s ± 2.62 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


### TF-idf

In [None]:
def preprocessor(text: str) -> str:
    # remove html tags
    text = re.sub(r'<[^>]+>', '', text)

    # replace non-alphanumeric
    text = re.sub("[^a-zA-Z0-9]+", " ", text)

    # replace unnecessary whitespaces
    text = re.sub("\s+", " ", text)

    normalizer = CONFIG["NORMALIZER"]

    if normalizer == "stemmer": 
        stemmer = PorterStemmer()
        text = ' '.join([stemmer.stem(word) for word in filtered_tokens])
    
    elif normalizer == "lemmatizer":
        lemmatizer = WordNetLemmatizer()
        text = ' '.join([lemmatizer.lemmatize(w, pos_tagger(w)) for w in filtered_tokens])
    else:
        raise Exception('Please enter CONFIG["NORMALIZER"] as "stemmer" or "lemmatizer"')

In [215]:
def preprocessor(text: str) -> str:
    '''
    Preprocessor for the input features
    '''

    # remove html tags
    text = re.sub(r'<[^>]+>', '', text)

    # replace non-alphanumeric
    text = re.sub("[^a-zA-Z0-9]+", " ", text)

    # replace unnecessary whitespaces
    text = re.sub("\s+", " ", text)

    normalizer = CONFIG["NORMALIZER"]

    if normalizer == "stem": 
        stemmer = PorterStemmer()
        text = ' '.join([stemmer.stem(word) for word in word_tokenize(text)])
    
    elif normalizer == "lemma":
        lemmatizer = WordNetLemmatizer()
        text = ' '.join([lemmatizer.lemmatize(w, pos_tagger(w)) for w in word_tokenize(text)])
    else:
        raise Exception('Please enter CONFIG["NORMALIZER"] as "stem" or "lemma"')

    return text

In [243]:
# tfidf vectorizer for the input features
tfidf_vectorizer = TfidfVectorizer(
    preprocessor=preprocessor,
    analyzer='word',
    stop_words=stopwords.words("english"),
    ngram_range = (1,3)
)

In [244]:
# using stemming as normalization
CONFIG["NORMALIZER"] = "stem"

# transform training data and get labels
X_train = tfidf_vectorizer.fit_transform(df_train["review"].values)
y_train = df_train["sentiment"].values.ravel()



In [245]:
# transform validation data
X_valid = tfidf_vectorizer.transform(df_valid["review"].values)
y_valid = df_valid["sentiment"].values.ravel()

## Model Training

### Set up MLFlow

In [284]:
mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("imdb_sentiment_active_learning")

2022/07/06 14:15:02 INFO mlflow.tracking.fluent: Experiment with name 'imdb_sentiment_active_learning' does not exist. Creating a new experiment.


<Experiment: artifact_location='/home/bengsoon/DataScienceProjects/IMDB_active_learning/mlflow/mlruns/1', experiment_id='1', lifecycle_stage='active', name='imdb_sentiment_active_learning', tags={}>

### Naive Bayes

In [281]:
X_train = df_train["review"].values
X_train_preprocessed = list(map(preprocessor, X_train))
y_train = df_train["sentiment"].values.ravel()

X_valid = df_valid["review"].values
X_valid_preprocessed = list(map(preprocessor, X_valid))
y_valid = df_valid["sentiment"].values.ravel()

In [285]:
with mlflow.start_run():
    model_type = "MultinomialNB"
    mlflow.set_tag("developer", "Bengsoon")
    mlflow.set_tag("model_type", model_type)
    # log data paths 
    mlflow.log_param("train-data-path", data_path / "train.csv")
    mlflow.log_param("valid-data-path", data_path / "valid.csv")

    # tfidf_vectorizer = TfidfVectorizer(
    #                                 preprocessor=preprocessor,
    #                                 analyzer='word',
    #                                 stop_words=stopwords.words("english"),
    #                                 ngram_range = (1,3)
    # )

    alpha = 1
    mlflow.log_param("alpha", alpha)

    pipeline = make_pipeline(
        TfidfVectorizer(ngram_range=(1,3)),
        MultinomialNB(alpha=alpha)
    )

    pipeline.fit(X_train_preprocessed, y_train)
    mlflow.sklearn.log_model(pipeline, artifact_path="model")

    score = accuracy_score(y_valid, pipeline.predict(X_valid_preprocessed))
    print(f"Accuracy score for {model_type} is {score}")
    mlflow.log_metric("accuracy", score)

Accuracy score for MultinomialNB is 0.8524


### Random Forest

In [286]:
with mlflow.start_run():
    model_type = "RandomForest"
    mlflow.set_tag("developer", "Bengsoon")
    mlflow.set_tag("model_type", model_type)
    # log data paths 
    mlflow.log_param("train-data-path", data_path / "train.csv")
    mlflow.log_param("valid-data-path", data_path / "valid.csv")

    # tfidf_vectorizer = TfidfVectorizer(
    #                                 preprocessor=preprocessor,
    #                                 analyzer='word',
    #                                 stop_words=stopwords.words("english"),
    #                                 ngram_range = (1,3)
    # )

    params = dict(max_depth=20, n_estimators=100, min_samples_leaf=10, random_state=158)
    mlflow.log_params(params)
    
    pipeline = make_pipeline(
        TfidfVectorizer(ngram_range=(1,3)),
        RandomForestClassifier(**params, n_jobs=-1)
    )

    pipeline.fit(X_train_preprocessed, y_train)
    mlflow.sklearn.log_model(pipeline, artifact_path="model")

    score = accuracy_score(y_valid, pipeline.predict(X_valid_preprocessed))
    print(f"Accuracy score for {model_type} is {score}")
    mlflow.log_metric("accuracy", score)

Accuracy score for RandomForest is 0.8183


In [254]:
test_sentence = "First of all, let's get a few things straight here: a) I AM an anime fan- always has been as a matter of fact (I used to watch Speed Racer all the time in Preschool). b) I DO like several B-Movies because they're hilarious. c) I like the Godzilla movies- a lot.<br /><br />Moving on, when the movie first comes on, it seems like it's going to be your usual B-movie, down to the crappy FX, but all a sudden- BOOM! the anime comes on! This is when the movie goes WWWAAAAAYYYYY downhill.<br /><br />The animation is VERY bad & cheap, even worse than what I remember from SPEED RACER, for crissakes! In fact, it's so cheap, one of the few scenes from the movie I ""vividly"" remember is when a bunch of kids run out of a school... & it's the same kids over & over again! The FX are terrible, too; the dinosaurs look worse than Godzilla. In addition, the transition to live action to animation is unorganized, the dialogue & voices(especially the English dub that I viewed) was horrid & I was begging my dad to take the tape out of the DVD/ VHS player; The only thing that kept me surviving was cracking out jokes & comments like the robots & Joel/Mike on MST3K (you pick the season). Honestly, this is the only way to barely enjoy this movie & survive it at the same time.<br /><br />Heck, I'm planning to show this to another fellow otaku pal of mine on Halloween for a B-Movie night. Because it's stupid, pretty painful to watch & unintentionally hilarious at the same time, I'm giving this movie a 3/10, an improvement from the 0.5/10 I was originally going to give it.<br /><br />(According to my grading scale: 3/10 means Pretty much both boring & bad. As fun as counting to three unless you find a way to make fun of it, then it will become as fun as counting to 15.)"
pipeline.predict(tfidf_vectorizer.transform([test_sentence]))

array(['negative'], dtype=object)

In [255]:
accuracy_score(y_valid, rf_clf.predict(X_valid))

0.8261