# Video Games Reviews - NLP Classification Model
* Notebook by Adam Lang
* Date: 5/21/2024
* In this notebook we will build 2 Naive Bayes Classifiers including
1. Multinomial Naive Bayes Classifier
  * The features will represent counts of words or other discrete elements.
2. Bernoulli Naive Bayes Classifier
  * Features represent the presence or abscence of words (e.g. one-hot-encoding)

# Data Science/NLP Problem
  * Predict whether a player or "gamer" will recommend the game to someone or not.

## Import libraries

In [1]:
import nltk
import pandas as pd
import numpy as np
import spacy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split, cross_validate, StratifiedKFold
from sklearn.pipeline import Pipeline

## feature extraction -- CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

## MultinomialNB, BernolliNB
from sklearn.naive_bayes import MultinomialNB, BernoulliNB

## evaluation metrics
from sklearn.metrics import accuracy_score, classification_report

In [2]:
# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## Read in text data

In [3]:
## dataset read in
train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Deep Learning Notebooks/NLP_deep_learning/Text_Classification/Building a Classification Model/NLP_Classification_Modeling/train.csv')
validation = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Deep Learning Notebooks/NLP_deep_learning/Text_Classification/Building a Classification Model/NLP_Classification_Modeling/validation.csv')
test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Deep Learning Notebooks/NLP_deep_learning/Text_Classification/Building a Classification Model/NLP_Classification_Modeling/test.csv')

In [4]:
## check shape of datasets
print(f"Shape of train data: {train.shape}")
print(f"Shape of validation data: {validation.shape}")
print(f"Shape of test data: {test.shape}")

Shape of train data: (17877, 5)
Shape of validation data: (3831, 5)
Shape of test data: (3831, 5)


In [5]:
# head of train data
train.head()

Unnamed: 0,review_id,title,year,user_review,user_suggestion
0,460,Black Squad,2018.0,"Early Access ReviewVery great shooter, that ha...",1
1,2166,Tree of Savior (English Ver.),2016.0,I love love love playing this game!Super 100%!...,1
2,17242,Eternal Card Game,2016.0,Early Access ReviewAs a fan of MTG and Hearths...,1
3,6959,Tactical Monsters Rumble Arena,2018.0,Turn based strategy game similiar to FF Tactic...,1
4,8807,Yu-Gi-Oh! Duel Links,2017.0,This game has an insanely huge download for be...,0


In [6]:
# view head of test data
test.head()

Unnamed: 0,review_id,title,year,user_review,user_suggestion
0,12053,Infestation: The New Z,2016.0,Unbelievable that this rehash copy and paste t...,0
1,12536,SMITE®,2015.0,I can't recommened this game in its current st...,0
2,747,Heroes & Generals,2016.0,Early Access ReviewThis game is constantly evo...,0
3,3214,World of Warships,2018.0,I play this game because it scratches an itch....,0
4,4036,World of Guns: Gun Disassembly,2016.0,"Finally, a game for people like us to enjoy! P...",1


In [7]:
# view head of validation data
validation.head()

Unnamed: 0,review_id,title,year,user_review,user_suggestion
0,8604,Dungeon Defenders II,2015.0,Early Access Review* Ok Played the first DD lo...,1
1,20407,Minion Masters,2017.0,Product received for freeEarly Access ReviewSo...,1
2,636,Magic Duels,2018.0,Game is extremely unfun to play unless you wan...,0
3,10217,Robocraft,2016.0,Early Access ReviewThis used to be an amazing ...,0
4,9564,Realm of the Mad God,2014.0,"With stunning visuals, an immersive storyline,...",1


In [8]:
## value_counts of the TARGET class - 'user_suggestion'
test['user_suggestion'].value_counts()

user_suggestion
1    2187
0    1644
Name: count, dtype: int64

Observation:
* We can see there is a bit of imbalance in the target class with more users recommending the product (1) than not (0).

## Load spacy model

In [9]:
## load spacy model and disable NER as we don't need it
nlp = spacy.load("en_core_web_sm",disable="ner")

## Preprocessing Function
This is what it does:
1. creates empty list `processed_texts` to store processed versions of input texts.
2. Loops and iterates through input texts using SpaCy's `nlp.pipe` method with `n_process=-1` for multi-core processing.
3. Lemmatization and Stopword Removal: For each text or doc object:
    * extracts tokens (words)
    * lemmatizes each token converting words to base form ("running" --> "run")
    * converts all tokens to lowercase
    * The `is_alpha` attribute makes sure only alpha numeric tokens are considered.
    * Removes stop words like "the" and "a" using `nlp.Defaults.stop_words`.
4. Text joining: joins remaining lemmatized tokens back to single string.
5. Storage: appends processed text to processed_texts list.
6. Return: function returns list of preprocessed texts.

In [10]:
## function to preprocess text
def preprocess_text(texts):
  # lemmatize tokens and store in list
  processed_texts = []
  # loop through them
  for doc in nlp.pipe(texts, n_process=-1):
    lemmatized_tokens = [token.lemma_.lower() for token in doc if token.is_alpha and token.lemma_ not in nlp.Defaults.stop_words]

    # Join the lemmatized tokens into a string
    processed_text = " ".join(lemmatized_tokens)

    processed_texts.append(processed_text)

  return processed_texts

In [11]:
## apply the preprocess_text function to user_review column
train['user_review'] = preprocess_text(train['user_review'])
validation['user_review'] = preprocess_text(validation['user_review'])
test['user_review'] = preprocess_text(test['user_review'])

In [12]:
# view first 5 rows of train data -- user review
train['user_review'].head()

0    early access reviewvery great shooter original...
1    i love love love play lot class choose bound s...
2    early access reviewas fan mtg hearthstone fun ...
3    turn base strategy game similiar ff tactic day...
4    game insanely huge download phone game blast v...
Name: user_review, dtype: object

## Vectorization
* One hot encoding (OHE)
    * We have set `min_df=0.001` which is any wor which does not appear in more than 0.1% of documents or reviews will not be considered in the internal vocabulary being created by `CountVectorizer`.
        * We can change this value to finetune the hyperparameters.
* CountVectorizer
  * Recall that the `CountVectorizer` counts the unique words.
  * One hot encoding however, uses binary labels 0 or 1 to say that a word is present or not.
  * Count Vectorization does not account for semantic information.

In [13]:
## instaniate the CountVectorizer
count_vectorizer_ohe = CountVectorizer(min_df=0.001, binary=True)

In [14]:
# fit_trainsform 'user_review' column
count_vectorizer_ohe_train = count_vectorizer_ohe.fit_transform(train['user_review'])

## Build Naive Bayes Classifier

In [15]:
# Naive Bayes Classifier
naive_bayes_classifier = BernoulliNB()

In [16]:
# create a naive bayes model for train data
naive_bayes_classifier.fit(count_vectorizer_ohe_train, train['user_suggestion'])
naive_bayes_classifier.score(count_vectorizer_ohe_train, train['user_suggestion'])

0.8258096996140292

In [17]:
# create naive bayes for validation data
count_vectorizer_ohe_val = count_vectorizer_ohe.transform(validation['user_review'])
naive_bayes_classifier.score(count_vectorizer_ohe_val, validation['user_suggestion'])

0.8120595144870791

## Observations
* The NB classifier shows train and validation accuracy scores are very similar. This implies the model is performing well.

## Count Vectorizer
* Can we improve the accuracy of our model using a CountVectorizer instead of one-hot-encoding?

In [18]:
## instantiate the count_vectorizer
count_vectorizer = CountVectorizer(min_df=0.001)

In [19]:
## fit_transform user_review target
count_vectorizer_train = count_vectorizer.fit_transform(train['user_review'])

## Build Naive Bayes Classifier with count vectorization

In [20]:
# Naive Bayes Classifier
naive_bayes_classifier = MultinomialNB()

In [21]:
# create the naive bayes model for train data
naive_bayes_classifier.fit(count_vectorizer_train, train['user_suggestion'])
naive_bayes_classifier.score(count_vectorizer_train, train['user_suggestion'])

0.8388991441517033

In [22]:
# create naive bayes model for validation data
count_vectorizer_val = count_vectorizer.transform(validation['user_review'])
naive_bayes_classifier.score(count_vectorizer_val, validation['user_suggestion'])

0.8258940224484469

## Observations
* We can see that using CountVectorization improved our model accuracy scores compared to using one hot encoding.
* We can also see that the model performance on the validation set is actually not better than the train data thus implying there is no overfitting occurring on unseen data.
* We can not say that OHE or CountVectorization is better than the other. Each has its own unique use case.

# TF-IDF
* Term frequency - inverse frequency document frequency.
* What will this do?
    * Evaluates how **relevant a word is to a document or corpus.**
    * Balances the term frequency with the uniqueness of the term across all documents or the corpus.
* Let's break these terms down.
1. Term Frequency(TF)
* Measures the frequency of a word in a document.
* Equation:
    * `TF = number of times term t appears in document d / total number of terms in document d`
    * Captures the most important words based on how often they appear.

2. Inverse Document Frequency (IDF) - 2 parts
    * a) Document Frequency (DF) - number of documents in which a word appears.
        * equation: `DF = number of documents with term t / total number of document D`
        * Example: "life" appears in all 3 documents this would be 3/3 = 1
        * Example: "beautiful" appears in 1 doc which is 1/3.
    * b) Inverse Document Frequency (IDF)
        * Assess the importance of the term across the corpus.
        * Equation: `IDF = log(total number of document D / number of documents with term t`
        * This may result in very large positive numbers thus we take the logarithm to the equation and +1 to the denominator to avoid zero division error.
        * Example: "life" is 3/3 or log 3/3 = 0        
        * Example: "beautiful" is log 3/1 = 0.47
            * Thus rarer terms are given weight.
    * c) Calculations for TF-IDF
        * `TF = TF * IDF`
        * Example: "life"
            * TF = 4/10 * IDF = 0
            * TF-IDF = 0
        * Example: "beautiful"
            * TF = 1/10 * IDF = 1.099
            * TF-IDF = 0.047

  * TF-IDF will increase with the number of occurrences of a specific word in a specific document but offsets by frequency of words in a corpus.
      * This allows us to account for words that are rare but may be more relevant.
      * This is very useful in `information retrieval` and `text mining`:
          * topic modeling
          * document summarization
          * search engine optimization (SEO)

# Application of TF-IDF

In [23]:
# import TFIDF vectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

In [24]:
# instantiate the tfidf vectorizer
tfidf_vectorizer = TfidfVectorizer(min_df=0.001)

In [25]:
# create a naive bayes model for train data using tfidf
tfidf_vectorizer_train = tfidf_vectorizer.fit_transform(train['user_review'])
naive_bayes_classifier.fit(tfidf_vectorizer_train, train['user_suggestion'])
naive_bayes_classifier.score(tfidf_vectorizer_train, train['user_suggestion'])

0.8414722828215024

In [26]:
# create naive bayes model for validation data with tfidf
tfidf_vectorizer_val = tfidf_vectorizer.transform(validation['user_review'])
naive_bayes_classifier.score(tfidf_vectorizer_val, validation['user_suggestion'])

0.8209344818585226

# N-grams
* combinations of words that appear together in a sentence.
  * N=1 --> unigram
  * N=2 --> bigram
  * N=3 --> trigram
  * etc....

## N-grams with TFIDF

In [27]:
## instantiate the TfidfVectorizer with ngram_range param
## setting min document frequency to 0.001 and the range of ngrams are 1 to 3
tfidf_ngram_vectorizer = TfidfVectorizer(min_df=0.001, ngram_range=(1,3))

## Building a Naive Bayes Model with n-grams and TFIDF

In [28]:
## fit and transform and score a naive bayes model
tfidf_ngram_vectorizer_train = tfidf_ngram_vectorizer.fit_transform(train['user_review'])
naive_bayes_classifier.fit(tfidf_ngram_vectorizer_train, train['user_suggestion'])
naive_bayes_classifier.score(tfidf_ngram_vectorizer_train, train['user_suggestion'])

0.859204564524249

Observations
* The accuracy score is much improved on the training data as compared to the previous model we built.

In [29]:
## print some of the ngram features to visualize impact of ngrams
tfidf_ngram_vectorizer.get_feature_names_out()[150:160]

array(['actually good', 'actually like', 'actually look', 'actually play',
       'actually play game', 'actually pretty', 'actually use',
       'actually want', 'actually work', 'actualy'], dtype=object)

Observations
* We have unigrams, bigrams and trigams as tokens.

In [30]:
## obtain validation scores on classifier
tfidf_ngram_vectorizer_val = tfidf_ngram_vectorizer.transform(validation['user_review'])
naive_bayes_classifier.score(tfidf_ngram_vectorizer_val, validation['user_suggestion'])

0.828765335421561

In [31]:
## instantiate countvectorizer
count_ngram_vectorizer = CountVectorizer(min_df=0.001, ngram_range=(1,3))

In [32]:
### now lets compare a count vectorizer using ngrams and naive bayes classifier
count_ngram_vectorizer_train = count_ngram_vectorizer.fit_transform(train['user_review'])
naive_bayes_classifier.fit(count_ngram_vectorizer_train, train['user_suggestion'])
naive_bayes_classifier.score(count_ngram_vectorizer_train, train['user_suggestion'])

0.8490238854393914

In [33]:
# create naive bayes model for validation data with count vetorizer and ngrams
count_ngram_vectorizer_val = count_ngram_vectorizer.transform(validation['user_review'])
naive_bayes_classifier.score(count_ngram_vectorizer_val, validation['user_suggestion'])

0.8258940224484469

## Observations
* We can see the score was lower for the ngram count vectorizer than for the tfidf.
* A superior score does not mean the count vectorizer is inferior to TFIDF, the choice between TF-IDF and count vectorizer depends on:
  * **Document uniqueness:** TF-IDF will emphasize words that are unique to specific documents which is ideal for situations where docuemnt differentiation is more important.
  * **Corpus size and diversity:** TF-IDF performs much better with large and diverse datasets by downplaying common words across documents.
  * **Relevance of common words:** if frequent common words are informative to the final analysis, the count vectorize would be better to use.

## Advanced Text preprocessing techniques that are important to consider
* Part of Speech Tagging
* Named Entity Recognition