# Naive Bayes Classifier for bord game reviews
## By Chesley Cecil

The implementation of the naive bayes algorithm is done entirely using functions from sklearn.naive_bayes
https://scikit-learn.org/stable/modules/naive_bayes.html
but the overall program had to be adapted to work with text data instead of the iris dataset

In [1]:
#Chesley Cecil
#12/10/2020

#https://scikit-learn.org/stable/modules/naive_bayes.html
#Changes:
#Changed test_size to 0.2
#Changed input to be TF-IDF vectorized to allow for text input
#Added a multi-nomial naive bayes classifier
#Changed output to be accuracy instead of incorrect count

import pandas as pd
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from joblib import dump, load

# Loading the data
The data itself is from https://www.kaggle.com/jvanelteren/boardgamegeek-reviews?select=bgg-15m-reviews.csv and required a lot of pre-processing to be usable. I was able to host it on my GitHub in several parts because it was too big to store as a single file

In [2]:
#Load data parts

part_0 = pd.read_csv("https://raw.githubusercontent.com/ccecil2/ccecil2.github.io/master/downloads/new_reviews_part_0.csv", error_bad_lines=False)

part_1 = pd.read_csv("https://raw.githubusercontent.com/ccecil2/ccecil2.github.io/master/downloads/new_reviews_part_1.csv", error_bad_lines=False)

part_2 = pd.read_csv("https://raw.githubusercontent.com/ccecil2/ccecil2.github.io/master/downloads/new_reviews_part_2.csv", error_bad_lines=False)

part_3 = pd.read_csv("https://raw.githubusercontent.com/ccecil2/ccecil2.github.io/master/downloads/new_reviews_part_3.csv", error_bad_lines=False)

part_4 = pd.read_csv("https://raw.githubusercontent.com/ccecil2/ccecil2.github.io/master/downloads/new_reviews_part_4.csv", error_bad_lines=False)

print(part_0[0:5])

print(part_1[0:5])

print(part_2[0:5])

print(part_3[0:5])

print(part_4[0:5])

   Unnamed: 0  rating                                            comment
0           0    10.0  I tend to either love or easily tire of co-op ...
1           1    10.0  This is an amazing co-op game.  I play mostly ...
2           2    10.0  Hey! I can finally rate this game I've been pl...
3           3    10.0  Love it- great fun with my son. 2 plays so far...
4           4    10.0  Fun, fun game. Strategy is required, but defin...
   Unnamed: 0  rating                                            comment
0      125000     6.5                          My first step into gaming
1      125001     6.5  Still has a certain charm. Lolts of better stu...
2      125002     6.5  I like to say that my first wargame was Guadal...
3      125003     6.5                             classic but too simple
4      125004     6.5  This was a game we played relentlessly until w...
   Unnamed: 0  rating                                            comment
0      252218     7.0  A decent deck-builder that h

# Merging the data
Because the data was stored and loaded in parts, it has to be merged into a single dataframe for use later in the program. Additionally, the ratings (a value on a scale from 0 to 10) were rounded to make the data simplier as there were numerous issues with the values not being reasonable (one was 9.01799999999999, and there were others that were just as bad)

In [3]:
#Merge data parts

data = part_0

data = data.append(part_1, ignore_index=True)

data = data.append(part_2, ignore_index=True)

data = data.append(part_3, ignore_index=True)

data = data.append(part_4, ignore_index=True)

#Rounds the ratings to make it easier to train the models
data["rating"] = data["rating"].map(lambda x: round(x))

print(data[0:5])

   Unnamed: 0  rating                                            comment
0           0      10  I tend to either love or easily tire of co-op ...
1           1      10  This is an amazing co-op game.  I play mostly ...
2           2      10  Hey! I can finally rate this game I've been pl...
3           3      10  Love it- great fun with my son. 2 plays so far...
4           4      10  Fun, fun game. Strategy is required, but defin...


# Extracting the data from the DataFrames
When a column is extracted from a DataFrame, it is turned into a Series, somehting that can't be used by Sklearn. To avoid that issue, the resulting Series were turned into lists

In [4]:
#Turns the data into lists
X = data["comment"].to_list()

y = data["rating"].to_list()

print(X[0:5])
print(y[0:5])

["I tend to either love or easily tire of co-op games.  Pandemic joins Knizia's LoTR as my favorite true co-op.  It edges LoTR out merely in time to set-up and play.  LoTR can be an undertaking to explain enough details so that players enjoy their first time through the game, while Pandemic is fast enough that even if the players don't quite get everything that is going on, they can try again immediately.", "This is an amazing co-op game.  I play mostly with my wife and this is a game that I can't really imagine getting tired of.  Win or lose, I usually want to play again immediately.  On the Brink and the fan-made expansion from on ArtsCow add much more variety and good game play, only enhancing an already great game.", 'Hey! I can finally rate this game I\'ve been playtesting on and off for a couple years. I really like Pandemic, it\'s the best cooperative game I\'ve played since Lord of the Rings. The key is in the pacing; the game successfully ratchets up the tension as infections 

# Splitting the data
After the entirety of the data has been extracted, it has to be broken into test and train parts. In this case, the ratio of train to test data is 8:2 since Naive Bayes works better the more training data it has. Additionally, there are two versions of the split data in order to test the two different pre-processing methods used in the next section

In [5]:
#Base

X_train_base, X_test_base, y_train_base, y_test_base = train_test_split(X, y, test_size=0.2, random_state=0)

In [6]:
#Sub

X_train_sub, X_test_sub, y_train_sub, y_test_sub = train_test_split(X, y, test_size=0.2, random_state=0)

# Pre-processing the data
Now that the data has been split into train and test parts, it has to be pre-processed into TF-IDF matrices to allow it to be used with the Sklearn Naive Bayes functions. The "base" vectorizer has sublinear_tf as False while the "sub" vectorizer has it as True

In [7]:
vectorizer_base = TfidfVectorizer(min_df=0.001)

train_base = vectorizer_base.fit_transform(X_train_base)
test_base = vectorizer_base.transform(X_test_base)

In [8]:
vectorizer_sublinear = TfidfVectorizer(min_df=0.001, sublinear_tf=True)

train_sub = vectorizer_sublinear.fit_transform(X_train_sub)
test_sub = vectorizer_sublinear.transform(X_test_sub)

# Further pre-procesing
The vectorizers produce the TF-IDF matrix in the form of a sparse matrix, which can't be used by Sklearn, so the todense method is used to turn it into a regular matrix

In [9]:
train_base_dense = train_base.todense()
test_base_dense = test_base.todense()

train_sub_dense = train_sub.todense()
test_sub_dense = test_sub.todense()

# Model types
The two types of Naive Bayes models that are being examined are Gaussian and Multinomial. According to the Sklearn documentation, Multinomial Naive Bayes is used for text classification, so it is presumed to work better than Gaussian Naive Bayes for this example

In [10]:
gauss = GaussianNB()
multi = MultinomialNB()

# Testing
With everything set up, the models can be tested on the data to determine their accuracies

In [13]:
#Base gauss

y_pred_base_gauss = gauss.fit(train_base_dense, y_train_base).predict(test_base_dense)

base_gauss_acc = ((y_test_base == y_pred_base_gauss).sum() / test_base_dense.shape[0])

print("Accuracy: %f" % (base_gauss_acc))

Accuracy: 0.116057


In [14]:
#Sub gauss

y_pred_sub_gauss = gauss.fit(train_sub_dense, y_train_sub).predict(test_sub_dense)

sub_gauss_acc = ((y_test_sub == y_pred_sub_gauss).sum() / test_sub_dense.shape[0])

print("Accuracy: %f" % (sub_gauss_acc))

Accuracy: 0.115650


In [15]:
#Base multi

y_pred_base_multi = multi.fit(train_base_dense, y_train_base).predict(test_base_dense)

base_multi_acc = ((y_test_base == y_pred_base_multi).sum() / test_base_dense.shape[0])

print("Accuracy: %f" % (base_multi_acc))

Accuracy: 0.352770


In [16]:
#Sub multi

y_pred_sub_multi = multi.fit(train_sub_dense, y_train_sub).predict(test_sub_dense)

sub_multi_acc = ((y_test_sub == y_pred_sub_multi).sum() / test_sub_dense.shape[0])

print("Accuracy: %f" % (sub_multi_acc))

Accuracy: 0.352941


For the purposes of loading the model later, the best model (multinomial sublinear in my testing) and its corresponding vectorizer are saved into files

In [None]:
#dump(multi, 'naive_bayes.joblib') 

In [None]:
#dump(vectorizer_sublinear, 'vectorize.joblib')

# Results
Now that all of the models have run, their accuracies can be evaluated to determine which was the best

In [17]:
#Determine which is best

accuracies = np.array([base_gauss_acc, sub_gauss_acc, base_multi_acc, sub_multi_acc])

max_index = np.argmax(accuracies)

if max_index == 0:
    print("The best option was the Gaussian Naive Bayes without sublinear term frequencies")
elif max_index == 1:
    print("The best option was the Gaussian Naive Bayes with sublinear term frequencies")
elif max_index == 2:
    print("The best option was the Multinomial Naive Bayes without sublinear term frequencies")
else:
    print("The best option was the Multinomial Naive Bayes with sublinear term frequencies")

The best option was the Multinomial Naive Bayes with sublinear term frequencies
