# I. Introduction

During recent years a great surge in the use of text classification has arised, with the now normalised use of technology for nearly every aspect of life, it does not come as strange that solutions for some of the most common problems that come with this new commodities have also been developed, such is the case of Text Classification. Remoting ourselves back to the beginning of the internet it is possible to imagine how complicated it was to tackle simple problems such as comment flagging, back then blogs and chat forums probably had a really hard time making sure the exchanges that took place on them were not harmful to its users, but right now that has changed, learning from its past mistakes now it is possible to use Text Classification algorithms to take care of this vital tasks. Text classification is an ever growing area that has introduced new found perks for developers and users alike, in examples like the previously mentioned comment flagging or email spam filtering, but also in things like identifying negative and positive reviews, giving new found insights to companies about their products and their clients, these categorization of text data has introduced many solutions not only for web technologies but for life in general.

## Objectives

From the previous stated reason surges the interest in developing a text classifier capable of identifying negative, neutral and positive statements, in order to undertake this work first it was necessary to select an adequate source of training data, many options were available but ultimately it was decided to work with a dataset consistent of the user reviews for popular 2020 life simulation video game ‘Animal Crossing’, this decision directly correlates with the objectives of the project, since the main purpose was to provide some novelty to the already widely explored and developed topic of Text classification, then the idea became developing a basic text classifier but trained specifically with video game reviews data so that it would be able to classify reviews mainly related to these topics. The objectives for this project were: 
1. that it could outperform, at least minimally, already existing solutions.
2. To classify text data correctly into the selected categories
3. Although not making a novel discovery, to indagate into an already well explored area from a different point of view.
4. Develop on top of already existing technologies and solutions to apport some new insights in the chosen area.
5. Make use of tensorflow models to construct the text classifier model.

## Dataset

As previously stated the dataset is a compilation of user reviews for the video game ‘Animal Crossing’, a work by Thomas Mock and posted on the site Kaggle by Jesse Mostipak, this dataset consists 2999 unique reviews posted by users of the game on the reviews site ‘Metacritic’, although no specific mention of a licence is made anywhere in the documentation, it can be inferred from the following line ‘Potential Analyses: Reviews: Sentiment analysis, text analysis, scores, date effect’ that it is safe to use it with educational purposes, the data set consists of a single table which compiles information about the grade assigned by the user, their username, the text for the reviews itself and the date said review was published, although for the purposes of this work only two of those columns will be utilised, the grade and review columns, the first column stores integer values ranging from 0 to 10 which denotes the numerical evaluation of the game, and the second column stores a textual description of the opinion the user has about the game, the other 2 columns which will not be made use of store an alpahumerical value representing the name of the user on the site and a date.

## Evalutation Methodology

In order to evaluate the success of the text classifier models it has been decided to compare its accuracy and precision against the accuracy and precision of an already existing model, for this particular case a Naive Bayes classifier implemented using the NLTK library, there were many options already existing form the literature which could have been selected to compare the classifier against but Naive Bayes was specially selected because it was an implementation already familiar from the content explored in the lectures, additionally it is one the most common implementations and therefore a good point of comparison; the specific model used for this work is a work by Preethi Thakur from the 23rd of October 2022 via the website medium. The reason why accuracy was specially selected is because it constitutes the quintessential metric for comparing the performance of models, it helps with knowing how often the model makes the right predictions while precision shows how often the model selects the correct target class, with this two metrics is already possible to construct and idea of the performance of the models. To calculate the accuracy and precision of the models the Sklearn library already provides two methods to calculate those metrics that will be made use of in this work.




# II. Implementation

## Libraries

**the first step is to import all the libraries necessary**

In [1]:
import pandas as pd 
import numpy as np 
import tensorflow as tf
import sklearn
import os
import pickle
import keras
import nltk
import re
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from tensorflow.keras.layers import SpatialDropout1D, Embedding, Conv1D, GlobalMaxPooling1D, Dense, Dropout





# Data Set

In [2]:
#import data set and print the first rows
reviews = pd.read_csv('user_reviews.csv')

reviews.head()

Unnamed: 0,grade,user_name,text,date
0,4,mds27272,My gf started playing before me. No option to ...,2020-03-20
1,5,lolo2178,"While the game itself is great, really relaxin...",2020-03-20
2,0,Roachant,My wife and I were looking forward to playing ...,2020-03-20
3,0,Houndf,We need equal values and opportunities for all...,2020-03-20
4,0,ProfessorFox,BEWARE! If you have multiple people in your h...,2020-03-20


# Preprocessing

**In its current state the data set is not appropriate to be used for the model, therefore the following section describes the processes performed on it to achieve the desired format**


### Null Values

The fists step is to check for null values

In [3]:
#make a copy of the data set and check for null values
c_reviews=reviews.copy()
c_reviews.isnull().sum()

grade        0
user_name    0
text         0
date         0
dtype: int64

Luckily this dataset does not contain null values, so it is possible to move on to the next step.

### Sentiment Column

A key piece of information needed for the sentiment analysis is the sentiment associated with each review, this could be 'positive', 'negative' or 'neutral', it is necessary to add an aditional column to the data set that contains this information.

The data set contains a grade section, which, as the name suggest, assigns a 'grade' in accordance to the review, with values rangin from 0 to 10. To assign the sentiment tag a score greater than 5 would be considerate positive, a score equal to 5 neutral and if the score is less than this number it would be assigned negative.

In [4]:
#Count how many reviews correspond to each grade
c_reviews['grade'].value_counts()

0     1158
10     752
1      255
9      253
2      131
4      105
3       98
8       91
5       78
6       44
7       34
Name: grade, dtype: int64

In [5]:
#initialise a function able to assign the appropiate sentiment tag according to the grade given by the user
def sentiment_function(row):
    
    if row['grade'] == 5:
        value = 'neutral'
    elif row['grade'] >= 0 and row['grade'] <= 4:
        value = 'negative'
    elif row['grade'] >= 6 and row['grade'] <= 10 :
        value = 'positive'
    return value

In [6]:
#apply the previous function to a new column 'sentiment' and print data set
c_reviews['sentiment'] = c_reviews.apply(sentiment_function, axis=1)
c_reviews.head()

Unnamed: 0,grade,user_name,text,date,sentiment
0,4,mds27272,My gf started playing before me. No option to ...,2020-03-20,negative
1,5,lolo2178,"While the game itself is great, really relaxin...",2020-03-20,neutral
2,0,Roachant,My wife and I were looking forward to playing ...,2020-03-20,negative
3,0,Houndf,We need equal values and opportunities for all...,2020-03-20,negative
4,0,ProfessorFox,BEWARE! If you have multiple people in your h...,2020-03-20,negative


In [7]:
#Then print the count for reviews according to sentiment
c_reviews['sentiment'].value_counts()

negative    1747
positive    1174
neutral       78
Name: sentiment, dtype: int64

### Reviews text cleaning

The next step ensures the data is standarized, it will essentially modify or remove linguistical elements that could cause confusion.

In [8]:
#initialise a function to lower case the data, remove unicode characters, numbers and extra spaces
def cleaning_function(text):
    text = str(text).lower()
    text = re.sub(r"(@\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", text)
    text = re.sub('\w*\d\w*', '', text)
    text = " ".join(text.split())
    return text

In [9]:
#Apply the function to the reviews on the data set
c_reviews['text']=c_reviews['text'].apply(lambda x:cleaning_function(x))
c_reviews.head()

Unnamed: 0,grade,user_name,text,date,sentiment
0,4,mds27272,my gf started playing before me no option to c...,2020-03-20,negative
1,5,lolo2178,while the game itself is great really relaxing...,2020-03-20,neutral
2,0,Roachant,my wife and i were looking forward to playing ...,2020-03-20,negative
3,0,Houndf,we need equal values and opportunities for all...,2020-03-20,negative
4,0,ProfessorFox,beware if you have multiple people in your hou...,2020-03-20,negative


### Stop words

In this step the aim is to remove 'stop words', words that do not convey relevant meaning.

The nltk library for stopwords will be used, but modified not to include stopwords that denote negative sentiment, in order for that meaning not to be lost.

In [10]:
#download nltk library
nltk.download('stopwords')
nltk.download('punkt')
print(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Usuario\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Usuario\AppData\Roaming\nltk_data...


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data]   Package punkt is already up-to-date!


In [11]:
#modified stopwords list
stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", 
             "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 
             'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 
             'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 
             'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 
             'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 
             'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 
             'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 
             'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 
             'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 
             't', 'can', 'will', 'just', 'don', 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 
             'y', 'ma']

#filter text in reviews
def filter_stopwords(text):
    words = text.split() 
    filtered = [word for word in words if word not in stop_words] 
    return ' '.join(filtered)


In [12]:
c_reviews['text']=c_reviews['text'].apply(lambda x:filter_stopwords(x))
c_reviews.head()

Unnamed: 0,grade,user_name,text,date,sentiment
0,4,mds27272,gf started playing no option create island guy...,2020-03-20,negative
1,5,lolo2178,game great really relaxing gorgeous cant ignor...,2020-03-20,neutral
2,0,Roachant,wife looking forward playing game released bou...,2020-03-20,negative
3,0,Houndf,need equal values opportunities players island...,2020-03-20,negative
4,0,ProfessorFox,beware multiple people house want play game no...,2020-03-20,negative


### Lematization

"Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meanings to one word."[1]


In [13]:
#initialize function to do lemmatization
lemmatizer = WordNetLemmatizer()

#filter text in reviews
def lemmatize(text):
    words = nltk.word_tokenize(text)
    sentence = ' '.join([lemmatizer.lemmatize(w) for w in words])
    return sentence

In [14]:
c_reviews['text']=c_reviews['text'].apply(lambda x: lemmatize(x))
c_reviews.head()

Unnamed: 0,grade,user_name,text,date,sentiment
0,4,mds27272,gf started playing no option create island guy...,2020-03-20,negative
1,5,lolo2178,game great really relaxing gorgeous cant ignor...,2020-03-20,neutral
2,0,Roachant,wife looking forward playing game released bou...,2020-03-20,negative
3,0,Houndf,need equal value opportunity player island wif...,2020-03-20,negative
4,0,ProfessorFox,beware multiple people house want play game no...,2020-03-20,negative


In [15]:
#Shuffle data set and drop non-necessary columns
r = c_reviews[['text', 'sentiment']]
r = r.sample(frac=1).reset_index(drop=True)

# Baseline

**To asses the performance of the model it will compare against a Naive Bayes classifier implemented using the NLTK library. The reasoning why Naive Bayes was selected is because it represents a perfect baseline since it is a linear classifier which makes it faster for large amounts of data.**

In [16]:
#first isolate required data from the cleaned data set
nb = c_reviews[['text', 'sentiment']]
nb.head()

Unnamed: 0,text,sentiment
0,gf started playing no option create island guy...,negative
1,game great really relaxing gorgeous cant ignor...,neutral
2,wife looking forward playing game released bou...,negative
3,need equal value opportunity player island wif...,negative
4,beware multiple people house want play game no...,negative


*The following 4 cells of code are a citation from [3]*

In [17]:
#Convert sentiment values to integer values
nb['sentiment'].replace({'positive':2, 'neutral':1, 'negative':0}, inplace=True)
nb.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  nb['sentiment'].replace({'positive':2, 'neutral':1, 'negative':0}, inplace=True)


Unnamed: 0,text,sentiment
0,gf started playing no option create island guy...,0
1,game great really relaxing gorgeous cant ignor...,1
2,wife looking forward playing game released bou...,0
3,need equal value opportunity player island wif...,0
4,beware multiple people house want play game no...,0


In [18]:
##using bag of words vectorization convert text from reviews
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=800)
# vectorizing words and storing in variable X(predictor)
X = cv.fit_transform(nb['text']).toarray()
# predictor
X
# X size
X.shape
output: (1000, 800)
# target
y = nb.iloc[:,-1].values
# y size
y.shape
output: (1000, )

In [19]:
#split data into train and test sets
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

In [20]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB

# Naive Bayes Classifiers
gnb = GaussianNB()
mnb = MultinomialNB()
bnb = BernoulliNB()
# fitting and predicting
gnb.fit(X_tr, y_tr)
y_pred_gnb = gnb.predict(X_te)
mnb.fit(X_tr, y_tr)
y_pred_mnb = mnb.predict(X_te)
bnb.fit(X_tr, y_tr)
y_pred_bnb = bnb.predict(X_te)
# accuracy scores
print("Gaussian", accuracy_score(y_te, y_pred_gnb))
print("Gaussian", precision_score(y_te, y_pred_gnb, average='weighted'))
print("Multinomial", accuracy_score(y_te, y_pred_mnb))
print("Multinomial", precision_score(y_te, y_pred_mnb, average='weighted'))
print("Bernoulli", accuracy_score(y_te, y_pred_bnb))
print("Bernoulli", precision_score(y_te, y_pred_bnb, average='weighted'))

Gaussian 0.5383333333333333
Gaussian 0.7249794922963937
Multinomial 0.84
Multinomial 0.8321622416033276
Bernoulli 0.735
Bernoulli 0.8176220934470443


# Text Representation

**In order to use this data for a Sentiment Analysis classifier it first needs to be available in a format the machine is able to understand, the following describes exactly that process.**

### Tokenizing

Using Keras tokenizer to vectorize the text from the reviews by transforming it into a sequence of integers so it can be processed by the model

*1. initialize tokenizer*
*2. update internal vocabulary based on reviews*
*3. transform text into a sequence of integers*
*4. transform sequences into 2D numpy array*

In [21]:
tokenizer = Tokenizer(num_words=6000, oov_token='<OOV>')
tokenizer.fit_on_texts(r['text'])
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(r['text'])
sequences_padded = pad_sequences(sequences, maxlen=100, truncating='post')

### One-Hot Encoding

Modify sentiment lables to one-hot encoding, which is a way to convert categorical data into a format computers can understand

In [22]:
sentiment = pd.get_dummies(r['sentiment']).values

## Model

**The next section concerns the initialization of the model for sentiment analysis, it is a tensorflow sequential model consiting of 6 keras layers, it uses embedding, dense, dropout, one dimensional convolutional and pooling layers**

*divide text from reviews into trainnig and test data 80-20, 80% for trainning and 20% for testing*

In [23]:
x_train, x_test, y_train, y_test = train_test_split(sequences_padded, sentiment, test_size=0.2)

*initialise sequential model incrementally*

In [24]:
model = Sequential()
model.add(Embedding(6000, 100, input_length=100))
model.add(Conv1D(32, 3, activation='selu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(32, activation='selu'))
model.add(Dropout(0.5))
model.add(Dense(3, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()



Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 100, 100)          600000    
                                                                 
 conv1d (Conv1D)             (None, 98, 32)            9632      
                                                                 
 global_max_pooling1d (Glob  (None, 32)                0         
 alMaxPooling1D)                                                 
                                                                 
 dense (Dense)               (None, 32)                1056      
                                                                 
 dropout (Dropout)           (None, 32)                0         
                                                                 
 dense_1 (Dense)             (None, 3)                 99        
                                                      

In [25]:
model.fit(x_train, y_train, epochs=10, batch_size=32, validation_data=(x_test, y_test))

Epoch 1/10


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x1ebfdaaadc0>

In [26]:
y_pred = np.argmax(model.predict(x_test), axis=-1)
print("Accuracy:", accuracy_score(np.argmax(y_test, axis=-1), y_pred))
print("Precision:", precision_score(np.argmax(y_test, axis=-1), y_pred, average='weighted'))

Accuracy: 0.8616666666666667
Precision: 0.8485022982635343


### Making Predictions with Sentiment Analysis Classifier

**After having trained and calculated accuracy for our model, we can use it to analyze actual text and infere the sentiment**

In [27]:
#first save the model
model.save('my_sentinalysis_model.h5')
with open('tknzr.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

  saving_api.save_model(


In [28]:
#load model and tokenizer
import keras

model = keras.models.load_model('my_sentinalysis_model.h5')
with open('tknzr.pickle', 'rb') as handle:
    tknzr = pickle.load(handle)

In [29]:
#initialize function to do sentiment analysis
def sentinalysis(text):
    # Tokenize and pad the input text
    sequence = tknzr.texts_to_sequences([text])
    sequence = pad_sequences(sequence, maxlen=100)

    # Make a prediction using the trained model
    sentiment_prediction = model.predict(sequence)[0]
    if np.argmax(sentiment_prediction) == 0:
        return 'Negative'
    elif np.argmax(sentiment_prediction) == 1:
        return 'Neutral'
    else:
        return 'Positive'

In [30]:
eg_text = "I hate the game"
sentiment = sentinalysis(eg_text)
print(sentiment)

Negative


In [31]:
eg_text = "I love the game"
sentiment = sentinalysis(eg_text)
print(sentiment)

Positive


# III. Conclusions

## Evalutation

After having successfully implemented the model, finally it is possible to compare its accuracy and precision against that of the Naive Bayes classifier, first it must be stated that in the baseline there are three different implementations of Naives Bayes, Gaussian, Multinomial and Bernoulli each with different accuracy and precision scores of: 0.5383 and 0.7249, 0.84 and 0.8321, 0.735 and 0.8176 respectively, while the model implemented had an accuracy of 0.8616 and precision of 0.8485, proving that it performs better than all implementations of Naive Bayes for both precision and accuracy, with Multinomial Bayes being the closest contender at a difference in accuracy of only about 2 hundredths and a precision difference of only about 1 hundredth.

## Summary and Conclusions

In light of the results obtained it is possible to determine that the model was successful in its ambitions of developing a text classifier capable of identifying the sentiment of reviews, beating just slightly the accuracy and precision of already existing models, also when using it to predict the sentiment of real text it was also capable of assigning them the correct sentiment. The state to which text classifiers are developed leaves the model in this work paling in comparison, however it can not be entirely crossed out that it could still represent an, although minimal, contribution to the area of text classifying, more specifically review classifying for video games and not only that specific area, but the work carried out could possibly also be transferred to other areas like monitoring the comment exchange in social sites, customer support for the aforementioned video games and also for product analysis. The work carried out is not out of the ordinary and can easily be replicated with minimal previous knowledge on python and the area of natural language processing, alternatively other programming languages could be replicate the work on this project but by doing so it would result in library incompatibility making it far more complicated than the work already developed.

[1] GeeksForGeeks Contributors. 2023. Python | Lemmatization with NLTK. GeeksForGeeks. https://www.geeksforgeeks.org/python-lemmatization-with-nltk/

[2] DeepChecks. 2023. DeepChecks Glossary: One-hot Encoding. DeepChecks. https://deepchecks.com/glossary/one-hot-encoding/#:~:text=One%2Dhot%20encoding%20in%20machine,algorithms%20to%20improve%20prediction%20accuracy.

[3] Preethi Thakur. 2022. Sentiment Analysis with Naive Bayes Classifier | NLTK | Python Code | Machine Learning. Medium. https://medium.com/@tpreethi/undesrtand-naive-bayes-algorithm-in-simple-explanation-with-python-code-part-2-a2b91cbbf637