# 16. Metacritic Video Game Comments
Just like the previous notebook, we wanted to see how our algorithm performs on other datasets. Where the imdb dataset only labeled negative or positive, this dataset gives the games a rating from 1 to 10 based on the comments. This will probably be a bit harder to predict. https://www.kaggle.com/dahlia25/metacritic-video-game-comments#metacritic_game_user_comments.csv

## Creating the dataframe

In [2]:
import pandas as pd

df = pd.read_csv('Metacritic/metacritic_game_user_comments.csv', index_col=0)
df

Unnamed: 0,Title,Platform,Userscore,Comment,Username
0,The Legend of Zelda: Ocarina of Time,Nintendo64,10,"Everything in OoT is so near at perfection, it...",SirCaestus
1,The Legend of Zelda: Ocarina of Time,Nintendo64,10,I won't bore you with what everyone is already...,Kaistlin
2,The Legend of Zelda: Ocarina of Time,Nintendo64,10,Anyone who gives the masterpiece below a 7 or ...,Jacody
3,The Legend of Zelda: Ocarina of Time,Nintendo64,10,I'm one of those people who think that this is...,doodlerman
4,The Legend of Zelda: Ocarina of Time,Nintendo64,10,This game is the highest rated game on Metacr...,StevenA
...,...,...,...,...,...
283978,Etrian Odyssey Untold: The Millennium Girl,3DS,7,"Extremely similar to EO:4, which obviously isn...",RileyWRussell
283979,Etrian Odyssey Untold: The Millennium Girl,3DS,0,Typical overrated Atlus trash. A game i should...,TemplarGR
283980,Etrian Odyssey Untold: The Millennium Girl,3DS,9,While I find the story mode to have annoying c...,midipon
283981,Etrian Odyssey Untold: The Millennium Girl,3DS,8,"Pretty good, but it certainly lacks the visual...",night4


## Preprocessing
Because these comments aren't preprocessed yet, we can use our own preprocessing method.

In [3]:
from preprocessing import PreProcessor

pp = PreProcessor()

df.Comment = df.Comment.apply(lambda c: pp.preprocess(str(c)))
df

Unnamed: 0,Title,Platform,Userscore,Comment,Username
0,The Legend of Zelda: Ocarina of Time,Nintendo64,10,everyth oot near perfect realli wonder game hu...,SirCaestus
1,The Legend of Zelda: Ocarina of Time,Nintendo64,10,wont bore everyon alreadi say amaz game your f...,Kaistlin
2,The Legend of Zelda: Ocarina of Time,Nintendo64,10,anyon give masterpiec either hate astound zeld...,Jacody
3,The Legend of Zelda: Ocarina of Time,Nintendo64,10,im one peopl think greatest game time matter q...,doodlerman
4,The Legend of Zelda: Ocarina of Time,Nintendo64,10,game highest rate game metacrit good reason ta...,StevenA
...,...,...,...,...,...
283978,Etrian Odyssey Untold: The Millennium Girl,3DS,7,extrem similar eo obvious isnt bad thing id sa...,RileyWRussell
283979,Etrian Odyssey Untold: The Millennium Girl,3DS,0,typic overr atlu trash game like sinc oldtim h...,TemplarGR
283980,Etrian Odyssey Untold: The Millennium Girl,3DS,9,find stori mode annoy charact intrus stori cla...,midipon
283981,Etrian Odyssey Untold: The Millennium Girl,3DS,8,pretti good certainli lack visual audio polish...,night4


## Vectorizing

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2))
features = tfidf.fit_transform(df.Comment)
labels = df.Userscore

features

<283983x525101 sparse matrix of type '<class 'numpy.float64'>'
	with 26055655 stored elements in Compressed Sparse Row format>

## Splitting and Training

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC

X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(features, labels, df.index, test_size=0.33, random_state=0)
    
linearSVCModel = LinearSVC()
linearSVCModel.fit(X_train, y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

## Predicting

In [6]:
from sklearn import metrics

y_predLinearSVC = linearSVCModel.predict(X_test)

print(metrics.classification_report(y_test, y_predLinearSVC))

              precision    recall  f1-score   support

           0       0.47      0.63      0.54      6643
           1       0.17      0.06      0.09      2350
           2       0.16      0.05      0.07      2028
           3       0.16      0.06      0.09      2337
           4       0.18      0.07      0.10      2600
           5       0.19      0.11      0.14      3313
           6       0.21      0.13      0.16      4035
           7       0.24      0.15      0.19      5675
           8       0.27      0.22      0.24      9904
           9       0.34      0.31      0.32     17610
          10       0.64      0.84      0.72     37220

    accuracy                           0.48     93715
   macro avg       0.27      0.24      0.24     93715
weighted avg       0.42      0.48      0.44     93715



## Scoring
The accuracy of the model is quite low. However, it is difficult to accurately predict a score, since the scores are given by users it is easy to be off a few points. It is not as straight forward as negative or positive. Because of this, on Kaggle, they also looked at the scoring in a different way. They looked at the average amount of points the model (mis)predicted the scores. That is exactly what I will do as well.  

In [7]:
numberOfPredictions = len(y_predLinearSVC)
totalDifference = 0

for prediction in range(0, numberOfPredictions):

    actual = y_test.array[prediction]
    predicted = y_predLinearSVC[prediction]
    
    difference = abs(actual - predicted)
    totalDifference += difference
    
averageDifference = totalDifference / numberOfPredictions
print(f'Average difference: {averageDifference}')

Average difference: 1.2506749186362909


## Conclusion
The Kaggle notebook was off by an average of 1.22 points per review. Our model performs almost as good. Just like with the IMDB reviews, our model scores very well when comparing it to the top Kaggle solutions.