# Investigating Amazon games reviews

Current dataset is a part of big Amazon review dataset, picked from category games. It is a balanced subsamle, which means that number of reviews of each score is approximately the same. This research aims to predict review score based on review text.

## Data preparation

Read data from file

In [1]:
import json
import numpy as np

texts = []
scores = []

with open("./data/Games_5876.json", "r") as json_file:
    for line in json_file:
        str = json.loads(line)
        texts.append(str["text"])
        scores.append(str["overall"])
        
texts = np.asarray(texts)
scores = np.asarray(scores)

Number of rows in data file: 

In [2]:
len(texts)

5876

Each row in dataset consists of review and score. Review is a combination of short summary that was made uppercase and added before review text. Score is a value from 1.0 to 5.0. Examle:

In [3]:
print(texts[0])
print(scores[0])

THEY ARE JUST OK
These slides are just as they advertised. Now that we have them, I would have preferred to buy a set with just one specimen per slide in a bigger kit. This set is just adequate.
2.0


Number of reviews in each group:

In [4]:
for i in range(1, 6):
    print("Score {}, number of items: {}".format(i, sum(scores == i)))

Score 1, number of items: 1175
Score 2, number of items: 1175
Score 3, number of items: 1176
Score 4, number of items: 1175
Score 5, number of items: 1175


Other reseaches show low results in score classification. Usually, they cannot reach accuracy higher than 60%. It happens because it is difficult to distinguish between scores that are similar, e.g. 1 and 2 or 4 and 5. Instead, we make a binary classification and choose between positive and negative reviews. 1 and 2 is negative; 4 and 5 is positive. Score 3 is ignored, since it is difficult to put it into one of these groups.

In [5]:
texts = texts[scores != 3]
scores = scores[scores != 3]
scores = 3 < scores

for i in [True, False]:
    print("Score {}, number of items: {}".format(i, sum(scores == i)))

Score True, number of items: 2350
Score False, number of items: 2350


Try combinations of different classifiers and vectorizers. First fill different vectorizers.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vectorizers = {}
vectorizers["CountVectorizer"] = CountVectorizer()
vectorizers["TfidfVectorizer"] = TfidfVectorizer()

Try different classifiers (Multinomial Naive Bayes, SVM, Random Forest)

In [7]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

classifiers = {}
classifiers["MultinomialNB"] = MultinomialNB()
classifiers["LinearSVC"] = LinearSVC(dual=False)

param_grids={} 
param_grids["MultinomialNB"] = {'clf__alpha': (0.1, 1.0)}
param_grids["LinearSVC"] = {'clf__penalty': ('l1', 'l2')}


Loop through all combinations of vectorizers and models and try different variations of parameters. Apply cross validation and calculate accuracy and AUC value.

In [8]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline

np.random.seed(123)
scoring = {'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score)}
results = {}

for vect_name, vectorizer in vectorizers.items():
    for cls_name, classifier in classifiers.items():
        name = "{} {}".format(vect_name, cls_name)
        print("Calculating {}".format(name))
        pipeline = Pipeline([
            ('vect', vectorizer),
            ('clf', classifier)
        ])
        
        cls_grid = param_grids[cls_name]
        param_grid = {'vect__max_df': (0.5, 1.0)}
        param_grid = {**param_grid, **cls_grid}
        gs = GridSearchCV(pipeline, param_grid=param_grid, scoring=scoring, cv=5, refit='AUC', return_train_score=True)
        gs.fit(texts, scores)
        result = gs.cv_results_ 
        print("{}: accuracy = {}; AUC = {};".format(name, result["mean_test_Accuracy"], result["mean_test_AUC"]))
        print("Best score = {}; Best params = {}".format(gs.best_score_, gs.best_params_))
        results[name] = gs.best_estimator_

Calculating CountVectorizer MultinomialNB
CountVectorizer MultinomialNB: accuracy = [ 0.85489362  0.85680851  0.87595745  0.87638298]; AUC = [ 0.92433228  0.92530376  0.94066455  0.94053871];
Best score = 0.9406645540968763; Best params = {'clf__alpha': 1.0, 'vect__max_df': 0.5}
Calculating CountVectorizer LinearSVC
CountVectorizer LinearSVC: accuracy = [ 0.86191489  0.86297872  0.86765957  0.86787234]; AUC = [ 0.93338705  0.93326845  0.93613762  0.93742055];
Best score = 0.9374205522861022; Best params = {'clf__penalty': 'l2', 'vect__max_df': 1.0}
Calculating TfidfVectorizer MultinomialNB
TfidfVectorizer MultinomialNB: accuracy = [ 0.84531915  0.84382979  0.86808511  0.87085106]; AUC = [ 0.92880851  0.92957266  0.94768493  0.94784427];
Best score = 0.9478442734268901; Best params = {'clf__alpha': 1.0, 'vect__max_df': 1.0}
Calculating TfidfVectorizer LinearSVC
TfidfVectorizer LinearSVC: accuracy = [ 0.88489362  0.88170213  0.88404255  0.88659574]; AUC = [ 0.9533789   0.95375373  0.9560

The best AUC score achieved by model is 0.95722, accuracy: 0.8866. This result is shown by LinearSVC and TfidfVectorizer.

In [11]:
print(results["TfidfVectorizer LinearSVC"])

Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
  ...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])
