# Commonsense Media Game Reviews Classification

## Project Goals

Reviews are one of the better ways to convince a potential customer whether or not he/she wants to buy/use a product. Such is the case for video games. As such, the goal of this project is to classify a game review as either kid-friendly (safe) or for more mature audiences (adult) using machine learning. 

## Dataset Gathering 

The dataset used in this project is a parsehub-scraped dataset containing parents reviews from https://www.commonsensemedia.org/. To make the data scraping method more efficient, the games were scraped in the order of the highest-rated games. This ensures that parsehubs runs are optimized. Each run of parsehub's free version takes 40 minutes to complete and for this project, a total of six runs were necessary to scrape a total of 2500 reviews. 

Then, the 2500 reviews were labelled using the pigeon-xt library for Jupyter notebooks (https://github.com/dennisbakhuis/pigeonXT). The categories for annotations were ['safe', 'adult', 'remove']. The `remove` category were for reviews which:
* Contains less than five words  (e.g. 'love it'
* Complaints rather than reviews (e.g. complaint of not receiving a refund instead of a game review)
* Incoherent sentences (e.g. 'Love love love love love love!')
* Ambivalent reviews (e.g. 'Dependent on your kid's maturity reviews')

Removing these categories will filter out the dataset so it contains kid-friendly and adult reviews as much as possible. 

After labelling, filtering out the reviews with `remove` label, and sampling an equal number of `safe` and `adult` reviews, 1300 reviews remained with `650 safe` and `650 adult` reviews.

In order to convert the labels for machine learning, scikit-learn's `LabelBinarizer()` was used to transform `safe --> 1` and `adult --> 0`. 

Now we have a dataset we can perform classification with. 

# Classification using Machine Learning
 

### Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# NLP preprocessing libraries
import nltk
import gensim
import re
import string
import unicodedata
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from gensim.parsing.preprocessing import STOPWORDS
from gensim.parsing.preprocessing import remove_stopwords

# ML libraries
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC, SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier, GradientBoostingClassifier
from sklearn import metrics
from sklearn.metrics import plot_confusion_matrix
from sklearn.pipeline import Pipeline

### Data Preprocessing

In [None]:
reviews = pd.read_csv("final_labelled_dataset.csv")

In [None]:
reviews.head()

Unnamed: 0,review,label
0,"Although the movie is not that good, the game ...",0
1,I am disappointed that there are guns and shoo...,0
2,My children and I have had a great deal of fun...,1
3,"although this game is fun and entertaining, th...",1
4,"Recently, I got the directions cut version of ...",0


In [None]:
X = reviews['review']
y = reviews['label'] # 1 = safe, 0 = adult

In [None]:
# Create functions for cleaning text

def remove_url(text):
  return text.replace('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '')

def remove_url_2(text):
  return re.sub(r'http\S+', '', text)

def remove_twitter_handles(text):
  return re.sub("@[A-Za-z0-9]+", "", text)

def remove_usernames_links(text):
  text = re.sub('@[^\s]+','',text)
  text = re.sub('http[^\s]+','',text)
  return text

def remove_punctuations(text):
  additional_punctuations = ['’', '…'] # punctuations not in string.punctuation
  for punctuation in string.punctuation:
      text = text.replace(punctuation, '')
    
  for punctuation in additional_punctuations:
      text = text.replace(punctuation, '')
      
  return text

def remove_hashtags(text):
  return re.sub("#[A-Za-z0-9_]+","", text)

def remove_emojis(text):
  emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
  
  return emoji_pattern.sub(r'', text)

# Stemming, Lemmatization, and stopwords removal
stemmer = SnowballStemmer('english')
nltk.download('wordnet')

def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS:
            result.append(lemmatize_stemming(token))
    return str(result)

# Final data cleaning function
# Numbers not to be removed since they may be important in telling age requirements
def clean_text(text):
  text = remove_twitter_handles(text)
  text = remove_hashtags(text)
  text = remove_url(text)
  text = remove_url_2(text)
  text = remove_punctuations(text)
  text = remove_emojis(text)
  text = remove_stopwords(text)
  text = text.lower()
  text = preprocess(text)
  return text

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
# preprocess text

X = X.map(clean_text)

In [None]:
pd.set_option('max_colwidth', None)
X.head()

0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ['movi', 'good', 'game', 'actual', 'entertain', 'violenc', 'especi', 'machin', 'gun', 'blood', 'show', 'game', 'base', 'terribl', 'movi', 'pretti', 'fun']
1    ['disappoint', 'gun', 'shoot', 'game', 'tell', 'mom', 'buy', 'son', 'holiday', 'like', 'car', 'look', 'race', 'game', 'instant', 'love', 'race', 'coupl', 'round', 'race', 'violent', 'activ', 'bomb', 'machin', 'gun',

In [None]:
y.value_counts()

0    650
1    650
Name: label, dtype: int64

### Model Training

In [None]:
# define custom functions for machine learning

def crossvalidate_classifier(model, X, y, cm=False):
    
    scores = cross_val_score(model, X, y, scoring='f1_macro', cv=5)
    #scores = cross_val_score(model, X, y, scoring='accuracy', cv=5)
    print(f"{model}: %0.5f f-1 score with a standard deviation of %0.5f" % (scores.mean(), scores.std()))
    print("\n")
    y_pred = cross_val_predict(model, X, y, cv=5)

    if(cm):
        conf_mat = confusion_matrix(y, y_pred, labels=['fulfillment','other'])
        plot_confusion_matrix(conf_mat, classes = ['fulfillment','other'])
    
def evaluate_classifier(model, X_train, X_test, y_train, y_test):
      
    # ... fit your model here ...
    model.fit(X_train,y_train)

    # Run predict on your tfidf test data to get your predictions
    pred = model.predict(X_test)

    # Calculate your accuracy using the metrics module
    acc_score = metrics.accuracy_score(pred, y_test)
    print(f"{model} Accuracy Score:   %0.5f" % acc_score)
    
    f1score = metrics.f1_score(pred, y_test, average='macro')
    print(f"{model} F-1 Score:   %0.5f" % f1score)

    print("\n")

    # Calculate the confusion matrices for the tfidf_svc model
    #svc_cm = metrics.confusion_matrix(y_test, pred, labels=['fulfillment','other'])

    # Plot the confusion matrix using the plot_confusion_matrix function
    #plot_confusion_matrix(svc_cm, classes = ['fulfillment','other'], title="Confusion Matrix")
    
    return acc_score.round(5), f1score.round(5)

def best_hyperparam(X_train_data, X_test_data, y_train_data, y_test_data, 
                       model, param_distributions, iterations, cv=5, scoring_fit='f1_score',
                       do_probabilities = False):
  
    gs = RandomizedSearchCV(
        estimator=model,
        param_distributions=param_distributions, 
        cv=cv, 
        n_jobs=-1, 
        scoring=scoring_fit,
        n_iter=iterations,
        verbose=2
    )

    fitted_model = gs.fit(X_train_data, y_train_data)
    
    if do_probabilities:
      pred = fitted_model.predict_proba(X_test_data)
    else:
      pred = fitted_model.predict(X_test_data)
    
    return fitted_model, pred

def best_hyperparam_grid(X_train_data, X_test_data, y_train_data, y_test_data, 
                       model, param_distributions, cv=5, scoring_fit='f1_score',
                       do_probabilities = False):
  
    gs = GridSearchCV(
        estimator=model,
        param_grid=param_distributions, 
        cv=cv, 
        n_jobs=-1, 
        scoring=scoring_fit,
        verbose=2
    )

    fitted_model = gs.fit(X_train_data, y_train_data)
    
    if do_probabilities:
      pred = fitted_model.predict_proba(X_test_data)
    else:
      pred = fitted_model.predict(X_test_data)
    
    return fitted_model, pred

In [None]:
# Split the data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2, shuffle=True, stratify=y)

In [None]:
# Initialize count vectorizer
count_vectorizer = CountVectorizer(stop_words='english', max_df=0.9, min_df=0.05)

# Create count train and test variables
count_train = count_vectorizer.fit_transform(X_train)
count_test = count_vectorizer.transform(X_test)

# Initialize tfidf vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.9, min_df=0.05)

# Create tfidf train and test variables
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)

In [None]:
classifiers = [MultinomialNB(), RandomForestClassifier(), AdaBoostClassifier(), 
               GradientBoostingClassifier(), SVC(),
               LogisticRegression()]

### Bag of Words Embeddings Training

In [None]:
print(count_train.shape)
print(count_vectorizer.vocabulary_)

(1040, 118)
{'game': 35, 'enjoy': 25, 'parent': 73, 'thing': 99, 'violenc': 107, 'blood': 8, 'player': 77, 'look': 62, 'like': 60, 'real': 82, 'nuditi': 69, 'sex': 88, 'kill': 53, 'languag': 55, 'word': 113, 'recommend': 83, 'love': 64, 'amaz': 3, 'year': 115, 'old': 70, 'play': 76, 'say': 85, 'turn': 103, 'fun': 34, 'buy': 10, 'son': 91, 'good': 37, 'matur': 67, 'great': 41, 'way': 112, 'learn': 56, 'fight': 31, 'peopl': 74, 'need': 68, 'lot': 63, 'world': 114, 'theme': 98, 'violent': 108, 'bite': 7, 'sexual': 89, 'content': 17, 'bad': 4, 'war': 110, 'littl': 61, 'overal': 72, 'gore': 38, 'teen': 97, 'enemi': 24, 'fine': 32, 'children': 14, 'understand': 104, 'graphic': 40, 'easi': 23, 'adult': 1, 'best': 5, 'make': 66, 'isnt': 50, 'famili': 28, 'swear': 96, 'kid': 52, 'age': 2, 'im': 48, 'actual': 0, 'feel': 30, 'younger': 117, 'level': 58, 'friend': 33, 'know': 54, 'high': 46, 'older': 71, 'think': 100, 'time': 101, 'tri': 102, 'pretti': 79, 'person': 75, 'differ': 19, 'charact': 12

In [None]:
# Iterate through the models and find the classifier with the best cross_val_score
# and test score

cv_accuracy = []
cv_f1 = []

for classifier in classifiers:
  crossvalidate_classifier(classifier, count_train, y_train)

for classifier in classifiers:
  accuracy, f1 = evaluate_classifier(classifier, count_train, count_test, y_train, y_test)
  cv_accuracy.append(accuracy)
  cv_f1.append(f1)

MultinomialNB(): 0.83552 f-1 score with a standard deviation of 0.00830


RandomForestClassifier(): 0.83519 f-1 score with a standard deviation of 0.02030


AdaBoostClassifier(): 0.82270 f-1 score with a standard deviation of 0.01462


GradientBoostingClassifier(): 0.82462 f-1 score with a standard deviation of 0.01654


SVC(): 0.81250 f-1 score with a standard deviation of 0.01368


LogisticRegression(): 0.82071 f-1 score with a standard deviation of 0.01110


MultinomialNB() Accuracy Score:   0.80769
MultinomialNB() F-1 Score:   0.80765


RandomForestClassifier() Accuracy Score:   0.81154
RandomForestClassifier() F-1 Score:   0.81091


AdaBoostClassifier() Accuracy Score:   0.83462
AdaBoostClassifier() F-1 Score:   0.83459


GradientBoostingClassifier() Accuracy Score:   0.83077
GradientBoostingClassifier() F-1 Score:   0.83068


SVC() Accuracy Score:   0.82308
SVC() F-1 Score:   0.82281


LogisticRegression() Accuracy Score:   0.83462
LogisticRegression() F-1 Score:   0.83459




### TF-IDF Embedding Training

In [None]:
# Iterate through the models and find the classifier with the best cross_val_score
# and test score

tfidf_accuracy = []
tfidf_f1 = []

for classifier in classifiers:
  crossvalidate_classifier(classifier, tfidf_train, y_train)

for classifier in classifiers:
  accuracy, f1 = evaluate_classifier(classifier, tfidf_train, tfidf_test, y_train, y_test)
  tfidf_accuracy.append(accuracy)
  tfidf_f1.append(f1)

MultinomialNB(): 0.82687 f-1 score with a standard deviation of 0.00609


RandomForestClassifier(): 0.84303 f-1 score with a standard deviation of 0.01608


AdaBoostClassifier(): 0.80664 f-1 score with a standard deviation of 0.01667


GradientBoostingClassifier(): 0.83533 f-1 score with a standard deviation of 0.01380


SVC(): 0.83537 f-1 score with a standard deviation of 0.01341


LogisticRegression(): 0.83353 f-1 score with a standard deviation of 0.00777


MultinomialNB() Accuracy Score:   0.81923
MultinomialNB() F-1 Score:   0.81923


RandomForestClassifier() Accuracy Score:   0.80769
RandomForestClassifier() F-1 Score:   0.80728


AdaBoostClassifier() Accuracy Score:   0.80000
AdaBoostClassifier() F-1 Score:   0.79989


GradientBoostingClassifier() Accuracy Score:   0.79615
GradientBoostingClassifier() F-1 Score:   0.79615


SVC() Accuracy Score:   0.84231
SVC() F-1 Score:   0.84231


LogisticRegression() Accuracy Score:   0.85769
LogisticRegression() F-1 Score:   0.85764




In [None]:
print(cv_accuracy)
print(cv_f1)
print(tfidf_accuracy)
print(tfidf_f1)

[0.80769, 0.81154, 0.83462, 0.83077, 0.82308, 0.83462]
[0.80765, 0.81091, 0.83459, 0.83068, 0.82281, 0.83459]
[0.81923, 0.80769, 0.8, 0.79615, 0.84231, 0.85769]
[0.81923, 0.80728, 0.79989, 0.79615, 0.84231, 0.85764]


In [None]:
columns = ["MultinomialNB", "RandomForest", "AdaBoosting", "GradientBoosting", "SVC", "LogisticRegression"]
indices = ['cv_accuracy', 'tfidf_accuracy', 'cv_f1', 'tfidf_f1']

In [None]:
summary = pd.DataFrame(data=[cv_accuracy, tfidf_accuracy, cv_f1, tfidf_f1], index=indices, columns=columns)

In [None]:
summary

Unnamed: 0,MultinomialNB,RandomForest,AdaBoosting,GradientBoosting,SVC,LogisticRegression
cv_accuracy,0.80769,0.81154,0.83462,0.83077,0.82308,0.83462
tfidf_accuracy,0.81923,0.80769,0.8,0.79615,0.84231,0.85769
cv_f1,0.80765,0.81091,0.83459,0.83068,0.82281,0.83459
tfidf_f1,0.81923,0.80728,0.79989,0.79615,0.84231,0.85764


# Conclusion

It can be observed that the LogisticRegression classifier performs the best across all metrics. However, there is not much of a difference between the scores among models. 

It is suggested then to increase the amount of data, for instance, include in kids reviews as well in order to increase the variability and distinguishability among a kid-friendly, safe review and an adult-friendly review. 