# Source

Originally from a project by [enriqueav](https://github.com/enriqueav), at [GitHub](https://github.com/enriqueav/MetacriticUserscore)

# Predicting the user score of Metacritic user reviews of Video Games using Keras functional API and Tensorflow.


~~Can we predict the score given to a Video Game based on the user review posted in Metacritic? In this post we are going to use Kera's functional API and Tensorflow backend to try to achieve this task.~~

~~We are going to be using [this amazing kaggle dataset ](https://www.kaggle.com/dahlia25/metacritic-video-game-comments/) which includes the metascore (the one derived from professional reviews) and the user comments (or reviews) for the top 5000 games.~~

In [1]:
# # Install the latest version of TensorFlow
# !pip install -q -U tensorflow==1.7.0

In [2]:
import pandas as pd
import tensorflow as tf

from tensorflow import keras
layers = keras.layers

# This code was tested with TensorFlow v1.7
print("You have TensorFlow version", tf.__version__)

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


You have TensorFlow version 1.14.0


  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Then we use Pandas to read the csv and transform it into a dataframe. After that we do some processing.

In [4]:
# Convert the data to a Pandas data frame
metascores = pd.read_csv('../data/metacritic_data.csv').infer_objects()
applist = pd.read_csv('../data/app_list.csv').set_index('appid')
descriptions = pd.read_csv('../data/steam_description_data.csv').join(applist, on='appid')
descriptions['name'] = descriptions['name'].astype(str).apply(str.lower)
descriptions = descriptions.drop_duplicates(subset='name').set_index('name')
metascores['name'] = metascores['name'].astype(str).apply(str.lower)
metascores = metascores.set_index('name')
df1 = metascores.join(descriptions, on='name', how='inner')
y_col = 'metacritic_score'
X_cols = ['detailed_description', 'about_the_game', 'short_description']
X_col = X_cols[2]

df1.head(2)

Unnamed: 0_level_0,metacritic_score,appid,detailed_description,about_the_game,short_description
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
tom clancy's splinter cell chaos theory,94,13570,The year is 2008.<br>\t\t\t\t\tCitywide blacko...,The year is 2008.<br>\t\t\t\t\tCitywide blacko...,The year is 2008. Citywide blackouts ... stock...
tom clancy's splinter cell chaos theory,87,13570,The year is 2008.<br>\t\t\t\t\tCitywide blacko...,The year is 2008.<br>\t\t\t\t\tCitywide blacko...,The year is 2008. Citywide blackouts ... stock...


In [5]:
# Shuffle with a fixed random seed
# This will help us to have the same training and test set every time
df1 = df1.sample(frac=1, random_state=387)

df2 = df1.reset_index().drop_duplicates('name').set_index('name')

df = df2


In [6]:

# Drop comments with less than 200 characters
# Modify this parameter to obtain different results
# comments = comments[comments['Comment'].str.len() > 200]
# # Print the first 5 rows
# print(comments.count())
# print(comments.head())
print(df.count())
df.head()

metacritic_score        3325
appid                   3325
detailed_description    3325
about_the_game          3325
short_description       3325
dtype: int64


Unnamed: 0_level_0,metacritic_score,appid,detailed_description,about_the_game,short_description
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
remilore: lost girl in the lands of lore,61,995240,RemiLore: Lost Girl in the Lands of Lore is a ...,RemiLore: Lost Girl in the Lands of Lore is a ...,RemiLore: Lost Girl in the Lands of Lore is a ...
mordhau,81,629760,<strong>MORDHAU</strong> is a medieval first &...,<strong>MORDHAU</strong> is a medieval first &...,MORDHAU is a multiplayer medieval slasher. Cre...
primordia,72,227000,What Happened to the Humans?<br><br>Set in a p...,What Happened to the Humans?<br><br>Set in a p...,"Life has ceased. Man is but a myth. And now, e..."
afterparty,75,762220,<h1>Coming Soon From Night School Studio</h1><...,"In Afterparty, you are Milo and Lola, recently...","In Afterparty, you are Milo and Lola, recently..."
wolfenstein ii: the new colossus,88,612880,"<h1>Accolades</h1><p><img src=""https://cdn.clo...",Wolfenstein® II: The New Colossus™ is the high...,"America, 1961. The assassination of Nazi Gener..."


Then we split the dataset in train and test sets. Since we used a constant random seed, this will return the same result every time.

In [7]:
# Split data into train and test
train_size = int(len(df) * .8)
print ("Train size: %d" % train_size)
print ("Test size: %d" % (len(df) - train_size))

# Train features
comments_train = df[X_col][:train_size]
# Train labels
labels_train = df[y_col][:train_size]
# Test features
comments_test = df[X_col][train_size:]
# Test labels
labels_test = df[y_col][train_size:]

Train size: 2660
Test size: 665


We define a keras Tokenizer, and we fit it with the train set.

In [8]:
# Create a tokenizer to preprocess our text descriptions
vocab_size = 12000 # This is a hyperparameter, experiment with different values for your dataset
tokenize = keras.preprocessing.text.Tokenizer(num_words=vocab_size, char_level=False)
tokenize.fit_on_texts(comments_train) # only fit on train
len(tokenize.word_counts)

12252

Now we are going to create the Keras models.

First we define the "***wide***" model. That will take a bag of words. 

In [9]:
# Define our wide model with the functional API
bow_inputs = layers.Input(shape=(vocab_size,))
inter = layers.Dense(256, activation='relu')(bow_inputs)
inter = layers.Dropout(0.3)(inter)
predictions = layers.Dense(1, activation='linear')(inter)
wide_model = keras.Model(inputs=bow_inputs, outputs=predictions)
wide_model.compile(loss='mse', optimizer='adam', metrics=['mse'])
print(wide_model.summary())

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 12000)]           0         
_________________________________________________________________
dense (Dense)                (None, 256)               3072256   
_________________________________________________________________
dropout (Dropout)            (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 257       
Total params: 3,072,513
Trainable params: 3,072,513
Non-trainable params: 0
_________________________________________________________________
None


Next we define the second model, which is the "***deep***" model, that will take the sequences of words and pass them to a Embedding layer

In [10]:
max_seq_length = 200

# Define our deep model with the Functional API
deep_inputs = layers.Input(shape=(max_seq_length,))
embedding = layers.Embedding(vocab_size, 16, input_length=max_seq_length)(deep_inputs)
embedding = layers.Flatten()(embedding)
embedding = layers.Dense(64, activation='relu')(embedding)
embedding = layers.Dropout(0.3)(embedding)
embed_out = layers.Dense(1, activation='linear')(embedding)
deep_model = keras.Model(inputs=deep_inputs, outputs=embed_out)
deep_model.compile(loss='mse',
                   optimizer='adam',
                   metrics=['mse'])
print(deep_model.summary())

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 200)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 200, 16)           192000    
_________________________________________________________________
flatten (Flatten)            (None, 3200)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 64)                204864    
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 65        
Total p

Then we combine both models using Keras functional API

In [11]:
# Combine wide and deep into one model
merged_out = layers.concatenate([wide_model.output, deep_model.output])
merged_out = layers.Dense(64, activation='relu')(merged_out)
merged_out = layers.Dropout(0.3)(merged_out)
merged_out = layers.Dense(1)(merged_out)
combined_model = keras.Model([wide_model.input, deep_model.input], merged_out)
combined_model.compile(loss='mse',
                       optimizer='adam',
                       metrics=['mse'])
print(combined_model.summary())

Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            [(None, 200)]        0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 200, 16)      192000      input_2[0][0]                    
__________________________________________________________________________________________________
input_1 (InputLayer)            [(None, 12000)]      0                                            
__________________________________________________________________________________________________
flatten (Flatten)               (None, 3200)         0           embedding[0][0]                  
____________________________________________________________________________________________

Generators are used in cases where your whole training set would not fit into memory, or when you want to apply some kind of data augmentation on training time.


In this case we are going to use it because the "***wide***" representation (bags of words) of all the training examples at the same time would take all of the memory allocated by Google colab.

Also note that at training time we will all *process_comments*, which will create the Bag Of Words and sequences of words to send to the Embedding layer. Naturally, this will slow down the training a little bit (we could actually pre-calculate the sequences to speed up).

In [12]:
def process_comments(comments, tokenize, max_seq_length):
    # Create the Bag Of Words and the embed version of only this
    # batch of examples. 
    # This is to avoid using all the memory at the same time
    bow = tokenize.texts_to_matrix(comments)
    embed = tokenize.texts_to_sequences(comments)
    embed = keras.preprocessing.sequence.pad_sequences(
        embed, maxlen=max_seq_length, padding="post"
    )
    return [bow, embed]
  
# Create the generator for fit and evaluate
def generator(comments_list, labels_list, batch_size, tokenize, max_seq_length):
    batch_number = 0
    data_set_len = len(comments_list)
    batches_per_epoch = int(data_set_len/batch_size)

    while True:
        initial = (batch_number*batch_size) % data_set_len
        final = initial + batch_size
        comments_to_send = comments_list[initial:final]

        x = process_comments(comments_to_send, tokenize, max_seq_length) 
        y = labels_list[initial:final]

        batch_number = (batch_number+1) % batches_per_epoch
        yield x, y

Another thing we will do before starting the training, we will define a callback function, which means, a function that will be called after each epoch ends. 

What we want is to test the partially trained model to check how it is predicting the score of the test set.

In [13]:
def on_epoch_end(epoch, logs, print_preditions=0):
    # Generate predictions
    predictions = combined_model.predict_generator(
        generator(comments_test, labels_test, 128, tokenize, max_seq_length),
        steps=int(len(comments_test)/128)
    )

    # Compare predictions with actual values for the first few items in our test dataset
    diff = 0
    printed = 0
    for i in range(len(predictions)):
        val = predictions[i]
        if print_preditions and printed < print_preditions:
            print(comments_test.iloc[i])
            print('Predicted: ', val[0], 'Actual: ', labels_test.iloc[i], '\n')
            printed += 1
        diff += abs(val[0] - labels_test.iloc[i])

    # Compare the average difference between actual price and the model's predicted price
    print('\nEpoch: %d. Average prediction difference: %0.4f\n' %
            (epoch+1, diff/len(predictions)))
    

print_callback = keras.callbacks.LambdaCallback(on_epoch_end=on_epoch_end)

Since we are going to use generators, we use Model.fit_generator instead of Model.fit.

Please note that the generator is responsible to yield both the inputs and the expected labels of each batch. 

We are also sending validation_data, so the model will get evaluated after each epoch, printing the loss and accuracy of the test set (val_loss, val_acc).

Finally, we send the callback we created a couple of steps before, so it will execute *on_epoch_end* every epoch.

In [14]:
# Run training
# It is a fairly deep network, it will take around 5 minutes per epoch
combined_model.fit_generator(
    generator(comments_train, labels_train, 128, tokenize, max_seq_length),
    steps_per_epoch=int(len(comments_train)/128),
    epochs=10,
    validation_data=generator(comments_test, labels_test, 128, tokenize, max_seq_length),
    validation_steps=int(len(comments_test)/128),
    # callbacks=[print_callback]
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1d393cd1668>

At the end, you should get an average difference of around 1.25 points per user review.

Finally, we evaluate again the final, trained model, printing some of the results.

In [15]:
combined_model.save('combined_model_test_3.h5')

In [16]:
# We manually call on_epoch_end with the trained model, but this time with 
# print_preditions=20
# It will print 20 examples of the training set, with its predicted and actual 
# value
on_epoch_end(7, {}, print_preditions=20)

Dungelot: Shattered Lands takes you on an epic roguelike dungeon-crawler adventure to defeat zombie cows, splat giant evil mushrooms, and generally get back home in one piece. Deceptively simple, ever so addictive.
Predicted:  69.22917 Actual:  78 

Solve challenging cable-based puzzles and uncover what really happened to the crew of The Alabaster. Now with Hint System (for those ultra tricky puzzles).
Predicted:  65.48518 Actual:  82 

Pick your faction, Monster, strategy, and find out who gets to become King of the Undead!
Predicted:  64.60827 Actual:  63 

A truly global cricket simulation from Big Ant Studios, the leading name in authentic, realistic cricket action, Cricket 19 allows you to take control of your team, from club through national level, and lead them to T20, ODI, or Test Match glory.
Predicted:  71.13329 Actual:  73 

GUNGRAVE is back! Undead assassin Beyond the Grave returns from a 14-year slumber ready to rumble in VR. Experience this classic series like never befor

Finally, let's try to predict the score of freshly retrieved user reviews.

In [17]:
# Let's try some user review of World War Z for Playstatin 4
# https://www.metacritic.com/game/playstation-4/world-war-z/user-reviews
test_comments = [
    "From the creator of Pony Island and The Hex comes the latest mind melting, self-destructing love letter to video games. Inscryption is an inky black card-based odyssey that blends the deckbuilding roguelike, escape-room style puzzles, and psychological horror into a blood-laced smoothie. Darker still are the secrets inscrybed upon the cards...\
In Inscryption you will...\
Acquire a deck of woodland creature cards by draft, surgery, and self mutilation\
Unlock the secrets lurking behind the walls of Leshy's cabin\
Embark on an unexpected and deeply disturbing odyssey",
    "Pre-order Age of Empires IV now and get the Age of Empires II: Definitive Edition “Dawn of the Dukes” expansion as a free bonus in August 2021*.\
One of the most beloved real-time strategy games returns to glory with Age of Empires IV, putting you at the center of epic historical battles that shaped the world. Featuring both familiar and innovative new ways to expand your empire in vast landscapes with stunning 4K visual fidelity, Age of Empires IV brings an evolved real-time strategy game to a new generation.\
Return to History – The past is prologue as you are immersed in a rich historical setting of 8 diverse civilizations across the world from the English to the Chinese to the Delhi Sultanate in your quest for victory. Build cities, manage resources, and lead your troops to battle on land and at sea in 4 distinct campaigns with 35 missions that span across 500 years of history from the Dark Ages up to the Renaissance.\
Choose Your Path to Greatness with Historical Figures – Live the adventures of Joan of Arc in her quest to defeat the English, or command mighty Mongol troops as Genghis Khan in his conquest across Asia. The choice is yours – and every decision you make will determine the outcome of history.\
Customize Your Game with Mods – Available in Early 2022, play how you want with user generated content tools for custom games.\
Challenge the World – Jump online to compete, cooperate or spectate with up to 7 of your friends in PVP and PVE multiplayer modes.\
An Age for All Players – Age of Empires IV is an inviting experience for new players with a tutorial system that teaches the essence of real-time strategy and a Campaign Story Mode designed for first time players to help achieve easy setup and success, yet is challenging enough for veteran players with new game mechanics, evolved strategies, and combat techniques.\
*Expansion bonus requires Age of Empires II: Definitive Edition game, sold separately. Valid for pre-orders via Steam, Microsoft Store, and participating retailers. Content requires broadband internet to download. See retailer for details."
]
# The scores are 
# 82
# 83

combined_model.predict(process_comments(test_comments, tokenize, max_seq_length))

array([[ 77.200554],
       [113.5192  ]], dtype=float32)