# Sentiment Analysis For Steam Games

Data is collected by scrapy web spider.

We need to organize the data and remove unnecessary elements such as punctuation and new line characters. 

In [84]:
import numpy as np
import tensorflow as tf
import json

## Reading and Preparing Data

Creating genre combination as one-hot vector so that classification would be correct.

* Can differentiate between action and action adventure as genre. 

In [85]:
with open('Steam-Data/steam_genres.json', 'r') as f:
    genres = f.read()

In [86]:
genres = json.loads(genres)

In [87]:
genre_list = [genre['genre'] for genre in genres]
genre_list.sort()

In [88]:
import itertools
genre_length = len(genre_list) if genre_list == 2 else 2

In [89]:
total_labels = []
for i in range(1, genre_length):
    combination_of_genres = list(itertools.combinations(genre_list, i))
    for combined_genre in combination_of_genres:
        label = ' '.join(combined_genre)
        total_labels.append(label)

In [90]:
labels_length = len(total_labels)
labels_length

12

In [91]:
# One-hot encoded labels would be equal to identity matrix with length of the total_labels
output = np.identity(labels_length)

In [92]:
output

array([[ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.]])

In [93]:
# Preparing the game data
with open('Steam-Data/steam_game_data.json', 'r') as f:
    games = f.read()

In [94]:
games = json.loads(games)

In [95]:
game_url = [game['url'] for game in games]

In [96]:
# Extracting appid from the url
import re
game_app_id = []
for i, url in enumerate(game_url):
    match = re.search("http://store.steampowered.com/app/(.+?)/", url)
    if match:
        game_app_id.append(match.group(1))
    else:
        # Removing sub products
        del games[i]

In [97]:
len(games) == len(game_app_id)

True

In [98]:
len(games)

14776

In [99]:
game_length = len(games)

### Combine all description and about page of the game

Description is short information and about has some of the key feautures of the game combining them will help us to correctly categorize them. Some of the information is repeating but it helps us to categorize them correctly.

In [100]:
game_description_information = [game['description'] for game in games]
game_about_information = [game['about'] for game in games]

joined_about_array = []
for game_about_array in game_about_information:
    striped_elements = [element.strip() for element in game_about_array]
    joined_about_array.append(' '.join(striped_elements))

combine_words = []
for i in range(game_length):
    stripped_game_description = str(game_description_information[i]).strip()
    combine_words.append(stripped_game_description + ' ' + joined_about_array[i])

combine_words[0]

"PLAYERUNKNOWN'S BATTLEGROUNDS is a last-man-standing shooter being developed with community feedback. Players must fight to locate weapons and supplies in a massive 8x8 km island to be the lone survivor. This is BATTLE ROYALE.   is a last-man-standing shooter being developed with community feedback. Starting with nothing, players must fight to locate weapons and supplies in a battle to be the lone survivor. This realistic, high tension game is set on a massive 8x8 km island with a level of detail that showcases Unreal Engine 4's capabilities. aka Brendan Greene, is a pioneer of the Battle Royale genre. As the creator of the Battle Royale game-mode found in the ARMA series and H1Z1 : King of the Kill, Greene is co-developing the game with veteran team at Bluehole to create the most diverse and robust Battle Royale experience to date "

In [101]:
len(combine_words)

14776

In [102]:
import re
game_information = []
for combine in combine_words:
    words_in_combine = combine.split(' ')
    individual_game_info = ' '.join([re.sub('[^0-9a-zA-Z]+', '', word) for word in words_in_combine])
    game_information.append(individual_game_info)

game_information[0]

'PLAYERUNKNOWNS BATTLEGROUNDS is a lastmanstanding shooter being developed with community feedback Players must fight to locate weapons and supplies in a massive 8x8 km island to be the lone survivor This is BATTLE ROYALE   is a lastmanstanding shooter being developed with community feedback Starting with nothing players must fight to locate weapons and supplies in a battle to be the lone survivor This realistic high tension game is set on a massive 8x8 km island with a level of detail that showcases Unreal Engine 4s capabilities aka Brendan Greene is a pioneer of the Battle Royale genre As the creator of the Battle Royale gamemode found in the ARMA series and H1Z1  King of the Kill Greene is codeveloping the game with veteran team at Bluehole to create the most diverse and robust Battle Royale experience to date '

In [103]:
len(game_information)

14776

### Encoding Genres

All of the genres must be encoded so we can use that. We are doing a one vs all classification. Action vs Non-Action, Adventure vs Non-Adventure. To do that we need to use the same neural network design for other genres. This is example for Action.

In [104]:
genres_to_int = {genre: i for i, genre in enumerate(genre_list)}

game_genres = [sorted(game['genres']) for game in games]

# Let's do it for the Action Games
genre_name = 'Action'
# I believe focusing a genre will generate better results overall
game_labels = np.array([1 if genre_name in game_genre_list  else 0 for game_genre_list in game_genres])

In [105]:
game_labels

array([1, 0, 1, ..., 0, 0, 0])

In [106]:
game_genres[0]

['Action', 'Adventure', 'Early Access', 'Massively Multiplayer', 'Violent']

In [107]:
len(game_labels)

14776

### Setting up threshold

By setting up threshold for word usage we can define the crucial words for genre. Some of the words in the classified genre is used by other genres as well. Also there are common words such as 'for', 'the', 'a' etc. We need to ignore them. Also we need to set a threshold compare words such as 'game'. These words are used for game information.

In [108]:
from collections import Counter
# Create three Counter objects to store positive, negative and total counts
positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()

# Loop over all the words in all the reviews and increment the counts in the appropriate counter objects
for i in range(len(game_labels)):
    if(game_labels[i] == 1):
        for word in game_information[i].split(" "):
            positive_counts[word] += 1
            total_counts[word] += 1
    else:
        for word in game_information[i].split(" "):
            negative_counts[word] += 1
            total_counts[word] += 1

In [109]:
positive_counts.most_common()

[('the', 63564),
 ('and', 47815),
 ('', 44276),
 ('to', 38228),
 ('of', 35211),
 ('a', 33245),
 ('in', 19434),
 ('you', 18723),
 ('is', 17616),
 ('your', 16884),
 ('with', 14283),
 ('game', 11007),
 ('for', 9226),
 ('as', 8249),
 ('on', 8213),
 ('The', 7627),
 ('that', 7491),
 ('will', 7229),
 ('an', 6968),
 ('are', 6263),
 ('from', 6114),
 ('can', 5795),
 ('by', 5612),
 ('or', 5343),
 ('be', 5263),
 ('world', 4618),
 ('have', 4141),
 ('this', 4116),
 ('it', 4088),
 ('all', 4049),
 ('new', 4037),
 ('up', 3750),
 ('their', 3435),
 ('You', 3431),
 ('through', 3424),
 ('more', 3261),
 ('has', 3182),
 ('into', 3139),
 ('players', 2969),
 ('them', 2804),
 ('enemies', 2777),
 ('action', 2711),
 ('one', 2673),
 ('at', 2651),
 ('his', 2638),
 ('but', 2593),
 ('weapons', 2567),
 ('way', 2557),
 ('play', 2467),
 ('time', 2465),
 ('out', 2408),
 ('its', 2389),
 ('where', 2379),
 ('which', 2151),
 ('In', 2125),
 ('levels', 2113),
 ('not', 2082),
 ('each', 2081),
 ('unique', 2070),
 ('other', 2031)

In [110]:
negative_counts.most_common()

[('the', 78385),
 ('and', 53972),
 ('', 50643),
 ('of', 43621),
 ('to', 42834),
 ('a', 38628),
 ('in', 23221),
 ('you', 20977),
 ('is', 19994),
 ('your', 18731),
 ('with', 14145),
 ('game', 12774),
 ('for', 10457),
 ('on', 9428),
 ('The', 9281),
 ('that', 8820),
 ('as', 8739),
 ('will', 8063),
 ('an', 7380),
 ('are', 6995),
 ('from', 6734),
 ('by', 6541),
 ('or', 6123),
 ('can', 6042),
 ('be', 6027),
 ('world', 5658),
 ('new', 5026),
 ('it', 4963),
 ('this', 4921),
 ('have', 4498),
 ('all', 4453),
 ('their', 4189),
 ('has', 3771),
 ('his', 3700),
 ('more', 3637),
 ('into', 3635),
 ('You', 3449),
 ('up', 3416),
 ('at', 3304),
 ('through', 3116),
 ('time', 3097),
 ('one', 3071),
 ('them', 2906),
 ('but', 2827),
 ('its', 2796),
 ('out', 2783),
 ('A', 2730),
 ('her', 2721),
 ('where', 2704),
 ('own', 2617),
 ('play', 2586),
 ('who', 2533),
 ('In', 2496),
 ('not', 2482),
 ('players', 2436),
 ('which', 2376),
 ('they', 2375),
 ('adventure', 2373),
 ('story', 2367),
 ('find', 2320),
 ('way', 

In [111]:
pos_neg_ratios = Counter()

# Calculate the ratios of positive and negative uses of the most common words
# Consider words to be "common" if they've been used at least 100 times
for term,cnt in list(total_counts.most_common()):
    if(cnt > 100):
        pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+1)
        pos_neg_ratios[term] = pos_neg_ratio

In [112]:
print("Pos-to-neg ratio for 'the' = {}".format(pos_neg_ratios["the"]))
print("Pos-to-neg ratio for 'action' = {}".format(pos_neg_ratios["action"]))
print("Pos-to-neg ratio for 'time' = {}".format(pos_neg_ratios["time"]))

Pos-to-neg ratio for 'the' = 0.8109101114995024
Pos-to-neg ratio for 'action' = 4.17076923076923
Pos-to-neg ratio for 'time' = 0.7956746287927695


In [113]:
# Convert ratios to logs
for word,ratio in pos_neg_ratios.most_common():
    if(ratio > 1):
        pos_neg_ratios[word] = np.log(ratio)
    else:
        pos_neg_ratios[word] = -np.log((1 / (ratio+0.01)))

In [114]:
print("Pos-to-neg ratio for 'the' = {}".format(pos_neg_ratios["the"]))
print("Pos-to-neg ratio for 'action' = {}".format(pos_neg_ratios["action"]))
print("Pos-to-neg ratio for 'time' = {}".format(pos_neg_ratios["time"]))

Pos-to-neg ratio for 'the' = -0.19734166212611562
Pos-to-neg ratio for 'action' = 1.4281004866089826
Pos-to-neg ratio for 'time' = -0.2160753043401113


In [115]:
min_count = 25 # Words must be repeated more than 25 time to be counted
polarity_cutoff = 0.3
info_vocab = set()
for info in game_information:
    for word in info.split(" "):
        if(total_counts[word] > min_count):
            if(word in pos_neg_ratios.keys()):
                if((pos_neg_ratios[word] >= polarity_cutoff) or (pos_neg_ratios[word] <= -polarity_cutoff)):
                    info_vocab.add(word)
            else:
                info_vocab.add(word)

In [116]:
# Convert the vocabulary set to a list so we can access words via indices
info_vocab = list(info_vocab)

In [117]:
info_vocab[:10]

['Frank',
 'relationship',
 'cartoon',
 'invite',
 'cows',
 'selecting',
 'define',
 'outer',
 'whos',
 'Hope']

In [118]:
word2index = {}
for i, word in enumerate(info_vocab):
    word2index[word] = i

In [119]:
word2index

{'coal': 3438,
 'Frank': 0,
 'module': 1106,
 'visible': 3440,
 'details': 3441,
 'relationship': 1,
 'battlegrounds': 3443,
 'Allied': 3439,
 'cartoon': 2,
 'tour': 2821,
 'invite': 3,
 'terribly': 1108,
 'cows': 4,
 'selecting': 5,
 'Builder': 3442,
 'Kickstarter': 3444,
 'plunges': 3445,
 'automatically': 3446,
 'outer': 7,
 'whos': 8,
 'press': 3448,
 'trail': 3449,
 'Must': 567,
 'Designed': 3450,
 'Powerful': 4546,
 'Hope': 9,
 'corpses': 10,
 'named': 11,
 'Express': 3451,
 'charging': 3452,
 'headquarters': 12,
 'Templar': 13,
 'realtime': 14,
 'deceptively': 15,
 'Guardian': 16,
 'GPL': 3454,
 'casino': 17,
 'flip': 3455,
 'respect': 3456,
 'twisting': 3457,
 'Ruler': 5088,
 'rubble': 3458,
 'twists': 18,
 'inventions': 3459,
 'occult': 19,
 'RC': 6635,
 'company': 20,
 'refuse': 21,
 'directed': 3461,
 'guys': 22,
 'intensity': 23,
 'climbing': 25,
 'Management': 3463,
 'Eight': 26,
 'tracking': 27,
 'hopefully': 28,
 'signature': 6633,
 'load': 3464,
 'trigger': 29,
 'ACCESS

In [120]:
game_info2ints = []
info_vocab_set = set(info_vocab)
for each_game_information in game_information:
    game_info2ints.append([word2index[word] for word in each_game_information.split() if word in info_vocab_set])

In [121]:
game_info2ints[0]

[707,
 6636,
 5514,
 2055,
 4612,
 5730,
 4894,
 1603,
 1881,
 561,
 1134,
 707,
 6636,
 6228,
 2055,
 4612,
 5730,
 2478,
 561,
 1134,
 1981,
 3496,
 4894,
 1603,
 1881,
 4900,
 2144,
 6052,
 523,
 3468,
 6109,
 5108,
 759,
 5984,
 3225,
 2846,
 4415,
 2944]

### Fixing and Converting Game Information

Some of the game information contains no special definition or information related to the game. We need to eliminate them so neural network can only focuses on the information given.

In [126]:
review_lens = Counter([len(x) for x in game_info2ints])
review_lens[0]

173

In [127]:
max(review_lens)

1257

In [128]:
min(review_lens)

0

In [129]:
non_zero_idx = [ii for ii, description in enumerate(game_info2ints) if len(description) != 0]
len(non_zero_idx)

14603

In [131]:
description = [game_info2ints[ii] for ii in non_zero_idx]
labels = np.array([game_labels[ii] for ii in non_zero_idx])

assert(len(description) == len(labels))

Some of the information contains more than 1000 words however compare the other games these information is unnecessary and will create huge difference in learning. We need to set a limit to the sequence.

In [132]:
seq_len = 150

features = np.zeros((len(description), seq_len), dtype=int)
for i, row in enumerate(description):
    features[i, -len(row):] = np.array(row)[:seq_len]
    
features[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,  707, 6636, 5514, 2055, 4612, 5730, 4894, 1603, 1881,
        561, 1134,  707, 6636, 6228, 2055, 4612, 5730, 2478,  561, 1134,
       1981, 3496, 4894, 1603, 1881, 4900, 2144, 6052,  523, 3468, 6109,
       5108,  759, 5984, 3225, 2846, 4415, 2944])

In [133]:
split_frac = 0.7
split_idx = int(len(features)*split_frac)
train_x, val_x = features[:split_idx], features[split_idx:]
train_y, val_y = labels[:split_idx], labels[split_idx:]

test_idx = int(len(val_x)*0.5)
val_x, test_x = val_x[:test_idx], val_x[test_idx:]
val_y, test_y = val_y[:test_idx], val_y[test_idx:]

print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))

			Feature Shapes:
Train set: 		(10222, 150) 
Validation set: 	(2190, 150) 
Test set: 		(2191, 150)


### Building neural network

We are using LSTM in RNN. 

In [210]:
lstm_size = 256
lstm_layers = 2
batch_size = 100
learning_rate = 0.001

In [211]:
n_words = len(info_vocab)

# Create the graph object
graph = tf.Graph()
# Add nodes to the graph
with graph.as_default():
    inputs_ = tf.placeholder(tf.int32, [None, None], name='inputs')
    labels_ = tf.placeholder(tf.int32, [None, None], name='labels')
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')

In [212]:
# Size of the embedding vectors (number of units in the embedding layer)
embed_size = 100

with graph.as_default():
    embedding = tf.Variable(tf.random_uniform((n_words, embed_size), -1, 1))
    embed = tf.nn.embedding_lookup(embedding, inputs_)

In [213]:
with graph.as_default():
    # Your basic LSTM cell
    # lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
    # lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
    
    # Add dropout to the cell
    # drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
    
    # Stack up multiple LSTM layers, for deep learning
    cell = tf.contrib.rnn.MultiRNNCell([
            tf.contrib.rnn.BasicLSTMCell(lstm_size)
        for _ in range(lstm_layers)
    ])
    
    # Getting an initial state of all zeros
    initial_state = cell.zero_state(batch_size, tf.float32)

In [214]:
with graph.as_default():
    outputs, final_state = tf.nn.dynamic_rnn(cell, embed,
                                             initial_state=initial_state)

In [215]:
with graph.as_default():
    predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid)
    cost = tf.losses.mean_squared_error(labels_, predictions)
    
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

In [216]:
with graph.as_default():
    correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), labels_)
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

In [217]:
def get_batches(x, y, batch_size=100):
    
    n_batches = len(x)//batch_size
    x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size], y[ii:ii+batch_size]


In [218]:
epochs = 10

with graph.as_default():
    saver = tf.train.Saver()

with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    iteration = 1
    for e in range(epochs):
        state = sess.run(initial_state)
        
        for ii, (x, y) in enumerate(get_batches(train_x, train_y, batch_size), 1):
            feed = {inputs_: x,
                    labels_: y[:, None],
                    keep_prob: 0.5,
                    initial_state: state}
            loss, state, _ = sess.run([cost, final_state, optimizer], feed_dict=feed)
            
            if iteration%5==0:
                print("Epoch: {}/{}".format(e, epochs),
                      "Iteration: {}".format(iteration),
                      "Train loss: {:.3f}".format(loss))

            if iteration%25==0:
                val_acc = []
                val_state = sess.run(cell.zero_state(batch_size, tf.float32))
                for x, y in get_batches(val_x, val_y, batch_size):
                    feed = {inputs_: x,
                            labels_: y[:, None],
                            keep_prob: 1,
                            initial_state: val_state}
                    batch_acc, val_state = sess.run([accuracy, final_state], feed_dict=feed)
                    val_acc.append(batch_acc)
                print("Val acc: {:.3f}".format(np.mean(val_acc)))
            iteration +=1
    saver.save(sess, "checkpoints/sentiment.ckpt")

Epoch: 0/10 Iteration: 5 Train loss: 0.239
Epoch: 0/10 Iteration: 10 Train loss: 0.234
Epoch: 0/10 Iteration: 15 Train loss: 0.248
Epoch: 0/10 Iteration: 20 Train loss: 0.247
Epoch: 0/10 Iteration: 25 Train loss: 0.203
Val acc: 0.610
Epoch: 0/10 Iteration: 30 Train loss: 0.199
Epoch: 0/10 Iteration: 35 Train loss: 0.202
Epoch: 0/10 Iteration: 40 Train loss: 0.259
Epoch: 0/10 Iteration: 45 Train loss: 0.227
Epoch: 0/10 Iteration: 50 Train loss: 0.217
Val acc: 0.638
Epoch: 0/10 Iteration: 55 Train loss: 0.207
Epoch: 0/10 Iteration: 60 Train loss: 0.193
Epoch: 0/10 Iteration: 65 Train loss: 0.198
Epoch: 0/10 Iteration: 70 Train loss: 0.271
Epoch: 0/10 Iteration: 75 Train loss: 0.209
Val acc: 0.654
Epoch: 0/10 Iteration: 80 Train loss: 0.230
Epoch: 0/10 Iteration: 85 Train loss: 0.231
Epoch: 0/10 Iteration: 90 Train loss: 0.192
Epoch: 0/10 Iteration: 95 Train loss: 0.170
Epoch: 0/10 Iteration: 100 Train loss: 0.219
Val acc: 0.730
Epoch: 1/10 Iteration: 105 Train loss: 0.224
Epoch: 1/10 Ite

### Testing the Data

After the training is completed, it is time to test the neural network.

In [219]:
test_acc = []
with tf.Session(graph=graph) as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    test_state = sess.run(cell.zero_state(batch_size, tf.float32))
    for ii, (x, y) in enumerate(get_batches(test_x, test_y, batch_size), 1):
        feed = {inputs_: x,
                labels_: y[:, None],
                keep_prob: 1,
                initial_state: test_state}
        batch_acc, test_state = sess.run([accuracy, final_state], feed_dict=feed)
        test_acc.append(batch_acc)
    print("Test accuracy: {:.3f}".format(np.mean(test_acc)))

INFO:tensorflow:Restoring parameters from checkpoints/sentiment.ckpt
Test accuracy: 0.745
