Title: Boardgame Recommender neural network <br>
Author: Austin Chou<br>
Date: 2018-11-07

# Purpose

The intent of this exercise is to:<br>

1) Practice programming and data manipulation in python <br>
2) Implement NLP approach to develop a boardgame recommender system

# Problem

There are a variety of methods for building recommendation engines. I want to apply approaches for boardgames which is a personal hobby.

The website "https://boardgamegeek.com/" has over 10000 ranked boardgames (includes expansions and variations of the same base boardgame). For new boardgame collectors and hobbyists, the number of choices is quite astronomical. Even if we reduce look at just the games released in the 2017, there are over 3000 boardgames.

Thus the primary problem is to identify patterns to help recommend boardgames to people who are looking to grow their boardgame collection.  

# Approach

The approach I chose was inspired by a basic Natural Language Processing technique: Continuous Bag of Words (CBOW). In CBOW (and its brother the Skip-gram), words are essentially defined by its surrounding words. In the case of CBOW, we can be given a set of words (as in a sentence) and predict what word is most likely surrounded by that set of words.  

In that way, I thought it would be interesting to build a boardgame recommender based off which boardgame is most likely to appear with any set of boardgames. This would be quite different from defining boardgames by their defined features (Categories, Mechanics, player count, etc).

Of course, there are limitations to this approach (assuming the model can be built): Because it is reliant on existing collections, new boardgames released each year would automatically be ignored by the model (since no one would own them and they appear on very few collections at a time). The CBOW approach would probably work best for a new hobbyist who is interested in building a collection of games that have been around for a while (and can be found in many collections).

## Load appropriate functions, packages, and parameters

In [28]:
# Load in functions
%run functions-bgg_get.py

# General libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import math
%matplotlib inline

# Regular expression
import re

# Flags to rerun code chunks
recollect = False
reclean = False

# Create Training Matrix: Aggregate all collections

The first step is to take all the player collections we have scraped and turn them into one giant matrix. CBOW and other NLP techniques utilize One Hot Encoding; analogously, we will one hot encode our boardgames. That means each row represents a Collection, and each entry in that Collection represents a game. <br>
 - If the value is a 0, then the game does not exist in the Collection. <br>
 - If the value is a 1, then the game exists in the Collection.

We will do additional cleaning and apply some limitations to try and reduce the number of features we will be feeding into our model: <br>
 - Limit the number of games we care about. I chose to only recommend from the top 5000 games (as opposed to trying to consider all 10000+).<br>
 - Remove games from users' collections that fall below a certain Rating by the user. If the user didn't rate the game, we'll use the overall Rating found on boardgamegeek. The idea is that if the game doesn't rate highly, it should not be valued for recommendation.

In [14]:
# Number of games we are considering (from top rated)
total = 5000

# Get the game_list
games_list = pd.read_csv('bgg_id_output.csv')
#Remove all NaN rows
games_list.dropna(axis=0,how='any',inplace=True)
games_list.reset_index(drop=True,inplace=True)
#There are repeat titles in the list. Remove them.
rep_games_idx = games_list[games_list['Game'].duplicated()].index.tolist() #Returns the indices of all repeat titles. This list does NOT include the first appearance
games_list.drop(games_list.index[rep_games_idx],inplace=True)
games_list.reset_index(drop=True,inplace=True)
#Reduce to # of games to consider
games_list = games_list.loc[:total,:]

# Get the playerlist
playerlist = pd.read_excel('playerlist.xlsx')
# Generate the bgg dictionary of indices to game
games, games_decode = bgg_dict(games_list, total)

In [15]:
# Have a list for removed users (those whose collections fall below 1 game after cleanup)
removed_users = []

for i in range(0,len(playerlist)):
    # Get the user file
    #print(str(ct) + ': ' + str(user))
    user = playerlist['Username'][i]
    file = str(user) + '_raw.csv'
    user_collection = pd.read_csv(file, sep='\t', encoding='ISO-8859-1')
    #Drop the column 'Unnamed: 0' that is used to number the rows
    user_collection.drop('Unnamed: 0', inplace=True, axis=1)
    
    # Apply RatingThreshold
    user_collection_cleaned = RatingThreshold(user_collection)
    
    # Check if user has more than 1 game after cleaning the collection
    if user_collection_cleaned.shape[0] == 1:
        print('Less than 1 game added')
        removed_users = removed_users + [user_collection_cleaned.loc[0,'User']]
    else:
        # Vectorize the collection
        user_vector = games2vec(user_collection_cleaned['Game'], games)
        #Add user number
        user_vector = np.append([[i]], user_vector, axis=1)
        #Create row as a dataframe
        user_row = pd.DataFrame(user_vector, columns=list(['User']) + list(games_list['Game']))
        
        # Append to existing aggregate_collections dataframe
        if i == 0:
            aggregate_collections = user_row
        else:
            aggregate_collections = pd.concat([aggregate_collections, user_row])

aggregate_collections.to_csv('Coded user collections.csv',encoding='utf-8')     

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  collection.drop(collection.index[low_bgg_idx],inplace=True)


In [18]:
print("Dimensions of our matrix:")
print(aggregate_collections.shape)

Dimensions of our matrix:
(211, 5002)


In [19]:
print("Users we removed because they had fewer than 1 highly-rated games:")
print(removed_users)

Users we removed because they had fewer than 1 highly-rated games:
[]


## Create Training Matrix: Split user matrix to inputs and outputs

The next step is to isolate each game from each collection. All the other games in the collection become features associated with having a value of "1" in that index. In other words, split our matrix into the input matrix (encoded collection) and output (encoded game that appears with the input set).

In [34]:
# Get the aggregated collections list
aggregate_collections = pd.read_csv('Coded user collections.csv')
aggregate_collections.drop(['Unnamed: 0'], inplace=True, axis=1)

# Number of games we are considering (from top rated)
total = 5000

# Get the game_list
games_list = pd.read_csv('bgg_id_output.csv')
#Remove all NaN rows
games_list.dropna(axis=0,how='any',inplace=True)
games_list.reset_index(drop=True,inplace=True)
#There are repeat titles in the list. Remove them.
rep_games_idx = games_list[games_list['Game'].duplicated()].index.tolist() #Returns the indices of all repeat titles. This list does NOT include the first appearance
games_list.drop(games_list.index[rep_games_idx],inplace=True)
games_list.reset_index(drop=True,inplace=True)
#Reduce to # of games to consider
games_list = games_list.loc[:total,:]

# Get the playerlist
playerlist = pd.read_excel('playerlist.xlsx')
# Generate the bgg dictionary of indices to game
games, games_decode = bgg_dict(total)

In [35]:
#Keep track of how many training rows we will have
total_training_rows = 0

#Iterate through all users and convert collections into input vectors and expected_output vectors
for i in range(0, aggregate_collections['User'].shape[0]):
    #Use generate_training to create the input and expected output vectors
    user_input, user_exp_output = generate_training(aggregate_collections.iloc[i,1:].as_matrix(), games)
    
    #Append vectors into a matrix
    if i == 0:
        training_data = user_input
        exp_output_data = user_exp_output
    else:
        training_data = np.concatenate((training_data, user_input),axis=0)
        exp_output_data = np.concatenate((exp_output_data, user_exp_output), axis=0)
    
    #Update number of rows generated
    total_training_rows = total_training_rows + user_input.shape[0]

In [36]:
# For storage, store input and output stacked vertically in a csv
saving = np.concatenate((training_data, exp_output_data), axis=0)
saving = pd.DataFrame(saving, columns=list(games_list['Game']))
saving.to_csv('Vectorized bgg data.csv', encoding='utf-8')

# SPLIT TRAINING DATA

As with training any model, we split our encoded data into training and test sets.

In [37]:
# Number of games we are considering (from top rated)
total = 5000
# Generate the bgg dictionary of indices to game
games, games_decode = bgg_dict(total)

In [38]:
# Load vectorized data
vectorized = pd.read_csv('Vectorized bgg data.csv')
vectorized.drop('Unnamed: 0', axis=1, inplace=True) #Remove unnecessary first column
row_size = int(vectorized.shape[0]/2) #The actual total datapoints is half the number of rows

# Split the data into the input and expected_output vectors
input_data = vectorized.iloc[:row_size,:].as_matrix()
exp_output_data = vectorized.iloc[row_size:,:].as_matrix()

print(input_data.shape)
print(exp_output_data.shape)

(8515, 5001)
(8515, 5001)


In [9]:
# Divide dataset into Training and Test data
from sklearn.model_selection import train_test_split

Train_inputs, CV_inputs, Train_outputs, CV_outputs = train_test_split(input_data, exp_output_data, test_size=0.25, random_state=101)

print('Training inputs: ' + str(Train_inputs.shape))
print('Training outputs: ' + str(Train_outputs.shape))
print('CV inputs: ' + str(CV_inputs.shape))
print('CV outputs: ' + str(CV_outputs.shape))

Training inputs: (6411, 5001)
Training outputs: (6411, 5001)
CV inputs: (2138, 5001)
CV outputs: (2138, 5001)


# Some extra analysis

Before we hop into training, I did some quick qualitative analysis out of interest.

In [39]:
# Sum all the columns of expected_outputs to see how many collections 
#  include each boardgame
mentions = exp_output_data.sum(axis=0)

In [40]:
# Game that appears the most
print(games_decode[mentions.argmax()] +': '+ str(mentions.max()))

Codenames: 77.0


In [42]:
# Get indices of games that are in user collections and 
#  games that never appear in the user collections
not_in_collections_idx = np.where(mentions==0)[0]
in_collections_idx = np.nonzero(mentions)[0]

In [47]:
print("Number of games from the Top 5000 that don't appear in anyone's collections:")
print(len(not_in_collections_idx))

Number of games from the Top 5000 that don't appear in anyone's collections:
3160


In [48]:
print("Number of games from the Top 5000 that appear in at least one collection:")
print(len(in_collections_idx))

Number of games from the Top 5000 that appear in at least one collection:
1841


Out of the top 5000 games, 3168 games are not in any of the user collections. Indices of the games are stored in array not_in_collections_idx.

# Neural Network Models

I decided to try two different approaches to implementing CBOW. First, I just tried to use sklearn's Multilayer Perceptron (1 layer neural network) and see how well that would work.

Second, since I wanted to try and learn to use TensorFlow, I decided to implement a model through TensorFlow.

# Black box MultiLayer Perceptron

Just try sklearn's MLP regressor...?

Inspired by: https://towardsdatascience.com/the-perils-of-predictive-policing-11928a9f1d60

http://scikit-learn.org/stable/modules/neural_networks_supervised.html

https://en.wikipedia.org/wiki/Limited-memory_BFGS



In [21]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(hidden_layer_sizes=(300), #1 hidden layer of 300 neurons
                    solver = 'lbfgs', #Limited memory BFGS
                    max_iter = 300, #Converge by 300 iterations or optimization stops
                    random_state=1) 

In [22]:
mlp.fit(Train_inputs,Train_outputs)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=300, learning_rate='constant',
       learning_rate_init=0.001, max_iter=300, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True,
       solver='lbfgs', tol=0.0001, validation_fraction=0.1, verbose=False,
       warm_start=False)

In [23]:
# Evaluate training error
mlp.score(Train_inputs, Train_outputs)

0.81547340508501009

In [24]:
# Save the model
from sklearn.externals import joblib
filename = 'bgg_rec_mlp.joblib.pkl'
_ = joblib.dump(mlp, filename, compress=9)

#### For future use, we can just load the model

In [26]:
from sklearn.neural_network import MLPClassifier
from sklearn.externals import joblib
filename = 'bgg_rec_mlp.joblib.pkl'

In [27]:
loaded_mlp = joblib.load(filename)

In [28]:
loaded_mlp.score(Train_inputs, Train_outputs)

0.81547340508501009

#### Let's try checking the CV dataset

In [29]:
loaded_mlp.score(CV_inputs, CV_outputs)

0.059401309635173059

Looks like the mlp does decently well with the training data set (score = 0.81) whereas it does terribly with the CV data set (score = 0.06).

Let's see if there's an affect of different numbers of neurons in the hidden layer

In [37]:
from sklearn.neural_network import MLPClassifier
from sklearn.externals import joblib
from IPython.display import clear_output

neurons = [100, 300, 500, 1000]

col = ['#Neurons', 'Training Acc', 'CV Acc']
#mlp_results = pd.DataFrame(columns=col)

for n in neurons:
    mlp = MLPClassifier(hidden_layer_sizes = (n),
                       solver = 'lbfgs',
                       max_iter = 200,
                       random_state = 1)
    
    mlp.fit(Train_inputs, Train_outputs)
    
    trained_acc = mlp.score(Train_inputs, Train_outputs)
    CV_acc = mlp.score(CV_inputs, CV_outputs)
    
    mlp_results = mlp_results.append(pd.DataFrame([[n, trained_acc, CV_acc]], columns=col))
    
    clear_output(wait = True)
    print(mlp_results)
    
    filename = 'bgg_mlp_' + str(n) + 'neurons.joblib.pkl'
    _ = joblib.dump(mlp, filename, compress=9)

In [38]:
mlp_results

Unnamed: 0,#Neurons,Training Acc,CV Acc
0,100,0.119482,0.00608
0,300,0.434254,0.018241
0,500,0.433006,0.019177
0,1000,0.67041,0.055192


Let's also adjust how many iterations we need to converge

In [39]:
from sklearn.neural_network import MLPClassifier
from sklearn.externals import joblib
from IPython.display import clear_output

iter_ct = [100, 300, 500, 1000]

col = ['Iter', 'Training Acc', 'CV Acc']
mlp_results_iter = pd.DataFrame(columns=col)

for i in iter_ct:
    mlp = MLPClassifier(hidden_layer_sizes = (300),
                       solver = 'lbfgs',
                       max_iter = i,
                       random_state = 1)
    
    mlp.fit(Train_inputs, Train_outputs)
    
    trained_acc = mlp.score(Train_inputs, Train_outputs)
    CV_acc = mlp.score(CV_inputs, CV_outputs)
    
    mlp_results_iter = mlp_results_iter.append(pd.DataFrame([[i, trained_acc, CV_acc]], columns=col))
    
    clear_output(wait = True)
    print(mlp_results_iter)
    
    filename = 'bgg_mlp_' + str(i) + 'iter.joblib.pkl'
    _ = joblib.dump(mlp, filename, compress=9)

   Iter  Training Acc    CV Acc
0  1000      0.000000  0.000000
0  1000      0.813914  0.058934
0  1000      0.976447  0.063611
0  1000      0.999844  0.065014


There was a bug in the previous run so the Iter# is incorrect. Reset it properly.

In [41]:
mlp_results_iter['Iter'].loc[:] = [100, 300, 500, 1000]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [42]:
mlp_results_iter

Unnamed: 0,Iter,Training Acc,CV Acc
0,100,0.0,0.0
0,300,0.813914,0.058934
0,500,0.976447,0.063611
0,1000,0.999844,0.065014


It seems more iterations allow the training accuracy to converge and it takes at least 500 to get above 90% accuracy. However, the CV accuracy seems to converge with 300 iteration cap whereas increasing number of neurons can increase the CV accuracy.

Let's try a 1000 neuron and 1500 neuron combo with 1000 iteration-cap.

In [11]:
from sklearn.neural_network import MLPClassifier
from sklearn.externals import joblib
from IPython.display import clear_output

neurons = [1000, 1500]

col = ['#Neurons', 'Training Acc', 'CV Acc']
mlp_results = pd.DataFrame(columns=col)

for n in neurons:
    mlp = MLPClassifier(hidden_layer_sizes = (n),
                       solver = 'lbfgs',
                       max_iter = 1000,
                       random_state = 1)
    
    mlp.fit(Train_inputs, Train_outputs)
    
    trained_acc = mlp.score(Train_inputs, Train_outputs)
    CV_acc = mlp.score(CV_inputs, CV_outputs)
    
    mlp_results = mlp_results.append(pd.DataFrame([[n, trained_acc, CV_acc]], columns=col))
    
    clear_output(wait = True)
    print(mlp_results)
    
    filename = 'bgg_mlp_' + str(n) + 'neurons_1000iter.joblib.pkl'
    _ = joblib.dump(mlp, filename, compress=9)

  #Neurons  Training Acc    CV Acc
0     1000      0.999844  0.080917
0     1500      0.000000  0.000000


In [10]:
from sklearn.neural_network import MLPClassifier
from sklearn.externals import joblib
from IPython.display import clear_output

neurons = [1500]

col = ['#Neurons', 'Training Acc', 'CV Acc']
mlp_results = pd.DataFrame(columns=col)

for n in neurons:
    mlp = MLPClassifier(hidden_layer_sizes = (1500),
                       solver = 'lbfgs',
                       max_iter = 1000,
                       random_state = 1)
    
    mlp.fit(Train_inputs, Train_outputs)
    
    trained_acc = mlp.score(Train_inputs, Train_outputs)
    CV_acc = mlp.score(CV_inputs, CV_outputs)
    
    mlp_results = mlp_results.append(pd.DataFrame([[n, trained_acc, CV_acc]], columns=col))
    
    #clear_output(wait = True)
    print(mlp_results)
    
    #filename = 'bgg_mlp_' + str(n) + 'neurons_1000iter.joblib.pkl'
    #_ = joblib.dump(mlp, filename, compress=9)

  #Neurons  Training Acc  CV Acc
0     1500           0.0     0.0


Looks like with the scikit learn mlp, 1000 neurons with 1000 iterations performs great on Training but still has very low CV accuracy. The model is essentially overfitting the Training data set.

Furthermore, it looks like above 1000 neurons in the hidden layer, the model falls apart.

To try and account for overfitting, we can either obtain more training examples (which is not trivial) or reduce the number of features (which is possible; there are 1833 games in the top 5000 games that don't even show up in any collection).

# Word2Vec and Tensorflow implementation

https://www.tensorflow.org/tutorials/word2vec

http://adventuresinmachinelearning.com/word2vec-tutorial-tensorflow/

https://stackoverflow.com/questions/37394970/tensorflow-word2vec-cbow-model

https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw

As in examples of word2vec implementation, we have created a "one-hot" vector where each unique integer value (index) is assigned to a specific boardgame. We have the dictionary required to encode and decode boardgame collections and the corresponding output. And we already have our dataset with both collection inputs and the expected outputs for training.

Word2vec only has 1 hidden layer: a word embedding matrix.

The output layer uses the softmax function

https://gist.github.com/discorev/b6a0900a52b62cd04f33

https://gist.github.com/yxtay/a94d971955d901c4690129580a4eafb9

http://adventuresinmachinelearning.com/python-tensorflow-tutorial/

In [11]:
import tensorflow as tf

# Python optimization variables
learning_rate = 0.5
epochs = 10
batch_size = 100

# Training data placeholders
#input x will be an unknown # of training examples encoded as a vectors of length 13955 (boardgames)
x = tf.placeholder(tf.float32, [None, 13955]) 
#output placeholder will be # of predictions encoded as vectors of length 13955 (boardgames)
y = tf.placeholder(tf.float32, [None, 13955])

#weights connecting input to hidden layer
# tf.random_normal will generate values from a mean=0 and stddev=input
W1 = tf.Variable(tf.random_normal([13955,300], stddev=0.03), name = 'W1') #300 Neuronal layer
b1 = tf.Variable(tf.random_normal([300]), name = 'b1') #First layer bias

#weights connecting hidden layer to output layer
W2 = tf.Variable(tf.random_normal([300,13955],stddev=0.03), name = 'W2') #Convert back to 13955 output
b2 = tf.Variable(tf.random_normal([13955]), name = 'b2') #Second layer bias


# Calculations for the hidden layer
hidden_out = tf.add(tf.matmul(x, W1),b1) #matrix multiply x and W1 weights, add b1 bias
hidden_out = tf.nn.relu(hidden_out) #rectified linear unit activation function: converts all negative values to 0

# Calculate output layer with softmax
y_ = tf.nn.softmax(tf.add(tf.matmul(hidden_out, W2), b2)) #apply softmax after matrix multiply hidden layer output and W2 weights, and adding b2 bias


# Cost function: Cross Entropy
y_clipped = tf.clip_by_value(y_, 1e-10, 0.9999999) #Limits output values between 1e-10 and 0.9999999; prevents log(0) operations
cross_entropy = -tf.reduce_mean(tf.reduce_sum(y * tf.log(y_clipped) 
                                              + (1-y) * tf.log(1 - y_clipped), axis=1))

# Optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate = learning_rate).minimize(cross_entropy)
#optimizer will minimize the cross entropy cost function and use the learning rate we set (0.5) as learning rate alpha

# Initialization operator
init_op = tf.global_variables_initializer()

# Define an accuracy assessment operation
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_, 1)) #Checks that predicted output is same as predicted
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) #Changes boolean correct_prediction to float, then finds mean. 1 = highest accuracy


In [None]:
# Start tensorflow session

with tf.Session() as sess:
    # initialize the variables
    sess.run(init_op)
    total_batch = int(len(<training labels>) / batch_size) #Use data set
    
    for epoch in range(epochs):
        avg_cost = 0
        for i in range(total_batch):
            batch_x, batch_y = <training group.nextbatch(batch_size=batch_size) #Use data set
            _, c = sess.run(optimizer, cross_entropy],
                           feed_dict = {x: batch_x, y: batch_y})
            avg_cost += c / total_batch
        
        print("Epoch: ", (epoch + 1), "cost = ", "{:.3f}".format(avg_cost))
        
    print(sess.run(accuracy, feed_dict={x: <training group decoded>, y:<training group coded>})) #Use data set

## Implementing Tensorflow Model


https://towardsdatascience.com/learn-word2vec-by-implementing-it-in-tensorflow-45641adaf2ac

In [1]:
import tensorflow as tf
import numpy as np

In [12]:
print('Training inputs: ' + str(Train_inputs.shape))
print('Training outputs: ' + str(Train_outputs.shape))
print('CV inputs: ' + str(CV_inputs.shape))
print('CV outputs: ' + str(CV_outputs.shape))

Training inputs: (5984, 5001)
Training outputs: (5984, 5001)
CV inputs: (2565, 5001)
CV outputs: (2565, 5001)


In [17]:
games
len(games_decode)

5001

In [92]:
#NEW VERSION
#Create tensorflow variables

# Make placeholders for x_train and y_train (x = data points, y = expected labels)
#Create placeholder architecture. shape = (None, len(games)) generates a matrix with unknown number of rows and len(games) columns
#This is the same as the representation of our data in vector form, with each individual example taking a row

batch_size = Train_inputs.shape[0]
col_size = len(games)
num_sampled = 30 #Number of negative examples to sample
EMBEDDING_DIM = 150 #Hyperparameter to be adjusted; # neurons of the hidden layer


def gamegram():
    x = tf.placeholder(tf.float32, shape = [batch_size, None])
    y_label = tf.placeholder(tf.int32, shape=[batch_size, None])
    #val_data = tf.constant(val_data,dtype=tf.int32)

    with tf.variable_scope("gamegram") as scope:
        embeddings = tf.Variable(tf.random_uniform([col_size,
                                                   EMBEDDING_DIM],
                                                  -1.0, 1.0))
        batch_embeddings = tf.nn.embedding_lookup(embeddings, tf.cast(x,tf.int32))

        weights = tf.Variable(tf.truncated_normal([col_size,
                                                  EMBEDDING_DIM],
                                                  stddev = 1.0/math.sqrt(EMBEDDING_DIM)))
        biases = tf.Variable(tf.zeros(col_size))

        loss = tf.reduce_mean(tf.nn.nce_loss(weights = weights,
                                            biases = biases,
                                            labels = y_label,
                                            inputs = x,
                                            num_sampled = num_sampled,
                                            num_classes = col_size,
                                            num_true = col_size))

        norm = tf.sqrt(tf.reduce_mean(tf.square(embeddings), 1, keep_dims = True))

        normalized_embeddings = embeddings/norm
        
#        val_embeddings = tf.nn.embedding_lookup(normalized_embeddings, val_dataset)
#        similarity = tf.matmul(val_embeddings, normalized_embeddings, transpose_b=True)
    
        return x, y_label, normalized_embeddings, loss #, similarity

In [93]:
def run():
    n_iters = 100
    
    x, y_label, normalized_embeddings, loss = gamegram()
    #    inputs, labels, normalized_embeddings, loss, similarity = gamegram()
    optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
    
    init = tf.global_variables_initializer()
    
    with tf.Session() as sess:
        sess.run(init)
        
        average_loss = 0.0
        
        # train for n_iter iterations
        #Recall: x = Train_inputs, y_label = Train_outputs
        for _ in range(n_iters):
            feed_dict = {x: Train_inputs, y_label: Train_outputs}
            _, loss_val = sess.run([optimizer, loss], feed_dict)
            average_loss += loss_val

            if step %1000 == 0:
                if step > 0:
                    average_loss /= 1000
                print('loss at iter', step, ':', average_loss)
                average_loss = 0        

        final_embeddings = normalized_embedding.eval()
        return final_embeddings
                    
#         for step, batch_data in enumerate(train_data): #need to change
#            # inputs, labels = batch_data #need to change
#            # feed_dict = {batch_inputs: inputs, batch_labels: labels}
            
#             _, loss_val = session.run([optimizer, loss], feed_dict)
#             average_loss += loss_val
            
#             if step %1000 == 0:
#                 if step > 0:
#                     average_loss /= 1000
#                 print('loss at iter', step, ':', average_loss)
#                 average_loss = 0

In [None]:
final_embeddings = run()

#visualize_embeddings(final_embeddings, games_decode)

In [28]:
#Create tensorflow variables

# Make placeholders for x_train and y_train (x = data points, y = expected labels)
#Create placeholder architecture. shape = (None, len(games)) generates a matrix with unknown number of rows and len(games) columns
#This is the same as the representation of our data in vector form, with each individual example taking a row

x = tf.placeholder(tf.int32, shape = (None, len(games)))
y_label = tf.placeholder(tf.int32, shape=(None, len(games)))


# Hidden layer calculation
EMBEDDING_DIM = 700 #Hyperparameter to be adjusted; # neurons of the hidden layer

W1 = tf.Variable(tf.random_normal([len(games), EMBEDDING_DIM])) #Weights

b1 = tf.Variable(tf.random_normal([EMBEDDING_DIM])) #bias

hidden_representation = tf.add(tf.matmul(x,W1), b1) # x * W1 + b

# Output layer calculation
W2 = tf.Variable(tf.random_normal([EMBEDDING_DIM, len(games)])) #Weights

b2 = tf.Variable(tf.random_normal([len(games)]))

prediction = tf.nn.softmax(tf.add( tf.matmul(hidden_representation,W2), b2))
#Apply softmax: Converts to normalized probability

In [29]:
W1.shape

TensorShape([Dimension(501), Dimension(700)])

In [12]:
sess = tf.Session()

init = tf.global_variables_initializer()

sess.run(init) #make sure you do this!

#cross_entropy_loss = tf.reduce_mean(-tf.reduce_sum(y_label * tf.log(prediction), reduction_indices=[1]))
cross_entropy_loss = tf.reduce_mean(tf.reduce_sum(y_label * tf.log(prediction + 1e-10) + (1-y_label) * tf.log(1 - prediction + 1e-10), axis=1))
sess.run(cross_entropy_loss, feed_dict={x: Train_inputs, y_label:Train_outputs})

#Added +1e-10 to softmax predictions to try and handle NaN cases. It seems super small values get cut out; happens with extremely sparse matrices.
#https://stackoverflow.com/questions/39583752/nan-from-sparse-softmax-cross-entropy-with-logits-in-tensorflow


-44.917133

In [30]:
# Train the model

sess = tf.Session()

init = tf.global_variables_initializer()

sess.run(init) #make sure you do this!

# Define loss function: Cross Entropy Loss function
cross_entropy_loss = tf.reduce_mean(-tf.reduce_sum(y_label * tf.log(prediction + 1e-10), reduction_indices=[1]))
# Define training step
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(cross_entropy_loss)

n_iters = 100

# train for n_iter iterations
#Recall: x = Train_inputs, y_label = Train_outputs
for _ in range(n_iters):
    sess.run(train_step, feed_dict={x: Train_inputs, y_label: Train_outputs})
    print('loss is : ', sess.run(cross_entropy_loss, feed_dict={x: Train_inputs, y_label: Train_outputs}))    


loss is :  22.961
loss is :  22.9602
loss is :  22.9594
loss is :  22.9587
loss is :  22.958
loss is :  22.9575
loss is :  22.9572
loss is :  22.9571
loss is :  22.957
loss is :  22.9569
loss is :  22.9568
loss is :  22.9567
loss is :  22.9566
loss is :  22.9565
loss is :  22.9564
loss is :  22.9561
loss is :  22.9557
loss is :  22.9552
loss is :  22.9546
loss is :  22.9541
loss is :  22.9536
loss is :  22.9528
loss is :  22.9518
loss is :  22.9506
loss is :  22.9494
loss is :  22.9482
loss is :  22.9472
loss is :  22.9463
loss is :  22.9456
loss is :  22.9452
loss is :  22.9451
loss is :  22.9451
loss is :  22.9451
loss is :  22.9451
loss is :  22.9451
loss is :  22.9451
loss is :  22.945
loss is :  22.945
loss is :  22.945
loss is :  22.945
loss is :  22.945
loss is :  22.945
loss is :  22.945
loss is :  22.945
loss is :  22.945
loss is :  22.945
loss is :  22.945
loss is :  22.945
loss is :  22.945
loss is :  22.945
loss is :  22.9448
loss is :  22.9438
loss is :  22.9418
loss is : 

With all games in the vector and 700 hidden neurons, loss is about 23
With top 5000 games in the vector and 700 hidden neurons, loss is about 23.0141
With top 500 games in the vector and 700 hidden neurons, loss is about 22.9393

In [31]:
games

{'1775: Rebellion': 220,
 '1830: Railways & Robber Barons': 148,
 '1960: The Making of the President': 140,
 '1989: Dawn of Freedom': 337,
 '51st State: Master Set': 256,
 '7 Wonders': 37,
 '7 Wonders Duel': 8,
 'A Feast for Odin': 35,
 'A Few Acres of Snow': 207,
 'A Game of Thrones (first edition)': 281,
 'A Game of Thrones: The Board Game (Second Edition)': 73,
 'A Game of Thrones: The Card Game (Second Edition)': 254,
 'Above and Below': 126,
 'Abyss': 301,
 'Acquire': 185,
 'Advanced Squad Leader': 171,
 'Advanced Squad Leader: Starter Kit #1': 360,
 "Aeon's End": 375,
 'Age of Empires III: The Age of Discovery': 116,
 'Age of Industry': 397,
 'Age of Steam': 94,
 'Agricola': 14,
 'Agricola (revised edition)': 291,
 'Agricola: All Creatures Big and Small': 153,
 'Airlines Europe': 267,
 'Alchemists': 74,
 'Alhambra': 381,
 'Alien Frontiers': 132,
 'Amerigo': 250,
 'Among the Stars': 378,
 'Amun-Re': 235,
 'Anachrony': 139,
 'Android: Netrunner': 31,
 'Antiquity': 227,
 'AquaSphere

In [49]:
games['Spyfall']

318

In [59]:
me=pd.DataFrame(data=['7 Wonders','Pandemic','Codenames','Dixit', 'Pandemic Legacy: Season 1','One Night Ultimate Werewolf','The Resistance','Coup',
'Galaxy Trucker','Space Alert','Cosmic Encounter','Spyfall'], columns=['Games'])

In [61]:
my_collection = games2vec(me['Games'],games)