# Board game analysis - Looking for similarities

To perform the analysis is necessary to import some Py libraries and the dataset as a Pandas Dataframe.
Pandas is a data manipulation library,
Matplotlib is a plotting library,
Seaborn is a visualization library based on matplotlib,
Numpy add support for vectors ,multidimensional matrix and other usefull mathematical functions.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import numpy as np
%matplotlib inline
import scipy.stats as st

First of all I've imported the dataset as a pandas dataframe and I started exploring data:

In [None]:
df = pd.read_csv('./bgg_db_2017_04.csv', encoding='latin1')

I analyse the first columns and the information inside in order to find some interesting analysis.

In [None]:
df.head()

In [None]:
total_row = len(df.index)
print(total_row)

# Let's explore the dataset

How many players is most frequent. Here I'm adding two column at the end of dataset: average numbers of players and the players number range (min - max). With this we can wich category is most frequent and look for same easy correlation:

In [None]:
df['avg_players'] = (df['min_players'] + df['max_players']) / 2
df['player_number'] = df['min_players'].astype(str) + '-' + df['max_players'].astype(str)

# Average player histogram:

In [None]:
avg_pl=df.groupby(['avg_players']).size().to_frame('count').reset_index()
avg_pl = avg_pl.sort_values(['count'], ascending=[False])
ax = avg_pl[['avg_players','count']].head(10).plot(kind='bar', x=['avg_players'], title ="", figsize=(15, 10), legend=True, fontsize=12)
ax.set_xlabel("Average players", fontsize=12)
ax.set_ylabel("# of occurrences", fontsize=12)
plt.show()

# Player range histogram:

In [None]:
net=df.groupby(['player_number']).size().to_frame('count').reset_index()
net_sorted = net.sort_values(['count'], ascending=[False])
ax = net_sorted[['player_number','count']].head(10).plot(kind='bar', x=['player_number'], title ="Players", figsize=(15, 10), legend=True, fontsize=12)
ax.set_xlabel("Player number", fontsize=12)
ax.set_ylabel("# of occurrences", fontsize=12)
plt.show()

2-4 is the most common player range followed by 2 players. Let's explore these building separated dataframe for the main five occurrencies:

In [None]:
table_2_4 = df.loc[df['player_number'] == '2-4']
table_2_2 = df.loc[df['player_number'] == '2-2']
table_2_5 = df.loc[df['player_number'] == '2-5']
table_2_6 = df.loc[df['player_number'] == '2-6']
table_1_4 = df.loc[df['player_number'] == '1-4']

And is possible to see wich games are inside the histogram bars:

In [None]:
table_2_4['names'].head()

The first five column are the most common:

In [None]:
row_sum = len(table_2_4.index)+len(table_2_2.index)+len(table_2_5.index)+len(table_2_6.index)+len(table_1_4.index);
percentage=row_sum/total_row*100
print(percentage)

They represent the 68% of the overall data.

# Most common age:

In [None]:
age_count=df.groupby(['age']).size().to_frame('count').reset_index()
age_count = age_count.sort_values(['count'], ascending=[False])
ax = age_count[['age','count']].head(10).plot(kind='bar', x=['age'], title ="", figsize=(15, 10), legend=True, fontsize=12)
ax.set_xlabel("Player age", fontsize=12)
ax.set_ylabel("# of occurrences", fontsize=12)
plt.show()

Again if we want which games are inside the bar:

In [None]:
table_age_12 = df.loc[df['age'] == 12]
table_age_12['names'].head()

# Correlation between age and the complexity of the game (weight):

In [None]:
ax = df.plot(kind='scatter', x='age',y='weight', title ="", figsize=(15, 10), legend=True, fontsize=12)

The Pearson correlation is not so strong:

In [None]:
np.corrcoef(df['age'],df['weight'])[0,1]

# Correlation between average rating and weight (how complex the game is):

There is a correlation between complexity and rating? 

In [None]:
ax = sb.jointplot(x='avg_rating', y='weight',data=df,kind='kde', size=10) 

The pearson correlation coefficient is 0.54 so there is a linear correlation between weight and average rating. If we use Geek rating the correlation crash to 0.19. Why? The reason is the geek rating is the average rating adjusted for the number of votes. If there are only a few votes, it's pulled strongly towards the middle (5.5 is the number used, I think). If there are a lot of votes, the geek rating is closer to the average rating.

In [None]:
ax = sb.jointplot(x='geek_rating', y='weight',data=df,kind='kde', size=10) 

Now I take only very high number of votes (more than 5000 votes):

In [None]:
df_stat_sig = df.loc[df['num_votes'] >= 5000]
ax = sb.jointplot(x='avg_rating', y='weight',data=df_stat_sig,kind='kde', size=10) 

In [None]:
len(df_stat_sig)

There are only 375 boardgame but the correlation is more clear and high (0.67)

# For a given game which other games are most similar:

Now I'm building the array with the values of a generic game that I want to compare with the others.

In [None]:
data = np.zeros((1,), dtype=[('names', 'U25'),('min_players', 'int'),('max_players', 'int'),('avg_time', 'float')
                             ,('min_time', 'int'),('max_time', 'int'),('avg_rating', 'float'),
                             ('geek_rating', 'float'),('age', 'int'),('weight', 'float'),('avg_players', 'float')])

data[:] = [('My Game',1,10,50,10,250,7.8,3.4,18,4,1.5)]

my_game = pd.DataFrame(data)


In [None]:
my_game.head()

Now I need to clean my dataframe with only the columns that describe numerically the characteristics of the games:

In [None]:
game_for_model = df.drop(['rank','bgg_url','game_id','num_votes','image_url','mechanic','owned','category','designer','player_number','year'], axis=1)

In [None]:
game_for_model.head(1)

The idea at the base of this model is to compute the spearman rank correlation in order to find which games are most similar:

In [None]:
spearman = []
for k in range(total_row):
    spearman.append(st.spearmanr(my_game[my_game.columns[1:]].loc[0],game_for_model[game_for_model.columns[1:]].iloc[k])[0])

Here I compute pearson and spearman correlation. Spearman rank correlation asses how well the relationship between two variables can be described using a monotonic function. Pearson correlation coefficient is a measure of the linear correlation between two variables.

In [None]:
pearson = []
for k in range(total_row):
    pearson.append(st.pearsonr(my_game[my_game.columns[1:]].loc[0],game_for_model[game_for_model.columns[1:]].iloc[k])[0])

Here I'm creating a new dataframe called "game_spear". Is a copy of the "game_for_model" dataframe but I want to add at the end the two column spearman and pearson correlation. The reason is to maintain the whole original dataset as is in order to be free to change "my_game" and perform other analysis with another mock game.

In [None]:
game_spear = game_for_model.copy(deep=True)

In [None]:
game_spear['spearman'] = spearman
game_spear['pearson'] = pearson

Sorting values by Spearman Rank Correlation:

In [None]:
game_spear = game_spear.sort_values(['spearman'], ascending=[False])

In [None]:
game_spear.head(10)

Here a list of the first 10 most similar games by the Spearman correlation. If we have a look at linear correlation instead:

In [None]:
game_pear = game_spear.sort_values(['pearson'], ascending=[False])

In [None]:
game_pear.head(10)