# Exploration

We will be building a statistical model to estimate the distribution of Pokemon stats and to predict the stats of future (or unseen) Pokemons.

In this notebook, we will explore the Pokemon dataset (https://www.kaggle.com/abcsds/pokemon) and try to answer a few questions about Pokemons:
1. How are the attributes of a Pokemon roughly distributed?
2. Are there any significant difference between the attribute allocation between different Pokemon types?
3. Do certain attributes correlate with each other?
4. How much better is a "Legendary" Pokemon?
5. What are the most "unusual" Pokemons? What about most "normal"?

Let us first inspect the data:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
pokemon = pd.read_csv("./Pokemon.csv")

In [None]:
pokemon.shape

In [None]:
pokemon.head()

Quoting the attribute description on Kaggle:

>#: ID for each pokemon 
>
Name: Name of each pokemon 
>
Type 1: Each pokemon has a type, this determines weakness/resistance to attacks 
>
Type 2: Some pokemon are dual type and have 2 
>
Total: sum of all stats that come after this, a general guide to how strong a pokemon is 
>
HP: hit points, or health, defines how much damage a pokemon can withstand before fainting 
>
Attack: the base modifier for normal attacks (eg. Scratch, Punch) 
>
Defense: the base damage resistance against normal attacks 
>
SP Atk: special attack, the base modifier for special attacks (e.g. fire blast, bubble beam) 
>
SP Def: the base damage resistance against special attacks 
>
Speed: determines which pokemon attacks first each round

We begin by inspecting the basic descriptive statistics of the attributes:

In [None]:
pokemon.describe()

In [None]:
pokemon.describe(exclude=[np.number])

Some characteristics of the data:

1. There are 13 attributes (including ID) per pokemon, and there are 800 pokemons in total.
2. All the 6 basic stats are on the same order of magnitude in terms of range, and have similar std as well. HP has the lowest deviation whereas Sp.Atk has the highest. The range of total stat points is pretty large.
3. All pokemons have one type, and about half of them have two types.

Let us compare the pokemon stat values pairwise:

Since we have modelled the effect of types and legendary status on the attributes, the correlation being explained by the multivariate normal distribution is smaller than that in the data overall (e.g. in many types, Sp. Atk and Sp.Def are either high or low at the same time). However, there seems to be a correlation between Defense and Sp.Def that is not sufficiently explained by our type model.

# Investigating the Relationship between Attributes Given Total Points

So far, we have modelled the relationship between the attributes of a pokemon and its types. What if we include the total attribute points of a pokemon as an input variable?

In [None]:
pos = pd.DataFrame(samp['att'].reshape(-1, M), columns=labels)
pos['Total'] = pos.sum(axis=1)
pos.describe()

In [None]:
nr = 6
nc = 3
f, ax = plt.subplots(figsize=(20, 25), nrows=nr, ncols=nc)
START = 100
RANGE = 50
for i in range(nr * nc):
    slice_dat = pos.loc[(pos['Total'] >= START + RANGE * i) & (pos['Total'] < START + RANGE * (1 + i)), :]\
        .iloc[:-1, :-1]
    ax[i//nc, i%nc].set_title("Total = {0} to {1}, count={2}"\
                            .format(START + RANGE * i, START + RANGE * (1 + i), slice_dat.shape[0]))
    slice_cov = slice_dat.cov()
    slice_diag = np.sqrt(np.diag(slice_cov))
    corr = pd.DataFrame(slice_cov.values / slice_diag.reshape(-1, 1) / slice_diag.reshape(1, -1),
                        columns=labels, index=labels)
    sns.heatmap(corr, cmap=sns.color_palette("coolwarm", 10), ax=ax[i//nc, i%nc], annot=True)
plt.show()

Here we can look at how the relationships between attributes change over the range of total attribute points. There are more positive correlation between certain attribute pairs at very low and very high total attr ranges (altough likely be chance, as the sample size is small on the extreme ends). Opposite attack and defense (e.g. Attack vs Sp. Def) seems to be consistently negatively correlated given total attrs. Speed and Defense are more consistently negatively correlated.

# Fun Time - Find the weirdest Pokemon!

We define "weirdness" as having low likeliness as predicted by our model. To find the most unlikely Pokemon, we need to calculate the likeliness for each row:

In [None]:
with model:
    logp = attributes.logp_elemwise(mu=trace['mu'].mean(axis=0), 
                   w=trace['w'].mean(axis=0),
                   nu_interval__=trace['nu_interval__'].mean(axis=0),
                   chol_cholesky_cov_packed__=trace['chol_cholesky_cov_packed__'].mean(axis=0),
                   amp_interval__=trace['amp_interval__'].mean(axis=0))

Now we just have to rank the Pokemons by their log-likeliness:

In [None]:
sort_ind = np.argsort(logp)

And here is our 20 weirdest Pokemon:

In [None]:
sns.pairplot(pokemon, hue='Legendary')

We found some further insights:

1. Stats are distributed rather evenly across generations barring a few outliers, but the averages seem to increase by pokemon IDs - likely power creep?
2. Individual stats are somewhat positively correlated with total stats.
3. All stats are skewed to the right.
4. The correlations between individual stats are not clear.
5. Legendaries have significantly higher total stats.

We would also like to know the relationship between type and stats.

In [None]:
vec_concat = np.vectorize(lambda x, y: " ".join([x, y]))

In [None]:
pokemon['CombinedType'] = vec_concat(pokemon['Type 1'].fillna(""), pokemon['Type 2'].fillna(""))

In [None]:
combined_types = pokemon.groupby('CombinedType')['#'].count().reset_index().\
    rename(columns={'#':'val'}).query('val > 13')['CombinedType']

In [None]:
fig, ax = plt.subplots(figsize=(14, 6))
sns.violinplot(data=pokemon.iloc[np.isin(pokemon['CombinedType'], combined_types), :],
               x='CombinedType', y='Total', ax=ax)

As we see above, type definitely have an effect on the value distribution and allocation of different stats. Notice that for many types, there is a bimodal pattern in their attribute distributions. This is likely due to evolution stages, and we will verify that later.

Now we look at the mean values of attributes in each commom pokemon type:

PPC for model validation - we draw samples of the observed variables from the posterior distribution and see if it demonstrates similar distribution to the actual data given.

In [None]:
with model:
    samp = pm.sample_ppc(trace, 1000)

In [None]:
labels = ['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']

Posterior value:

In [None]:
pd.DataFrame(samp['att'].reshape(-1, 6), columns=labels).agg([np.mean, np.std, 
                                                              lambda x: x.quantile(0.05),
                                                              lambda x: x.quantile(0.95)])

Data value:

In [None]:
df[labels].agg([np.mean, np.std, lambda x: x.quantile(0.05), lambda x: x.quantile(0.95)])

As we see here, our sample estimate is somewhat close to the true data values, but there are some underestimates and overestimates. We suspect this is due to the fact that being legendary affects different attributes differently, and when two types mix, they do not affect each attribute equally.

Let us compare the distribution of different attributes from the posterior sample and from the data:

In [None]:
f, ax = plt.subplots(nrows=3, ncols=2, figsize=(12, 10))
tdf = pd.DataFrame(samp['att'].reshape(-1, 6))
for i in range(3):
    for j in range(2):
        tdf.iloc[:, 2 * i + j].sample(10000).plot.kde(xlim=(0, 200), 
                                        title=labels[2 * i + j], 
                                        ax=ax[i, j])
tdf = pd.DataFrame(df[labels])
for i in range(3):
    for j in range(2):
        tdf.iloc[:, 2 * i + j].plot.kde(xlim=(0, 200), 
                                        title=labels[2 * i + j],
                                        ax=ax[i, j], color='red')
tdf = pd.DataFrame(df.query('Legendary == False')[labels])
for i in range(3):
    for j in range(2):
        tdf.iloc[:, 2 * i + j].plot.kde(xlim=(0, 200), 
                                        title=labels[2 * i + j],
                                        ax=ax[i, j], color='green')
plt.show()

There are still some imperfections about this model, for example it does not fit the peaks of some attributes very well.

One of the easiest inferences we can make is about the bonus of being "Legendary":

In [None]:
pm.plot_posterior(trace, varnames=['amp'])

According to our model, legendary Pokemons are expected to be 1.434 times stronger than an ordinary Pokemon. The 95% credible interval of the multiplier is (1.367, 1.494). This is a reasonable estimate given the scatterplot of total attribute points versus legendary status, however from the scatterplot we can also see that being "legendary" is clearly not equal to receiving a flat bonus across the broad. A legendary Pokemon might only receive bonus in a few attributes and not at all in others. This is a limitation of our current model. However, due to the small number of legendary Pokemons, we might not be able to infer the **true** "legendary generating algorithm".

One interesting observation from the posterior is that not all types affect the attributes equally. We plot the distribution and relative mean of the mixture weight variables $w_{type_i}$:

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
pd.DataFrame(trace['w'], columns=cata1.categories).plot.kde(ax=ax)

In [None]:
(pd.DataFrame(trace['w'], columns=cata1.categories).mean() - 1).plot.bar()

Although the effect is subtle and the result is not likely to be very significant, we do find that Steel type is more likely to dominate the stat allocation in mixed-type pokemons and Bug type tends to have a smaller influence in mixed-types.

In [None]:
pm.plots.forestplot(trace, varnames=['w'], ylabels=cata1.categories)
plt.show()

As we see above, pretty much all of the type weights' 95% credible intervals overlap, so we really cannot conclude for certain whether some types affect attributes more than others.

We may also investigate the covariance matrix and derive our estimate of attribute correlation from it. Let us contrast this with the correlation matrix from the data:

In [None]:
cov_mean = trace['cov'].mean(axis=0)
cov_diag = np.sqrt(np.diag(cov_mean))
corr = pd.DataFrame(cov_mean / cov_diag.reshape(-1, 1) / cov_diag.reshape(1, -1), index=labels, columns=labels)

f, ax = plt.subplots(figsize=(16, 6), ncols=2)
ax[0].set_title("data")
sns.heatmap(df[labels].corr(), ax=ax[0], cmap=sns.color_palette("coolwarm", 10), annot=True)
ax[1].set_title('posterior')
sns.heatmap(corr, ax=ax[1], cmap=sns.color_palette("coolwarm", 10), annot=True)
plt.show()

As we see here, the attributes of a combined type is not always the average of two component types.

Now we look at the joint distributions of the attributes.

In [None]:
g = sns.PairGrid(pokemon[["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]], diag_sharey=False)
g.map_lower(sns.kdeplot, cmap="Blues_d")
g.map_upper(plt.scatter)
g.map_diag(sns.kdeplot, lw=3)

There exist some correlation between attributes, but each attribute appears to be distributed mostly normally. It is reasonable to believe that we can roughly fit this distribution pattern with a multivariate normal distribution.

# Modelling

Let us build a basic attribute generation model. We make the following assumptions:

1. All six attributes of a pokemon follow a joint multivariate normal distribution. The mean vector and covariance matrix are random variables to be estimated.
2. The mean vector of the attribute distribution is dependent on the type of the pokemon. For each type, the mean follows a normal distribution prior. If the pokemon has two types, the distribution of the mean is the mixture of the two component distributions.
3. Some types may have a stronger influence on the attributes than others. This is modelled with weight variables that are different for all types.
4. "Legendary" is a multiplier on the mean of the attributes.

We begin by converting the "type" fields into numberical IDs.

In [None]:
df = pokemon[['Type 1', 'Type 2', 'Legendary', 'HP', 'Attack', 'Defense',
              'Sp. Atk', 'Sp. Def', 'Speed']].copy()
df['Single'] = pd.isnull(df['Type 2'])
df.loc[pd.isnull(df['Type 2']), 'Type 2'] = df[pd.isnull(df['Type 2'])]['Type 1']
cata1 = df['Type 1'].astype('category').cat
cata1_map = {cata1.categories[i]:i for i in range(len(cata1.categories))}
df['Type 1'] = df['Type 1'].apply(lambda x: cata1_map[x])
df['Type 2'] = df['Type 2'].apply(lambda x: cata1_map[x])
df.head()

In [None]:
import pymc3 as pm
import theano.tensor as tt
from theano import shared
import theano as th

Let us construct the model in PyMC3. Some tricks must be used to model the prior of the covariance matrix.

In [None]:
M = 6
T = 18

with pm.Model() as model:
    type_1 = shared(df['Type 1'].astype(np.int16).values)
    type_2 = shared(df['Type 2'].astype(np.int16).values)
    leg_ph = shared(df['Legendary'].astype(np.bool).values)
    obs = shared(df.loc[:, ['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']].astype(np.float32).values)

    leg = tt.reshape(leg_ph, (df.shape[0], 1))
    
    type_mu = pm.Normal("mu", mu=70., sd=10., shape=(T, M))  # prior of attr mean by type
#     r = pm.Beta("r_{mix}", 2, 2)
    w = pm.Normal("w", mu=1., sd=0.2, shape=T)
    r = w[type_1] / (w[type_1] + w[type_2])
    r = tt.reshape(r, (r.shape[0], 1))
    selected_type_mu = r * type_mu[type_1, :] + (1 - r) * type_mu[type_2, :]

#     sigma = pm.Lognormal('sigma', 30., 5.)
    
    # prior for the covariance matrix
    nu = pm.Uniform('nu', 0., 5.)
    chol = pm.LKJCholeskyCov('chol', n=M, eta=nu, sd_dist=pm.Lognormal.dist(30., 5.))
    chol = pm.expand_packed_triangular(M, chol, lower=True)
    cov = pm.Deterministic("cov", tt.dot(chol, chol.T))  # dummy
    
    amp = pm.Uniform("amp", 0., 2.)  # legendary multiplier, i.e. legendaries are k times stronger
    attributes = pm.MvNormal('att', mu=selected_type_mu * (leg * amp + 1), 
                             chol=chol, observed=obs)
#     attributes = pm.Normal('att', mu=selected_type_mu * (leg * amp + 1), sd=sigma, observed=obs)

Model description:
$$
\mu_{type_i} \sim Normal(\mu_{type_i}, \sigma_{type_i}) \\
w_{type_i} \sim Normal(\mu_{w}, \sigma_{w}) \\
r = \dfrac{w[type_1]}{w[type_1] + w[type_2]} \\
\mu_{actual} = r\mu_{type_1} + (1-r)\mu_{type_2} \\
\nu \sim Uniform(a_\nu, b_\nu) \\
\varsigma \sim Lognormal(\mu_\varsigma, \sigma_\varsigma) \\
\Sigma \sim LKJ(\varsigma, \nu) \\
k \sim Uniform(a_k, b_k) \\
Y \sim MvNormal(\mu_{actual} \cdot (1 + kI_{legendary}), \Sigma)
$$

Input values:
$$
type_1, type_2, I_{legendary}
$$

Observed variable: $$Y$$

We run the sampler to perform inference on the data:

In [None]:
with model:
    trace = pm.sample(2000, tune=2000, init='ADVI')

In [None]:
pm.traceplot(trace, varnames=['mu', 'w', 'amp'])
plt.show()

We can check the inferred attribute means of two opposite types to see if the inferred distribution makes sense:

In [None]:
SELECTED_TYPE = ['Fighting', 'Psychic']
f, ax = plt.subplots(nrows=3, ncols=2, figsize=(12, 12))
pd.DataFrame(trace['mu'][:, :, 0]).rename(columns={i:cata1.categories[i] for i in range(T)})[SELECTED_TYPE]\
    .plot.kde(title="HP", ax = ax[0, 0], legend=False, xlim=(40, 120))
pd.DataFrame(trace['mu'][:, :, 1]).rename(columns={i:cata1.categories[i] for i in range(T)})[SELECTED_TYPE]\
    .plot.kde(title="Attack", ax=ax[0,1], legend=False, xlim=(40, 120))
pd.DataFrame(trace['mu'][:, :, 2]).rename(columns={i:cata1.categories[i] for i in range(T)})[SELECTED_TYPE]\
    .plot.kde(title="Defense", ax=ax[1,0], legend=False, xlim=(40, 120))
pd.DataFrame(trace['mu'][:, :, 3]).rename(columns={i:cata1.categories[i] for i in range(T)})[SELECTED_TYPE]\
    .plot.kde(title="Sp. Atk", ax=ax[1,1], legend=False, xlim=(40, 120))
pd.DataFrame(trace['mu'][:, :, 4]).rename(columns={i:cata1.categories[i] for i in range(T)})[SELECTED_TYPE]\
    .plot.kde(title="Sp. Def", ax=ax[2,0],legend=False, xlim=(40, 120))
pd.DataFrame(trace['mu'][:, :, 5]).rename(columns={i:cata1.categories[i] for i in range(T)})[SELECTED_TYPE]\
    .plot.kde(title="Speed", ax=ax[2,1], xlim=(40, 120))
plt.show()

The model infers that Fighting Type has a higher HP and attack than Psychic Type, yet the Psychic Type has higher Sp. Atk and Sp. Def, which is correct according to Pokemon lore. So at least this small part of the model fits our "real world" experiences!

In [None]:
combined_types = pokemon.groupby('CombinedType')['#'].count().reset_index().\
    rename(columns={'#':'val'}).query('val >= 10')['CombinedType']
type_mean = pokemon.iloc[np.isin(pokemon['CombinedType'], combined_types), :]\
    .groupby('CombinedType')[["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]]\
    .mean()

In [None]:
f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(type_mean, cmap=cmap, linewidths=.5, cbar_kws={"shrink": .5})

It is also interesting to look at attribute distributions of single-typed pokemons versus double-typed pokemons.

In [None]:
mm = pokemon.iloc[np.isin(pokemon['CombinedType'], ['Water ', 'Water Ground', 'Ground ']), :]\
    [['CombinedType', "HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]].melt(id_vars=['CombinedType'])
    
f, ax = plt.subplots(figsize=(11, 9))
sns.stripplot(x="value", y="variable", hue="CombinedType", hue_order=['Water ', 'Water Ground', 'Ground '],
              data=mm, dodge=True, jitter=True, palette='dark',
              alpha=.25, zorder=1)
sns.pointplot(x="value", y="variable", hue="CombinedType", hue_order=['Water ', 'Water Ground', 'Ground '],
              data=mm, dodge=.532, join=False, palette="dark",
              markers="d", scale=.75, ci=None)
# Improve the legend 
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[3:], labels[3:], title="Combined Type",
          handletextpad=0, columnspacing=1,
          loc="lower right", ncol=3, frameon=True)

In [None]:
mm = pokemon.iloc[np.isin(pokemon['CombinedType'], ['Bug ', 'Bug Poison', 'Poison ']), :]\
    [['CombinedType', "HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]].melt(id_vars=['CombinedType'])
    
f, ax = plt.subplots(figsize=(11, 9))
sns.stripplot(x="value", y="variable", hue="CombinedType", hue_order=['Bug ', 'Bug Poison', 'Poison '],
              data=mm, dodge=True, jitter=True, palette='dark',
              alpha=.25, zorder=1)
sns.pointplot(x="value", y="variable", hue="CombinedType", hue_order=['Bug ', 'Bug Poison', 'Poison '],
              data=mm, dodge=.532, join=False, palette="dark",
              markers="d", scale=.75, ci=None)
# Improve the legend 
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[3:], labels[3:], title="Combined Type",
          handletextpad=0, columnspacing=1,
          loc="lower right", ncol=3, frameon=True)