# Pokemon type predictor - Exploratory data analysis

### Group information:
##### Team number: 8
##### Team members: Sarah Abdelazim, Wilfred Hass, Vincent Ho, Caroline Tang
##### Source : https://gist.githubusercontent.com/HansAnonymous/56d3c1f8136f7e0385cc781cf18d486c/raw/f91faec7cb2fd08b3c28debf917a576c225d8174/pokemon.csv
##### Introduction: The dataset contains different attributes of 1049 pokemons. Our project question is if we can predict a pokemon’s type based on its other attributes

In [1]:
import pandas as pd
import altair as alt
from sklearn.model_selection import train_test_split

#

## 1.  Overview and Summary of the dataset

In [2]:
pokemon_df = pd.read_csv("../data/pokemon.csv")

In [3]:
pd.set_option('display.max_columns', None)

pokemon_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21
0,NUMBER,CODE,SERIAL,NAME,TYPE1,TYPE2,COLOR,ABILITY1,ABILITY2,ABILITY HIDDEN,GENERATION,LEGENDARY,MEGA_EVOLUTION,HEIGHT,WEIGHT,HP,ATK,DEF,SP_ATK,SP_DEF,SPD,TOTAL
1,1,1,11,Bulbasaur,Grass,Poison,Green,Overgrow,,Chrolophyll,1,0,0,0.7,6.9,45,49,49,65,65,45,318
2,2,1,21,Ivysaur,Grass,Poison,Green,Overgrow,,Chrolophyll,1,0,0,1,13,60,62,63,80,80,60,405
3,3,1,31,Venusaur,Grass,Poison,Green,Overgrow,,Chrolophyll,1,0,0,2,100,80,82,83,100,100,80,525
4,3,2,32,Mega Venusaur,Grass,Poison,Green,Thick Fat,,,1,0,1,2.4,155.5,80,100,123,122,120,80,625


In [4]:
pokemon_df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21
count,1049,1049,1049,1049,1049,555,1049,1049,523,819,1049,1049,1049,1049.0,1049.0,1049,1049,1049,1049,1049,1049,1049
unique,899,7,1049,951,19,19,11,217,127,158,9,3,3,60.0,474.0,104,124,114,120,107,128,216
top,479,1,SERIAL,Rotom,Water,Flying,Blue,Levitate,Frisk,Telepathy,5,0,0,0.6,0.3,60,100,70,40,50,50,600
freq,6,898,1,6,136,109,181,41,17,21,172,924,998,92.0,14.0,85,57,72,71,70,57,45


In [5]:
pokemon_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1049 entries, 0 to 1048
Data columns (total 22 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       1049 non-null   object
 1   1       1049 non-null   object
 2   2       1049 non-null   object
 3   3       1049 non-null   object
 4   4       1049 non-null   object
 5   5       555 non-null    object
 6   6       1049 non-null   object
 7   7       1049 non-null   object
 8   8       523 non-null    object
 9   9       819 non-null    object
 10  10      1049 non-null   object
 11  11      1049 non-null   object
 12  12      1049 non-null   object
 13  13      1049 non-null   object
 14  14      1049 non-null   object
 15  15      1049 non-null   object
 16  16      1049 non-null   object
 17  17      1049 non-null   object
 18  18      1049 non-null   object
 19  19      1049 non-null   object
 20  20      1049 non-null   object
 21  21      1049 non-null   object
dtypes: object(22)
memory usa

#### There are some missing value since not all pokemons have second type, second ability or ability hidden.

#

## 2. Partition the data set into training and test sets

##### Before proceeding to EDA section, we will split the data such that 70% of observations are in training and 30% of observations are in the test set.

In [6]:
train_df, test_df = train_test_split(pokemon_df, test_size=0.3, random_state=123)

#

## 3. Exploratory analysis on the training data set

#

#### a. Distributions of numerical columns

In [7]:
alt.Chart(train_df).mark_bar().encode(
     alt.X(alt.repeat(), type='quantitative', bin=alt.Bin(maxbins=40)),
     y='count()',
).properties(
    width=300,
    height=200
).repeat(
    ['NUMBER','CODE','GENERATION','LEGENDARY','MEGA_EVOLUTION', 'HEIGHT', 'WEIGHT','HP','ATK','DEF','SP_ATK','SP_DEF','SPD','TOTAL'], columns=3
)

Most of the numerical attributes are normally distributed except attributes such as HEIGHT and WEIGHT. Only small amount of pokemon can evolve, be lengendary or have a mega evolution form so it makes sense that CODE, LEGENDARY and MEGA_EVOLUTION are so imbalanced.

#

#### b. Distributions of categorical columns

In [8]:
alt.Chart(train_df).mark_bar().encode(
     x='count()',
     y=alt.X(alt.repeat()),
).properties(
    width=300,
    height=1500
).repeat(
    ['TYPE1','TYPE2','COLOR', 'ABILITY1','ABILITY2','ABILITY HIDDEN' ],
    columns=3
)

For the categorical columns, we can see that the types of pokemons are not quite balanced and some pokemons actually share the same ability.

#

#### c. Correlation table for numerical columns

In [9]:
train_df.corr("spearman").style.background_gradient()

In the correlation table, we can see that the numerical attributes has a high correlation with each other which makes sense because if the pokemon is stronger, its overall score should be higher. But their numerical attributes are not highly correlated with some identification variables such as NUMBER, CODE, SERIAL and GENERATION. Therefore, we might consider to drop these columns since they are not quite related in this case.

#

#### d. Explore some interesting relationships between categorical variables

In [10]:
# Primany Ability of pokemon vs Primany Type of pokemon

alt.Chart(train_df).mark_square().encode(
    x='ABILITY1',
    y='TYPE1',
    color='count()',
    size='count()')

ValueError: ABILITY1 encoding field is specified without a type; the type cannot be inferred because it does not match any column in the data.

alt.Chart(...)

It seems like most of the pokemon's types have their own ability or only share with a few of other types.

#

In [11]:
# TYPE of pokemon vs COLOR of pokemon

alt.Chart(train_df).mark_square().encode(
    x='TYPE1',
    y='COLOR',
    color='count()',
    size='count()')

ValueError: TYPE1 encoding field is specified without a type; the type cannot be inferred because it does not match any column in the data.

alt.Chart(...)

We can see that many pokemon's types actually share all ten colors while some types have their dominated color. For example, Water Type has blue and Grass Type has Green.

In [12]:
#Preprocessing session for later milestones

imputed_train_df_df = train_df.copy()

imputed_train_df_df['TYPE2'].fillna("No Other Type", inplace=True)
imputed_train_df_df['ABILITY2'].fillna("No Other Ability", inplace=True)
imputed_train_df_df['ABILITY HIDDEN'].fillna("No Ability Hidden", inplace=True)



KeyError: 'TYPE2'

In [None]:
imputed_train_df_df.info()

In [None]:
binary_features = ['LEGENDARY','MEGA_EVOLUTION']
categorical_features = ['TYPE1','TYPE2','COLOR', 'ABILITY1','ABILITY2','ABILITY HIDDEN']
numeric_features = ['HEIGHT', 'WEIGHT','HP','ATK','DEF','SP_ATK','SP_DEF','SPD']
drop_features = ['NUMBER','CODE','GENERATION', 'TOTAL' ]