# Pokemon type predictor - Exploratory data analysis

### Group information:
##### Team number: 8
##### Team members: Sarah Abdelazim, Wilfred Hass, Vincent Ho, Caroline Tang
##### Source : https://gist.github.com/HansAnonymous/56d3c1f8136f7e0385cc781cf18d486c

# Introduction
This dataset contains different attributes of 1049 pokemon, such as their height, weight color, types, base stats, and abilities. Using this data, we want to predict a pokemon’s primary type based on its other characteristics.

In [1]:
import pandas as pd
import altair as alt
from sklearn.model_selection import train_test_split

alt.renderers.enable('default')

#

## 1.  Overview and Summary of the dataset

In [2]:
pokemon_df = pd.read_csv("../data/raw/pokemon.csv")

In [3]:
pd.set_option('display.max_columns', None)

pokemon_df.head()

Unnamed: 0,NUMBER,CODE,SERIAL,NAME,TYPE1,TYPE2,COLOR,ABILITY1,ABILITY2,ABILITY HIDDEN,GENERATION,LEGENDARY,MEGA_EVOLUTION,HEIGHT,WEIGHT,HP,ATK,DEF,SP_ATK,SP_DEF,SPD,TOTAL
0,1,1,11,Bulbasaur,Grass,Poison,Green,Overgrow,,Chrolophyll,1,0,0,0.7,6.9,45,49,49,65,65,45,318
1,2,1,21,Ivysaur,Grass,Poison,Green,Overgrow,,Chrolophyll,1,0,0,1.0,13.0,60,62,63,80,80,60,405
2,3,1,31,Venusaur,Grass,Poison,Green,Overgrow,,Chrolophyll,1,0,0,2.0,100.0,80,82,83,100,100,80,525
3,3,2,32,Mega Venusaur,Grass,Poison,Green,Thick Fat,,,1,0,1,2.4,155.5,80,100,123,122,120,80,625
4,4,1,41,Charmander,Fire,,Red,Blaze,,Solar Power,1,0,0,0.6,8.5,39,52,43,60,50,65,309


In [4]:
pokemon_df.describe()

Unnamed: 0,NUMBER,CODE,SERIAL,GENERATION,LEGENDARY,MEGA_EVOLUTION,HEIGHT,WEIGHT,HP,ATK,DEF,SP_ATK,SP_DEF,SPD,TOTAL
count,1048.0,1048.0,1048.0,1048.0,1048.0,1048.0,1048.0,1048.0,1048.0,1048.0,1048.0,1048.0,1048.0,1048.0,1048.0
mean,442.412214,1.187977,4425.310115,4.260496,0.118321,0.04771,1.257252,71.057634,69.842557,80.25,74.340649,72.668893,71.885496,68.61355,437.601145
std,260.578582,0.528074,2605.778646,2.268484,0.323142,0.213254,1.256724,132.054353,26.037975,32.466227,30.738994,32.707936,27.458648,30.100794,120.150889
min,1.0,1.0,11.0,1.0,0.0,0.0,0.1,0.1,1.0,5.0,5.0,10.0,20.0,5.0,175.0
25%,217.75,1.0,2178.5,2.0,0.0,0.0,0.6,9.0,50.0,55.0,50.0,50.0,50.0,45.0,330.0
50%,438.5,1.0,4386.0,4.0,0.0,0.0,1.0,29.0,68.0,76.5,70.0,65.0,70.0,65.0,455.0
75%,667.25,1.0,6673.5,6.0,0.0,0.0,1.5,70.625,80.0,100.0,90.0,95.0,90.0,90.0,515.0
max,898.0,6.0,8983.0,8.0,1.0,1.0,14.5,999.9,255.0,190.0,230.0,194.0,230.0,200.0,780.0


In [5]:
pokemon_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048 entries, 0 to 1047
Data columns (total 22 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   NUMBER          1048 non-null   int64  
 1   CODE            1048 non-null   int64  
 2   SERIAL          1048 non-null   int64  
 3   NAME            1048 non-null   object 
 4   TYPE1           1048 non-null   object 
 5   TYPE2           554 non-null    object 
 6   COLOR           1048 non-null   object 
 7   ABILITY1        1048 non-null   object 
 8   ABILITY2        522 non-null    object 
 9   ABILITY HIDDEN  818 non-null    object 
 10  GENERATION      1048 non-null   int64  
 11  LEGENDARY       1048 non-null   int64  
 12  MEGA_EVOLUTION  1048 non-null   int64  
 13  HEIGHT          1048 non-null   float64
 14  WEIGHT          1048 non-null   float64
 15  HP              1048 non-null   int64  
 16  ATK             1048 non-null   int64  
 17  DEF             1048 non-null   i

There are some missing values since not all pokemon have a second type, second ability or hidden abilities.

#

## 2. Partition the data set into training and test sets

Before proceeding to EDA section, we will split the data such that 70% of observations are in training and 30% of observations are in the test set.

In [6]:
train_df, test_df = train_test_split(pokemon_df, test_size=0.3, random_state=123)

#

## 3. Exploratory analysis on the training data set

#

#### a. Distributions of numerical columns

In [7]:
alt.Chart(train_df).mark_bar().encode(
     alt.X(alt.repeat(), type='quantitative', bin=alt.Bin(maxbins=40)),
     y='count()',
).properties(
    width=300,
    height=200
).repeat(
    ['NUMBER','CODE','GENERATION','LEGENDARY','MEGA_EVOLUTION', 'HEIGHT', 'WEIGHT','HP','ATK','DEF','SP_ATK','SP_DEF','SPD','TOTAL'], columns=3
)

Figure 1. Distributions of numeric characteristics.

Most of the numerical attributes appear to be relatively normally distributed except for HEIGHT and WEIGHT, which are right skewed. Only a small amount of pokemon can evolve, be lengendary or have a mega evolution form so it makes sense that CODE, LEGENDARY and MEGA_EVOLUTION are so imbalanced.

#

#### b. Distributions of categorical columns

In [8]:
alt.Chart(train_df).mark_bar().encode(
     x='count()',
     y=alt.X(alt.repeat()),
).properties(
    width=300,
    height=1500
).repeat(
    ['TYPE1','TYPE2','COLOR', 'ABILITY1','ABILITY2','ABILITY HIDDEN' ],
    columns=3
)

Figure 2. Distributions of categorical characteristics.

For the categorical columns, we can see that the types of pokemons are not quite balanced and that some pokemon share the same ability.

#

#### c. Correlation table for numerical columns

In [9]:
train_df.corr("spearman").style.background_gradient()

Unnamed: 0,NUMBER,CODE,SERIAL,GENERATION,LEGENDARY,MEGA_EVOLUTION,HEIGHT,WEIGHT,HP,ATK,DEF,SP_ATK,SP_DEF,SPD,TOTAL
NUMBER,1.0,-0.055013,0.999999,0.887848,0.273563,-0.173972,0.017224,0.02142,0.168556,0.14939,0.121486,0.119136,0.084039,0.055392,0.173136
CODE,-0.055013,1.0,-0.05432,0.099148,0.102876,0.547774,0.114481,0.081824,0.0828,0.228664,0.19661,0.159985,0.207702,0.209565,0.26762
SERIAL,0.999999,-0.05432,1.0,0.887965,0.273563,-0.173575,0.017268,0.021456,0.168571,0.149466,0.12156,0.119203,0.084108,0.055413,0.173241
GENERATION,0.887848,0.099148,0.887965,1.0,0.184573,-0.197465,-0.028643,-0.013603,0.107321,0.124351,0.061546,0.047015,0.02543,0.041032,0.0956
LEGENDARY,0.273563,0.102876,0.273563,0.184573,1.0,0.017758,0.292413,0.261617,0.359741,0.338984,0.291545,0.382871,0.332879,0.324851,0.499154
MEGA_EVOLUTION,-0.173972,0.547774,-0.173575,-0.197465,0.017758,1.0,0.238636,0.195193,0.109494,0.258752,0.217803,0.220142,0.258929,0.185273,0.331398
HEIGHT,0.017224,0.114481,0.017268,-0.028643,0.292413,0.238636,1.0,0.836517,0.633884,0.622661,0.483274,0.453955,0.482859,0.344829,0.72246
WEIGHT,0.02142,0.081824,0.021456,-0.013603,0.261617,0.195193,0.836517,1.0,0.591194,0.594795,0.522675,0.319488,0.438963,0.181358,0.635332
HP,0.168556,0.0828,0.168571,0.107321,0.359741,0.109494,0.633884,0.591194,1.0,0.591208,0.429429,0.452477,0.485219,0.278495,0.725479
ATK,0.14939,0.228664,0.149466,0.124351,0.338984,0.258752,0.622661,0.594795,0.591208,1.0,0.50725,0.342016,0.332953,0.358023,0.723272


In the correlation table, we can see that the numeric attributes have a high correlation with each other, which makes sense because if the pokemon is stronger, its overall score should be higher. However, their numerical attributes are not highly correlated with some identification variables such as NUMBER, CODE, SERIAL and GENERATION. We might consider dropping these columns since they are generally unique to each pokemon.

#

#### d. Explore some interesting relationships between categorical variables

In [10]:
# Primany Ability of pokemon vs Primany Type of pokemon

alt.Chart(train_df).mark_square().encode(
    x='ABILITY1',
    y='TYPE1',
    color='count()',
    size='count()')

Figure 3. Correlation plot between primary type and primary ability

It seems like most of the pokemon's types have their own ability or only share with a few of other types.

#

In [11]:
# TYPE of pokemon vs COLOR of pokemon

alt.Chart(train_df).mark_square().encode(
    x='TYPE1',
    y='COLOR',
    color='count()',
    size='count()')

Figure 4. Correlation plot between primary type and main color

We can see that many pokemon types actually have pokemon in all ten colors while some types have one dominant color. For example, Water types are mostly blue and Grass types are mostly Green.

In [12]:
#Preprocessing session for later milestones

imputed_train_df_df = train_df.copy()

imputed_train_df_df['TYPE2'].fillna("No Other Type", inplace=True)
imputed_train_df_df['ABILITY2'].fillna("No Other Ability", inplace=True)
imputed_train_df_df['ABILITY HIDDEN'].fillna("No Ability Hidden", inplace=True)



In [13]:
imputed_train_df_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 733 entries, 787 to 1041
Data columns (total 22 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   NUMBER          733 non-null    int64  
 1   CODE            733 non-null    int64  
 2   SERIAL          733 non-null    int64  
 3   NAME            733 non-null    object 
 4   TYPE1           733 non-null    object 
 5   TYPE2           733 non-null    object 
 6   COLOR           733 non-null    object 
 7   ABILITY1        733 non-null    object 
 8   ABILITY2        733 non-null    object 
 9   ABILITY HIDDEN  733 non-null    object 
 10  GENERATION      733 non-null    int64  
 11  LEGENDARY       733 non-null    int64  
 12  MEGA_EVOLUTION  733 non-null    int64  
 13  HEIGHT          733 non-null    float64
 14  WEIGHT          733 non-null    float64
 15  HP              733 non-null    int64  
 16  ATK             733 non-null    int64  
 17  DEF             733 non-null    

In [14]:
binary_features = ['LEGENDARY','MEGA_EVOLUTION']
categorical_features = ['TYPE1','TYPE2','COLOR', 'ABILITY1','ABILITY2','ABILITY HIDDEN']
numeric_features = ['HEIGHT', 'WEIGHT','HP','ATK','DEF','SP_ATK','SP_DEF','SPD']
drop_features = ['NUMBER','CODE','GENERATION', 'TOTAL' ]