# Data cleaning

This data doesn't appear to need much cleaning. The only issue is there are just so many variables to work with.

The 'classification' column does have a typo so will fix that:

In [55]:
import pandas as pd
import numpy as np

pokemon = pd.read_csv("raw_data/pokemon.csv")

pokemon.rename(columns={"classfication":"classification"}, inplace = True)

pokemon.classification

0             Seed Pokémon
1             Seed Pokémon
2             Seed Pokémon
3           Lizard Pokémon
4            Flame Pokémon
              ...         
796         Launch Pokémon
797    Drawn Sword Pokémon
798      Junkivore Pokémon
799          Prism Pokémon
800     Artificial Pokémon
Name: classification, Length: 801, dtype: object

Capture rate should be numeric but currently dtype object due to one pokemon (Minior) having characters in its capture_rate. This is because it has two forms - Meteor and Core - with two different rates. Going to set this to NA to avoid hassle.

In [100]:
pokemon.loc[pokemon["name"] == "Minior", "capture_rate"]

pokemon.loc[pokemon["name"] == "Minior", "capture_rate"] = np.NaN

pokemon.loc[pokemon["name"] == "Minior", "capture_rate"]

pokemon.dtypes

abilities             object
against_bug          float64
against_dark         float64
against_dragon       float64
against_electric     float64
against_fairy        float64
against_fight        float64
against_fire         float64
against_flying       float64
against_ghost        float64
against_grass        float64
against_ground       float64
against_ice          float64
against_normal       float64
against_poison       float64
against_psychic      float64
against_rock         float64
against_steel        float64
against_water        float64
attack                 int64
base_egg_steps         int64
base_happiness         int64
base_total             int64
capture_rate          object
classification        object
defense                int64
experience_growth      int64
height_m             float64
hp                     int64
japanese_name         object
name                  object
percentage_male      float64
pokedex_number         int64
sp_attack              int64
sp_defense    

Now will convert capture_rate to numeric rather than object.

In [102]:
pokemon.capture_rate = pokemon.capture_rate.astype('Int64')

pokemon.capture_rate

0       45
1       45
2       45
3       45
4       45
      ... 
796     25
797    255
798     15
799      3
800      3
Name: capture_rate, Length: 801, dtype: Int64

## Dropping variables

There are some variables I'm not interested in including in my models, so will drop those now.

In [111]:
# remove columns against_x
pokemon_trimmed = pokemon.drop(list(pokemon.filter(regex = "against_")), axis = "columns")
# remove distinct/semi-distinct columns
pokemon_trimmed = pokemon_trimmed.drop(["pokedex_number", "japanese_name", "classification"], axis = "columns")
pokemon_trimmed.head(20)

Unnamed: 0,abilities,attack,base_egg_steps,base_happiness,base_total,capture_rate,defense,experience_growth,height_m,hp,name,percentage_male,sp_attack,sp_defense,speed,type1,type2,weight_kg,generation,is_legendary
0,"['Overgrow', 'Chlorophyll']",49,5120,70,318,45,49,1059860,0.7,45,Bulbasaur,88.1,65,65,45,grass,poison,6.9,1,0
1,"['Overgrow', 'Chlorophyll']",62,5120,70,405,45,63,1059860,1.0,60,Ivysaur,88.1,80,80,60,grass,poison,13.0,1,0
2,"['Overgrow', 'Chlorophyll']",100,5120,70,625,45,123,1059860,2.0,80,Venusaur,88.1,122,120,80,grass,poison,100.0,1,0
3,"['Blaze', 'Solar Power']",52,5120,70,309,45,43,1059860,0.6,39,Charmander,88.1,60,50,65,fire,,8.5,1,0
4,"['Blaze', 'Solar Power']",64,5120,70,405,45,58,1059860,1.1,58,Charmeleon,88.1,80,65,80,fire,,19.0,1,0
5,"['Blaze', 'Solar Power']",104,5120,70,634,45,78,1059860,1.7,78,Charizard,88.1,159,115,100,fire,flying,90.5,1,0
6,"['Torrent', 'Rain Dish']",48,5120,70,314,45,65,1059860,0.5,44,Squirtle,88.1,50,64,43,water,,9.0,1,0
7,"['Torrent', 'Rain Dish']",63,5120,70,405,45,80,1059860,1.0,59,Wartortle,88.1,65,80,58,water,,22.5,1,0
8,"['Torrent', 'Rain Dish']",103,5120,70,630,45,120,1059860,1.6,79,Blastoise,88.1,135,115,78,water,,85.5,1,0
9,"['Shield Dust', 'Run Away']",30,3840,70,195,255,35,1000000,0.3,45,Caterpie,50.0,20,20,45,bug,,2.9,1,0


## Feature engineering

- I think I might use a loop to make a column that indicates if a pokemon is evolution step 1, 2, 3 etc. based on abilities remaining the same in consecutive pokemon.

### Caveats

- Might be tough as evolutions were added in later generations and therefore are not consecutive
- Not all pokemon in an evo chain have the same abilities (e.g. caterpie, metapod, butterfree)
- Pokemon at the end of the dataset have the same ability (Beast Boost) but are not evolutions of each other

In [112]:
# create a column that returns true when the row's abilities match the previous row's
pokemon_trimmed["ability_match"] = pokemon_trimmed.abilities.eq(pokemon.abilities.shift())

# Creating a mask for when pokemon do or do not have the BB ability
condition = [pokemon_trimmed["abilities"] == "['Beast Boost']"]
match = [False]

pokemon_trimmed["ability_match"] = np.select(condition, match, default = pokemon_trimmed["ability_match"])


In [113]:
# create an empty array for evolution level
evo_level = np.empty(len(pokemon), dtype = object)

In [139]:
# loop iterates through pokemon rows and inserts
# evo levels into the evo_level column accordingly
for index, row in pokemon_trimmed.iterrows():
    
    # if ability_match is False or if pokemon is one of the 3 identified before
    if (row["ability_match"] == False) or (index in [479, 480, 481]):
        evo_level[index] = 1 # evolution level is set 1

    else:
        evo_level[index] = evo_level[index-1] + 1 # evo level +1 

# add this array to dataframe
pokemon_trimmed["evolution_level"] = evo_level.astype(int)
pokemon_trimmed.dtypes

abilities             object
attack                 int64
base_egg_steps         int64
base_happiness         int64
base_total             int64
capture_rate           Int64
defense                int64
experience_growth      int64
height_m             float64
hp                     int64
name                  object
percentage_male      float64
sp_attack              int64
sp_defense             int64
speed                  int64
type1                 object
type2                 object
weight_kg            float64
generation             int64
is_legendary           int64
ability_match           bool
evolution_level        int64
dtype: object

In [142]:
# no pokemon has more than 3 evolution levels so let's check
mask = pokemon_trimmed["evolution_level"] > 3
mask
pokemon_trimmed[mask].loc[:, ["name", "evolution_level"]]

# now works fine

Unnamed: 0,name,evolution_level


Upon inspection, Uxie, Mesprit and Azelf all had the same ability but aren't evolutions. This has now been fixed.

In [143]:
pokemon_trimmed.loc[[479, 480, 481], ["name", "abilities", "evolution_level"]]

Unnamed: 0,name,abilities,evolution_level
479,Uxie,['Levitate'],1
480,Mesprit,['Levitate'],1
481,Azelf,['Levitate'],1


In [144]:
# drop abilities and ability match cols
pokemon_trimmed = pokemon_trimmed.drop(["abilities", "ability_match"], axis = "columns")
pokemon_trimmed

Unnamed: 0,attack,base_egg_steps,base_happiness,base_total,capture_rate,defense,experience_growth,height_m,hp,name,percentage_male,sp_attack,sp_defense,speed,type1,type2,weight_kg,generation,is_legendary,evolution_level
0,49,5120,70,318,45,49,1059860,0.7,45,Bulbasaur,88.1,65,65,45,grass,poison,6.9,1,0,1
1,62,5120,70,405,45,63,1059860,1.0,60,Ivysaur,88.1,80,80,60,grass,poison,13.0,1,0,2
2,100,5120,70,625,45,123,1059860,2.0,80,Venusaur,88.1,122,120,80,grass,poison,100.0,1,0,3
3,52,5120,70,309,45,43,1059860,0.6,39,Charmander,88.1,60,50,65,fire,,8.5,1,0,1
4,64,5120,70,405,45,58,1059860,1.1,58,Charmeleon,88.1,80,65,80,fire,,19.0,1,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
796,101,30720,0,570,25,103,1250000,9.2,97,Celesteela,,107,101,61,steel,flying,999.9,7,1,1
797,181,30720,0,570,255,131,1250000,0.3,59,Kartana,,59,31,109,grass,steel,0.1,7,1,1
798,101,30720,0,570,15,53,1250000,5.5,223,Guzzlord,,97,53,43,dark,dragon,888.0,7,1,1
799,107,30720,0,600,3,101,1250000,2.4,97,Necrozma,,127,89,79,psychic,,230.0,7,1,1


## Creating dummies

I'm going to turn my categorical variables, such as pokemon element type, into dummy variables (0/1).

In [145]:
pokemon_dummies = pd.get_dummies(pokemon_trimmed, drop_first = True)

pokemon_dummies.head()

pokemon_dummies.columns

Index(['attack', 'base_egg_steps', 'base_happiness', 'base_total',
       'capture_rate', 'defense', 'experience_growth', 'height_m', 'hp',
       'percentage_male',
       ...
       'type2_ghost', 'type2_grass', 'type2_ground', 'type2_ice',
       'type2_normal', 'type2_poison', 'type2_psychic', 'type2_rock',
       'type2_steel', 'type2_water'],
      dtype='object', length=851)

## Save cleaned data

In [146]:
# one last check
pokemon_dummies

Unnamed: 0,attack,base_egg_steps,base_happiness,base_total,capture_rate,defense,experience_growth,height_m,hp,percentage_male,...,type2_ghost,type2_grass,type2_ground,type2_ice,type2_normal,type2_poison,type2_psychic,type2_rock,type2_steel,type2_water
0,49,5120,70,318,45,49,1059860,0.7,45,88.1,...,0,0,0,0,0,1,0,0,0,0
1,62,5120,70,405,45,63,1059860,1.0,60,88.1,...,0,0,0,0,0,1,0,0,0,0
2,100,5120,70,625,45,123,1059860,2.0,80,88.1,...,0,0,0,0,0,1,0,0,0,0
3,52,5120,70,309,45,43,1059860,0.6,39,88.1,...,0,0,0,0,0,0,0,0,0,0
4,64,5120,70,405,45,58,1059860,1.1,58,88.1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
796,101,30720,0,570,25,103,1250000,9.2,97,,...,0,0,0,0,0,0,0,0,0,0
797,181,30720,0,570,255,131,1250000,0.3,59,,...,0,0,0,0,0,0,0,0,1,0
798,101,30720,0,570,15,53,1250000,5.5,223,,...,0,0,0,0,0,0,0,0,0,0
799,107,30720,0,600,3,101,1250000,2.4,97,,...,0,0,0,0,0,0,0,0,0,0


In [149]:
from pathlib import Path
filepath = Path("clean_data/pokemon_clean.csv")
pokemon_dummies.to_csv(filepath)