# Data Preprocessing for Pokémon Dataset
#### Source: Kaggle – Pokémon Dataset
#### Author: Tanishk Sharma

### 🧹 Cleaning and Transforming Pokémon Data for Analysis
This notebook focuses on parsing, cleaning, and transforming the original dataset to extract useful information about Pokémon types, stats, and moves. The goal is to prepare a structured and usable dataset for future analysis and simulations, such as battle damage estimation or machine learning tasks.

#### Dataset Loading and Initial Exploration

In [5]:
import pandas as pd

# =================== Cargar dataset ===================
dat_base = pd.read_csv("../data/pokemon_data.csv")
dat_base.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1302 entries, 0 to 1301
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   id               1302 non-null   int64 
 1   name             1302 non-null   object
 2   base_experience  1302 non-null   int64 
 3   height           1302 non-null   int64 
 4   weight           1302 non-null   int64 
 5   types            1302 non-null   object
 6   abilities        1302 non-null   object
 7   moves            1268 non-null   object
 8   stats            1302 non-null   object
dtypes: int64(4), object(5)
memory usage: 91.7+ KB


#### Unpacking Composite Columns: Types, Moves, and Stats

In [31]:
# =================== Separar tipos ===================
# Separar la columna 'types' en 'type_1' y 'type_2' directamente en la copia
dat_base[['primary_type', 'secondary_type']] = dat_base['types'].str.split(',', n=1, expand=True)

# Limpiar espacios
dat_base['primary_type'] = dat_base['primary_type'].str.strip()
dat_base['secondary_type'] = dat_base['secondary_type'].str.strip()

# =================== Separar Movimientos ===================
# Separar la columna 'moves' en 'move_1', 'move_2', 'move_3', 'move_4' y 'move_5'
dat_base[['move_1', 'move_2', 'move_3', 'move_4', 'move_5']] = dat_base['moves'].str.split(',', n=4, expand=True)

# Limpiar espacios
dat_base['move_1'] = dat_base['move_1'].str.strip()
dat_base['move_2'] = dat_base['move_2'].str.strip()
dat_base['move_3'] = dat_base['move_3'].str.strip()
dat_base['move_4'] = dat_base['move_4'].str.strip()
dat_base['move_5'] = dat_base['move_5'].str.strip()

# =================== Separar Stats ===================
# Separar la columna 'stats' en columnas específicas
dat_base[['hp', 'attack', 'defense', 'special-attack', 'special-defense', 'speed']] = dat_base['stats'].str.split(',', n=5, expand=True)

# Limpiar espacios en las columnas de stats
dat_base['hp'] = dat_base['hp'].str.strip()
dat_base['attack'] = dat_base['attack'].str.strip()
dat_base['defense'] = dat_base['defense'].str.strip()
dat_base['special-attack'] = dat_base['special-attack'].str.strip()
dat_base['special-defense'] = dat_base['special-defense'].str.strip()
dat_base['speed'] = dat_base['speed'].str.strip()


if '=' in str(dat_base["hp"].iloc[0]):
    dat_base["hp"] = dat_base["hp"].str.split('=').str[1].astype('int64')
if '=' in str(dat_base["attack"].iloc[0]):
    dat_base["attack"] = dat_base["attack"].str.split('=').str[1].astype('int64')
if '=' in str(dat_base["defense"].iloc[0]):
    dat_base["defense"] = dat_base["defense"].str.split('=').str[1].astype('int64')
if '=' in str(dat_base["special-attack"].iloc[0]):
    dat_base["special-attack"] = dat_base["special-attack"].str.split('=').str[1].astype('int64')
if '=' in str(dat_base["special-defense"].iloc[0]):
    dat_base["special-defense"] = dat_base["special-defense"].str.split('=').str[1].astype('int64')
if '=' in str(dat_base["speed"].iloc[0]):
    dat_base["speed"] = dat_base["speed"].str.split('=').str[1].astype('int64')

# Eliminar columnas separadas
cols_to_drop = ['types', 'moves', 'stats']
cols_presentes = [col for col in cols_to_drop if col in dat_base.columns]

if cols_presentes:
    dat_base = dat_base.drop(cols_presentes, axis=1)

# Quitar valores NA
dat_base = dat_base.fillna("")

#### Exporting the Cleaned Dataset

In [35]:
# Guardar como CSV
dat_base.to_csv("../data/pokemon_data_mod.csv", index=False)

db = pd.read_csv("../data/pokemon_data_mod.csv")
db.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1302 entries, 0 to 1301
Data columns (total 19 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   id               1302 non-null   int64 
 1   name             1302 non-null   object
 2   base_experience  1302 non-null   int64 
 3   height           1302 non-null   int64 
 4   weight           1302 non-null   int64 
 5   abilities        1302 non-null   object
 6   primary_type     1302 non-null   object
 7   secondary_type   726 non-null    object
 8   move_1           1268 non-null   object
 9   move_2           1265 non-null   object
 10  move_3           1263 non-null   object
 11  move_4           1263 non-null   object
 12  move_5           1262 non-null   object
 13  hp               1302 non-null   int64 
 14  attack           1302 non-null   int64 
 15  defense          1302 non-null   int64 
 16  special-attack   1302 non-null   int64 
 17  special-defense  1302 non-null   