# Create datasets

In this notebook we start from the Kaggle dataset and create the csvs we will use for our mock ml pipeline.

The dataset used is taken from Kaggle: https://www.kaggle.com/datasets/rounakbanik/pokemon.

We load it and create some data issues so that we will be able to implement cleanning steps in the ML pipeline.

## 1. Imports

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import json
import os

In [2]:
np.random.seed(100)

In [3]:
data_path = Path().cwd().parent / "data" 
dataset_path = data_path / "datasets" / "raw"
credentials_path = data_path / "credentials" / "kaggle.json"

## 2. Load Kaggle data

In [4]:
# Get kaggle TOKEN
with open(credentials_path, "r") as file:
    credentials = json.load(file)
    
# SET ENV variables
os.environ['KAGGLE_USERNAME'] = credentials["username"]
os.environ['KAGGLE_KEY'] = credentials["key"]

In [5]:
# Initialize client and authenticate
from kaggle.api.kaggle_api_extended import KaggleApi

api = KaggleApi()
api.authenticate()

In [6]:
# Download csv unzipped
api.dataset_download_files('rounakbanik/pokemon', path=dataset_path, unzip=True)

In [7]:
df_kaggle_data = pd.read_csv(dataset_path / "pokemon.csv")

## 3. Extract list of pokemon

As a first step, we create a clean list of pokemon and relative index in the pokedex. We will use this list as a source of truth for valid pokemons.

In [8]:
df_pokemon_index = df_kaggle_data[["pokedex_number", "name"]]

In [9]:
assert df_pokemon_index.pokedex_number.is_unique
assert df_pokemon_index[lambda df: df.pokedex_number.isnull()].shape[0] == 0

## 4. Add artificial nans


We insert in 8% of the table artificial NaNs so that we have to clean the data in the preprocess phase

In [10]:
p = 0.1

Avoid removing for simplicity the index, the name and the label. Remove from other columns

In [11]:
cols_to_modify = set(df_kaggle_data.columns) - {"name", "pokedex_number", "is_legendary"}

Create a mask to insert NaNs

In [12]:
mask = np.random.choice(
    [True, False], 
    p=[p, 1-p],
    size=(df_kaggle_data.shape[0], df_kaggle_data.shape[1] - 3)
)

In [13]:
df_data_with_nans = (
    df_kaggle_data[["name", "pokedex_number", "is_legendary"]]
    .join(
        df_kaggle_data[list(cols_to_modify)].mask(mask),
        how="left"
    )
    .rename(columns={"classfication": "classification"})
)

In [14]:
df_data_with_nans

Unnamed: 0,name,pokedex_number,is_legendary,against_psychic,against_ground,defense,base_egg_steps,sp_attack,against_ice,base_total,...,against_steel,capture_rate,type2,speed,against_fairy,against_grass,generation,type1,classification,hp
0,Bulbasaur,1,0,2.0,1.0,49.0,5120.0,,2.0,318.0,...,1.0,,poison,45.0,0.5,0.25,,grass,Seed Pokémon,
1,Ivysaur,2,0,2.0,1.0,63.0,5120.0,80.0,,405.0,...,1.0,45,poison,60.0,0.5,0.25,1.0,grass,Seed Pokémon,60.0
2,Venusaur,3,0,2.0,1.0,,5120.0,122.0,,625.0,...,1.0,45,poison,80.0,0.5,0.25,,grass,Seed Pokémon,80.0
3,Charmander,4,0,1.0,2.0,43.0,5120.0,60.0,0.5,309.0,...,0.5,45,,65.0,0.5,0.50,1.0,fire,Lizard Pokémon,39.0
4,Charmeleon,5,0,1.0,2.0,58.0,5120.0,80.0,0.5,405.0,...,0.5,45,,80.0,0.5,0.50,1.0,fire,Flame Pokémon,58.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
796,Celesteela,797,1,0.5,0.0,103.0,,107.0,1.0,570.0,...,0.5,25,flying,61.0,0.5,0.25,7.0,steel,Launch Pokémon,97.0
797,Kartana,798,1,0.5,1.0,131.0,30720.0,59.0,1.0,,...,0.5,,steel,109.0,0.5,0.25,7.0,grass,Drawn Sword Pokémon,
798,Guzzlord,799,1,0.0,1.0,53.0,30720.0,97.0,2.0,,...,1.0,15,dragon,43.0,4.0,0.50,,dark,,223.0
799,Necrozma,800,1,0.5,1.0,,30720.0,127.0,1.0,600.0,...,1.0,3,,79.0,1.0,1.00,7.0,psychic,Prism Pokémon,97.0


## 5. Add duplicated rows 

We insert duplicated rows so that we will have to check for duplicates

In [15]:
df_duplicated_rows = (
    pd.concat(
        [df_data_with_nans, df_data_with_nans.sample(frac=0.15)],
        axis=0,
        ignore_index=True
    )
    .reset_index(drop=True)
)

In [16]:
assert df_data_with_nans.shape[0] < df_duplicated_rows.shape[0]

## 6. Add rows with non valid indices

In [17]:
df_invalid_numbers = (
    pd.concat(
        [
            df_duplicated_rows,
            df_duplicated_rows
                .sample(frac=0.5)
                .assign(pokedex_number=lambda df: df.pokedex_number + 10000)
        ],
        axis=0,
        ignore_index=True
    )
    .reset_index(drop=True)
)

In [18]:
assert df_duplicated_rows.shape[0] < df_invalid_numbers.shape[0]

## 7. Store as CSVS

In [19]:
df_pokemon_index.to_csv(dataset_path / "raw_pokemon_index.csv", sep=";", index=False)

In [20]:
df_invalid_numbers.to_csv(dataset_path / "raw_pokemon_data.csv", sep=";", index=False)