Aquest Notebook estarà dedicat a fer un model que pugui fer la predicció sobre si un Pokémon és realment legendari. Ens basarem en les estadístiques d'aquests i altres paràmetres per veure si els Pokémon legendaris ho haurien de ser per estadístiques i si hi hagués altres Pokémon que no ho fossin si poguessin entrar en aquesta categoria.

In [15]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
import sklearn
import os

In [16]:
df = pd.read_csv('pokemon.csv')
df.shape

(801, 41)

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 801 entries, 0 to 800
Data columns (total 41 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   abilities          801 non-null    object 
 1   against_bug        801 non-null    float64
 2   against_dark       801 non-null    float64
 3   against_dragon     801 non-null    float64
 4   against_electric   801 non-null    float64
 5   against_fairy      801 non-null    float64
 6   against_fight      801 non-null    float64
 7   against_fire       801 non-null    float64
 8   against_flying     801 non-null    float64
 9   against_ghost      801 non-null    float64
 10  against_grass      801 non-null    float64
 11  against_ground     801 non-null    float64
 12  against_ice        801 non-null    float64
 13  against_normal     801 non-null    float64
 14  against_poison     801 non-null    float64
 15  against_psychic    801 non-null    float64
 16  against_rock       801 non

In [18]:
df.describe()

Unnamed: 0,against_bug,against_dark,against_dragon,against_electric,against_fairy,against_fight,against_fire,against_flying,against_ghost,against_grass,...,height_m,hp,percentage_male,pokedex_number,sp_attack,sp_defense,speed,weight_kg,generation,is_legendary
count,801.0,801.0,801.0,801.0,801.0,801.0,801.0,801.0,801.0,801.0,...,781.0,801.0,703.0,801.0,801.0,801.0,801.0,781.0,801.0,801.0
mean,0.996255,1.057116,0.968789,1.07397,1.068976,1.065543,1.135456,1.192884,0.985019,1.03402,...,1.163892,68.958801,55.155761,401.0,71.305868,70.911361,66.334582,61.378105,3.690387,0.087391
std,0.597248,0.438142,0.353058,0.654962,0.522167,0.717251,0.691853,0.604488,0.558256,0.788896,...,1.080326,26.576015,20.261623,231.373075,32.353826,27.942501,28.907662,109.354766,1.93042,0.282583
min,0.25,0.25,0.0,0.0,0.25,0.0,0.25,0.25,0.0,0.25,...,0.1,1.0,0.0,1.0,10.0,20.0,5.0,0.1,1.0,0.0
25%,0.5,1.0,1.0,0.5,1.0,0.5,0.5,1.0,1.0,0.5,...,0.6,50.0,50.0,201.0,45.0,50.0,45.0,9.0,2.0,0.0
50%,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,65.0,50.0,401.0,65.0,66.0,65.0,27.3,4.0,0.0
75%,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,...,1.5,80.0,50.0,601.0,91.0,90.0,85.0,64.8,5.0,0.0
max,4.0,4.0,2.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,...,14.5,255.0,100.0,801.0,194.0,230.0,180.0,999.9,7.0,1.0


Primer de tot veurem quines columnes hi ha amb NaNs i com ho gestionarem:

In [54]:
col_num = list(df.select_dtypes(exclude=['object']).columns)
col_text = list(df.select_dtypes(include=['object']).columns)
col_nans = df.columns[df.isnull().any()]

print("Columnes numèriques:", len(col_num))
print("Columnes amb text:", len(col_text))
print("Columnes amb NaNs:", len(col_nans))



Columnes numèriques: 34
Columnes amb text: 7
Columnes amb NaNs: 4


In [55]:
df[col_nans].head()

Unnamed: 0,height_m,percentage_male,type2,weight_kg
0,0.7,88.1,poison,6.9
1,1.0,88.1,poison,13.0
2,2.0,88.1,poison,100.0
3,0.6,88.1,,8.5
4,1.1,88.1,,19.0


In [57]:
print("Nombre de valors nuls per columna:")
df[col_nans].isnull().sum()

Nombre de valors nuls per columna:


height_m            20
percentage_male     98
type2              384
weight_kg           20
dtype: int64

El major nombre de NaNs està a type2. Explorarem amb dades reals per què hi ha tantes NaNs. Posem l'exemple més conegut de tots "Pikachu", sabem que és de tipus Elèctric i no te cap tipus secundari.

In [62]:
df[df.name == 'Pikachu'][["name","type1","type2"]]

Unnamed: 0,name,type1,type2
24,Pikachu,electric,


Veiem que està el NaN a type2, que passarà si busquem un Pokémon amb un tipus secundari? Per exemple, Charizard?

In [63]:
df[df.name == 'Charizard'][["name","type1","type2"]]

Unnamed: 0,name,type1,type2
5,Charizard,fire,flying


Sembla que el NaN a type2 és pels Pokémon que no tenen tipus secundari. Per arreglar això, mapejarem les nans a tipus: None.

In [73]:
df.type2.fillna("None", inplace=True)
df[df.name == 'Pikachu'][["name","type1","type2"]]

Unnamed: 0,name,type1,type2
24,Pikachu,electric,


Ara, Pikachu tindrà els dos tipus bén marcats.

In [74]:
legendary_by_generation = df[df['is_legendary'] == 1].groupby('generation').size().reset_index(name='count')

fig = px.bar(legendary_by_generation, x='generation', y='count', title='Número de Pokémon Legendarios por Generación')
fig.show()

In [46]:
top_types_primary = df['type1'].value_counts().reset_index(name='count')

fig = px.bar(top_types_primary, x='type1', y='count', title='Distribución Tipo Principal')
fig.update_xaxes(title_text='Tipo Principal')
fig.update_yaxes(title_text='Número de Pokémon')
fig.show()


In [49]:
top_types_secundary = df['type2'].value_counts().reset_index(name='count')

fig = px.bar(top_types_secundary, x='type2', y='count', title='Distribución Tipo Secundario')
fig.update_xaxes(title_text='Tipo Secundario')
fig.update_yaxes(title_text='Número de Pokémon')
fig.show()
