# Codificación, Escalamiento y Transformaciones

En este notebook aprenderemos a aplicar técnicas de **preprocesamiento de datos**, esenciales antes del modelado:

1. Codificación de variables categóricas (Label Encoding y One-Hot Encoding)
2. Escalamiento de variables numéricas (StandardScaler y MinMaxScaler)
3. Transformaciones matemáticas (logarítmica y raíz cuadrada)

Usaremos el dataset `athlete_events.csv`, que contiene información de atletas olímpicos.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Cargar dataset
df = pd.read_csv("athlete_events.csv")

# Mostrar primeras filas
df.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


## 1. Inspección inicial del dataset

In [3]:
print("Filas y columnas:", df.shape)
print("\nTipos de datos:\n", df.dtypes)
print("\nValores nulos:\n", df.isnull().sum())

Filas y columnas: (271116, 15)

Tipos de datos:
 ID          int64
Name       object
Sex        object
Age       float64
Height    float64
Weight    float64
Team       object
NOC        object
Games      object
Year        int64
Season     object
City       object
Sport      object
Event      object
Medal      object
dtype: object

Valores nulos:
 ID             0
Name           0
Sex            0
Age         9474
Height     60171
Weight     62875
Team           0
NOC            0
Games          0
Year           0
Season         0
City           0
Sport          0
Event          0
Medal     231333
dtype: int64


## 2. Limpieza básica e imputación

Vamos a eliminar columnas irrelevantes y rellenar valores nulos en `Height` y `Weight` con sus medianas.

In [None]:
df = df[['Name','Sex','Age','Height','Weight','Team','NOC','Year','Sport','Medal']]

# Imputar valores faltantes
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mediana')
df[['Height','Weight']] = imputer.fit_transform(df[['Heigth','Weight']])

df.head()

Unnamed: 0,Name,Sex,Age,Height,Weight,Team,NOC,Year,Sport,Medal
0,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992,Basketball,
1,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012,Judo,
2,Gunnar Nielsen Aaby,M,24.0,175.0,70.0,Denmark,DEN,1920,Football,
3,Edgar Lindenau Aabye,M,34.0,175.0,70.0,Denmark/Sweden,DEN,1900,Tug-Of-War,Gold
4,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988,Speed Skating,


## 3. Codificación de variables categóricas

Usaremos dos tipos:
- `LabelEncoder`: convierte texto a números enteros.
- `OneHotEncoder`: genera variables binarias para cada categoría.

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Label encoding para Sex y Medal
le = LabelEncoder()
df['Sex'] = le.fit_transform(df['Sex'])

df['Medal'] = le.fit_transform(df['Medal'])

# One-hot encoding para Team
ct = ColumnTransformer(
    transformers=[('team', OneHotEncoder(handle_unknown='Ignore'), ['Team'])],
    remainder='drop'
)

team_encoded = ct.fit_transform(df)
team_encoded_df = pd.DataFrame(team_encoded.toarray(), columns=ct.named_transformers_['team'].get_feature_names_out(['Team']))

# Concatenar resultados
df_encoded = pd.concat([df.reset_index(drop=True), team_encoded_df], axis=1)

df_encoded.head()

Unnamed: 0,Name,Sex,Age,Height,Weight,Team,NOC,Year,Sport,Medal,...,Team_Ylliam VII,Team_Ylliam VIII,Team_Yugoslavia,Team_Yugoslavia-1,Team_Yugoslavia-2,Team_Zambia,Team_Zefyros,Team_Zimbabwe,Team_Zut,Team_rn-2
0,A Dijiang,1,24.0,180.0,80.0,China,CHN,1992,Basketball,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,A Lamusi,1,23.0,170.0,60.0,China,CHN,2012,Judo,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Gunnar Nielsen Aaby,1,24.0,175.0,70.0,Denmark,DEN,1920,Football,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Edgar Lindenau Aabye,1,34.0,175.0,70.0,Denmark/Sweden,DEN,1900,Tug-Of-War,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Christine Jacoba Aaftink,0,21.0,185.0,82.0,Netherlands,NED,1988,Speed Skating,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 4. Escalamiento de variables numéricas

Aplicaremos dos técnicas:
- **StandardScaler:** para centrar y escalar.
- **MinMaxScaler:** para escalar entre 0 y 1.

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

scaler_std = StandardScaler()
scaler_mm = MinMaxScaler()

scaled_std = scaler_std.fit_transform(df_encoded[['Heigth','Weight','Age']])
scaled_mm = scaler_mm.fit_transform(df_encoded[['Height','Weight','Age']])

scaled_df = pd.DataFrame(scaled_std, columns=['Height_std','Weight_std','Age_std'])
scaled_df_mm = pd.DataFrame(scaled_mm, columns=['Height_mm','Weight_mm','Age_mm'])

df_scaled = pd.concat([df_encoded, scaled_df, scaled_df_mm], axis=1)
df_scaled.head()

Unnamed: 0,Name,Sex,Age,Height,Weight,Team,NOC,Year,Sport,Medal,...,Team_Zefyros,Team_Zimbabwe,Team_Zut,Team_rn-2,Height_std,Weight_std,Age_std,Height_mm,Weight_mm,Age_mm
0,A Dijiang,1,24.0,180.0,80.0,China,CHN,1992,Basketball,2,...,0.0,0.0,0.0,0.0,0.51042,0.752137,-0.243511,0.535354,0.291005,0.16092
1,A Lamusi,1,23.0,170.0,60.0,China,CHN,2012,Judo,2,...,0.0,0.0,0.0,0.0,-0.567265,-0.837921,-0.399918,0.434343,0.185185,0.149425
2,Gunnar Nielsen Aaby,1,24.0,175.0,70.0,Denmark,DEN,1920,Football,2,...,0.0,0.0,0.0,0.0,-0.028423,-0.042892,-0.243511,0.484848,0.238095,0.16092
3,Edgar Lindenau Aabye,1,34.0,175.0,70.0,Denmark/Sweden,DEN,1900,Tug-Of-War,1,...,0.0,0.0,0.0,0.0,-0.028423,-0.042892,1.320566,0.484848,0.238095,0.275862
4,Christine Jacoba Aaftink,0,21.0,185.0,82.0,Netherlands,NED,1988,Speed Skating,2,...,0.0,0.0,0.0,0.0,1.049262,0.911143,-0.712734,0.585859,0.301587,0.126437


## 5. Transformaciones matemáticas

Aplicaremos transformaciones logarítmicas y de raíz cuadrada a `Weight` y `Age` para reducir sesgo.

In [None]:
df_scaled['Log_Weight'] = np.log1p(df_scaled['Weight'])
df_scaled['Sqrt_Age'] = np.sqrt(df_scaled['Age'])
df_scaled[['Weight','Log_Weight','Age','Sqrt_Ages']].head()

Unnamed: 0,Weight,Log_Weight,Age,Sqrt_Age
0,80.0,4.394449,24.0,4.898979
1,60.0,4.110874,23.0,4.795832
2,70.0,4.26268,24.0,4.898979
3,70.0,4.26268,34.0,5.830952
4,82.0,4.418841,21.0,4.582576


## 6. Resumen

En este ejercicio aprendimos a:
- Imputar valores faltantes.
- Codificar variables categóricas (Label & One-Hot).
- Escalar y transformar variables numéricas.

Este dataset ya está listo para análisis o modelado supervisado.