<img style="float: left;;" src='Figures/iteso.jpg' width="100" height="200"/></a>

# <center> <font color= #000047> Práctica Codificación de Variables Categóricas </font> </center>

### Variables en el Dataset
> **Rooms:** Number of rooms

> **Price:** Price in dollars

> **Method:** S - property sold; SP - property sold prior; PI - property passed in; PN - sold prior not disclosed; SN - sold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn prior to auction; SA - sold after auction; SS - sold after auction price not disclosed. N/A - price or highest bid not available.

> **Type:** br - bedroom(s); h - house,cottage,villa, semi,terrace; u - unit, duplex; t - townhouse; dev site - development site; o res - other residential.

> **SellerG:** Real Estate Agent

> **Date:** Date sold

> **Distance:** Distance from CBD

> **Regionname:** General Region (West, North West, North, North east …etc)

> **Propertycount:** Number of properties that exist in the suburb.

> **Bedroom2 :** Scraped # of Bedrooms (from different source)

> **Bathroom:** Number of Bathrooms

> **Car:** Number of carspots

> **Landsize:** Land Size

> **BuildingArea:** Building Size

> **CouncilArea:** Governing council for the area

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Leyendo el dataset
data = pd.read_csv('melb_data.csv')

In [2]:
# separar las variables de predicción y la predicción
y = data.Price
X = data.drop(['Price'], axis=1)


In [3]:
# Dividir los datos en entrenamiento y testeo
X_train_full, X_test_full, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

### Limpiar el dataset

In [4]:
# Eliminar las columnas que tengan algún nan (el aproach más simple)
cols_with_missing = [col for col in X_train_full.columns if X_train_full[col].isnull().any()] 
X_train_full.drop(cols_with_missing, axis=1, inplace=True)
X_test_full.drop(cols_with_missing, axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train_full.drop(cols_with_missing, axis=1, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test_full.drop(cols_with_missing, axis=1, inplace=True)


In [5]:
#Seleccionar las columnas categóricas con una pequeña cardinalidad (arbitrario) 
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Seleccionar las columnas numéricas
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

In [6]:
# Obtener las columnas seleccionadas
my_cols = low_cardinality_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()

In [7]:
X_train.head(5)

Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude,Propertycount
12167,u,S,Southern Metropolitan,1,5.0,3182.0,1.0,1.0,0.0,-37.85984,144.9867,13240.0
6524,h,SA,Western Metropolitan,2,8.0,3016.0,2.0,2.0,193.0,-37.858,144.9005,6380.0
8413,h,S,Western Metropolitan,3,12.6,3020.0,3.0,1.0,555.0,-37.7988,144.822,3755.0
2919,u,SP,Northern Metropolitan,3,13.0,3046.0,3.0,1.0,265.0,-37.7083,144.9158,8870.0
6043,h,S,Western Metropolitan,3,13.3,3020.0,3.0,1.0,673.0,-37.7623,144.8272,4217.0


In [8]:
# Obtener la lista de las variables categóricas
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)
object_cols

['Type', 'Method', 'Regionname']

**Definimos una función score_codification() para comparar enfoques diferentes para tratar con variables categóricas. Esta función obtiene el error absoluto medio (MAE) de un modelo de RandomForest. Se desea obtener un valor de MAE lo más bajo posible.**

In [9]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

def score_codification(X_train, X_test, y_train, y_test):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    return mean_absolute_error(y_test, preds)

### 1.- Elimine las variables categóricas de $X\_train$ y $X\_test$ y obtenga el MAE mediante la función $score\_codification$


In [None]:
# Eliminar variables categóricas
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_test = X_test.select_dtypes(exclude=['object'])
mae_drop = score_codification(drop_X_train, drop_X_test, y_train, y_test)
print(f"MAE eliminando variables categóricas: {mae_drop}")

### 2.- Aplique una codificación $One-Hot$ a las variables categóricas y obtenga el MAE mediante la función $score\_codification$


In [None]:
from sklearn.preprocessing import OneHotEncoder

# One-hot encoding
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_test = pd.DataFrame(OH_encoder.transform(X_test[object_cols]))

# Index alignment
OH_cols_train.index = X_train.index
OH_cols_test.index = X_test.index

# Remove categorical columns and add one-hot columns
num_X_train = X_train.drop(object_cols, axis=1)
num_X_test = X_test.drop(object_cols, axis=1)

OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_test = pd.concat([num_X_test, OH_cols_test], axis=1)

mae_oh = score_codification(OH_X_train, OH_X_test, y_train, y_test)
print(f"MAE con One-Hot Encoding: {mae_oh}")

### 3.- Aplique una codificación por $conteos$ o $frecuencias$ de categorías a las variables categóricas y obtenga el MAE mediante la función $score\_codification$

**¿En qué casos puede ser útil este tipo de codificación?**

In [None]:
# Codificación por conteo
X_train_count = X_train.copy()
X_test_count = X_test.copy()

for col in object_cols:
    freq = X_train_count[col].value_counts()
    X_train_count[col] = X_train_count[col].map(freq)
    X_test_count[col] = X_test_count[col].map(freq)

mae_count = score_codification(X_train_count, X_test_count, y_train, y_test)
print(f"MAE con codificación por conteo: {mae_count}")

### 4.- Codificación con base a la media

Codifique una variable categórica usando la media del target (`Price`) para cada categoría. Evalúe el MAE.

**¿Qué riesgos tiene este método? ¿Cómo se puede evitar el sobreajuste?**

In [None]:
# Target/Mean Encoding para la primera variable categórica
X_train_mean = X_train.copy()
X_test_mean = X_test.copy()

if object_cols:
    col = object_cols[0]
    means = X_train_mean.join(y_train).groupby(col)['Price'].mean()
    X_train_mean[col] = X_train_mean[col].map(means)
    X_test_mean[col] = X_test_mean[col].map(means)

    # Para las demás variables categóricas, usar One-Hot
    other_obj_cols = object_cols[1:]
    OH_encoder3 = OneHotEncoder(handle_unknown='ignore', sparse=False)
    OH_cols_train3 = pd.DataFrame(OH_encoder3.fit_transform(X_train_mean[other_obj_cols])) if other_obj_cols else pd.DataFrame()
    OH_cols_test3 = pd.DataFrame(OH_encoder3.transform(X_test_mean[other_obj_cols])) if other_obj_cols else pd.DataFrame()

    OH_cols_train3.index = X_train_mean.index
    OH_cols_test3.index = X_test_mean.index

    num_X_train3 = X_train_mean.drop(other_obj_cols, axis=1)
    num_X_test3 = X_test_mean.drop(other_obj_cols, axis=1)

    OH_X_train3 = pd.concat([num_X_train3, OH_cols_train3], axis=1)
    OH_X_test3 = pd.concat([num_X_test3, OH_cols_test3], axis=1)

    mae_mean = score_codification(OH_X_train3, OH_X_test3, y_train, y_test)
    print(f"MAE con Target/Mean Encoding para '{col}': {mae_mean}")