# Ingeniería de Características e Imputación de Datos

En este notebook practicaremos **ingeniería de características** (feature engineering) y repasaremos **imputación de datos**.

Usaremos el dataset `Houses.csv`, que contiene información de viviendas y su precio de venta (`SalePrice`).

Objetivos:

1. Identificar valores faltantes y aplicar técnicas de imputación con `SimpleImputer`.
2. Crear nuevas variables derivadas de columnas existentes con `MathFeatures` de `feature_engine`.
3. Transformar y preparar datos numéricos y categóricos para futuros modelos.

Librerías clave:
```python
from sklearn.impute import SimpleImputer
from feature_engine.creation import MathFeatures

In [1]:
pip install feature_engine

Collecting feature_engine
  Downloading feature_engine-1.9.3-py3-none-any.whl.metadata (10 kB)
Collecting scikit-learn>=1.4.0 (from feature_engine)
  Downloading scikit_learn-1.7.2-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (11 kB)
Collecting scipy>=1.4.1 (from feature_engine)
  Downloading scipy-1.15.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting statsmodels>=0.11.1 (from feature_engine)
  Downloading statsmodels-0.14.5-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (9.5 kB)
Collecting joblib>=1.2.0 (from scikit-learn>=1.4.0->feature_engine)
  Downloading joblib-1.5.2-py3-none-any.whl.metadata (5.6 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn>=1.4.0->feature_engine)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Collecting patsy>=0.5.6 (from statsmodels>=0.11.1->feature_engine)
  Downloading patsy-1.0.1-py2.py3-none-any.whl.metadata (3.3 kB)
Down

In [2]:
import pandas as pd
import numpy as np

In [3]:
# Cargar el dataset
df = pd.read_csv("Houses.csv")

# Mostrar las primeras filas
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


## 1. Inspección inicial del dataset

Primero observaremos la estructura del dataset, tipos de datos y valores faltantes.

In [4]:
# Tamaño del dataset
print("Filas y columnas:", df.shape)

# Tipos de datos
print("\nTipos de datos:\n", df.dtypes.head())

# Resumen general
df.info()

# Estadísticas básicas
df.describe().T.head(10)

Filas y columnas: (1460, 81)

Tipos de datos:
 Id               int64
MSSubClass       int64
MSZoning        object
LotFrontage    float64
LotArea          int64
dtype: object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Conditio

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Id,1460.0,730.5,421.610009,1.0,365.75,730.5,1095.25,1460.0
MSSubClass,1460.0,56.89726,42.300571,20.0,20.0,50.0,70.0,190.0
LotFrontage,1201.0,70.049958,24.284752,21.0,59.0,69.0,80.0,313.0
LotArea,1460.0,10516.828082,9981.264932,1300.0,7553.5,9478.5,11601.5,215245.0
OverallQual,1460.0,6.099315,1.382997,1.0,5.0,6.0,7.0,10.0
OverallCond,1460.0,5.575342,1.112799,1.0,5.0,5.0,6.0,9.0
YearBuilt,1460.0,1971.267808,30.202904,1872.0,1954.0,1973.0,2000.0,2010.0
YearRemodAdd,1460.0,1984.865753,20.645407,1950.0,1967.0,1994.0,2004.0,2010.0
MasVnrArea,1452.0,103.685262,181.066207,0.0,0.0,0.0,166.0,1600.0
BsmtFinSF1,1460.0,443.639726,456.098091,0.0,0.0,383.5,712.25,5644.0


## 2. Identificación de valores faltantes

Buscaremos las columnas con valores nulos y sus porcentajes.

In [5]:
missing = df.isnull().sum()
missing = missing[missing > 0].sort_values(ascending=False)
missing_percent = (missing / len(df)) * 100
pd.DataFrame({'Valores faltantes': missing, '%': missing_percent}).head(10)

Unnamed: 0,Valores faltantes,%
PoolQC,1453,99.520548
MiscFeature,1406,96.30137
Alley,1369,93.767123
Fence,1179,80.753425
MasVnrType,872,59.726027
FireplaceQu,690,47.260274
LotFrontage,259,17.739726
GarageType,81,5.547945
GarageYrBlt,81,5.547945
GarageFinish,81,5.547945


## 3. Imputación de datos numéricos

Utilizaremos `SimpleImputer` de `sklearn` para rellenar valores faltantes en columnas numéricas.

In [6]:
from sklearn.impute import SimpleImputer

# Seleccionamos columnas numéricas
num_cols = df.select_dtypes(include=np.number).columns

# Imputador con mediana
imputer_num = SimpleImputer(strategy='most_frequent')

df[num_cols] = imputer_num.fit_transform(df[num_cols])

# Verificar que ya no haya nulos numéricos
df[num_cols].isnull().sum().sum()

np.int64(0)

## 4. Imputación de datos categóricos

Para columnas categóricas, usaremos la moda (el valor más frecuente).

In [7]:
cat_cols = df.select_dtypes(exclude=np.number).columns

imputer_cat = SimpleImputer(strategy='most_frequent')

df[cat_cols] = imputer_cat.fit_transform(df[cat_cols])

# Confirmar
df[cat_cols].isnull().sum().sum()

np.int64(0)

## 5. Creación de características con `MathFeatures`

Usaremos `feature_engine.creation.MathFeatures` para generar variables combinadas:

- `TotalArea = 1stFlrSF + 2ndFlrSF + GrLivArea`
- `BathRoomsTotal = FullBath + HalfBath + BsmtFullBath + BsmtHalfBath`
- `AgeHouse = YrSold - YearBuilt`

In [9]:
from feature_engine.creation import MathFeatures

# Creamos un subconjunto con columnas relevantes
features = ['1stFlrSF', '2ndFlrSF', 'GrLivArea']

math_feature = MathFeatures(variables=features, func='sum', new_variables_names=['TotalArea'])
df = math_feature.fit_transform(df)

# Nueva variable manualmente
df['BathRoomsTotal'] = df['FullBath'] + df['HalfBath'] + df['BsmtFullBath'] + df['BsmtHalfBath']

# Edad de la casa
df['AgeHouse'] = df['YrSold'] - df['YearBuilt']

df[['TotalArea', 'BathRoomsTotal', 'AgeHouse']].head()


Unnamed: 0,TotalArea,BathRoomsTotal,AgeHouse
0,3420.0,4.0,5.0
1,2524.0,3.0,31.0
2,3572.0,4.0,7.0
3,3434.0,2.0,91.0
4,4396.0,4.0,8.0


## 6. Escalamiento (opcional)

Podemos escalar variables creadas si el rango difiere mucho respecto a otras.

In [10]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['TotalArea', 'AgeHouse']] = scaler.fit_transform(df[['TotalArea', 'AgeHouse']])

df[['TotalArea', 'AgeHouse']].describe()

Unnamed: 0,TotalArea,AgeHouse
count,1460.0,1460.0
mean,-1.545187e-16,5.110068e-17
std,1.000343,1.000343
min,-2.255226,-1.208604
25%,-0.7396757,-0.9440523
50%,-0.1043691,-0.05118902
75%,0.5055826,0.5771222
max,7.902025,3.288781


## 7. Resumen

En este ejercicio:

- Imputamos valores faltantes numéricos (mediana) y categóricos (moda).
- Creamos nuevas variables (`TotalArea`, `BathRoomsTotal`, `AgeHouse`).
- Preparamos el dataset para modelado posterior.

La **ingeniería de características** es una de las etapas más importantes del preprocesamiento, ya que transforma los datos en información útil para los modelos.