# Integración de Preprocesamiento en un Pipeline (Práctica)
 **Objetivo:** Integrar diferentes técnicas de preprocesamiento en un pipeline completo.

## Instrucciones:

 1. Carga del Dataset:

    * Utilizar el dataset Wine Quality de Scikit-learn.
 2. Tareas:

    * Manejar valores faltantes.
    * Codificar variables categóricas.
    * Escalar características numéricas.
    * Integrar todas las transformaciones en un pipeline.
 

 3. Ejemplo de Código:

In [3]:
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Cargar el dataset Wine Quality
wine = load_wine()

In [4]:
print(type(wine))

<class 'sklearn.utils._bunch.Bunch'>


In [5]:

X = pd.DataFrame(wine.data, columns=wine.feature_names)
y = wine.target

In [6]:
type(X)

pandas.core.frame.DataFrame

In [7]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   alcohol                       178 non-null    float64
 1   malic_acid                    178 non-null    float64
 2   ash                           178 non-null    float64
 3   alcalinity_of_ash             178 non-null    float64
 4   magnesium                     178 non-null    float64
 5   total_phenols                 178 non-null    float64
 6   flavanoids                    178 non-null    float64
 7   nonflavanoid_phenols          178 non-null    float64
 8   proanthocyanins               178 non-null    float64
 9   color_intensity               178 non-null    float64
 10  hue                           178 non-null    float64
 11  od280/od315_of_diluted_wines  178 non-null    float64
 12  proline                       178 non-null    float64
dtypes: fl

In [8]:
# Añadir valores faltantes para la práctica
import numpy as np
X.loc[0:10, 'alcohol'] = np.nan

In [9]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   alcohol                       167 non-null    float64
 1   malic_acid                    178 non-null    float64
 2   ash                           178 non-null    float64
 3   alcalinity_of_ash             178 non-null    float64
 4   magnesium                     178 non-null    float64
 5   total_phenols                 178 non-null    float64
 6   flavanoids                    178 non-null    float64
 7   nonflavanoid_phenols          178 non-null    float64
 8   proanthocyanins               178 non-null    float64
 9   color_intensity               178 non-null    float64
 10  hue                           178 non-null    float64
 11  od280/od315_of_diluted_wines  178 non-null    float64
 12  proline                       178 non-null    float64
dtypes: fl

In [10]:
# Definir transformaciones
numeric_features = X.select_dtypes(include=['float64', 'int']).columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

In [11]:
# En este dataset no hay categóricas, pero se puede añadir una columna categórica ficticia para la práctica
X['quality'] = np.where(y > 1, 'high', 'low')
categorical_features = ['quality']
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

In [13]:
# Combinar transformaciones
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Integrar en un pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

# Aplicar preprocesamiento
X_transformed = pipeline.fit_transform(X)
# Convertir en DF un array de numpy
X_transformed=pd.DataFrame(X_transformed, columns=pipeline.get_feature_names_out())

print("Preprocesamiento completado. Datos transformados listos para modelar.")

Preprocesamiento completado. Datos transformados listos para modelar.


In [14]:
X_transformed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 15 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   num__alcohol                       178 non-null    float64
 1   num__malic_acid                    178 non-null    float64
 2   num__ash                           178 non-null    float64
 3   num__alcalinity_of_ash             178 non-null    float64
 4   num__magnesium                     178 non-null    float64
 5   num__total_phenols                 178 non-null    float64
 6   num__flavanoids                    178 non-null    float64
 7   num__nonflavanoid_phenols          178 non-null    float64
 8   num__proanthocyanins               178 non-null    float64
 9   num__color_intensity               178 non-null    float64
 10  num__hue                           178 non-null    float64
 11  num__od280/od315_of_diluted_wines  178 non-null    float64

In [1]:
##type(X_transformed)

X_transformed.info()

NameError: name 'X_transformed' is not defined