## Ejemplo de las valoraciones de un restaurante


In [11]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import FeatureUnion

# Create a small sample dataset
X_train = pd.DataFrame(data={'review': ['Good food good service.', 
                                        'Good food with friendly service.',
                                        'Average food and bad service'],
                             'star': [5,4,2],
                             'meal_time': ['dinner','lunch','breakfirst'],
                             'tip_%': [0.25, 0.18, np.nan]})
X_train.head()

Unnamed: 0,review,star,meal_time,tip_%
0,Good food good service.,5,dinner,0.25
1,Good food with friendly service.,4,lunch,0.18
2,Average food and bad service,2,breakfirst,


### Construimos un ColumTransformer

1. For the categorical column meal_time, encode it using OneHotEncoder
2. For numeric columns star and tip_%, create a Pipeline to first impute NaN values using SimpleImputer and then scale its result using MinMaxScaler
3. Pass through the remaining columns (review)
4. Combine the two results together using ColumnTransformer

In [12]:
cat_col = ['meal_time']
num_col = ['star','tip_%']

# make a pipeline to do computing and scaling
num_pipe = Pipeline([
    ('computer',SimpleImputer(strategy='constant', fill_value=0)),
    ('scaler', MinMaxScaler())
])

# construct the ColumnTransformer
ColumnTransformation = ColumnTransformer(
    transformers=[
        # one-hot encode categorical cols 
        ('cat_ohe', OneHotEncoder(sparse_output = False, handle_unknown='ignore'), cat_col),
        # pipe transform numeric cols
        ('num_pipe', num_pipe, num_col)
    ]
        # passthrough the rest cols
    , remainder = 'passthrough'
)

# fit and transform the data
ColumnTransformation.fit_transform(X_train)

TypeError: OneHotEncoder.__init__() got an unexpected keyword argument 'sparse'

## Scikit-Learn ColumnTransformer
Extraído de la [web](https://gist.github.com/iamirmasoud/03b8788e87768691103784af9767cbd8)

In [181]:
from seaborn import load_dataset
#Set seed
seed = 123

#Loading data sets
df = load_dataset('tips').drop(columns=['tip', 'sex']).sample(n=5, random_state=seed)
 
#Add missing values
df.iloc[[1, 2, 4], [2, 4]] = np.nan
df


Unnamed: 0,total_bill,smoker,day,time,size
112,38.07,No,Sun,Dinner,3.0
19,20.65,No,,Dinner,
187,30.46,Yes,,Dinner,
169,10.63,Yes,Sat,Dinner,2.0
31,18.35,No,,Dinner,


In [189]:
from sklearn.model_selection import train_test_split
# Partition data
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['total_bill']),
                                                    df['total_bill'],
                                                    test_size=.2,
                                                    random_state=seed)

# Define classification columns
categorical = list(X_train.select_dtypes('category').columns)
print(f"Categorical columns are: {categorical}")

# Define numeric columns
numerical = list(X_train.select_dtypes('number').columns)
print(f"Numerical columns are: {numerical}")


Categorical columns are: ['smoker', 'day', 'time']
Numerical columns are: ['size']


<img src="data/columntransformer.png" width="400" />

In [190]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler

# Define classification pipeline
cat_pipe = Pipeline([('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
                     ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))])

# Define value pipeline
num_pipe = Pipeline([('imputer', SimpleImputer(strategy='median')),
                     ('scaler', MinMaxScaler())])

# Combined classification pipeline and numerical pipeline
preprocessor = ColumnTransformer(transformers=[('cat', cat_pipe, categorical),
                                               ('num', num_pipe, numerical)])

# Install transformer and training data estimator on the pipeline
pipe = Pipeline(steps=[('preprocessor', preprocessor),
                       ('model', LinearRegression())])
pipe.fit(X_train, y_train)

# Forecast training data
y_train_pred = pipe.predict(X_train)
print(f"Predictions on training data: {y_train_pred}")

# Forecast test data
y_test_pred = pipe.predict(X_test)
print(f"Predictions on test data: {y_test_pred}")


Predictions on training data: [10.63 18.35 38.07 30.46]
Predictions on test data: [18.35]


<img src="data/featureunion.png" width="400" />

In [193]:
# Custom pipe
class ColumnSelector(BaseEstimator, TransformerMixin):
    """Select only specified columns."""

    def __init__(self, columns):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.columns]


# Define classification pipeline
cat_pipe = Pipeline([('selector', ColumnSelector(categorical)),
                     ('imputer', SimpleImputer(
                         strategy='constant', fill_value='missing')),
                     ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))])

# Define value pipeline
num_pipe = Pipeline([('selector', ColumnSelector(numerical)),
                     ('imputer', SimpleImputer(strategy='median')),
                     ('scaler', MinMaxScaler())])

# Combined classification pipeline and numerical pipeline
preprocessor = FeatureUnion(transformer_list=[('cat', cat_pipe),
                                              ('num', num_pipe)])

# Combined classification pipeline and numerical pipeline
pipe = Pipeline(steps=[('preprocessor', preprocessor),
                       ('model', LinearRegression())])
pipe.fit(X_train, y_train)

# Forecast training data
y_train_pred = pipe.predict(X_train)
print(f"Predictions on training data: {y_train_pred}")

# Forecast test data
y_test_pred = pipe.predict(X_test)
print(f"Predictions on test data: {y_test_pred}")


Predictions on training data: [10.63 18.35 38.07 30.46]
Predictions on test data: [18.35]


# Ejercicios
1. Construir un FunctionalTransformer que elimine las filas duplicadas de una matriz
2. Construir un StandardScalerCustom que resista el test check_estimator(StandardScalerCustom())
3. Construir un Transformer que elimine las filas de un DataFrame que superen un porcentaje determinado de valores nulos o en las que la etiqueta tenga un valor nulo.
4. Crear un transformer que registre la asimetría y la curtosis las columnas de la matriz
5. Crear un transformer que convierta las columnas que tu le dices a categóricas sino lo son ya creando imputando el valor más frecuente en los valores nulos.
6. Crear un transformer que elimine los datos que superen un número determinado de desviaciones típicas
7. Crear un transformer que elimine un atributo numérico si tiene un valor único con un recuento superior a un porcentaje de los datos. Imagínate un atributo numérico en el que el 90% de los datos vale -1.0.
8. Crear un transformer que cree una columna adicional con una categoría para cada valor único de un atributo numérico cuyo recuento sea muy superior a la media.$$\begin{align}&\frac{recuento}{n-recuento}\times numuniq>6 \\ &numuniq*recuento>n+3\sqrt{numuniq\times\left(\sum_{i=1}^{numuniq} recuento_i²\right)-n²}\end{align}$$
9. Crear un Transformer que opere sobre un dataframe para categorizar aquellas columnas cuya cantidad de valores únicos respecto al número total de datos es inferior a un umbral
