# Pipelines

Un Pipeline (o tubería) es un objeto que encadena (o "pasa por una tubería") diversas fases de preprocesamiento y un estimador final. Por ejemplo:

* Transformaciones iniciales (ej. imputación de valores faltantes, escalado de datos, selección de características, etc.).
* Modelo final (ej. una regresión lineal, un clasificador random forest, un SVM, etc.).

Al usar un Pipeline, estas fases se integran en un solo objeto que se entrena y se evalúa de forma conjunta. 

* Ventajas:
    * Se asegura que todas las transformaciones se realicen siempre de la misma forma en entrenamiento y en predicción.
    * Se reduce el riesgo de fugas de información (data leakage).
    * Se simplifica el código y se puede integrar fácilmente con las rutinas de búsqueda de hiperparámetros (e.g. GridSearchCV o RandomizedSearchCV) y validación cruzada.
    
- Función python vs pipeline:
    * Gestionar manualmente el particionado evitando la fuga de datos
    * Aprovecha el polimorfismo ya que todos los preprocesadores de scikit heredan de una clase en común: TransformerMixin por tanto tienen unos métodos comunes: fit, transform y fit_transform
    * Facilita la exportación para usar en producción porque exporta un objeto con todos los preprocesados y modelado incluido
    * Facilita la composición de pasos de forma muy simplificada


* Objetivo:
    * crear un pipeline que tenga preprocesados y modelo y exportarlo. De esta forma si lo cargamos en otro entorno podemos pedirle predicciones sin tener que limpiar / preprocesar los datos, ya hace el propio pipeline.

* Ámbito: lo pipelines están diseñados para transformar la X, es decir lo datos de entrada a través de pasos.
    * Cuando se ejecutan los métodos fit, predict del pipelines no aplican transformaciones a la "y", solo a la "X".
    * Si se quiere modificar la "y" se puede hacer antes de entrenar el pipeline


Clases y métodos de scikit learn:

* Pipeline: 
    * Permite encadenar una secuencia de transformadores y un estimador final.
* make_pipeline: 
    * función para crear un objeto Pipeline sin necesidad de asignar manualmente un nombre a cada paso.

* ColumnTransformer: 
    * Permite aplicar diferentes transformaciones a subconjuntos específicos (por ejemplo, columnas) de un conjunto de datos. Útil para trabajar con datos tabulares que contienen variables de distintos tipos
    * El conjunto de datos completo, pero se especifican columnas específicas para cada transformador.
    * Por ejemplo combinar MinMaxScaler con OneHotEncoding

* FeatureUnion: 
    * Entrada única para todos: Aplica cada transformador de la unión a la misma matriz de entrada completa. Por ejemplo combinar PCA y SelectKBest
* make_union: 
    * Función de ayuda para crear una FeatureUnion de forma automática, similar a make_pipeline

In [16]:
from sklearn.pipeline import Pipeline, make_pipeline
import seaborn as sns 
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler, PowerTransformer, FunctionTransformer, OneHotEncoder
from sklearn.metrics import r2_score
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.datasets import load_breast_cancer


In [2]:
df = sns.load_dataset('penguins')
df = df.dropna(subset=['body_mass_g']) #quitar nulos en la salida 'y' porque es la variable a predecir

X = df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm']]
y = df['body_mass_g']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [3]:
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('model', LinearRegression())
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print(pipeline.named_steps)
print(pipeline.named_steps['imputer'])
print(pipeline.named_steps['model'])



{'imputer': SimpleImputer(strategy='median'), 'model': LinearRegression()}
SimpleImputer(strategy='median')
LinearRegression()


In [4]:
#prediccion ejemplo 
X_new = pd.DataFrame([[39.1, np.nan, 181.0]], columns=['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm'])
pipeline.predict(X_new)

array([3209.64419227])

In [5]:
#alternativa con make_pipeline

pipeline = make_pipeline(
    SimpleImputer(strategy='median'),
    LinearRegression()
)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print(pipeline.named_steps)
print(pipeline.named_steps['simpleimputer'])
print(pipeline.named_steps['linearregression'])


{'simpleimputer': SimpleImputer(strategy='median'), 'linearregression': LinearRegression()}
SimpleImputer(strategy='median')
LinearRegression()


## Pipeline con GridSearchCV

In [6]:
pipeline = Pipeline([
    ('imputer', SimpleImputer()),
    ('transformer', PowerTransformer()),
    ('scaler', MinMaxScaler()),
    ('model', KNeighborsRegressor())
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

params = {
    'imputer__strategy': ['mean', 'median'],
    'transformer__method': ['yeo-johnson', 'box-cox'],
    'scaler__feature_range': [(0, 1), (0, 2)],
    'model__n_neighbors': np.arange(3,20)
}

grid = GridSearchCV(pipeline, params, scoring='r2')
grid.fit(X_train, y_train)
y_pred = grid.predict(X_test)
print('r2_score',r2_score(y_test, y_pred))
print('grid best params', grid.best_params_)




r2_score 0.8151453148627383
grid best params {'imputer__strategy': 'mean', 'model__n_neighbors': np.int64(11), 'scaler__feature_range': (0, 1), 'transformer__method': 'yeo-johnson'}


In [7]:
X_new = pd.DataFrame([[39.1, np.nan, 181.0]], columns=['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm'])
grid.predict(X_new)

array([3461.36363636])

In [8]:
pipeline = Pipeline([
    ('imputer', SimpleImputer()),
    ('transformer', PowerTransformer()), # En este ejemplo lo hacemos opcional
    ('scaler', MinMaxScaler()), # En este ejemplo lo hacemos opcional   
    ('model', KNeighborsRegressor())
])
params = {
    'imputer__strategy': ['mean', 'median'],
    'transformer': [None, PowerTransformer(method='yeo-johnson'), PowerTransformer(method='box-cox')],
    'scaler': [None, MinMaxScaler(feature_range=(0, 1)), MinMaxScaler(feature_range=(0, 2))],
    'model__n_neighbors': np.arange(3, 20)
}
grid = GridSearchCV(pipeline, params, scoring='r2')
grid.fit(X_train, y_train)
y_pred = grid.predict(X_test)
print('r2_score:', r2_score(y_test, y_pred))
print('grid best params:', grid.best_params_)

r2_score: 0.8253040480659294
grid best params: {'imputer__strategy': 'mean', 'model__n_neighbors': np.int64(18), 'scaler': MinMaxScaler(), 'transformer': None}


In [9]:
#Probando varios modelos
pipeline = Pipeline([
    ('imputer', SimpleImputer()),
    ('transformer', PowerTransformer()), 
    ('scaler', MinMaxScaler()),  
    ('model', 'placeholder') #Se reemplaza por cada modelo en la busqueda
])
params = [
    # KNN 
    {
        'imputer__strategy': ['mean', 'median'],
        'transformer__method': ['yeo-johnson','box-cox'],
        'scaler__feature_range': [(0, 1), (0, 2)],
        'model': [KNeighborsRegressor()],
        'model__n_neighbors': np.arange(3, 20)
    },
    # Decision Tree
    {
        'imputer__strategy': ['mean', 'median'],
        'transformer__method': ['yeo-johnson','box-cox'],
        'scaler__feature_range': [(0, 1), (0, 2)],
        'model': [DecisionTreeRegressor()],
        'model__max_depth': [None, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    }
]

grid = GridSearchCV(pipeline, params, scoring='r2', n_jobs=-1, verbose=1)
grid.fit(X_train, y_train)
y_pred = grid.predict(X_test)
print('r2_score:', r2_score(y_test, y_pred))
print('grid best params:', grid.best_params_)
print('grid results:', pd.DataFrame(grid.cv_results_))

Fitting 5 folds for each of 216 candidates, totalling 1080 fits
r2_score: 0.8151453148627383
grid best params: {'imputer__strategy': 'mean', 'model': KNeighborsRegressor(), 'model__n_neighbors': np.int64(11), 'scaler__feature_range': (0, 1), 'transformer__method': 'yeo-johnson'}
grid results:      mean_fit_time  std_fit_time  ...  std_test_score  rank_test_score
0         0.030225      0.007710  ...        0.021990              129
1         0.042640      0.005279  ...        0.020942              133
2         0.020129      0.002274  ...        0.021990              129
3         0.031061      0.003364  ...        0.020942              133
4         0.019217      0.002212  ...        0.020605              125
..             ...           ...  ...             ...              ...
211       0.038671      0.009988  ...        0.049546              197
212       0.022728      0.001952  ...        0.058265              207
213       0.034355      0.008129  ...        0.044089              

## FunctionTransformer

Uso de FunctionTransformer para crear funciones personalizadas que usar en el pipeline

In [10]:
def log_transform(X):
    return np.log(X)


pipeline = Pipeline([
    ('imputer', SimpleImputer()),
    ('log', FunctionTransformer(log_transform)), 
    ('scaler', MinMaxScaler()),  
    ('model', KNeighborsRegressor()) #Se reemplaza por cada modelo en la busqueda

])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print('r2_score:', r2_score(y_test, y_pred))


r2_score: 0.8134498329740103


## ColumnTransformer

Separar y combinar pipelines para hacer distintos tratamientos a deferentes columnas

In [11]:
df = sns.load_dataset('penguins')
df = df.dropna(subset=['body_mass_g']) #quitar nulos en la salida 'y' porque es la variable a predecir

X = df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm']]
y = df['body_mass_g']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [12]:
# pipeline numéricas
numerical_cols = X_train.select_dtypes(include=[np.number]).columns
pipeline_numerical = Pipeline([
    ('imputer', KNNImputer()),
    ('scaler', MinMaxScaler())
])
# pipeline categóricas
categorical_cols = X_train.select_dtypes(exclude=[np.number]).columns 
pipeline_categorical = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(sparse_output=False))
])
# unir pipelines con el column transformer 
pipeline_all = ColumnTransformer ([
    ('numerical', pipeline_numerical, numerical_cols),
    ('categorical', pipeline_categorical, categorical_cols)])
# pipeline final con el modelo
pipeline = make_pipeline(
    pipeline_all,
    KNeighborsRegressor(n_neighbors=7)
)
pipeline

In [13]:
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print(r2_score(y_test, y_pred))

0.8212291991117089


In [14]:
# remainder 'drop' (por defecto)
from sklearn.preprocessing import StandardScaler

pipeline = ColumnTransformer([
        ('numeric', StandardScaler(), ['bill_length_mm', 'bill_depth_mm']),
        ('categorical', OneHotEncoder(), ['species', 'island']),
    ], remainder='drop'
)

# 'flipper_length_mm' y 'sex' han sido eliminadas y no se han procesado
pd.DataFrame(pipeline.fit_transform(X_train, y_train)).head()

ValueError: A given column is not a column of the dataframe

In [19]:
# remainder 'passthrough'
from sklearn.preprocessing import StandardScaler

pipeline = ColumnTransformer([
        ('numeric', StandardScaler(), ['bill_length_mm', 'bill_depth_mm']),
        ('categorical', OneHotEncoder(), ['species', 'island']),
    ], remainder='passthrough'
)

# 'flipper_length_mm' y 'sex' se mantienen, pero no se han procesado, simplemente se agregan al resultado final
pd.DataFrame(pipeline.fit_transform(X_train, y_train)).head()

ValueError: A given column is not a column of the dataframe

In [20]:
from sklearn.preprocessing import StandardScaler
# remainder con un preprocesador:
X = df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm']]
y = df['body_mass_g']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)


pipeline_numerical1 = Pipeline([
    ('imputer', KNNImputer(n_neighbors=7)),
    ('scaler', StandardScaler())
])

pipeline_numerical2 = Pipeline([
    ('imputer', KNNImputer(n_neighbors=7)),
    ('scaler', MinMaxScaler())
])

pipeline = ColumnTransformer([
        ('numeric', pipeline_numerical1, ['bill_length_mm', 'bill_depth_mm']),
    ], remainder=pipeline_numerical2
)
pipeline

KeyError: "None of [Index(['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm'], dtype='object')] are in the [columns]"

In [None]:
# 'bill_length_mm', 'bill_depth_mm' se les aplica StandardScaler, y a 'flipper_length_mm'  se aplica MinMaxScaler
pd.DataFrame(pipeline.fit_transform(X_train, y_train)).head()

## Transformador personalizado

Para crear transformadores preprocesadores personalizados podemos crear una clase Python

In [21]:
# Transformador personalizado para imprimir los datos e inspeccionarlos despues de cada paso de un pipeline
class Debugger(BaseEstimator, TransformerMixin):

    def __init__(self, title, show_shape=True):
        self.title = title
        self.show_shape = show_shape
        
        
    def fit(self, X, y=None):
        # normalmente aqui se aprende o se calculan parametros a partir de los datos de entrada
        print(f'Ejecutando Debugger.fit {self.title}')
        if self.show_shape:
            print(f'Shape de X:  {X.shape}')
            print(f'X sample: {X[:1]}') # mostrar una fila
        return self # devuelve la instancia Debugger para encadenar en el Pipeline
            
        
    def transform(self, X):
        X_copy = X.copy()
        #Aqui hariamos transformaciones sobre X_copy
        print(f'Ejecutando Debugger.transform {self.title}')
        if self.show_shape:
            print(f'Shape de X:  {X_copy.shape}')
            print(f'X sample: {X_copy[:1]}') # mostrar una fila        
        return X_copy 
    
    def fit_transform(self, X, y = None):
        self.fit(X, y)
        return X

In [22]:
X = df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm']]
y = df['body_mass_g']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

pipeline = Pipeline([
    ('debug1', Debugger(title='Datos X sin procesar')),
    
    ('imputer', SimpleImputer()),
    ('debug2', Debugger(title='Datos X tras SimpleImputer')),
      
    ('transformer', PowerTransformer()), # En este ejemplo lo hacemos opcional
    ('debug3', Debugger(title='Datos X tras PowerTransformer')),
      
    ('scaler', MinMaxScaler()), # En este ejemplo lo hacemos opcional   
    ('debug4', Debugger(title='Datos X tran MinMaxScaler')),
    
    ('model', KNeighborsRegressor())
])
pipeline.fit(X_train, y_train)

KeyError: "None of [Index(['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm'], dtype='object')] are in the [columns]"

In [23]:
pipeline.predict(X_test)

AttributeError: 'ColumnTransformer' object has no attribute 'predict'

## Transformador personalizado para outliers

In [24]:
#class OutlierRemover(BaseEstimator, TransformerMixin):
    
    # def __init__(self, factor=1.5):
    #     self.factor = factor #factor para calcular umbrales inferior y superior (metodo turkey)
        
    # def fit(self, X, y=None):
    #     if not isinstance(X, pd.DataFrame):
    #         X = pd.DataFrame(X)    
    #     self.numerical_cols_ = X.select_dtypes(include=[np.number]).columns
    #     Q1 = X[self.numerical_cols_].quantile(0.25)
    #     Q3 = X[self.numerical_cols_].quantile(0.75)
    #     IQR = Q3 - Q1
    #     self.lower_bound_ = Q1 - self.factor * IQR
    #     self.upper_bound_ = Q3 + self.factor * IQR
    #     return self
    
    # def transform(self, X):
    #     X_copy = X.copy()
    #     filtro = ((X_copy[self.numerical_cols_] < self.lower_bound_)| (X_copy[self.numerical_cols_] > self.upper_bound_)).any(axis=1)
    #     return X_copy[filtro]
        
        

In [25]:
# clase que elimina los outliers, se puede crear una variante que simplemente los reemplace por nan
class OutlierRemover(BaseEstimator, TransformerMixin):
    
    def __init__(self, factor=1.5):
        self.factor = factor 
        
    def fit(self, X, y=None):
       
        Q1 = np.percentile(X, 25, axis=0)
        Q3 = np.percentile(X, 75, axis=0)
        IQR = Q3 - Q1
        
        # cálculo de límites
        self.lower_bound_ = Q1 - self.factor * IQR
        self.upper_bound_ = Q3 + self.factor * IQR
        return self
    
    def transform(self, X):
        X_copy = X.copy()
        
        #filtro = ~((X_copy < self.lower_bound_) | (X_copy > self.upper_bound_)).any(axis=1)
        filtro = np.all((X_copy >= self.lower_bound_) & (X_copy <= self.upper_bound_), axis=1)
        return X_copy[filtro]

In [26]:
remover = OutlierRemover(factor=0.4)
remover.fit_transform(X_train).shape

(187, 3)

In [27]:
df = sns.load_dataset('penguins')
df = df.dropna(subset=['body_mass_g']) 

X = df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm']]
y = df['body_mass_g']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

pipeline = make_pipeline(
    SimpleImputer(),
    Debugger(title='X tras SimpleImputer'),
    OutlierRemover(factor=0.4),
    Debugger(title='X tras Outlier'),
    PowerTransformer(),
    Debugger(title='X tras PowerTransformer'),
    MinMaxScaler(),
    Debugger(title='X tran MinMaxScaler'),
    KNeighborsRegressor()
)
pipeline.fit(X_train, y_train)

Ejecutando Debugger.fit X tras SimpleImputer
Shape de X:  (273, 3)
X sample: [[ 42.7  18.3 196. ]]
Ejecutando Debugger.fit X tras Outlier
Shape de X:  (187, 3)
X sample: [[ 42.7  18.3 196. ]]
Ejecutando Debugger.fit X tras PowerTransformer
Shape de X:  (187, 3)
X sample: [[-0.24245763  0.66723679 -0.22030696]]
Ejecutando Debugger.fit X tran MinMaxScaler
Shape de X:  (187, 3)
X sample: [[0.41653698 0.66619004 0.47450903]]


ValueError: Found input variables with inconsistent numbers of samples: [187, 273]

## Transformador personalizado para crear nuevas features

Ejemplo para crear una nueva columna en el dataset titanic

sibsp + parch + 1

In [28]:
df = sns.load_dataset('titanic')
X = df.drop('alive', axis=1)
y = df['alive']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [30]:
class FamilySizeFeature(BaseEstimator, TransformerMixin):
    def transform(self, X, y=None):
        X_copy = X.copy()
        X_copy['family_size'] = X_copy['sibsp'] +  X_copy['parch']+1
        return X_copy
    
    def fit(self, X, y=None, **fit_params):
        return self

In [31]:
pipeline = make_pipeline(
    FamilySizeFeature(),
    # agregar mas pasos por ejemplo un column transformer con un pipeline para numericas y otro para categoricas
    # agregar modelo 
    
)
pipeline.fit_transform(X_train, y_train)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alone,family_size
331,0,1,male,45.5,0,0,28.5000,S,First,man,True,C,Southampton,True,1
733,0,2,male,23.0,0,0,13.0000,S,Second,man,True,,Southampton,True,1
382,0,3,male,32.0,0,0,7.9250,S,Third,man,True,,Southampton,True,1
704,0,3,male,26.0,1,0,7.8542,S,Third,man,True,,Southampton,False,2
813,0,3,female,6.0,4,2,31.2750,S,Third,child,False,,Southampton,False,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106,1,3,female,21.0,0,0,7.6500,S,Third,woman,False,,Southampton,True,1
270,0,1,male,,0,0,31.0000,S,First,man,True,,Southampton,True,1
860,0,3,male,41.0,2,0,14.1083,S,Third,man,True,,Southampton,False,3
435,1,1,female,14.0,1,2,120.0000,S,First,child,False,B,Southampton,False,4


In [32]:
pipeline.transform(X_test)



Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alone,family_size
709,1,3,male,,1,1,15.2458,C,Third,man,True,,Cherbourg,False,3
439,0,2,male,31.0,0,0,10.5000,S,Second,man,True,,Southampton,True,1
840,0,3,male,20.0,0,0,7.9250,S,Third,man,True,,Southampton,True,1
720,1,2,female,6.0,0,1,33.0000,S,Second,child,False,,Southampton,False,2
39,1,3,female,14.0,1,0,11.2417,C,Third,child,False,,Cherbourg,False,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
433,0,3,male,17.0,0,0,7.1250,S,Third,man,True,,Southampton,True,1
773,0,3,male,,0,0,7.2250,C,Third,man,True,,Cherbourg,True,1
25,1,3,female,38.0,1,5,31.3875,S,Third,woman,False,,Southampton,False,7
84,1,2,female,17.0,0,0,10.5000,S,Second,woman,False,,Southampton,True,1


## Con feature_selection

In [39]:
data = load_breast_cancer() #clasificacion binaria
df = pd.DataFrame(data.data, columns= data.feature_names)
X = df
y = data.target

print(df.shape)
print(df.columns.to_list())

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

(569, 30)
['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension']


In [40]:
pipeline = make_pipeline(
    OutlierRemover(factor=0.9),
    Debugger(title='X tras Outlier'),
    SimpleImputer(),
    Debugger(title='X tras SimpleImputer'),
    SelectKBest(f_classif, k=10),
    PowerTransformer(),
    Debugger(title='X tras PowerTransformer'),
    MinMaxScaler(),
    Debugger(title='X tran MinMaxScaler'),
    KNeighborsRegressor()
)
pipeline.fit(X_train, y_train)
pipeline.predict(X_test)

Ejecutando Debugger.fit X tras Outlier
Shape de X:  (217, 30)
X sample:      mean radius  mean texture  ...  worst symmetry  worst fractal dimension
248        10.65         25.22  ...          0.3409                  0.08147

[1 rows x 30 columns]
Ejecutando Debugger.fit X tras SimpleImputer
Shape de X:  (217, 30)
X sample: [[1.065e+01 2.522e+01 6.801e+01 3.470e+02 9.657e-02 7.234e-02 2.379e-02
  1.615e-02 1.897e-01 6.329e-02 2.497e-01 1.493e+00 1.497e+00 1.664e+01
  7.189e-03 1.035e-02 1.081e-02 6.245e-03 2.158e-02 2.619e-03 1.225e+01
  3.519e+01 7.798e+01 4.557e+02 1.499e-01 1.398e-01 1.125e-01 6.136e-02
  3.409e-01 8.147e-02]]


ValueError: Found input variables with inconsistent numbers of samples: [217, 455]

In [38]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.neighbors import KNeighborsClassifier
pipeline = make_pipeline(
    OutlierRemover(factor=0.9),
    Debugger(title='X after OutlierRemover'),
    
    SimpleImputer(strategy='median'),
    Debugger(title='X after SimpleImputer'),
    
    SelectKBest(f_classif, k=10),
    Debugger(title='X after SelectKBest'),
        
    PowerTransformer(),
    Debugger(title='X after PowerTransformer'),
    
    MinMaxScaler(), 
    Debugger(title='X after MinMaxScaler'),
    KNeighborsClassifier(), 
)
pipeline.fit(X_train, y_train)
pipeline.predict(X_test)

TypeError: '<' not supported between instances of 'float' and 'str'