# INTRODUCCION (CHATGPT)

En el ciclo de vida de un proyecto de machine learning, es crucial asegurarse de que las transformaciones de datos realizadas durante el preprocesamiento se puedan replicar exactamente en producción. Para ello, existen varios frameworks y bibliotecas en Python que no solo facilitan el preprocesamiento de datos, sino que también permiten guardar y reutilizar las transformaciones para garantizar la consistencia entre el entrenamiento y la inferencia en producción. A continuación se mencionan algunos de los más utilizados:

## 1. scikit-learn

- *Pipelines*: scikit-learn proporciona la clase Pipeline que permite encadenar varios pasos de preprocesamiento y modelado. Todos los transformadores y el modelo final pueden ser guardados en un solo objeto.
- *ColumnTransformer*: Permite aplicar diferentes transformaciones a diferentes subconjuntos de características.

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
import joblib

numeric_features = ['age', 'income']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_features = ['gender', 'occupation']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

# Guardar el pipeline
joblib.dump(pipeline, 'pipeline.pkl')

# Cargar el pipeline
pipeline = joblib.load('pipeline.pkl')

## 2. TensorFlow Transform (tf.Transform)

Es una biblioteca de TensorFlow diseñada específicamente para el preprocesamiento de datos que necesita ser replicado exactamente en entrenamiento y en producción. Las transformaciones se definen en un modo declarativo y se pueden exportar a TensorFlow Serving.

In [8]:
import tensorflow_transform as tft
import apache_beam as beam

def preprocessing_fn(inputs):
    outputs = {}
    outputs['age'] = tft.scale_to_z_score(inputs['age'])
    outputs['income'] = tft.scale_to_z_score(inputs['income'])
    outputs['gender'] = tft.compute_and_apply_vocabulary(inputs['gender'])
    return outputs

raw_data = ...  # Your input data
with beam.Pipeline() as pipeline:
    with tft_beam.Context(temp_dir=temp_dir):
        transformed_dataset, transform_fn = (
            (raw_data, raw_metadata)
            | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)
        )
# Guardar transform_fn para usarlo en producción


ModuleNotFoundError: No module named 'tensorflow_transform'

## 3. Feature-engine

Es una biblioteca de preprocesamiento de datos que se integra con scikit-learn y permite crear pipelines reutilizables para el preprocesamiento de datos.

In [None]:
from feature_engine.imputation import MeanMedianImputer
from feature_engine.encoding import OneHotEncoder
from feature_engine.wrappers import SklearnTransformerWrapper
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

pipeline = Pipeline(steps=[
    ('median_imputer', MeanMedianImputer(imputation_method='median', variables=['age', 'income'])),
    ('onehot_encoder', OneHotEncoder(variables=['gender', 'occupation'])),
    ('scaler', SklearnTransformerWrapper(transformer=StandardScaler(), variables=['age', 'income']))
])

# Guardar el pipeline
import joblib
joblib.dump(pipeline, 'pipeline.pkl')

# Cargar el pipeline
pipeline = joblib.load('pipeline.pkl')

## 4. MLflow

Aunque MLflow es una plataforma de gestión del ciclo de vida de ML que incluye seguimiento de experimentos, gestión de modelos y despliegue, también permite guardar y reutilizar pipelines de preprocesamiento.

In [None]:
import mlflow
import mlflow.sklearn

pipeline = Pipeline(steps=[
    ('median_imputer', MeanMedianImputer(imputation_method='median', variables=['age', 'income'])),
    ('onehot_encoder', OneHotEncoder(variables=['gender', 'occupation'])),
    ('scaler', SklearnTransformerWrapper(transformer=StandardScaler(), variables=['age', 'income']))
])

# Guardar el pipeline
mlflow.sklearn.save_model(pipeline, 'model_path')

# Cargar el pipeline
loaded_pipeline = mlflow.sklearn.load_model('model_path')

## 5. Dask-ML

Dask-ML proporciona herramientas para el preprocesamiento de datos en paralelo y distribuidos, lo que puede ser útil para conjuntos de datos muy grandes.

In [None]:
import dask.dataframe as dd
from dask_ml.preprocessing import StandardScaler

ddf = dd.read_csv('data.csv')
scaler = StandardScaler()
scaler.fit(ddf)

# Guardar y cargar con joblib
import joblib
joblib.dump(scaler, 'scaler.pkl')
scaler = joblib.load('scaler.pkl')

## 6. torchvision.transforms (imágenes)

In [None]:
import torch
from torchvision import transforms
import pickle

# Definir las transformaciones
transform = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# Guardar las transformaciones
with open('transform.pkl', 'wb') as f:
    pickle.dump(transform, f)

# Cargar las transformaciones
with open('transform.pkl', 'rb') as f:
    transform = pickle.load(f)

# SCIKIT LEARN - PIPELINES
> https://scikit-learn.org/stable/modules/compose.html

## 6.1. PIPELINES AND ESTIMATORS

La herramienta más común para componer estimadores (*transformers, predictores o estimadores de clustering*), es utilizar los *Pipeline*. Los pipeline requieren que todos los pasos sean transformers (tengan el método `transform`) menos el último paso.

Un pipeline expone todos los métodos del último estimador (sea un `transform` o un `predict`). En el pipeline se van aplicando las transformaciones adecuadas hasta el último estimador, y si este es un predict, se ejecuta dicho método.

Se puede utilizar en combinación con *ColumnTransformer* o con *FeatureUnion*. *TransformedTargetRegressor* sirve para transformar las variables objetivos.

### 6.1.1. Pipeline: chaining estimators

Pipeline puede ser utilizado para encadenar múltiples estimadores en uno solo.

Beneficios:
- Solamente hace falta llamar a `fit` y `predict` una vez para entrenar una secuencia de estimadores
- Se puede hacer una búsqueda (*grid search*) sobre todos los estimadores a la vez (inclusive de los transformers).
- Seguridad al "*leaking statistics*"

### 6.1.1.1. Usage

#### 6.1.1.1.1. Build a pipeline

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
estimators = [('reduce_dim', PCA()), ('clf', SVC())]
pipe = Pipeline(estimators)
pipe

#### 6.1.1.1.2. Access pipeline steps

In [None]:
pipe[:1]

In [None]:
pipe[-1:]

#### 6.1.1.1.3. Tracking feature names in a pipeline

In [None]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest
iris = load_iris()
pipe = Pipeline(steps=[
    ('select', SelectKBest(k=2)),
    ('clf', LogisticRegression())])
pipe.fit(iris.data, iris.target)
pipe[:-1].get_feature_names_out()

array(['x2', 'x3'], dtype=object)

In [None]:
pipe[:-1].get_feature_names_out(iris.feature_names)

array(['petal length (cm)', 'petal width (cm)'], dtype=object)

#### 6.1.1.1.4. Access to nested parameters

Se utiliza la sintaxis \<estimator\>__\<parameter\>

In [None]:
pipe = Pipeline(steps=[("reduce_dim", PCA()), ("clf", SVC())])
pipe.set_params(clf__C=10)

### 6.1.1.2. Caching transformers: avoid repeated computation

Para cachear los datos (parámetros) de un *transformer*, luego de llamar a `fit`, se puede utilizar el parámetro `memory`. `memory` puede ser un string que contiene el directorio donde guardar el *transformer* o un objeto en memoria (`joblib.Memory`)

In [None]:
from tempfile import mkdtemp
from shutil import rmtree
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
estimators = [('reduce_dim', PCA()), ('clf', SVC())]
cachedir = mkdtemp()
pipe = Pipeline(estimators, memory=cachedir)
pipe
# Clear the cache directory when you don't need it anymore
rmtree(cachedir)

### 6.1.2. Transforming target in regression

*TransformedTargetRegressor* transforma la variable objetivo antes de entrenar un modelo.

In [None]:
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import QuantileTransformer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
X, y = fetch_california_housing(return_X_y=True)
X, y = X[:2000, :], y[:2000]  # select a subset of data
transformer = QuantileTransformer(output_distribution='normal')
regressor = LinearRegression()
regr = TransformedTargetRegressor(regressor=regressor,
                                  transformer=transformer)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
regr.fit(X_train, y_train)
print('R2 score: {0:.2f}'.format(regr.score(X_test, y_test)))
raw_target_regr = LinearRegression().fit(X_train, y_train)
print('R2 score: {0:.2f}'.format(raw_target_regr.score(X_test, y_test)))

R2 score: 0.61
R2 score: 0.59


También se pueden pasar un par de funciones (con su inversa):

In [None]:
def func(x):
    return np.log(x)
def inverse_func(x):
    return np.exp(x)

In [None]:
regr = TransformedTargetRegressor(regressor=regressor,
                                  func=func,
                                  inverse_func=inverse_func)
regr.fit(X_train, y_train)
print('R2 score: {0:.2f}'.format(regr.score(X_test, y_test)))

R2 score: 0.51


### 6.1.3. FeatureUnion: composite feature spaces

*FeatureUnion* combina varios objetos *transformers* en un nuevo *transformer* que combina su salida. Toma una lista de objetos y durante el entrenamiento cada uno de estos se entrena independientemente (en paralelo) y luego se concatenan.

#### 6.1.3.1. Usage

In [None]:
from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import PCA
from sklearn.decomposition import KernelPCA
estimators = [('linear_pca', PCA()), ('kernel_pca', KernelPCA())]
combined = FeatureUnion(estimators)
combined

In [None]:
# Ignorar seteando "drop"
combined.set_params(kernel_pca='drop')

### 6.1.4. ColumnTransformer for heterogeneous data

Cuando se quiere aplicar transformaciones a columnas enteras se puede utilizar *ColumnTransformer*. Se puede aplicar una transformación para cada columna.

In [None]:
import pandas as pd
X = pd.DataFrame(
    {'city': ['London', 'London', 'Paris', 'Sallisaw'],
     'title': ["His Last Bow", "How Watson Learned the Trick",
               "A Moveable Feast", "The Grapes of Wrath"],
     'expert_rating': [5, 3, 4, 5],
     'user_rating': [4, 5, 4, 3]})

Ej: Encodear 'city' con *OneHotEncoder* y 'title' con *CountVectorizer*. Por defecto, las columnas restantes se ingonran (`remainder='drop'`)

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder
column_trans = ColumnTransformer(
    [('categories', OneHotEncoder(dtype='int'), ['city']), # 2D columna como lista de string
     ('title_bow', CountVectorizer(), 'title')], # 1D por eso es 'title'
    remainder='drop', verbose_feature_names_out=False)
column_trans.fit(X)
column_trans

In [None]:
column_trans.get_feature_names_out()

array(['city_London', 'city_Paris', 'city_Sallisaw', 'bow', 'feast',
       'grapes', 'his', 'how', 'last', 'learned', 'moveable', 'of', 'the',
       'trick', 'watson', 'wrath'], dtype=object)

In [None]:
column_trans.transform(X).toarray()

array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0],
       [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1]])

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_selector
ct = ColumnTransformer([
      ('scale', StandardScaler(),
      make_column_selector(dtype_include=np.number)),
      ('onehot',
      OneHotEncoder(),
      make_column_selector(pattern='city', dtype_include=object))])
ct.fit_transform(X)

array([[ 0.90453403,  0.        ,  1.        ,  0.        ,  0.        ],
       [-1.50755672,  1.41421356,  1.        ,  0.        ,  0.        ],
       [-0.30151134,  0.        ,  0.        ,  1.        ,  0.        ],
       [ 0.90453403, -1.41421356,  0.        ,  0.        ,  1.        ]])

Se puede mantener las columnas utilizando `remainder='passthrough'`, en donde se agrega los valores al final de la transformación

In [None]:
column_trans = ColumnTransformer(
    [('city_category', OneHotEncoder(dtype='int'),['city']),
     ('title_bow', CountVectorizer(), 'title')],
    remainder='passthrough')

column_trans.fit_transform(X)

array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 5, 4],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 3, 5],
       [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 4, 4],
       [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 5, 3]])

## 6.3. PREPROCESSING DATA

### 6.3.8. Custom transformers

Para converitr una función de Python en un transformer, se debe implementar un transformer de una función arbitraria *FunctionTransformer*

In [None]:
import numpy as np
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p, validate=True)
X = np.array([[0, 1], [2, 3]])
# Since FunctionTransformer is no-op during fit, we can call transform directly
transformer.transform(X)

array([[0.        , 0.69314718],
       [1.09861229, 1.38629436]])

## COLUMN TRANSFORMER WITH HETEROGENEOUS DATA SOURCES
> https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer.html#

Cuando se aplica:
- Cuando el conjunto contienen datos heterogenos.
- Cuando diferentes columnas requieren diferente procesamiento.

In [None]:
# Author: Matt Terry <matt.terry@gmail.com>
#
# License: BSD 3 clause

import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import PCA
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.svm import LinearSVC

In [None]:
categories = ["sci.med", "sci.space"]
X_train, y_train = fetch_20newsgroups(
    random_state=1,
    subset="train",
    categories=categories,
    remove=("footers", "quotes"),
    return_X_y=True,
)
X_test, y_test = fetch_20newsgroups(
    random_state=1,
    subset="test",
    categories=categories,
    remove=("footers", "quotes"),
    return_X_y=True,
)

In [None]:
print(X_train[0])

From: mccall@mksol.dseg.ti.com (fred j mccall 575-3539)
Subject: Re: Metric vs English
Article-I.D.: mksol.1993Apr6.131900.8407
Organization: Texas Instruments Inc
Lines: 31




American, perhaps, but nothing military about it.  I learned (mostly)
slugs when we talked English units in high school physics and while
the teacher was an ex-Navy fighter jock the book certainly wasn't
produced by the military.

[Poundals were just too flinking small and made the math come out
funny; sort of the same reason proponents of SI give for using that.] 

-- 
"Insisting on perfect safety is for people who don't have the balls to live
 in the real world."   -- Mary Shafer, NASA Ames Dryden


### Creando transformers

Transformer que obtiene el "subjet" y "body" de cada post.

In [None]:
def subject_body_extractor(posts):
    # construct object dtype array with two columns
    # first column = 'subject' and second column = 'body'
    features = np.empty(shape=(len(posts), 2), dtype=object)
    for i, text in enumerate(posts):
        # temporary variable `_` stores '\n\n'
        headers, _, body = text.partition("\n\n")
        # store body text in second column
        features[i, 1] = body

        prefix = "Subject:"
        sub = ""
        # save text after 'Subject:' in first column
        for line in headers.split("\n"):
            if line.startswith(prefix):
                sub = line[len(prefix) :]
                break
        features[i, 0] = sub

    return features


subject_body_transformer = FunctionTransformer(subject_body_extractor)

Transformer que extrae el largo del texto y el número de sentencias:

In [None]:
def text_stats(posts):
    return [{"length": len(text), "num_sentences": text.count(".")} for text in posts]

text_stats_transformer = FunctionTransformer(text_stats)

### Clasification pipeline

- Extrae el "subject" y "body" de cada post utilizando `SubjectBodyExtractor` (produce un array de (n_samples, 2)).
- Se computa *bag-of-words* en dicho arreglo, así como el largo del texto y el número de sentencias utilizando `ColumnTransformer`.
- Finalmente se combina todo con pesos y se entrena un clasificador.

In [None]:
pipeline = Pipeline(
    [
        # Extract subject & body
        ("subjectbody", subject_body_transformer),
        # Use ColumnTransformer to combine the subject and body features
        (
            "union",
            ColumnTransformer(
                [
                    # bag-of-words for subject (col 0)
                    ("subject", TfidfVectorizer(min_df=50), 0),
                    # bag-of-words with decomposition for body (col 1)
                    (
                        "body_bow",
                        Pipeline(
                            [
                                ("tfidf", TfidfVectorizer()),
                                ("best", PCA(n_components=50, svd_solver="arpack")),
                            ]
                        ),
                        1,
                    ),
                    # Pipeline for pulling text stats from post's body
                    (
                        "body_stats",
                        Pipeline(
                            [
                                (
                                    "stats",
                                    text_stats_transformer,
                                ),  # returns a list of dicts
                                (
                                    "vect",
                                    DictVectorizer(),
                                ),  # list of dicts -> feature matrix
                            ]
                        ),
                        1,
                    ),
                ],
                # weight above ColumnTransformer features
                transformer_weights={
                    "subject": 0.8,
                    "body_bow": 0.5,
                    "body_stats": 1.0,
                },
            ),
        ),
        # Use a SVC classifier on the combined features
        ("svc", LinearSVC(dual=False)),
    ],
    verbose=True,
)

In [None]:
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print("Classification report:\n\n{}".format(classification_report(y_test, y_pred)))

[Pipeline] ....... (step 1 of 3) Processing subjectbody, total=   0.0s
[Pipeline] ............. (step 2 of 3) Processing union, total=   0.6s
[Pipeline] ............... (step 3 of 3) Processing svc, total=   0.0s
Classification report:

              precision    recall  f1-score   support

           0       0.84      0.87      0.86       396
           1       0.87      0.84      0.85       394

    accuracy                           0.86       790
   macro avg       0.86      0.86      0.86       790
weighted avg       0.86      0.86      0.86       790



## COLUMN TRANSFORMER WITH MIXED TYPES
> https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html

In [None]:
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

np.random.seed(0)

In [None]:
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

# Alternatively X and y can be obtained directly from the frame attribute:
# X = titanic.frame.drop('survived', axis=1)
# y = titanic.frame['survived']

Ejemplo de utilizar funciones de python (ej: map, astype)

In [None]:
ct = ColumnTransformer([
    ('duplicar', FunctionTransformer(lambda x: x.map(lambda y: y * 2)), ['Columna1']),
    ('mayusculas', FunctionTransformer(lambda x: x.map(str.upper)), ['Columna2'])
])

Utilizar `ColumnTransformer` para seleccionar las columnas por nombre. Creamos pipelines para los datos numéricos y categóricos.

In [None]:
numeric_features = ["age", "fare"]
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)

categorical_features = ["embarked", "sex", "pclass"]
categorical_transformer = Pipeline(
    steps=[
        ("encoder", OneHotEncoder(handle_unknown="ignore")),
        ("selector", SelectPercentile(chi2, percentile=50)),
    ]
)
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

Entrenamiento:

In [None]:
clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())]
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

model score: 0.798


In [None]:
clf

Mismo ejemplo pero utilizando el tipo de datos:

In [None]:
subset_feature = ["embarked", "sex", "pclass", "age", "fare"]
X_train, X_test = X_train[subset_feature], X_test[subset_feature]
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1047 entries, 1118 to 684
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   embarked  1045 non-null   category
 1   sex       1047 non-null   category
 2   pclass    1047 non-null   int64   
 3   age       841 non-null    float64 
 4   fare      1046 non-null   float64 
dtypes: category(2), float64(2), int64(1)
memory usage: 35.0 KB


In [None]:
from sklearn.compose import make_column_selector as selector

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, selector(dtype_exclude="category")),
        ("cat", categorical_transformer, selector(dtype_include="category")),
    ]
)
clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())]
)


clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))
clf

model score: 0.798


## CUSTOMIZING SCIKIT-LEARN PIPELINES: WRITE YOUR OWN TRANSFORMER
> https://towardsdatascience.com/customizing-scikit-learn-pipelines-write-your-own-transformer-fdaaefc5e5d7

### How to apply a pipeline?

In [None]:
import pandas as pd

# Definir los datos de la tabla
data = {
    "Patient": [1, 2, 3, 4, 5],
    "Gender": [1, 1, 0, 1, 1],
    "Age": [52, 60, 68, 72, 41],
    "BMI": [30, 28, 23, 26, 22],
    "Smoking": [0, 0, 1, 0, 1],
    "weekly_alcohol_consumption": [6, 0, 3, 3, 4],
    "Blood_pressure": [135, 120, 130, 120, 110],
    "High_risk": [1, 0, 1, 0, 0]
}

# Crear el dataframe
df = pd.DataFrame(data)

# Mostrar el dataframe
df

Unnamed: 0,Patient,Gender,Age,BMI,Smoking,weekly_alcohol_consumption,Blood_pressure,High_risk
0,1,1,52,30,0,6,135,1
1,2,1,60,28,0,0,120,0
2,3,0,68,23,1,3,130,1
3,4,1,72,26,0,3,120,0
4,5,1,41,22,1,4,110,0


In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['High_risk']), df['High_risk'], test_size=0.2)

Se incluirá la imputación de valores faltantes y estandarización. Finalmente, se entrenará un clasificador `RandomForestClassifier`.

In [None]:
# import relevant packeges
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier  # define our pipeline
pipe = Pipeline([('imputer', SimpleImputer()),
                 ('scaler', StandardScaler()),
                 ('RF', RandomForestClassifier())])
pipe

In [None]:
pipe.fit(X_train, y_train)
pipe.predict(X_test)

array([1])

### Customize your pipeline by writing your own transformer

Ejemplo: Imputar solo la edad

In [None]:
# pipe = Pipeline([('age_imputer', AgeImputer()),('imputer', SimpleImputer()),('scaler', StandardScaler()), ('RF', RandomForestClassifier())])

#### How to write a transformer?

Ejemplo:

In [None]:
# import packages
from sklearn.base import BaseEstimator, TransformerMixin# define the transformer
class AgeImputer(BaseEstimator, TransformerMixin):
    def __init__(self, max_age):
        print('Initialising transformer...')
        self.max_age = max_age
        
    def fit(self, X, y = None):
        self.mean_age = round(X['Age'].mean())
        return self
    
    def transform(self, X):
        print ('replacing impossible age values')
        X.loc[(X['Age'] > self.max_age) 
              |  (X['Age'] < 0), 'Age'] = self.mean_age
        return X

In [None]:
# Aplicamos
pipe = Pipeline([('age_imputer', AgeImputer(50)),('imputer', SimpleImputer()),('scaler', StandardScaler()), ('RF', RandomForestClassifier())])
age_scaled = pipe[0].fit_transform(X_train)
age_scaled

Initialising transformer...
replacing impossible age values


Unnamed: 0,Patient,Gender,Age,BMI,Smoking,weekly_alcohol_consumption,Blood_pressure
2,3,0,58,23,1,3,130
0,1,1,58,30,0,6,135
4,5,1,41,22,1,4,110
3,4,1,58,26,0,3,120


## CREATING CUSTOM SCIKIT-LEARN TRANSFORMERS
> https://www.andrewvillazon.com/custom-scikit-learn-transformers/

### Creating a custom transformer

Requerimientos básicos:
- La clase *Transformer* (se puede crear a partir de una función también).
- Hereda de `BaseEstimator` y `TransformerMixin` del paquete `sklearn.base`
- Implementa los métodos `fit()` y `transform()`. Tienen que tener `X` e `y` como parámetros, y `transform()` debe retorar un DataFrame de *pandas* o un array de *NumPy*.

Ejemplo:

In [None]:
from numpy.random import randint
from sklearn.base import BaseEstimator, TransformerMixin


class CustomTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        # Perform arbitary transformation
        X["random_int"] = randint(0, 10, X.shape[0])
        return X

Podemos usarlo:

In [None]:
import pandas as pd
from sklearn.pipeline import Pipeline


df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})

pipe = Pipeline(
    steps=[
        ("use_custom_transformer", CustomTransformer())]
)
transformed_df = pipe.fit_transform(df)

print(df)

   a  b  c  random_int
0  1  4  7           4
1  2  5  8           6
2  3  6  9           9


### Passing arguments to a Custom Transformer

Si se necesitan extra datos o objetos, se puede darle a la clase el método `__init()__` y pasarle los parámetros que se quieran. Ej:

In [None]:
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline


class MultiplyColumns(BaseEstimator, TransformerMixin):
    def __init__(self, by=1, columns=None):
        self.by = by
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        cols_to_transform = list(X.columns)

        if self.columns:
            cols_to_transform = self.columns

        X[cols_to_transform] = X[cols_to_transform] * self.by
        return X


# Use Custom Transformer
df = pd.DataFrame({"a": [1, -2, 3], "b": [-4, 5, 6], "c": [-7, -8, 9]})

pipe = Pipeline(
    steps=[
        ("multiply_cols_by_3", MultiplyColumns(3, columns=["a", "c"]))]
)
transformed_df = pipe.fit_transform(df)

print(df)

   a  b   c
0  3 -4 -21
1 -6  5 -24
2  9  6  27


Otro ejemplo:

In [None]:
class CustomTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, column_name, multiplier=2):
        self.column_name = column_name
        self.multiplier = multiplier

    def fit(self, X, y=None):
        return self  # The fit method typically does nothing for transformers

    def transform(self, X):
        X_transformed = X.copy()  # Copy the input DataFrame to avoid modifying the original
        # Check if the specified column is numerical
        if pd.api.types.is_numeric_dtype(X_transformed[self.column_name]):
            X_transformed[self.column_name] *= self.multiplier
        else:
            # If categorical, apply a different transformation (e.g., capitalize strings)
            X_transformed[self.column_name] = X_transformed[self.column_name].apply(
                lambda x: str(x).capitalize())
        return X_transformed

In [None]:
# Remove missing values
from sklearn.base import BaseEstimator, TransformerMixin


class MissingValuesFeatureRemover(BaseEstimator, TransformerMixin):
    def __init__(self, threshold=0.2):
        self.threshold = threshold
        self.features_to_drop = []
        self.output = "pandas"
        self.fitted = False

    def fit(self, X, y=None):
        nan_fracs = X.isna().sum() / X.shape[0]

        self.features_to_drop = nan_fracs[nan_fracs >=
                                          self.threshold].keys().to_list()
        self.fitted = True

        return self

    def transform(self, X):
        if not self.fitted:
            raise ValueError("Fit the transformer first using fit().")

        cleaned_X = X.drop(self.features_to_drop, axis=1)

        return cleaned_X if self.output == "pandas" else cleaned_X.to_numpy()

    def fit_transform(self, X, y=None):
        return self.fit(X).transform(X)

    def set_output(self, transform="pandas"):
        self.output = transform

### Function transformers

Por facilidad, se puede utilizar la clase `FunctionTransformer`, que hace un *wrapper* sobre un *Transformer*

Ejemplo (one-hot enconding):

In [None]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

data = {
    "id": [1, 2, 3, 4, 5,],
    "fruit": ["Apple", "Apple", "Peach", "Banana"],
}
df = pd.DataFrame({k: pd.Series(v) for k, v in data.items()})

pipe = Pipeline(
    steps=[
        ("simple_one_hot_encode", FunctionTransformer(pd.get_dummies))]
)
transformed_df = pipe.fit_transform(df)

print(transformed_df)

   id  fruit_Apple  fruit_Banana  fruit_Peach
0   1         True         False        False
1   2         True         False        False
2   3        False         False         True
3   4        False          True        False
4   5        False         False        False


Si la función tiene parámetros adicionales, se los pasa con el argumento `kw_args`:

In [None]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer


data = {
    "id": [1, 2, 3, 4, 5,],
    "fruit": ["Apple", "Apple", "Peach", "Banana"],
}
df = pd.DataFrame({k: pd.Series(v) for k, v in data.items()})

pipe = Pipeline(
    steps=[
        (
            "simple_one_hot_encode",            FunctionTransformer(                pd.get_dummies, kw_args={"dummy_na": True, "dtype": "float"}            ),
        )
    ]
)
transformed_df = pipe.fit_transform(df)

print(transformed_df)