<h1 style="text-align: center; font-size: 36px;">Machine learning pipeline</h1>

##### Student information:
- Name: Tuan Anh NGUYEN
- Email: tuan.nguyen@etu.univ-cotedazur.fr

---

##### Tasks:

1. Build you custom pipeline (imblearn pipeline)
- Remove caracteritics with too many missing value by row (more than 10 % of missing values)
    + Custom transformer
- Remove outliers observation -> choose the strategy
    + Custom sampler
- Specific stage
    + For categorical value
        + Fill missing value -> choose the strategy and justify it
        + OneHotEncode
    + For numerical value
        + Fill missing value -> choose the strategy and justify it
        + Normalize the value (RobustScaler)
- Choose the correct predictor

2. Try the pipeline with the initial hyperparameter and evaluate it
3. Use RandomSearchCV or BayesanSearchCV in order to have the better predictor
- Choose the hyper-parameter set (for all pipeline stage)
- Fit the pipeline (try only 10 iterations on RandomSearchCV in order to test the parameter set)
- Evaluate the best pipeline

---





In [1]:
from palmerpenguins import load_penguins
import matplotlib.pyplot as plt
from collections import Counter
import numpy as np
from imblearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, RobustScaler
from sklearn.impute import SimpleImputer
from imblearn import FunctionSampler
from sklearn.base import BaseEstimator, TransformerMixin
from scipy.stats import zscore
import pandas as pd
import traceback
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

from collections import defaultdict
d = defaultdict(LabelEncoder)

In [2]:
df = load_penguins()        # load the penguins dataframe

Let's check the dataset a little bit

In [3]:
Counter(df.species)

Counter({'Adelie': 152, 'Gentoo': 124, 'Chinstrap': 68})

We can see that the dataset is unbalanced

In [4]:
df.isnull()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,True,True,True,True,True,False
4,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...
339,False,False,False,False,False,False,False,False
340,False,False,False,False,False,False,False,False
341,False,False,False,False,False,False,False,False
342,False,False,False,False,False,False,False,False


And there are a lot of missing values in the dataset

Extract the data and target from the DataFrame. y_data is the label which is the ***species*** column of the DataFrame, X_data is the data which are the other columns of the DataFrame.

In [5]:
X_data = df.loc[:, 'island':]
y_data = df.loc[:, ['species']]

# Create the test set and training set
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.2, shuffle=True, random_state=1)

Now, I will create ***MyCustomTransformer*** which is an estimator. Its roles are:
- Remove rows has too many missing values. $ (>=10\%)$
- Fill the missing values for the remaining rows, which have less missing values than the removed rows.
- Encode the data


In [6]:

class MyCustomTransformer(BaseEstimator):
    def __init__(self, hp=None):
        self.hp = hp
        self.row_null = None
        self.categorical_col = None
        self.numerical_col = None
        self.X_raw = None
        self.y_raw = None
    def fit(self, X, y):
        try:
            # Get the categoritics with too many missing values by row
            num_of_null = X.isnull().sum(axis=1)
            self.row_null = list(num_of_null[num_of_null/len(X.columns) >= 0.1].index)
            self.categorical_col = X.loc[:, ['island','sex','year']]
            self.numerical_col = X.loc[:, 'bill_length_mm':'body_mass_g']
            self.X_raw = X
            self.y_raw = y
            return self
        except Exception as e:
            print(traceback.format_exc())
            

    def fit_resample(self, X, y):
        self.fit(X, y)
        try:
            # Remove the categoritics with too many missing values
            self.categorical_col = self.categorical_col.drop(self.row_null)
            self.categorical_col = self.categorical_col.reset_index(drop=True)
            self.numerical_col = self.numerical_col.drop(self.row_null)
            self.numerical_col = self.numerical_col.reset_index(drop=True)
            self.y_raw = self.y_raw.drop(self.row_null)
            self.y_raw = self.y_raw.reset_index(drop=True)
            self.row_null = []

            # Fill the missing value with the categoritics have fewer missing values
            imp = SimpleImputer(missing_values = np.nan, strategy = 'most_frequent')
            self.categorical_col = pd.DataFrame(imp.fit_transform(self.categorical_col), columns=imp.get_feature_names_out())

            # Use OneHotEncoder to the categorical data
            oh_encoder = OneHotEncoder(dtype='int', handle_unknown='ignore')
            oh_encoded_raw = oh_encoder.fit_transform(self.categorical_col)
            self.categorical_col = pd.DataFrame(oh_encoded_raw.toarray(), columns=oh_encoder.get_feature_names_out())

            # Use LabelEncoder to encode the label
            l_encoder = LabelEncoder()
            l_encoded_raw = l_encoder.fit_transform(self.y_raw)
            self.y_raw = pd.DataFrame(l_encoded_raw, columns=self.y_raw.columns)

            # Merge the categorical columns with the numerical columns
            self.X_raw = pd.concat([self.categorical_col, self.numerical_col], axis='columns')
            return self.X_raw, self.y_raw
        except Exception:
            print(traceback.format_exc())
            

Next, I will create ***MyCustomSampler*** which also is an estimator, its roles are:
- Rebalance the dataset
- Fill missing values
- Cap the outliers with 2 strategies: *Z score* or *IQR*
- Normalize the dataset

In [7]:
class MyCustomSampler(BaseEstimator):
    def __init__(self, strategy):
        self.strategy = strategy
        self.categorical_col = None
        self.numerical_col = None
        self.X_raw = None
        self.y_raw = None
    def fit(self, X, y):
        try:
            self.categorical_col = X.iloc[:, :-4]
            self.numerical_col = X.loc[:, 'bill_length_mm':'body_mass_g']
            self.X_raw = X
            self.y_raw = y
            return self
        except Exception:
            print(traceback.format_exc())

    def fit_resample(self, X, y):
        # Rebalance the data set
        X_tmp, y_tmp = self.rebalance(X, y)

        # fit the data set
        self.fit(X_tmp, y_tmp)

        # Fill missing values of the dataset
        self.numerical_col = self.fill_missing(self.numerical_col)
        if(self.strategy == 'z_score'):                         # Using strategy 'z_score'
            z = self.numerical_col.apply(zscore)
            self.numerical_col = self.numerical_col[(z <= 3) & (z >= -3)]
        elif(self.strategy == 'IQR'):                           # Using strategy 'IQR' (Inter-Quartile Range)
            q1 = self.numerical_col.quantile(0.25)              # 25%
            q3 = self.numerical_col.quantile(0.75)              # 75%        
            iqr = q3 - q1
            self.numerical_col = self.numerical_col[(self.numerical_col <= q3+1.5*iqr)&(self.numerical_col >= q1-1.5*iqr)]
        else:
            print("Invalid strategy")
        
        # Normalize the data set
        self.numerical_col = self.normalize(self.numerical_col)
        self.X_raw = pd.concat([self.categorical_col, self.numerical_col], axis='columns')
        return self.X_raw.to_numpy(), self.y_raw.to_numpy()

    def fill_missing(self, X):
        imp = SimpleImputer(missing_values = np.nan, strategy = 'mean')
        X_tmp = pd.DataFrame(imp.fit_transform(X), columns=imp.get_feature_names_out())
        return X_tmp

    def normalize(self, X):
        scaler = RobustScaler()
        X_tmp = pd.DataFrame(scaler.fit_transform(X), columns= scaler.get_feature_names_out())
        return X_tmp

    def rebalance(self, X, y):
        smote = SMOTE()
        X_resampled, y_resampled = smote.fit_resample(X, y)        
        return X_resampled, y_resampled


Make pipeline

In [8]:
pipe = make_pipeline(MyCustomTransformer(), MyCustomSampler(strategy='z_score'))        # Transform pipeline

X_train_transformed, y_train_transformed = pipe.fit_resample(X_train, y_train)          # Transform the training set
X_test_transformed, y_test_transformed = pipe.fit_resample(X_test, y_test)              # Transform the test set

reg_model = LogisticRegression(C = 1, penalty = None, multi_class = 'auto', solver='saga')  # Use LogisticRegression with initial hyperparameters
reg_model.fit(X_train_transformed, y_train_transformed)
y_predict = reg_model.predict(X_test_transformed)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [9]:
print(classification_report(y_test_transformed, y_predict))             # Print the classification report

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        33
           1       1.00      1.00      1.00        33
           2       1.00      1.00      1.00        33

    accuracy                           1.00        99
   macro avg       1.00      1.00      1.00        99
weighted avg       1.00      1.00      1.00        99



Now, use the Random Search CV to find the best hyperparameter for the model

In [10]:
pipe_est = make_pipeline(LogisticRegression(C = 1, penalty = None, multi_class = 'auto', solver='saga'))        # Estimated pipeline

hyper_parameters = {'logisticregression__C':range(1,11), 'logisticregression__penalty':['l2', 'l1', None],      # Create hyperparameter
'logisticregression__multi_class':['auto', 'ovr', 'multinomial']}

clf = RandomizedSearchCV(pipe_est, hyper_parameters, n_iter=10, scoring='accuracy', return_train_score=True)     # Find the best hyperparameter and evaluate it
clf.fit(X_train_transformed, y_train_transformed)
y_pred = clf.best_estimator_.predict(X_test_transformed)
   

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = colu

In [11]:
print(classification_report(y_test_transformed, y_pred))                # Print the classification report

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        33
           1       1.00      1.00      1.00        33
           2       1.00      1.00      1.00        33

    accuracy                           1.00        99
   macro avg       1.00      1.00      1.00        99
weighted avg       1.00      1.00      1.00        99



---

##### Conclusion:
- Through this part, we've learned how to make our own pipeline to transform the data in order to make our performance of model increases.

References:
- Imbalanced documentations: https://imbalanced-learn.org/stable/
- SimpleImputer: https://scikit-learn.org/1.5/modules/generated/sklearn.impute.SimpleImputer.html