# <font color = 'pickle'>**Introduction**</font>
<font color = 'indianred'>In this lecture, we will learn what data leakage is, what problems can occur due to it, and how to resolve it.

# <font color = 'pickle'>**Import Libraries**

First, we will all the required libraries that we will use across this lecture.

It is always a good practice to import all the required libraries initially.

In [1]:
# numerical processing
import numpy as np

# control teh wdith of text displayed
import textwrap as tw

# get/create dataset
from sklearn.datasets import load_diabetes
from sklearn.datasets import fetch_openml
from sklearn.datasets import make_classification

# sklearn for pre-processing
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score, train_test_split

from sklearn.preprocessing import PolynomialFeatures, StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.compose import make_column_transformer, ColumnTransformer
from sklearn.metrics import accuracy_score

# sklearn models
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.linear_model import Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier

# <font color = 'pickle'>**Common Pitfalls**

## <font color = 'pickle'>**Inconsistent Preprocessing**


* Before understanding the data leakage problem, first, we will understand what problem can occur if the preprocessing is done inconsistently.
* For example, we apply some preprocessing in training data but forget to use the same on test data, so due to this, how our model gets impacted and how we can resolve it.

Sklearn's make_classification() can be used to create dummy classification data.

In ML, we extensively use this to create dummy data, and then do training with that data to better understand the model.

It's few of the parameters are as follows:

- n_samples: The total number of samples i.e. data points.
- n_features: The total number of features to be generated.
- n_informative: The total number of informative features.
- n_redundant: The total number of redundant features.
- random_state: To set random state, so that re-running the code will create exact same data.

You can find more about this from their [official documentation](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html).

In [18]:
# Creating 1000 features with 2 class labels using make_classification() that we have imported earlier.
X12, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=7)

In [19]:
# Creating standard normally distributed 1000 features.
np.random.seed(0)
X3 = 1000 * np.random.standard_normal((1000, 1))

In [20]:
# Concatenating the feature
X = np.concatenate((X12,X3), axis =1)

So, we have created our randomly generated dataset.

In [21]:
print(X[:,0].mean(), X[:,1].mean(), X[:,2].mean())

0.04148292912185122 -0.000783387580940758 -45.25670749019538


Now, let's split our data into train and test set.

In [22]:
# split the data into train/test split
X_train, X_test, y_train, y_test = train_test_split( X, y, random_state=200, train_size = 0.3)

Now, let's standardize our training data and train a simple KNN model.

**Note:** StandardScaler() is used to standardize the data in such a way that it has a mean of 0 and a standard deviation of 1.

In [23]:
# Initializing StandardScaler that we have imported earlier.
preprocessor = StandardScaler()

In [24]:
# select top 10 features
X_train = preprocessor.fit_transform(X_train)

In [25]:
# Taining KNN classification model
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

In [26]:
# score on train data set
knn.score(X_train, y_train)

0.9433333333333334

In [27]:
# Score the model on test dataset
knn.score(X_test, y_test)

0.49

<font color = 'indianred'>**Question**
- <font color = 'indianred'>**Why model perfrmed poorly on Test Data?**

  - KNN Classifier is sensitive to scale of variables.
  - We scaled the train dataset but not the test dataset.
  - Since the magnitude of X3 is higher, the distance calculations will be dominated by X3. Since X3 is randomly generated and has no corelation with Y, the perfromance on X_test is similar to random predictions - close to 50%.


<font color = 'indianred'>**Solution:**
- Scale both X_train and X_test
- **Better approach - Use a Pipeline** (explained later in the lecture), which makes it easier to chain transformations with estimators, and reduces the possibility of forgetting to apply a transformation on Test data.


In [28]:
X_test  = preprocessor.transform(X_test)
knn.score(X_test, y_test)

0.94

## <font color = 'pickle'>**Data Leakage**

*In statistics and machine learning, leakage (also known as data leakage or target leakage) is the use of information in the model training process which would not be expected to be available at prediction time, causing the predictive scores (metrics) to overestimate the model's utility when run in a production environment*

Source: [Wikipedia](https://en.wikipedia.org/wiki/Leakage_(machine_learning)


* Leakage means that information is revealed to the model, giving it an unrealistic advantage to make better predictions.
* This could happen when test data is leaked into the training set or when data from the future is leaked to the past. Any time a model is given information that it shouldn’t have access to when making predictions in real-time in production, there is leakage.

— Page 93, Feature Engineering for Machine Learning, 2018.

### <font color = 'pickle'>**Data Leakage - Preprocessing before train/test split**

This is not a direct type of data leakage. The model is not trained on the test dataset. However, some information from the test data set is captured during the preprocessing step and made available to model during training.

In [29]:
# Creating 2000 normally distributed data with 2 class labels and 5000 features.
np.random.seed(123)
X = np.random.standard_normal((2000, 5000))
y = np.random.choice(2, 2000)

In [30]:
y[0:5]

array([1, 1, 1, 0, 1])

In [31]:
X[0:5]

array([[-1.0856306 ,  0.99734545,  0.2829785 , ..., -1.85971515,
         0.91382219, -1.35383977],
       [ 0.3187635 ,  1.51110387, -1.13662678, ..., -0.47226641,
         0.58196437,  0.97061286],
       [-1.24096967, -0.31294679, -0.84894679, ..., -1.82934642,
         0.9741791 , -0.6933265 ],
       [ 0.90756418,  1.68521718, -1.1163093 , ..., -1.40283982,
         1.04454086,  0.36928112],
       [ 1.03159348,  1.33194488,  0.09584389, ...,  0.65930018,
        -0.29068836,  0.98800033]])

Now, let's use SelectKBest() to get the top 20 features among 5000 features.

In [35]:
# select top 10 features
X_selected = SelectKBest(k=20).fit_transform(X, y)

In [36]:
# split the data into train/test split
X_train, X_test, y_train, y_test = train_test_split( X_selected, y, random_state=200, train_size = 0.3)

In [37]:
# fit KNNClassifier on train data
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

In [38]:
knn.score(X_train, y_train)

0.695

In [39]:
knn.score(X_test, y_test)

0.5685714285714286

<font color = 'indianred'>**Questions:**
- What should be the expected score on test data?
- What is wrong in the above model?

  - X and y are randomly generated. X has no correlation with Y.  
  - Hence the perfromance of the model should not be better than the random predictions.
  - In the preprocessing, step we selected the features that have best correllation with Y in the training data. That is why we observe higher accuracy in train dataset.
  - However the model should not generalize to unseen data. Hence, the accuracy of model on test data set should not be more than 50%.
  - By performing, preprocessing on complete dataset, we exposed the information from test dataset to the model as well. The best features were selected based on the complete dataset. This is called **"Data Leakage"**. It can artifically inflate the perfromance on Test dats set.


#### <font color = 'indianred'>**Soution**
- <font color = 'indianred'>**Always do preprocessing after train/test split.**</font>
- <font color = 'indianred'>**Use fit_transform on Train.**</font>
- <font color = 'indianred'>**Use only transform on Test**</font>
- <font color = 'indianred'>**If we use fit_transform on Test data then we will never know how the model perfromed on unseen data.**</font> If we use fit on test dataset, we are using information from the test dataset. This is true even for pre-processing steps.
- For example - When doing mean imputation, we will use the mean value calculated from training data to impute missing values in both train and test dats sets.

**Better approach - Use a Pipeline** (explained later in the lecture), which makes it easier to chain transformations with estimators, and reduces the possibility of data leakage.

In [40]:
# split the data into train/test split
X_train, X_test, y_train, y_test = train_test_split( X, y, random_state=200, train_size = 0.3)

In [41]:
preprocessor = SelectKBest(k=20)

In [42]:
X_train_selected = preprocessor.fit_transform(X_train, y_train)

In [43]:
# note we do not need y_test in this step
# We are just selecting subset of X_test determined based on the training data
X_test_selected = preprocessor.transform(X_test)

In [44]:
# fit KNNClassifier on train data
knn = KNeighborsClassifier()
knn.fit(X_train_selected, y_train)

In [45]:
knn.score(X_train_selected, y_train)

0.7233333333333334

In [46]:
knn.score(X_test_selected, y_test)

0.49142857142857144

- Once we apply preprocesing after train/test split and use only transform on test dataset, we can see that model perfroms as expected i.e. it gives the same perfromance as random prediction (50% accuracy).

### <font color = 'pickle'>**Data Leakage in Cross Validation**</font>

Let us revisit KFold Cross Validation.
- The main purpose of the croos- validation is to use multiple train/valid folds and take the average score across valid folds in muliple splits to acees how the model will generalize to unseen data.

<img src ="https://drive.google.com/uc?export=view&id=1LQ_9W5Xeqnj4LNuM5mPmZV3M-nYiy8Hv" width =400 >

**We will have similar data leakage issue as in the previous section, if we apply data transformation before cross validation**

In [47]:
# Geneate Data
np.random.seed(123)
X = np.random.standard_normal((2000, 5000))
y = np.random.choice(2, 2000)
# split the data into train/test split
X_train, X_test, y_train, y_test = train_test_split( X, y, random_state=200, train_size = 0.3)

In [48]:
preprocessor = SelectKBest(k=20)
X_train_selected = preprocessor.fit_transform(X_train, y_train)
X_test_selected = preprocessor.transform(X_test)

# Cross Validation
knn = KNeighborsClassifier()
kfolds = KFold(n_splits = 5, random_state=0, shuffle = True)
scores = cross_val_score(knn, X_train_selected, y_train, cv=kfolds)

scores.mean()

0.5883333333333334

In [49]:
knn.fit(X_train_selected, y_train)
knn.score(X_test_selected, y_test)

0.49142857142857144

- <font color = 'indianred'>**The cross validation score is over-optimistic. The cross validation score should also be close to 50%**.
- **This happened because of data leakage in cross validation step.**</font>


**Let  us see the inner working of above code**

In [50]:
preprocessor = SelectKBest(k=20)
X_train_selected = preprocessor.fit_transform(X_train, y_train)
X_test_selected = preprocessor.transform(X_test)

kfolds = KFold(n_splits = 5, random_state=0, shuffle = True)
scores =[]
for train, valid in kfolds.split(X_train_selected, y_train):
  knn = KNeighborsClassifier().fit(X_train_selected[train], y_train[train])

  # the model is evaluated on X_train_selected[valid]
  score = knn.score(X_train_selected[valid], y_train[valid])
  scores.append(score)

np.mean(scores)

0.5883333333333334

<font color = 'indianred'>

- **In cross validation model was evaluated on X_train_selected[valid].**
- **This does not reflect the perfromnace on unseen data, The model has seen the X_train_selected[valid] during the preprocessing step.**
-**The features were selected based on the complete X_train_selected.**
</font>

#### <font color = 'pickle'>**Solution : Need to include pre-processing inside  CV**

In [51]:
# let us do pre-processing inside the CV loop now.

kfolds = KFold(n_splits = 5, random_state=0, shuffle = True)
preprocessor = SelectKBest(k=20)
scores =[]
for train, valid in kfolds.split(X_train, y_train):

  # fit and transform train fold using preprocessor
  X_KBest_train_fold = preprocessor.fit_transform(X_train[train], y_train[train])

  # in each iteration the KBest features are selected only based on train fold
  # the same selected Kbest features are used to transform valid fold
  X_KBest_valid_fold = preprocessor.transform(X_train[valid])

  # output of preprocessor will become input to classifier
  # we fit model on X_KBest_train_fold
  knn = KNeighborsClassifier().fit(X_KBest_train_fold, y_train[train])

  # the  scores are claculated on valid fold
  # the model has now never seen the valid fold, hence will reflect the ability of model to
  # generalize on unseen data
  score = knn.score(X_KBest_valid_fold, y_train[valid])
  scores.append(score)

print('Mean Cross Validation Score')
np.mean(scores)

Mean Cross Validation Score


0.49000000000000005

In [52]:
print('Test Score')
knn.score(X_test_selected, y_test)

Test Score


0.49142857142857144

- In the above example, pre-processing was moved inside the cross-validation
- The pre-processing was done based on X_train[train]
- The best features were selected based on X_train[train]
- The model was fitted on X_train[train]

- The same two steps are applied on X_train[valid]
- However, the features are selected based on X_train[train]
- We only used the transform method on X_train[valid]
- The model has never seen X_train[valid]
- Thus, the cross-validation score reflected the model's ability to generalize on unseen data
- Many pre-processing steps like imputing missing values with mean use statistics from training data. These pre-processing steps should be done inside the cross-validation loop.

### <font color = 'pickle'>**Incorrect Hyperparameter Tuning**
<font color = 'indianred'>**Data leakage in Cross Validation step of Hyperparameter Tuning using GridSearch**

In [53]:
np.random.seed(123)
X = np.random.standard_normal((2000, 5000))
y = np.random.choice(2, 2000)
# split the data into train/test split
X_train, X_test, y_train, y_test = train_test_split( X, y, random_state=200, train_size = 0.3)

In [54]:
preprocessor = SelectKBest(k=20)
X_train_selected = preprocessor.fit_transform(X_train, y_train)
X_test_selected = preprocessor.transform(X_test)

kfolds = KFold(n_splits = 5, random_state=0, shuffle = True)

# giving the param_grid values
param_grid = {'n_neighbors':  np.arange(1, 16, 2)}

# Using GridSearchCV for kNN classification and returning the train_score as True
grid = GridSearchCV(KNeighborsClassifier(), param_grid=param_grid, cv=kfolds,
                   return_train_score=True)

# Now fit the  GridSearchCV on the X_train, y_train by using fit() method
grid.fit(X_train_selected, y_train)

In [55]:
# The grid can be used to generate the mean of cross validation by using best_score_
# grid.best_params_ generates the best parameter i.e n_neighbor
# grid.score(data) gives the score on the data after fitting the model on # complete training data using the best hyper parameters

print(f"best mean cross-validation score: {grid.best_score_}")
print(f"best parameters: {grid.best_params_}")

# We can check the accuracy score of training dataset and test dataset.
print(f"train-set score: {grid.score(X_train_selected, y_train):.3f}")
print(f"test-set score: {grid.score(X_test_selected, y_test):.3f}")

best mean cross-validation score: 0.6216666666666667
best parameters: {'n_neighbors': 15}
train-set score: 0.695
test-set score: 0.489


<font color = 'indianred'>**- Here again the cross validation score is over-optimistic**

<font color = 'indianred'>**- Let us look at the inner working of this code**

In [56]:
# split the data into train/test split
X_train, X_test, y_train, y_test = train_test_split( X, y, random_state=200, train_size = 0.3)

preprocessor = SelectKBest(k=20)
X_train_selected = preprocessor.fit_transform(X_train, y_train)
X_test_selected = preprocessor.transform(X_test)

# create empty list to store cross validation scores
cross_val_scores = []
kfolds = KFold(n_splits = 5, random_state=0, shuffle = True)

# Taking k values ranging from 1 to 15 with a step of 2
neighbors = np.arange(1, 16, 2)
for i in neighbors:
    knn = KNeighborsClassifier(n_neighbors=i)

    scores = cross_val_score(knn, X_train_selected, y_train, cv=kfolds)

    # scores will give us five values corrsponding to five validation splits
    # We will take mean of these five values and append the mean value to cross_val_score list
    cross_val_scores.append(np.mean(scores))

# consider the accuracy i.e highest score by using max() function
print(f"best cross-validation score: {np.max(cross_val_scores):.3}")

# Consider the neighbor from the split which gives  maximum cross validation score
best_n_neighbors = neighbors[np.argmax(cross_val_scores)]
print(f"best_value_of_k: {best_n_neighbors}")

# Retrain the model with the best_value_of_k

knn = KNeighborsClassifier(n_neighbors = best_n_neighbors)
knn.fit(X_train_selected, y_train)

best cross-validation score: 0.622
best_value_of_k: 15


**Summary: GridSeachCV (Grid Serach Cross Validation to find best parameters)**

<img src ="https://drive.google.com/uc?export=view&id=1iK80BvXepRL1xHwJWqQa14BiNhJMVH9S" width =600 >

In [57]:
print(f"best mean cross-validation score: {np.max(cross_val_scores):.3}")
print(f"best parameters: {best_n_neighbors}")

# We can check the accuracy score of training dataset and test dataset.
print(f"train-set score: {knn.score(X_train_selected, y_train):.3f}")
print(f"test-set score: {knn.score(X_test_selected, y_test):.3f}")

best mean cross-validation score: 0.622
best parameters: 15
train-set score: 0.695
test-set score: 0.489


<font color = 'dodgerblue'>- **The best hyperparameter is selected based on mean of cross_val_score.**

<font color = 'dodgerblue'>- **However, as seen earlier, the cross_val_score is incorrect**

<font color = 'indianred'>- **Key Take Away - Chain together pre-processing steps and classification/regression steps. The combined preprocessing steps and classifcation/regression steps should be considered as final model**

## <font color = 'pickle'>**What is Pipeline**

<font color = 'indianred'>**Pipeline**</font> is a simple way of combining pre-processing and modeling steps so you can use the combination as if it were a single step. This will help us to counter both problems - **Inconsistent Preprocessing and Data Leakage**

<font color = 'indianred'>**Advantages of a Pipeline:**</font>

- <font color = 'indianred'>**Cleaner Code and reduce data leakage**:</font> We may have to apply multiple pre-processing steps. For example, mean imputation followed by variable transformation. Manually keeping track of training and validation folds at each stage can get messy and increase the chance of data leakage. Using pipelines will reduce the likelihood of data leakage significantly.
- <font color = 'indianred'>**Avoid inconsistent pre-processing**:</font> You are less likely to forget to apply a pre-processing step to a test or newer dataset. Hence pipelines can help us to avoid inconsistent pre-processing.
- <font color = 'indianred'>**Easier to Productionize**:</font> Since everything is done in one step, it becomes easier to deploy the model in **production pipelines**.
- <font color = 'indianred'>**More Options for Model Validation**:</font> We can optimize choices in pre-processing and classification/regression steps together.

## <font color = 'pickle'>**Pipelines - Use Cases**

Till now, we have learned some common pitfalls like inconsistent pre-processing, data leakage issues in various scenarios like in cross-validation, during hyperparameter tuning, etc.

So, now we will learn how to create pipelines that resolve these issues.


**Note:** *In the below explanations, we may use machine learning algorithms like Ridge, Lasso, LogisticeRegresssion, etc. In this lecture, we will not learn about these models; these models are just to show how to create pipelines with various ML models. All the required and necessary machine learning model implementation will be explained in future lectures.*

### <font color = 'pickle'>**Pipeline for inconsistent pre-processing**

In [58]:
# Generate dataset
X12, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=7)
X3 = 1000 * np.random.standard_normal((1000, 1))
X = np.concatenate((X12,X3), axis =1)
X_train, X_test, y_train, y_test = train_test_split( X, y, random_state=200, train_size = 0.3)

In [59]:
# specify the preprocessing step and model
preprocessor = StandardScaler()
knn = KNeighborsClassifier()
# chain preprocessing step and model into one step using pipeline

model = make_pipeline(preprocessor, knn)

- Pipeline chains together multiple transformations and final estimator in one step.
- The transformations and estimators are applied sequentially.

In [None]:
# fit the model on train dataset
model.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('kneighborsclassifier', KNeighborsClassifier())])

In [None]:
print(f'Model score on training dataset: {model.score(X_train, y_train)}')
print(f'Model score on test dataset: {model.score(X_test, y_test)}')

Model score on training dataset: 0.9433333333333334
Model score on test dataset: 0.9471428571428572


**Since we have chained everything into one step, it is difficult to omit to apply pre-processing step on test data.**

### <font color = 'pickle'>**Pipeline for  Data leakage**

In [None]:
np.random.seed(123)
X = np.random.standard_normal((2000, 5000))
y = np.random.choice(2, 2000)

In [None]:
# make a pipeline to combine SelectKBest and KNeighborsClassifier() in to one step
model = make_pipeline(SelectKBest(k=20), KNeighborsClassifier())

In [None]:
# split the data into train/test split
X_train, X_test, y_train, y_test = train_test_split( X, y, random_state=200, train_size = 0.3)

In [None]:
model.fit(X_train, y_train)
print(f'Model score on training dataset: {model.score(X_train, y_train)}')
print(f'Model score on test dataset: {model.score(X_test, y_test)}')

Model score on training dataset: 0.7233333333333334
Model score on test dataset: 0.49142857142857144


<font color = 'indianred'>**Since we fit the combined model (preprocesing + classifier) on training data, the features are selected only based on the training data. When we call the score, both the steps are applied to the test dataset.**



### <font color = 'pickle'>**Pipeline - Data leakage in cross validation**
**Task** - Redo the cross validation in section "Data Leakage in Cross Validation" using Pipelines. Exaplin how using pipelines will resolve the data leakage.

In [None]:
kfolds = KFold(n_splits = 5, random_state=0, shuffle = True)
scores = cross_val_score(model, X_train, y_train, cv=kfolds)
scores.mean()

0.49000000000000005

- In each cross-validation split, the model (pre-processor + classifier) is fitted on the training fold.

- Therefore, during training, the model is never exposed to data from the validation fold.
- The scores are then calculated by applying the trained model (pre-processor + classifier) on the validation fold.

- Thus, the cross-validation score gives a good approximation of how the model will perform on the unseen data.

### <font color = 'pickle'>**Pipeline and GridserachCV- Correct Hyperparameter Tuning**

**Task** - Redo the hyperparameter tuning in section "Incorrect Hyperparameter Tuning" using Pipelines. Explain how using pipelines will resolve the data leakage.

In [None]:
model = make_pipeline(SelectKBest(k=20), KNeighborsClassifier())
# giving the param_grid values
param_grid = {'kneighborsclassifier__n_neighbors':  np.arange(1, 16, 2)}

# Using GridSearchCV for kNN classification and returning the train_score as True
grid = GridSearchCV(model, param_grid=param_grid, cv=kfolds,
                   return_train_score=True)

# Now fit the  GridSearchCV on the X_train, y_train by using fit() method
grid.fit(X_train, y_train)

GridSearchCV(cv=KFold(n_splits=5, random_state=0, shuffle=True),
             estimator=Pipeline(steps=[('selectkbest', SelectKBest(k=20)),
                                       ('kneighborsclassifier',
                                        KNeighborsClassifier())]),
             param_grid={'kneighborsclassifier__n_neighbors': array([ 1,  3,  5,  7,  9, 11, 13, 15])},
             return_train_score=True)

In [None]:
print(f"best mean cross-validation score: {grid.best_score_}")
print(f"best parameters: {grid.best_params_}")

# We can check the accuracy score of training dataset and test dataset.
print(f"train-set score: {grid.score(X_train, y_train):.3f}")
print(f"test-set score: {grid.score(X_test, y_test):.3f}")

best mean cross-validation score: 0.5066666666666666
best parameters: {'kneighborsclassifier__n_neighbors': 3}
train-set score: 0.782
test-set score: 0.496


- Since we combined preprocessing and classifier into one step, there is no data leakage in the cross-validation step that is used to find the hyperparameters.
- We no longer see over-optimistic best cross-validation score.

### <font color = 'pickle'>**Optimizing PreProcessing and Classifier together**

**Task**
- Create a pipeline where we add polynomial features (PolynomialFeatures(), followed by scaling (StandardScaler() and finally KNeighborsRegressor().
- Optimize Polynomial features (degree of 1, 2, and 3) and KNeighborsRegressor (n_neighbors = 1 to 10) jointly in a single pipeline. Here you will evaluate different combinations of pre-processing steps and KNeighborsRegressor.

In [None]:
diabetes = load_diabetes()
print(tw.fill(diabetes.DESCR, 100))
X_train, X_test, y_train, y_test = train_test_split(
    diabetes.data, diabetes.target, random_state=1)

.. _diabetes_dataset:  Diabetes dataset ----------------  Ten baseline variables, age, sex, body
mass index, average blood pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a quantitative measure of disease
progression one year after baseline.  **Data Set Characteristics:**    :Number of Instances: 442
:Number of Attributes: First 10 columns are numeric predictive values    :Target: Column 11 is a
quantitative measure of disease progression one year after baseline    :Attribute Information:
- age     age in years       - sex       - bmi     body mass index       - bp      average blood
pressure       - s1      tc, total serum cholesterol       - s2      ldl, low-density lipoproteins
- s3      hdl, high-density lipoproteins       - s4      tch, total cholesterol / HDL       - s5
ltg, possibly log of serum triglycerides level       - s6      glu, blood sugar level  Note: Each of
these 10 feature variables 

In [None]:
model  = make_pipeline(PolynomialFeatures(),
                       StandardScaler(),
                       KNeighborsRegressor()
                      )
param_grid = {'polynomialfeatures__degree': [1, 2, 3],
              'kneighborsregressor__n_neighbors': range(1, 10)}
grid = GridSearchCV(model, param_grid=param_grid, n_jobs=-1, return_train_score=True)
grid.fit(X_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('polynomialfeatures',
                                        PolynomialFeatures()),
                                       ('standardscaler', StandardScaler()),
                                       ('kneighborsregressor',
                                        KNeighborsRegressor())]),
             n_jobs=-1,
             param_grid={'kneighborsregressor__n_neighbors': range(1, 10),
                         'polynomialfeatures__degree': [1, 2, 3]},
             return_train_score=True)

In [None]:
print(f"best mean cross-validation score: {grid.best_score_}")
print(f"best parameters: {grid.best_params_}")

# We can check the accuracy score of training dataset and test dataset.
print(f"train-set score: {grid.score(X_train, y_train):.3f}")
print(f"test-set score: {grid.score(X_test, y_test):.3f}")

best mean cross-validation score: 0.41862723410123037
best parameters: {'kneighborsregressor__n_neighbors': 9, 'polynomialfeatures__degree': 1}
train-set score: 0.550
test-set score: 0.414


### <font color = 'pickle'>**Using Named Steps and Multiple Models**
**Task**: Create a pipeline with folllowing
- scale all variables follwed by regression.
  - for scaling pipeline should evaluate StandardScaler(), MinMaxScaler(), 'passthrough' as options for scaling.
  - pipeline should evaluate Ridge(), Lasso() as option fpr regression
  - Both Ridge() and Lasso() has a hyperparamter alpha. Specify the range np.logspace() for these hyperparameter.

In [None]:
model = Pipeline([('scaler', StandardScaler()), ('regressor', Ridge())])

param_grid = {'scaler': [StandardScaler(), MinMaxScaler(), 'passthrough'],
              'regressor': [Ridge(), Lasso()],
              'regressor__alpha': np.logspace(-3, 3, 7)}


grid = GridSearchCV(model, param_grid, n_jobs=-1, return_train_score=True)
grid.fit(X_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('regressor', Ridge())]),
             n_jobs=-1,
             param_grid={'regressor': [Ridge(), Lasso()],
                         'regressor__alpha': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03]),
                         'scaler': [StandardScaler(), MinMaxScaler(),
                                    'passthrough']},
             return_train_score=True)

In [None]:
print(f"best mean cross-validation score: {grid.best_score_}")
print(f"best parameters: {grid.best_params_}")

# We can check the accuracy score of training dataset and test dataset.
print(f"train-set score: {grid.score(X_train, y_train):.3f}")
print(f"test-set score: {grid.score(X_test, y_test):.3f}")

best mean cross-validation score: 0.48071210439622797
best parameters: {'regressor': Ridge(), 'regressor__alpha': 1.0, 'scaler': MinMaxScaler()}
train-set score: 0.533
test-set score: 0.438


### <font color = 'pickle'>**Multiple Models with Different Hyper Parameters**

**Task**: Create a pipeline with folllowing

- The ppipeline should evaluate following two options
  - (1) scaler followed by Ridge(). For Ridge Regression , you will tune hyperparameter alpha and evaluate following values: [0.1, 1]. For scaler give the following options -- StandardScaler(), MinMaxScaler(), 'passthrough'

  - (2) scaler followed by DecisionTreeRegressor().  For DecisionTreeRegressor(), you will tune hyperparameter max_depth and evaluate following values [2, 3, 4]. For scaler you will only use 'passthrough'.


In [None]:
model  = Pipeline([('scaler', StandardScaler()), ('regressor', Ridge())])

param_grid = [{'regressor': [DecisionTreeRegressor()],
               'regressor__max_depth': [2, 3, 4],
               'scaler': ['passthrough']},
              {'regressor': [Ridge()],
               'regressor__alpha': [0.1, 1],
               'scaler': [StandardScaler(), MinMaxScaler(), 'passthrough']}
             ]
grid = GridSearchCV(model, param_grid, n_jobs=-1, return_train_score=True)
grid.fit(X_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('regressor', Ridge())]),
             n_jobs=-1,
             param_grid=[{'regressor': [DecisionTreeRegressor()],
                          'regressor__max_depth': [2, 3, 4],
                          'scaler': ['passthrough']},
                         {'regressor': [Ridge(alpha=1)],
                          'regressor__alpha': [0.1, 1],
                          'scaler': [StandardScaler(), MinMaxScaler(),
                                     'passthrough']}],
             return_train_score=True)

In [None]:
print(f"best mean cross-validation score: {grid.best_score_}")
print(f"best parameters: {grid.best_params_}")

# We can check the accuracy score of training dataset and test dataset.
print(f"train-set score: {grid.score(X_train, y_train):.3f}")
print(f"test-set score: {grid.score(X_test, y_test):.3f}")

best mean cross-validation score: 0.48071210439622797
best parameters: {'regressor': Ridge(alpha=1), 'regressor__alpha': 1, 'scaler': MinMaxScaler()}
train-set score: 0.533
test-set score: 0.438


###  <font color = 'pickle'>**Different PreProcessing Steps for Different Variables**

You can download the Titanic dataset using the commands below and see it’s description at https://www.openml.org/d/40945

In [None]:
X, y = fetch_openml("Titanic", version=1, as_frame=True, return_X_y=True)

In [None]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   pclass     1309 non-null   float64 
 1   name       1309 non-null   object  
 2   sex        1309 non-null   category
 3   age        1046 non-null   float64 
 4   sibsp      1309 non-null   float64 
 5   parch      1309 non-null   float64 
 6   ticket     1309 non-null   object  
 7   fare       1308 non-null   float64 
 8   cabin      295 non-null    object  
 9   embarked   1307 non-null   category
 10  boat       486 non-null    object  
 11  body       121 non-null    float64 
 12  home.dest  745 non-null    object  
dtypes: category(2), float64(6), object(5)
memory usage: 115.4+ KB


In [None]:
X =  X[['pclass','sex','sibsp','parch']]
categorical = ['sex']
continuous=['pclass','sibsp','parch']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [None]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 981 entries, 747 to 684
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   pclass  981 non-null    float64 
 1   sex     981 non-null    category
 2   sibsp   981 non-null    float64 
 3   parch   981 non-null    float64 
dtypes: category(1), float64(3)
memory usage: 31.7 KB


#### <font color = 'pickle'>**Column Transformer**
Task : Logistic with standar sacler (for continuous variables only) and onehot encoder (for categorical variable only).

In [None]:
preprocess1= make_column_transformer(
    (StandardScaler(),continuous),
    (OneHotEncoder(drop='first'), categorical),
    remainder='passthrough'
)

In [None]:
from sklearn.linear_model import LogisticRegression
model = make_pipeline( preprocess1, LogisticRegression())

In [None]:
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.7865853658536586

#### <font color = 'pickle'>**Feature Engine**

In [None]:
!pip install feature_engine -qq

[?25l[K     |█▏                              | 10 kB 13.9 MB/s eta 0:00:01[K     |██▍                             | 20 kB 4.5 MB/s eta 0:00:01[K     |███▋                            | 30 kB 6.5 MB/s eta 0:00:01[K     |████▊                           | 40 kB 3.1 MB/s eta 0:00:01[K     |██████                          | 51 kB 3.3 MB/s eta 0:00:01[K     |███████▏                        | 61 kB 3.9 MB/s eta 0:00:01[K     |████████▎                       | 71 kB 4.2 MB/s eta 0:00:01[K     |█████████▌                      | 81 kB 4.8 MB/s eta 0:00:01[K     |██████████▊                     | 92 kB 4.8 MB/s eta 0:00:01[K     |███████████▉                    | 102 kB 4.0 MB/s eta 0:00:01[K     |█████████████                   | 112 kB 4.0 MB/s eta 0:00:01[K     |██████████████▎                 | 122 kB 4.0 MB/s eta 0:00:01[K     |███████████████▍                | 133 kB 4.0 MB/s eta 0:00:01[K     |████████████████▋               | 143 kB 4.0 MB/s eta 0:00:01[K    

In [None]:
from feature_engine.encoding import OneHotEncoder as fe_ohe
from feature_engine.wrappers import SklearnTransformerWrapper

In [None]:
model = Pipeline([

    ('one_hot_encoder',
      fe_ohe(variables=categorical, drop_last= True, ignore_format=True)),
    ('scalar',
      SklearnTransformerWrapper(StandardScaler(), variables = continuous)),
    ('logreg',
     LogisticRegression())
])

In [None]:
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.7865853658536586

### <font color = 'pickle'>**Different PreProcessing Steps and Different Models**

Task : (1) Decsion Tree with One hot encoder (categorical) and (2) KNNRegression with standar sacler (continuous) and onehot encoder (categorical).

In [None]:
preprocess1= make_column_transformer(
    (OneHotEncoder(drop='first'), categorical),
    remainder='passthrough'
)

In [None]:
preprocess2 = make_column_transformer(
    (StandardScaler(),continuous),
    (OneHotEncoder(drop='first'), categorical),
    remainder='passthrough'
)

In [None]:
model  = Pipeline([('preprocessor', preprocess2), ('regressor', Ridge())])

param_grid = [{'regressor': [DecisionTreeRegressor()],
               'preprocessor' : [preprocess1]},

              {'regressor': [Ridge()],
               'preprocessor' : [preprocess2]}

             ]

In [None]:
grid = GridSearchCV(model, param_grid, n_jobs=-1, return_train_score=True)
grid.fit(X_train, y_train)
grid.score(X_test, y_test)

0.3341705140118807