# Hyperparameter Optimization in Machine Learning Models
This tutorial covers what a parameter and a hyperparameter are in a machine learning model along with why it is vital in order to enhance your model’s performance.
Machine learning involves predicting and classifying data and to do so, you employ various machine learning models according to the dataset. Machine learning models are **parameterized** so that their behavior can be **tuned** for a given problem. These models can have many parameters and finding the best combination of parameters can be treated as a search problem. But this very term called parameter may appear unfamiliar to you if you are new to applied machine learning. But don’t worry! You will get to know about it in the very first place of this blog, and you will also discover what the difference between a **parameter** and a **hyperparameter** of a machine learning model is. This blog consists of following sections:

- What are a parameter and a hyperparameter in a machine learning model?
- Why hyperparameter optimization/tuning is vital in order to enhance your model’s performance?
- Two simple strategies to optimize/tune the hyperparameters
- A simple case study in Python with the two strategies

Let’s straight jump into the first section!

## What is a parameter in a machine learning learning model?
A model parameter is a **configuration variable** that is internal to the model and whose value can be estimated from the given data.

- They are required by the model when making predictions.
- Their values define the skill of the model on your problem.
- They are estimated or learned from data.
- They are often not set manually by the practitioner.
- They are often saved as part of the learned model.

So your main take away from the above points should be parameters are crucial to machine learning algorithms. Also, they are the part of the model that is **learned from historical training data**. Let’s dig it a bit deeper. Think of the function parameters that you use while programming in general. You may pass a parameter to a function. In this case, a parameter is a function argument that could have one of a range of values. In machine learning, the specific model you are using is the function and requires parameters in order to make a prediction on new data. Whether a model has a fixed or variable number of parameters determines whether it may be referred to as “parametric” or “nonparametric“. 

Some examples of model parameters include:
- The weights in an artificial neural network.
- The support vectors in a support vector machine.
- The coefficients in a linear regression or logistic regression.

__NOTE__: Determining the best set of parameters is the essential step of training __any__ machine learning models. As you saw before, using the same modeling technique and the same training data, the model performance can be different - that is because that different parameters are learnt.

## What is a hyperparameter in a machine learning learning model?
A model hyperparameter is a configuration that is __external__ to the model and whose value __cannot be estimated__ from data.

- They are often used in processes to help estimate model parameters.
- They are often specified by the practitioner (manually).
- They can often be set using heuristics (rules-of-thumb).
- They are often tuned for a given predictive modeling problem (problem-specific).

You cannot know the best value for a model hyperparameter on a given problem. You may use:
1. rules of thumb, 
2. copy values used on other issues, 
3. or search for the best value by trial and error. 

When a machine learning algorithm is tuned for a specific problem then essentially you are tuning the hyperparameters of the model to discover the parameters of the model that result in the most skillful predictions. 

According to a very popular book called “Applied Predictive Modelling” - “Many models have important parameters which cannot be directly estimated from the data. For example, in the K-nearest neighbor classification model … This type of model parameter is referred to as a tuning parameter because there is no analytical formula available to calculate an appropriate value.” 

Model hyperparameters are often referred to as model parameters which can make things confusing. A good rule of thumb to overcome this confusion is as follows: “If you have to specify a model parameter **manually**, then it is probably a model hyperparameter. ” Some examples of model hyperparameters include:

- The learning rate for training a neural network.
- The C and sigma hyperparameters for support vector machines.
- The k in k-nearest neighbors.

In the next section, you will discover the importance of the right set of hyperparameter values in a machine learning model.

## Importance of the right set of hyperparameter values in a machine learning model:

The best way to think about hyperparameters is like the settings of an algorithm that can be adjusted to optimize performance, just as you might *turn the knobs of an AM radio to get a clear signal*. When creating a machine learning model, you'll be presented with design choices as to how to define your model architecture. Often, you don't immediately know what the **optimal** model architecture should be for a given model, and thus you'd like to be able to explore a range of possibilities. In a true machine learning fashion, you’ll ideally ask the machine to perform this exploration and select the optimal model architecture automatically.

__NOTE__: Do not forget feature engineering, aside from hyperparameter optimization, is another way of tweaking models for better performances.

You will see in the case study section on how the right choice of hyperparameter values affect the performance of a machine learning model. In this context, choosing the right set of values is typically known as “Hyperparameter optimization” or “Hyperparameter tuning”.

## Two simple strategies to optimize/tune the hyperparameters:

Models can have many hyperparameters and finding the best combination of parameters can be treated as a search problem.

Although there are many hyperparameter optimization/tuning algorithms now, this post discusses two simple strategies: 

1. grid search and 
2. Random Search.

### Grid searching of hyperparameters:
Grid search is an approach to hyperparameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid. 

Let’s consider the following example: 

Suppose, a machine learning model `X` takes hyperparameters `a1`, `a2` and `a3.` In grid searching, you first define the range of values for each of the hyperparameters `a1`, `a2` and `a3`. You can think of this as an array of values for each of the hyperparameters. __NOTE__ that the ranges are usually heuristic. Now the grid search technique will construct many versions of `X` with all the possible combinations of hyperparameter (`a1`, `a2` and `a3`) values that you defined in the first place. This range of hyperparameter values is referred to as the grid. 

Suppose, you defined the grid as - __Note__ that usually the range of a hyperparameter is NOT continuous:
`a1 = [0,1,2,3,4,5]`
`a2 = [10,20,30,40,50,60]`
`a3 = [105,105,110,115,120,125]`

Note that, the array of values of that you are defining for the hyperparameters has to be legitimate in a sense that you cannot supply Floating type values to the array if the hyperparameter only takes Integer values.

Now, grid search will begin its process of constructing several versions of X with the grid that you just defined.

It will start with the combination of `[0,10,105]`, and it will end with `[5,60,125]`. It will go through all the intermediate combinations between these two which makes grid search computationally very expensive.

Let’s take a look at the other search technique Random search:

### Random searching of hyperparameters:
The idea of random searching of hyperparameters was proposed by James Bergstra & Yoshua Bengio. You can check the original paper [here](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf).

Random search differs from a grid search. In that you longer provide a discrete set of values to explore for each hyperparameter; rather, you provide a statistical distribution for each hyperparameter from which values may be randomly sampled.

Before going any further, let’s understand what distribution and sampling mean:

In Statistics, by distribution, it is essentially meant an arrangement of values of a variable showing their observed or theoretical frequency of occurrence.

On the other hand, Sampling is a term used in statistics. It is the process of choosing a representative sample from a target population and collecting data from that sample in order to understand something about the population as a whole.

Now let's again get back to the concept of random search.

You’ll define a sampling distribution for each hyperparameter. You can also define how many iterations you’d like to build when searching for the optimal model. For each iteration, the hyperparameter values of the model will be set by sampling the defined distributions. One of the primary theoretical backings to motivate the use of a random search in place of grid search is the fact that for most cases, hyperparameters are __not equally important__. According to the original paper:

“*….for most datasets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different datasets. This phenomenon makes grid search a poor choice for configuring algorithms for new datasets.*”

In the following figure, we're searching over a hyperparameter space where the one hyperparameter has significantly more influence on optimizing the model score - the distributions shown on each axis represent the model's score. In each case, we're evaluating `9` different models. The grid search strategy blatantly misses the optimal model and spends redundant time exploring the unimportant parameter. During this grid search, we isolated each hyperparameter and searched for the best possible value while holding all other hyperparameters constant. For cases where the hyperparameter being studied has little effect on the resulting model score, this results in wasted effort. Conversely, the random search has much improved exploratory power and can focus on finding the optimal value for the critical hyperparameter.

<img src = 'http://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1531340388/grid_vs_random_jltknd.png' />

Source: [Random Search for Hyper-Parameter Optimization](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf)

In the following sections, you will see grid search and random search in action with Python. You will also be able to decide which is better regarding the effectiveness and efficiency.

## Case study in Python:

Hyperparameter tuning is a final step in the process of applied machine learning before presenting results.

You will use the Pima Indian diabetes dataset. The dataset corresponds to a classification problem on which you need to make predictions on the basis of whether a person is to suffer diabetes given the `8` features in the dataset. You can find the complete description of the dataset [here](https://www.kaggle.com/uciml/pima-indians-diabetes-database).

There are a total of `768` observations in the dataset. Your first task is to load the dataset so that you can proceed. But before that let's import the dependencies you are going to need.

In [0]:
# Dependencies

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings('ignore')

Now that the dependencies are imported let's load Pima Indians dataset into a Dataframe object with the famous `Pandas` library.

In [0]:
 data = pd.read_csv("./data/diabetes.csv")

The dataset is successfully loaded into the Dataframe object data. Now, let's take a look at the data.

In [0]:
data.head()

So you can `8` different features labeled into the outcomes of `1` and `0` where `1` stands for the observation has diabetes, and `0` denotes the observation does not have diabetes. The dataset is known to have missing values. Specifically, there are missing observations for some columns that are marked as a zero value. We can corroborate this by the definition of those columns, and the domain knowledge that a zero value is invalid for those measures, e.g., `0` for body mass index or blood pressure is invalid.

(Missing value creates a lot of problems when you try to build a machine learning model. In this case, you will use a Logistic Regression classifier for predicting the patients having diabetes or not. Now, Logistic Regression cannot handle the problems of missing values. )

(If you want a quick refresher on Logistic Regression you can refer [here](https://www.analyticsvidhya.com/blog/2015/10/basics-logistic-regression/).)

Let's get some statistics about the data with Pandas' `describe()` utility.

In [0]:
data.describe()

This is useful.

We can see that there are columns that have a minimum value of `0`. On some columns, a value of `0` does not make sense and indicates an invalid or missing value.

Specifically, the following columns have an invalid zero minimum value:

- Plasma glucose concentration (`Glucose`)
- Diastolic blood pressure (`BloodPressure`)
- Triceps skinfold thickness (`SkinThickness`)
- 2-Hour serum insulin (`Insulin`)
- Body mass index (`BMI`)

Now you need to identify and mark values as missing. Let’s confirm this by looking at the raw data, the example prints the first `20` rows of data.

In [0]:
data.head(20)

You can see 0 in several columns, right?

You can get a count of the number of missing values in each of these columns. You can do this by marking all of the values in the subset of the DataFrame you are interested in that have zero values as True. You can then count the number of true values in each column. For this, you will have to reimport the data without the column names.

In [0]:
data = pd.read_csv("https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv",header=None)
print((data[[1,2,3,4,5]] == 0).sum())

You can see that columns `1`,`2` and `5` have just a few `0` values, whereas columns `3` and `4` show a lot more, nearly half of the rows. Column `0` has several missing values although but that is natural. Column `8` denotes the target variable so, `0`s in it is natural.

This highlights that different “missing value” strategies may be needed for different columns, e.g., to ensure that there are still a sufficient number of records left to train a predictive model.

In Python, specifically `Pandas`, `NumPy` and `Scikit-Learn`, you should mark missing values as `NaN`.

Values with a `NaN` value are ignored from operations like sum, count, etc.

You can mark values as NaN easily with the Pandas DataFrame by using the replace() function on a subset of the columns you are interested in.

After you have marked the missing values, you can use the `isnull()` function to mark all of the NaN values in the dataset as `True` and get a count of the missing values for each column.

In [0]:
# Mark zero values as missing or NaN
data[[1,2,3,4,5]] = data[[1,2,3,4,5]].replace(0, np.NaN)
# Count the number of NaN values in each column
print(data.isnull().sum())

You can see that the columns `1` to `5` have the same number of missing values as zero values identified above. This is a sign that you have marked the identified missing values correctly.

This is a useful summary. But you'd like to look at the actual data though, to confirm that you have not fooled yourselves.

Below is the same example, except you print the first `5` rows of data.

In [0]:
data.head()

It is clear from the raw data that marking the missing values had the intended effect. Now, you will **impute** the missing values. Imputing refers to using a model to replace missing values. Although there are several solutions for imputing missing values, you will use mean imputation which means replacing the missing values in a column with the mean of that particular column. Let's do this with `Pandas`' `fillna()` method.

In [0]:
# Fill missing values with mean column values
data.fillna(data.median(), inplace=True)
# Count the number of NaN values in each column
print(data.isnull().sum())

Cheers! You have now handled the missing value problem. Now let's use this data to build a Logistic Regression model using `scikit-learn`.

First, you will see the model with some random hyperparameter values. Then you will build two other Logistic Regression models with two different strategies - **Grid search** and **Random search**.

In [0]:
# Split dataset into inputs and outputs
values = data.values
X = values[:,0:8]
y = values[:,8]

# Initiate the LR model with random hyperparameters
lr = LogisticRegression(penalty='l2',dual=False,max_iter=110)

You have created the Logistic Regression model with some random hyperparameter values. The hyperparameters that you used are:

- `penalty`: Used to specify the norm used in the penalization ([regularization](https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a)).
- `dual`: Dual or primal formulation. The dual formulation is only implemented for l2 penalty with liblinear solver. Prefer dual=False when n_samples > n_features (when you have more observations than features).
- `max_iter`: Maximum number of iterations taken to converge.

Later in the case study, you will optimize/tune these hyperparameters so see the change in the results.

__NOTE__: in this tutorial, our main purpose is to optimize hyperparameters - hence we did not split the data into training/testing. But you should always split your data. However later on we applied K-fold cross validation to ensure the model(s) is not overfitted.

In [0]:
# Pass data to the LR model
lr.fit(X,y)

It's time to check the __accuracy__ score.

In [0]:
lr.score(X,y)

In the above step, you applied your LR model to the same data and evaluated its score. But there is always a need to validate the stability of your machine learning model. You just can’t fit the model to your training data and hope it would accurately work for the real data it has never seen before. You need some kind of assurance that your model has got most of the patterns from the data correct.

Well, Cross-validation is there for rescue. I will not go into the details of it as it is out of the scope of this blog. But [this post](https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f) does a very fine job.

In [0]:
# You will need the following dependencies for applying Cross-validation and evaluating the cross-validated score

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# Build the k-fold cross-validator

kfold = KFold(n_splits=3, random_state=2019)

You supplied `n_splits` as `3`, which essentially makes it a 3-fold cross-validation.

__NOTE__: this is because that you have a limited-sized dataset. 

You also supplied `random_state` as 2019. This is just to reproduce the results. You could have supplied any integer value as well. 

__NOTE__: you should always ensure the __reproducibility__ of your data analysis.

Now, let's apply this.

In [0]:
result = cross_val_score(lr, X, y, cv=kfold, scoring='accuracy')
print(result.mean())

You can see there's a slight decrease in the score. This is because you are usng less (**66%**) of your data to train the model. Anyway, you can do better with hyperparameter tuning/optimization.

Let's build another LR model, but this time its hyperparameter will be tuned. You will first do this grid search.

Let's first import the dependencies you will need. Scikit-learn provides a utility called `GridSearchCV` for this.

In [0]:
from sklearn.model_selection import GridSearchCV

Let's define the grid values of the hyperparameters that you used above.

In [0]:
dual=[True,False]
max_iter=[100,110,120,130,140]
param_grid = dict(dual=dual,max_iter=max_iter)

You have defined the grid. Let's run the grid search over them and see the results with execution time.

In [0]:
import time

lr = LogisticRegression(penalty='l2')
grid = GridSearchCV(estimator=lr, param_grid=param_grid, cv = 3, n_jobs=-1)

start_time = time.time()
grid_result = grid.fit(X, y)
# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
print("Execution time: " + str((time.time() - start_time)) + ' ms')

Oh-no! The results decreased again! Does that mean we are totally off track? Rest assured, this only means that the current grid does not contain the optimal set.

You can define a larger grid of hyperparameter as well and apply grid search.

In [0]:
lr = LogisticRegression(penalty='l2')
grid = GridSearchCV(estimator=lr, param_grid=param_grid, cv = 3, n_jobs=-1)

start_time = time.time()
grid_result = grid.fit(X, y)
# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
print("Execution time: " + str((time.time() - start_time)) + ' ms')

You can see an increase in the accuracy score, but there is a sufficient amount of growth in the execution time as well. **The larger the grid, the more execution time**.

Let's rerun everything but this time with the random search. Scikit-learn provides `RandomSearchCV` to do that. As usual, you will have to import the necessary dependencies for that.

In [0]:
from sklearn.model_selection import RandomizedSearchCV

In [0]:
import numpy as np
np.random.seed(7)
max_iter1 = np.random.randint(100, 200, 50)
C1 = np.random.uniform(0, 5, 50)
param_grid1 = dict(dual=dual,max_iter=max_iter1,C=C1)

In [0]:
random = RandomizedSearchCV(estimator=lr, param_distributions=param_grid1, cv = 3, n_jobs=-1)

start_time = time.time()
random_result = random.fit(X, y)
# Summarize results
print("Best: %f using %s" % (random_result.best_score_, random_result.best_params_))
print("Execution time: " + str((time.time() - start_time)) + ' ms')

Woah! The random search yielded the same accuracy but in a much lesser time.

#### YOUR TURN HERE
Please check the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) for `RandomSearchCV`, for the purpose of understanding why we set `n_jobs` to `-1`.

Also, please list any other random search parameters that you found interesting in the docs.

Double click and type your answer here.

That is all for the case study part. Now, let's wrap things up!

## Conclusion and further reading:
In this tutorial, you learned about parameters and hyperparameters of a machine learning model and their differences as well. You also got to know about what role hyperparameter optimization plays in building efficient machine learning models. You built a simple Logistic Regression classifier in Python with the help of scikit-learn.

You tuned the hyperparameters with grid search and random search and saw which one performs better.

Besides, you saw small data preprocessing steps (like handling missing values) that are required before you feed your data into the machine learning model. You covered Cross-validation as well.

That is a lot to take in, and all of them are equally important in your data science journey. I will leave you with some further readings that you can do.

Further readings:

- [Problems in hyperparameter optimization](https://blog.sigopt.com/posts/common-problems-in-hyperparameter-optimization)
- [Hyperparameter optimization with soft computing techniques](https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf)
- [Random Search for Hyper-Parameter Optimization](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf)

For the ones who are a bit more advanced, I would highly recommend reading this paper for effectively [optimizing the hyperparameters of neural networks](https://arxiv.org/abs/1803.09820).

# Bonus Content: Introducing the Implementation of a Pipeline

We have been using the concept of a __pipeline__ frequently in this class. A __pipeline__ refers to a sequence of tasks covering one or more phases in [CRISP-DM](http://www.proglobalbusinesssolutions.com/six-steps-in-crisp-dm-the-standard-data-mining-process/). 

Even though we embraced the idea of pipelines, we have not formally implemented the idea of a pipeline in `sklearn`. The implementation so far from us is basically stacking different steps together. 

For instance, we observe that the dataset used in this tutorial contains `NaN` values, we will need to __impute__ them.  Additionally, since the data is not at the same range, natrually we would like to __scale__ the data.  After these steps, we want to perform __grid search__ (with cross validation) on the __logistic regression model__ to find the optimal hyperparameter. Then we __build the best model__ with the optimal hyperparameter settings to get the results. 

So, the aforementioned steps form a pipeline like below:
```
INPUT: Raw Data 
---------------
STEP 1: SimpleImputer
STEP 2: StandardScaler
STEP 3: GridSearchCV + LogisticRegression
STEP 4: LogisticRegression.predict
----------------
OUTPUT: Model Evaluation Results
```
As of now, we would implement the pipeline like this.


In [0]:
data_copy = data.copy()
values = data_copy.values
X = values[:,0:8]
y = values[:,8]

In [0]:
# imputation
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputedData = imputer.fit_transform(X)

In [0]:
# scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaledData = scaler.fit_transform(imputedData)

In [0]:
# grid search
# hyper parameters
dual=[True,False]
max_iter=[100,110,120,130,140]
C = [1.0,1.5,2.0,2.5]
param_grid = dict(dual=dual,max_iter=max_iter,C=C)

lr = LogisticRegression(penalty='l2')
grid = GridSearchCV(estimator=lr, param_grid=param_grid, cv = 3, n_jobs=-1)

grid_result = grid.fit(scaledData, y)
# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

We just saw this process in the main body of the tutorial. The only step left is to __fit__ and __evaluate__ the best model. If you have questions about this step, refer to the `Cross Validation in Python` tutorial from last week.

In [0]:
# fit & evaluate best model
from sklearn.model_selection import cross_val_score, RepeatedKFold
lr_best = LogisticRegression(C=1.0, dual=False, max_iter=100)
rkf = RepeatedKFold(n_splits=10, n_repeats=5, random_state=2020)
f1_scores_cv = cross_val_score(lr_best, scaledData, y, scoring='f1', cv=rkf)
roc_auc_cv = cross_val_score(lr_best, scaledData, y, scoring='roc_auc', cv=rkf)

print("model bias (f1-score):", f1_scores_cv.mean())
print("model variance (f1-score):", f1_scores_cv.std())
print("model bias (ROC/AUC):", roc_auc_cv.mean())
print("model variance (ROC/AUC):", roc_auc_cv.std())

To prove that this is the best hyperparameter setting, we can create a model for compare.

In [0]:
# model for compare
lr_comp = LogisticRegression(C=0.1, dual=False, max_iter=50)
f1_scores_comp = cross_val_score(lr_comp, scaledData, y, scoring='f1', cv=rkf)
roc_auc_comp = cross_val_score(lr_comp, scaledData, y, scoring='roc_auc', cv=rkf)

print("model bias (f1-score):", f1_scores_comp.mean())
print("model bias (ROC/AUC):", roc_auc_comp.mean())

Since we have `50` values in both model evaluation results, we can perform the _t-test_ for comparison purpose.

The function below (original code from [here](https://towardsdatascience.com/inferential-statistics-series-t-test-using-numpy-2718f8f9bf2f)) returns the _t-test_ results.

In [0]:
from scipy import stats

def my_t_test(a,b,size): # a, b are two arrays for comparison
    n = size # size of the arrays

    ## Calculate the Standard Deviation
    var_a = a.var(ddof=1)
    var_b = b.var(ddof=1)

    #std deviation
    s = np.sqrt((var_a + var_b)/2)

    ## Calculate the t-statistics
    t = (a.mean() - b.mean())/(s*np.sqrt(2/n))

    ## Compare with the critical t-value
    #Degrees of freedom
    df = 2*n - 2

    #p-value after comparison with the t 
    p = 1 - stats.t.cdf(t,df=df)  
  
  
    # return(t, p)

    if (abs(t) > 1.96) and p < 0.05:
        return('mean of a and b are significantly different (reject H0)')
    else:
        return('mean of a and b are not significantly different (not reject H0)')

In [0]:
my_t_test(f1_scores_cv, f1_scores_comp, 50)

In [0]:
my_t_test(roc_auc_cv, roc_auc_comp, 50)

Above results prove that the "best" model is not significantly different from a random model we create for comparison. And the last part (__model evaluation and comparison__) ties back to the tutorial from last week.

`sklearn` formally support the idea of pipelines, we can easily combine STEPs 1 - 3 in a pipeline.

In [0]:
from sklearn.pipeline import make_pipeline
# STEP 1 - 3
best_pipeline = make_pipeline(imputer, # STEP 1
                    scaler, # STEP 2
                    GridSearchCV(LogisticRegression(penalty='l2'),
                                 param_grid=param_grid,
                                 cv=3)) # STEP 3

# STEP 4
f1_scores_pp = cross_val_score(best_pipeline, scaledData, y, scoring='f1', cv=rkf)
roc_auc_pp = cross_val_score(best_pipeline, scaledData, y, scoring='roc_auc', cv=rkf)

print("model bias (f1-score):", f1_scores_pp.mean())
print("model variance (f1-score):", f1_scores_pp.std())
print("model bias (ROC/AUC):", roc_auc_pp.mean())
print("model variance (ROC/AUC):", roc_auc_pp.std())

See how elegant your pipeline can be? You can even incorporate your custom functions in there.

Hopefully this part is helpful when you organize your pipelines.