![alt text](https://drive.google.com/uc?export=view&id=1DXUVHxd4t15mfuqMgMCLnsP4jWVI5EWz)

---
<br>
© 2024 Copyright The University of New South Wales - CRICOS 00098G

**Author**: Oscar Perez-Concha: o.perezconcha@unsw.edu.au

**Contributors/Co-authors**: Marta Fredes-Torres, Zhisheng (Sandy) Sa and Matthew Sainsbury-Dale.








---




# Laboratory 3: Model Evaluation and Improvement.
# Cross-Validation for hyper-parameter tuning using `scikit-learn`'s function `GridSearchCV`.


Goal/Research question:To develop a predictive algorithm for forecasting hospital readmissions within 30 days post-discharge; this exercise will specifically focus on tuning the hyper-parameter C (sometimes called $\alpha$ or $λ$, where $\alpha =  C = \frac {1}{\lambda}$).
We will apply this tuning to our predictive algorithm, Lasso.

![alt text](https://drive.google.com/uc?export=view&id=105SGqeyo8RgLhSO8mN7ZE5OsG0YiLPKt)



---





---



# 1. Aims of the Exercise:

1. To become familiar with using a validation set to find the best hyper-parameters for a model. Remember that the hyper-parameters are defined by the user, whereas parameters (e.g., the beta coefficients in logistic regression) are found automatically by fitting the model.
2. To become familiar with a grid search: the commonly used method for tuning hyper-parameters is via a grid search, which entails testing many combinations of the hyper-parameters of interest.
3. To become familiar with k-fold cross-validation (k-CV) and grid search.
4. To become familiar with Python pipelines.

It aligns with these learning outcomes of our course:

1.	Distinguish a range of task specific machine learning techniques appropriate for Health Data Science.
2.	Design machine learning tasks for Health Data Science scenarios.




Follow the instructions given and if you have any questions, please use the **Comment section** in **Open Learning**.



---





---



# 2. Initial Docstring:

All programs should have an initial docstring comment. It must include at least the following elements:

* Purpose: what is the aim of your code?
* Date created
* Author
* Date modified
* Author of the modification
* Method: how did you go about solving the problem?
* Data dictionary: The data dictionary should contain all the important variables and constants defined, their datatype (float, string, int) and a short description of what they are.
* List and defintions of functions: similar to the data dictionary, but with functions.
* List of libraries: libraries used in the program and their functionality.

Is there anything else you think we should include in the docstring? Please comment in the comments section of this week's laboratory.

Please read these two documents:
1. pandas docstring guide: https://pandas.pydata.org/pandas-docs/version/0.23/contributing_docstring.html
2. Style guide: https://www.cse.unsw.edu.au/~en1811/resources/style.html


In addition to this initial docstring, please comment and document all the steps and rationale of your code. Use text cells or docstrings within your Python code.

<b> Docstring:</b>
#####################################################################################################################

pipeln:

#####################################################################################################################



---





---



# 3. Dataset

In [2]:
# Insert your comments and explanations


import sys
import numpy as np
import pandas as pd


In [4]:
# Insert your comments and explanations


# Mount Google Drive
# We do not need to run this cell if you are not running this notebook in Google Colab

if 'google.colab' in str(get_ipython()):
    from google.colab import drive # import drive from Gogle colab
    root = '/content/drive'     # default location for the drive
    # print(root)                 # print content of ROOT (Optional)
    drive.mount(root)
else:
    print('Not running on CoLab')

Mounted at /content/drive


In [5]:
# Insert your comments and explanations

from pathlib import Path

if 'google.colab' in str(get_ipython()):
    # EDIT THE PROJECT PATH IF DIFFERENT WITH YOUR ONE
    # You may need to change 'MyDrive' to 'My Drive'.
    project_path = Path(root) / 'MyDrive' / 'Colab Notebooks' / 'HDAT9500' / 'Week3'

    # OPTIONAL - set working directory according to your google drive project path
    # import os
    # Change directory to the location defined in project_path
    # os.chdir(project_path)
else:
    project_path = Path()

As we explained in the previous exercise, we already cleaned and created dummy variables for the diabetes dataset. In the Exercise 1 of Chapter 2 we saved this 'prepared' dataset using `pickle`. Now, we will load this data using 'pickle' ([more information](https://docs.python.org/3/library/pickle.html)). In this case, we will read it from last week's folder. You will have to change the path according to the folder where you stored your `hospital_data.pickle`.

In [6]:
# Insert your comments and explanations
pickle_data_path = Path(project_path) /'hospital_final.pickle'
hospital = pd.read_pickle(pickle_data_path)

In [7]:
# Insert your comments and explanations

# Santity checks - Please add many as you wish.
hospital.head(n=20)

Unnamed: 0,los,Age,sex,max_glu_serum,A1Cresult,number_diagnoses,num_lab_procedures,num_procedures,num_medications,number_emergency,...,group_name_1,group_name_2,group_name_3,readmission,admission_type_id_cat,discharge_disposition_id_cat,admission_source_id_cat,discharge_disposition_grouped,admission_source_grouped,admission_type_grouped
0,2,79,Female,,,9,38,0,12,0,...,Other,Endocrine,Infectious,no,3,18,7,,Emergency Room,Elective
1,5,59,Male,,>8,8,49,0,16,0,...,Endocrine,Cardiac_&_circulatory,Other,no,3,1,7,,Emergency Room,Elective
2,2,33,Female,,,5,62,0,15,1,...,Other,Endocrine,Cardiac_&_circulatory,no,3,1,7,,Emergency Room,Elective
3,6,42,Female,,,9,77,0,30,0,...,Infectious,Respiratory,Endocrine,no,3,6,7,,Emergency Room,Elective
4,1,62,Male,,,7,13,5,6,0,...,Cardiac_&_circulatory,Cardiac_&_circulatory,Cardiac_&_circulatory,no,3,1,7,,Emergency Room,Elective
5,1,53,Female,,,9,11,3,16,0,...,Cardiac_&_circulatory,Respiratory,Cardiac_&_circulatory,no,3,1,7,,Emergency Room,Elective
6,10,80,Female,,,9,56,3,13,0,...,Cardiac_&_circulatory,Other,Cardiac_&_circulatory,no,3,3,7,,Emergency Room,Elective
7,2,44,Male,,,3,1,2,22,0,...,Other,Endocrine,Other,no,3,18,7,,Emergency Room,Elective
8,1,73,Male,,,9,9,0,5,0,...,Cardiac_&_circulatory,Endocrine,Other,no,3,7,7,,Emergency Room,Elective
9,4,64,Female,,,4,19,4,24,0,...,Other,Cardiac_&_circulatory,Endocrine,no,3,18,7,,Emergency Room,Elective




---



\

# 4. Hyper-parameter tuning: Grid Search with Cross-Validation (GridSearchCV)



---



Remember that:


1. Cross-validation is just a method to better estimate the performance of a model.
2. Cross-validation can be used either:


a. To just compute the performance of a model using different partitions of training and test sets and/or different hyper-parameters.

b. To compute the performance of a model using different partitions of training and test sets and/or different hyper-parameters AND to select the best resulting model out of that process. We will see this second option in this exercise.


---



The most commonly used method for tuning hyper-parameters is via a grid search, which entails testing many combinations of the hyper-parameters of interest.<p>
    
We want to combine the benefits of cross-validation with the grid search. We will seek to find the model with the best accuracy by using cross-validation. We will use the `sklearn.model_selection.GridSearchCV` function from sklearn ([More information](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html))
    


---

We have one hyper-parameter that we would like to tune:
C, ($C=\alpha=1/\lambda$)
<p>

---

Let's say we want to try:
C = 0.001, 0.01, 0.1, 1, 10, 100.

---

Please note that we can tune multiple hyper-parameters, including class weight, C, penalty, among others. In this exercise, we will demonstrate how to use grid search to tune only one parameter, 'C'. In addition, we will choose **L1 regularisation (Lasso) for this problem**. Feel free to add other hyper-parameters and evaluate if the model improves.



---



## 4.1 Splitting the feature variables from the response

In this step, we will also convert responses to binary for later steps.

In [8]:
# Dividing my dataset in X and y (outcome)
X = hospital.drop(axis=1, columns=['readmission'])
y = hospital['readmission']

In [9]:
# Sanity check
X.head(n=20)

Unnamed: 0,los,Age,sex,max_glu_serum,A1Cresult,number_diagnoses,num_lab_procedures,num_procedures,num_medications,number_emergency,...,number_outpatient,group_name_1,group_name_2,group_name_3,admission_type_id_cat,discharge_disposition_id_cat,admission_source_id_cat,discharge_disposition_grouped,admission_source_grouped,admission_type_grouped
0,2,79,Female,,,9,38,0,12,0,...,0,Other,Endocrine,Infectious,3,18,7,,Emergency Room,Elective
1,5,59,Male,,>8,8,49,0,16,0,...,0,Endocrine,Cardiac_&_circulatory,Other,3,1,7,,Emergency Room,Elective
2,2,33,Female,,,5,62,0,15,1,...,1,Other,Endocrine,Cardiac_&_circulatory,3,1,7,,Emergency Room,Elective
3,6,42,Female,,,9,77,0,30,0,...,0,Infectious,Respiratory,Endocrine,3,6,7,,Emergency Room,Elective
4,1,62,Male,,,7,13,5,6,0,...,0,Cardiac_&_circulatory,Cardiac_&_circulatory,Cardiac_&_circulatory,3,1,7,,Emergency Room,Elective
5,1,53,Female,,,9,11,3,16,0,...,0,Cardiac_&_circulatory,Respiratory,Cardiac_&_circulatory,3,1,7,,Emergency Room,Elective
6,10,80,Female,,,9,56,3,13,0,...,0,Cardiac_&_circulatory,Other,Cardiac_&_circulatory,3,3,7,,Emergency Room,Elective
7,2,44,Male,,,3,1,2,22,0,...,0,Other,Endocrine,Other,3,18,7,,Emergency Room,Elective
8,1,73,Male,,,9,9,0,5,0,...,0,Cardiac_&_circulatory,Endocrine,Other,3,7,7,,Emergency Room,Elective
9,4,64,Female,,,4,19,4,24,0,...,0,Other,Cardiac_&_circulatory,Endocrine,3,18,7,,Emergency Room,Elective


In [11]:
# Sanity check
X.shape

(69267, 21)

We are going to use the F1 score to choose our best model (best hyperparameters) after cross-validation on the training set. Since we are using the F1 score, we need the labels to be 1 and 0 to compute it. This is because the F1 score has a parameter called `pos_label` that is used to specify the positive label. Its default value is set to 1. Therefore, we are going to change the label 'yes' to '1'.

Alternatevely, you can specify `pos_label='yes'` too. Feel free to do that if you prefer.

This is the alternative code that we are not using in this example:

`from sklearn.metrics import make_scorer, f1_score`

`f1_scorer = make_scorer(f1_score, pos_label='yes')`


For more information, please check the API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

In [28]:
# Sanity Checks:
print('******************************************')
print('y - Yes values =', sum(i =='yes' for i in y))
print('y - No values =', sum(i =='no' for i in y))
print('******************************************\n')

# Create y_binary
y_binary = [0 if x=='no' else 1 for x in y]

# Sanity Checks:
print('\n******************************************')
print('y_binary - 1 values =', sum(i ==1 for i in y_binary))
print('y_binary - 0 values =', sum(i ==0 for i in y_binary))
print('******************************************')

******************************************
y - Yes values = 11919
y - No values = 57348
******************************************


******************************************
y_binary - 1 values = 11919
y_binary - 0 values = 57348
******************************************




---



## 4.2  Split the whole dataset into a train and a test set (20% of the total)
Now let's split the data into a training and test set. We will include the optional argument `stratify = y` to preserve the ratio between readmission = yes to readmission = no.

In [13]:
from sklearn.model_selection import train_test_split
# Split X and y into 80% train and 20% test data (roughly), set random state for reproducability and stratify responses
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.2, random_state=0, stratify = y_binary)



---



## 4.3 Define the Pipeline

### <font color='blue'> Question: Define the scaler we will use, and the estimator. As before, choose the scaler ("Transform") to be StandardScaler(), and the estimator ("Estimator") to be L1 Logistic Regression. </font>

To read about Pipelines:
1. Sklearn's API: [sckit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). Check the example for pipelines in the API. Instead of SVC(), we are using LogisticRegression().
2. Textbook.



In [14]:
# Write Python Code Here:

# Import the standard scaler function
from sklearn.preprocessing import StandardScaler
# Import the logistic regression model
from sklearn.linear_model import LogisticRegression
# Import the pipeline function
from sklearn.pipeline import Pipeline

In [17]:
# Scaler object
standard_scaler = StandardScaler()

# Classifcaiton model using L1 regularisation
log_reg = LogisticRegression(penalty = 'l1', solver='liblinear')

# Pipeline
pipeln = Pipeline([('Transform', standard_scaler), ('Estimator', log_reg)])




---



## 4.4 Define the hyper-parameters grid

### <font color='blue'> Question: Define the hyper-parameter grid. </font>

In [18]:
#Defining the hyper-parameters grid:
param_grid = {'Estimator__C': [0.001, 0.01, 0.1, 1, 10, 100]}



In [19]:
# Sanity check
print("Parameter grid:")
print("C: {}".format(param_grid['Estimator__C']))

Parameter grid:
C: [0.001, 0.01, 0.1, 1, 10, 100]




---



Remember that we are going to use a pipeline. In this case, there are two processes that are executed for each iteration of the cross-validation for hyper-parameter tuning:

First, the standardization of the features.
1. Then, the fitting of the logistic model.
2. This means that we have to indicate which of these processes our hyper-parameter grid belongs to.

If you have named your logistic regression model 'Estimator', you can designate its hyper-parameters by naming the parameter grid "Estimator__'parameter_name'". For example, we tell the computer that C is meant for the logistic regression estimator by defining it as 'Estimator__C' in the param_grid.

### <font color='blue'> Question: Initialise the GridSearchCV class by passing it the pipeline we have created, our paramater grid, the score you would like to use to select the best model, and specifying the number of folds. We must consider the computational complexity of the algorithm, so we can't set cv too high. In this case, let's choose 3 folds.</font>

The default score for the `GridSearchCV` we defined above is **accuracy**.

Different scoring measures can be used in the grid search. This can be a simple string or you can pass a list of values to use as multiple evaluation metrics, with the results for each metric viewed through cv_results_. If multiple scoring metrics are set then the refit parameter needs to be set to tell the function which scorer will be used to evaluate the best parameter settings.

For example, we could extract the F1-score of our GridSearchCV by using the command: **`GridSearchCV(pipeline, param_grid,...,scoring='f1')`**

[More information](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html?highlight=cross_validate#sklearn.model_selection.cross_validate)

In [22]:
# Write Python Code Here:

# Import the grid search CV class
from sklearn.model_selection import GridSearchCV

In [23]:
# Define gridsearch object
grid_search = GridSearchCV(pipeln, param_grid, cv = 3, scoring='f1')



---



## 4.5 Find the best parameters

Now we train the grid_search object. Note that grid_search behaves similarly to other classifiers, in the sense that we can use the methods fit, predict and score with it. When we use fit, it performs the grid cross-validation we designed during its initialisation.

**Please note that the following code takes a while to run**:

In [25]:
grid_search.fit(X_train, y_train)

print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.4f}".format(grid_search.best_score_))

ValueError: 
All the 18 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
18 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py", line 401, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py", line 359, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/usr/local/lib/python3.10/dist-packages/joblib/memory.py", line 312, in __call__
    return self.func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py", line 893, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/_set_output.py", line 140, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 881, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_data.py", line 824, in fit
    return self.partial_fit(X, y, sample_weight)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_data.py", line 861, in partial_fit
    X = self._validate_data(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 565, in _validate_data
    X = check_array(X, input_name="X", **check_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py", line 879, in check_array
    array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/_array_api.py", line 185, in _asarray_with_order
    array = numpy.asarray(array, order=order, dtype=dtype)
  File "/usr/local/lib/python3.10/dist-packages/pandas/core/generic.py", line 1998, in __array__
    arr = np.asarray(values, dtype=dtype)
ValueError: could not convert string to float: 'Male'


**To have greater clarity of our results, we proceed to analyse the heatmap.**

Let's visualise the mean cross-validation f1-score as a function of the parameter 'C'.

In [26]:
import seaborn as sns
import matplotlib.pyplot as plt
import warnings; warnings.simplefilter('ignore') #prevent warnings

# Creating DataFrame with GridSearchCV results. df stands for Dataframe
df_gridsearch = pd.DataFrame(grid_search.cv_results_)

# Sanity check
print(df_gridsearch)

AttributeError: 'GridSearchCV' object has no attribute 'cv_results_'

In [27]:
max_scores = df_gridsearch.groupby(['param_Estimator__C']).max()
max_scores = max_scores.unstack()[['mean_test_score']]
max_scores.fillna(value=np.nan, inplace=True)
max_scores = max_scores.apply(pd.to_numeric)
sns.heatmap(max_scores.mean_test_score.to_frame(), annot=True, fmt='.5g',xticklabels="")
plt.xlabel("mean_test_score")

NameError: name 'df_gridsearch' is not defined



---



## 4.6 Evaluate on the performance of the resulting model

<font color=red> **Very very important:** Recall that to this point we have not used the test set</font> - only the training set was used for tuning the hyper-parameters.</font>

![alt text](https://drive.google.com/uc?export=view&id=105SGqeyo8RgLhSO8mN7ZE5OsG0YiLPKt)

**Very very important:** "Fitting the GridSearchCV object not only searches for the best parameters, but also
automatically fits a new model on the whole training dataset with the parameters that
yielded the best cross-validation performance".

### <font color='blue'> Question: Calculate confusion matrix, and classification report for the training and test sets. Comment on the results below </font>
<font color='green'> Tip: read [GridSerchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) </font>

<font color='green'>Do we need to scale the test set? Yes, we do, but we do not have to do much. When we apply `.predict` on new data using  our `grid_search(pipeline,...)` object, all the steps defined in the `pipeline` will be applied automatically, including the scaling, which was the first step. Remember our `pipeline` was defined as `pipeln = Pipeline([('Transform', standard_scaler), ('Estimator', log_reg)])`.

In [None]:
# Write Python code here:

# Import evaluation metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
from sklearn import metrics
import seaborn as sns
import matplotlib.pyplot as plt

Evaluation on the training set.

In [None]:
# Output best parameters and score based on accuracy
print(f'Best parameters on training data: {grid_search.best_params_}')
print(f'Best cross validation accuracy score on training data: {grid_search.best_score_}\n')

In [None]:
# Generate confusion matrix for evaluation
labels = {'No', 'Yes'}
confusion_train = confusion_matrix(y_train, grid_search.predict(X_train))



# Visualising the confusion matrix:
print(f'Confusion matrix - Test set:')
ax= plt.subplot()
sns.heatmap(confusion_train, annot=True, fmt='.0f', ax= ax, cmap="viridis")

# labels, title and ticks
ax.set_xlabel('Predicted labels - Training set');ax.set_ylabel('True labels - Training set');
ax.set_title('Confusion Matrix - Training set');
ax.xaxis.set_ticklabels(['No', 'Yes']); ax.yaxis.set_ticklabels(['No', 'Yes'])

Evaluation on the test set.

In [None]:
# Generate predictions for test data based on the model with best parameters (based on accuracy)generated by GridSearchCV
pred_grid_search = grid_search.predict(X_test)

# Generate confusion matrix for evaluation
labels = {'No', 'Yes'}
confusion_test = confusion_matrix(y_test, pred_grid_search)



# Output results from precision, recall, accuracy and f1 score
print('Results of applying model to training data:\n')


# Visualising the confusion matrix:
print(f'Confusion matrix - Test set:')
ax= plt.subplot()

sns.heatmap(confusion_test, annot=True, fmt='.0f', ax= ax, cmap="viridis")



# labels, title and ticks
ax.set_xlabel('Predicted labels - Test set');ax.set_ylabel('True labels - Test set');
ax.set_title('Confusion Matrix');
ax.xaxis.set_ticklabels(['No', 'Yes']); ax.yaxis.set_ticklabels(['No', 'Yes'])

In [None]:
from sklearn.metrics import classification_report
print("Classification report for the training set:")
print("")
print(classification_report(y_train, grid_search.predict(X_train)))


print("Classification report for the test set:")
print("")
print(classification_report(y_test, pred_grid_search))

<b> Write your thoughts here:</b>
#####################################################################################################################


#####################################################################################################################

Quiz1 Q7

In [None]:
# Exaple 1
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Classifer_Readmission = LogisticRegression(random_state=0).fit(X_train, y_train)

y_pred_test = Classifer_Readmission.predict(X_train)
y_pred_test = Classifer_Readmission.predict(X_test)

logprob_train = Classifer_Readmission.predict_log_probability(X_train)
logprob_test  = Classifer_Readmission.predict_log_probabloty(X_test)

accuracy_train = Classifer_Readmission.score(X_train, y_train)
accuracy_test = Classifer_Readmission.score(X_test, y_test)

In [None]:
# Exaple 2
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Classifer_Readmission = LogisticRegression(random_state=0).fit(X_train, y_train)

y_pred_test = Classifer_Readmission.predict(X_train)
y_pred_test = Classifer_Readmission.predict(X_test)

logprob_train = Classifer_Readmission.predict_log_proba(X_train)
logprob_test  = Classifer_Readmission.predict_log_proba(X_test)

accuracy_train = Classifer_Readmission.score(X_train, y_train)
accuracy_test = Classifer_Readmission.score(X_test, y_test)

In [None]:
# Exaple 3
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Classifer_Readmission = LogisticRegression(random_state=0).fit(X_train, y_train)

y_pred_test = Classifer_Readmission.predict(X_train)
y_pred_test = Classifer_Readmission.predict(X_test)

logprob_train = Classifer_Readmission.predict_log_prob(X_train)
logprob_test  = Classifer_Readmission.predict_log_prob(X_test)

accuracy_train = Classifer_Readmission.score(X_train, y_train)
accuracy_test = Classifer_Readmission.score(X_test, y_test)

Quiz1 Q8

In [21]:
# Import the standard scaler function
from sklearn.preprocessing import StandardScaler
# Import the logistic regression model
from sklearn.linear_model import LogisticRegression
# Import the pipeline function
from sklearn.pipeline import Pipeline

# Scaler object
standard_scaler = StandardScaler()

# Classifcaiton model using L1 regularisation
log_reg = LogisticRegression(penalty = 'l1', solver='liblinear')

# Pipeline
pipeln = Pipeline([('Transform', standard_scaler), ('Estimator', log_reg)])

#Defining the hyper-parameters grid:
param_grid = {'Estimator__C': [0.001, 0.01, 0.1, 1, 10, 100]}

# Sanity check
print("Parameter grid:")
print("C: {}".format(param_grid['Estimator__C']))

# Import the grid search CV class
from sklearn.model_selection import GridSearchCV

# Define gridsearch object
grid_search = GridSearchCV(pipeln, param_grid, cv = 3, scoring='f1')

grid_search.fit(X_train, y_train)

print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.4f}".format(grid_search.best_score_))


Parameter grid:
C: [0.001, 0.01, 0.1, 1, 10, 100]


ValueError: 
All the 18 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
18 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py", line 401, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py", line 359, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/usr/local/lib/python3.10/dist-packages/joblib/memory.py", line 312, in __call__
    return self.func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py", line 893, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/_set_output.py", line 140, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 881, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_data.py", line 824, in fit
    return self.partial_fit(X, y, sample_weight)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_data.py", line 861, in partial_fit
    X = self._validate_data(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 565, in _validate_data
    X = check_array(X, input_name="X", **check_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py", line 879, in check_array
    array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/_array_api.py", line 185, in _asarray_with_order
    array = numpy.asarray(array, order=order, dtype=dtype)
  File "/usr/local/lib/python3.10/dist-packages/pandas/core/generic.py", line 1998, in __array__
    arr = np.asarray(values, dtype=dtype)
ValueError: could not convert string to float: 'Male'


In [None]:
#a) generate predictions for test data based on model with best parameters generated by GridSearchCV
pred_grid_search = grid_search.predict(X_test)

#b) scale the test set using the mean and variance previously calcuated from the training set
X_test_scaled = standard_scaler.transform(X_test)

# generate predictions for test data based on model with best parameters generated by GridSearchCV
pred_grid_search = grid_search.predict(X_test_scaled)

#c) None of above