What's new to this notebook: implementing hyperparameter tuning using Grid Search.

# Hyperparameter tuning

Hyperparameter tuning is a process in the field of machine learning that involves adjusting the parameters of a model to improve its performance. Unlike model parameters, which are learned from the training data, hyperparameters are set prior to the training process and govern the training process itself. The goal of hyperparameter tuning is to find the optimal combination of hyperparameters that yields the best performance of a model according to a predefined metric, such as accuracy or precision. 

## Imports

In [13]:
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

## Load data

In [10]:
df = pd.read_csv('../Resources/Feature_Engineered_Insurance_Data.csv', decimal='.', delimiter=',')
df.head()

Unnamed: 0,age,gender,bmi,bloodpressure,diabetic,children,smoker
0,39.0,1,23.2,91,1,0,0
1,24.0,1,30.1,87,0,0,0
2,38.0,1,33.3,82,1,0,0
3,38.0,1,33.7,80,0,0,0
4,38.0,1,34.1,100,0,0,0


We're going to use techniques like Grid Search to systematically explore different combinations of parameters in search of the most effective model.

## Define the parameter grid 

Set up a dictionary where keys are the hyperparameters, and values are lists of settings to try. For Logistic Regression, common hyperparameters to tune are *C* (inverse of regularization strength) and *solver*.

In [11]:
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
}

## Instantiate the model

In [12]:
grid_search = GridSearchCV(LogisticRegression(random_state=42), param_grid, cv=5, verbose=2, n_jobs=-1)

## Split the data into training and testing sets

In [14]:
X = df.drop('bloodpressure', axis=1)
y = df['bloodpressure']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## Fit GridSearchCV to the Data:
The GridSearchCV object is now treated as a normal model where you call fit with the training data.

In [15]:
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 30 candidates, totalling 150 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Review the Best Parameters and Model

In [16]:
print("Best parameters:", grid_search.best_params_)
best_model = grid_search.best_estimator_

# Evaluate the best model
best_accuracy = best_model.score(X_test, y_test)
print(f"Best Model Accuracy: {best_accuracy}")

Best parameters: {'C': 10, 'solver': 'lbfgs'}
Best Model Accuracy: 0.05970149253731343


# Conclusion

Given the consistent low accuracy when using multiple models and hyperparameter tuning, it is likely that the data is not sufficient to predict blood pressure. The problem might be the chosen dataset itself, and not the way the data was processed or the models were trained. 