In this exercise, the Pima Indian dataset is used to develop a Logistic Regression model to predict whether a patient has diabetes or not. Further, Cross Validation technique is applied to train and evaluate the model on the training dataset.

In [8]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from scipy import stats
from sklearn import linear_model
from sklearn.metrics import f1_score
from sklearn import model_selection
from sklearn.model_selection import train_test_split

In [9]:
pima_dataset = pd.read_csv('diabetes.csv')
pima_dataset.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


The intuition of any ML model is to work well on the unseen data. To make sure that the trained model is robust, first divide the data into 2 sets - Training and Test sets. Training set is used to train the model and then you evaluate the trained model on the unseen test set to check the robustness of the model. Let us first split the data in 2 sets in a stratified manner.

In [10]:
X = pima_dataset.loc[:,['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 
                        'DiabetesPedigreeFunction', 'Age']]
y = pima_dataset['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify = y)

In [11]:
# Inspect whether the 2 sets are indeed stratified.

print(pd.Series(y_train).value_counts(normalize = True))
print(pd.Series(y_test).value_counts(normalize = True))

0    0.651466
1    0.348534
Name: Outcome, dtype: float64
0    0.649351
1    0.350649
Name: Outcome, dtype: float64


In [12]:
# Further, you are splitting the training datasets into 5 folds such that at a time 
# 4 folds are used for training and the remaining fold is used for validation.

k = 5

# Concat X_train and y_train
kfold_df = X_train.copy()
kfold_df['target'] = y_train

# Shuffle the rows of the data frame
kfold_df = kfold_df.sample(frac=1, replace=False, random_state=42).reset_index(drop = True)
kfold_df['kfold'] = ""

# Initiate KFold class from scikit-learn
kf = model_selection.KFold(n_splits = k)

# Populate the kfold column
for fold, (train, validation) in enumerate(kf.split(X=kfold_df)):
    kfold_df.loc[validation, 'kfold'] = fold

# Inspect training data frame
kfold_df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,target,kfold
0,4,96,56,17,49,20.8,0.340,26,0,0
1,0,78,88,29,40,36.9,0.434,21,0,0
2,0,162,76,56,100,53.2,0.759,25,1,0
3,2,134,70,0,0,28.9,0.542,23,1,0
4,8,100,76,0,0,38.7,0.190,42,0,0
...,...,...,...,...,...,...,...,...,...,...
609,4,183,0,0,0,28.4,0.212,36,1,4
610,7,136,90,0,0,29.9,0.210,50,0,4
611,9,164,78,0,0,32.8,0.148,45,1,4
612,5,97,76,27,0,35.6,0.378,52,1,4


The model is trained 5 times - training on 4 folds and validation on 1 fold (this set would change for all the 5 experiments). This method is called Cross Validation. Finally, the average of the model performance across the 5 times is calculated and used to compare all the candidate models. Here - K (number of folds) = 5 so you have to run the model 5 times. This parameter can be configured.

In [13]:
# Train Model with Cross-Validation
mean_f1 = []

# Instantiate Model
model = linear_model.LogisticRegression(fit_intercept = True, random_state=42)

for fold in range(0,k):

    # Training data is all but the provided fold
    fold_df = kfold_df[kfold_df['kfold'] != fold].reset_index(drop = True)

    # Validation fold is the fold provided
    valid_df = kfold_df[kfold_df['kfold'] == fold].reset_index(drop = True)

    # Assign feature & target columns
    X_train_cv = fold_df.drop("target", axis = 1)
    y_train_cv = fold_df['target']

    # Same for validation set
    X_valid_cv = valid_df.drop("target", axis = 1)
    y_valid_cv = valid_df['target']

    # Fit the model on training data
    model.fit(X_train_cv, y_train_cv)

    # Get predictions for validation samples
    predictions = model.predict(X_valid_cv)

    # Evaluate model
    f1 = f1_score(y_valid_cv, predictions)
    print(f"Fold = {fold}, F1 = {f1}")

    mean_f1.append(f1)

print(f"Average model F1-Score is {np.mean(mean_f1)}")

Fold = 0, F1 = 0.5135135135135136
Fold = 1, F1 = 0.5974025974025973
Fold = 2, F1 = 0.6966292134831461
Fold = 3, F1 = 0.7058823529411765
Fold = 4, F1 = 0.6881720430107526
Average model F1-Score is 0.6403199440702372


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

#### In this exercise, you have learned how to train and evaluate a model using Cross Validation. Cross validation is a common technique applied to get the performance of candidate models to compare them. Try changing the K parameter and check the performance of the trained model on the validation set.