# Required assignment 13.1: Optimising a logistic function in Python

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

## Introduction

In this notebook, you will explore how to use logistic regression to predict the onset of diabetes using the [Pima Indians diabetes data set](https://www.kaggle.com/uciml/pima-inidans-diabetes-database). This notebook will walk you through the process of data preparation, model building, parameter selection and model evaluation with a focus on practical interpretation. 

Specifically, you will:

- Work with the Pima Indians diabetes data set. 

- Build a logistic regression model.

- Choose the best parameters.

- Select the probability threshold based on the false positive rate (FPR) and the false negative rate (FNR).

## Download and prepare the data

### Question 1

- Review the `diabetes.csv` data set and store it in the variable `data`.

- Display the columns of the data set and assign them to `columns`.  

In [None]:
###GRADED CELL
data = ...
columns = ...

###BEGIN SOLUTION
data = pd.read_csv('data/diabetes.csv')
columns = data.columns
###END SOLUTION

print("The columns in the dataset are given by ", columns)

## Preprocess the data

Check for the null values in the data.

In [None]:
print (data.isna().sum())

Split the data into inputs and outputs.

### Question 2

- Identify the inputs and outputs of the data. Assign them to variables `inputs` and `outputs`, respectively. 

- Assign the inputs to `X`.

- Assign the outputs to `Y`.

Hint: Ensure that the output is a 1D array of shape (`n_samples`).

In [None]:
###GRADED CELL
inputs = ...
outputs = ...
X = ...
Y = ...

###BEGIN SOLUTION
inputs = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
outputs = ['Outcome']


X = data[inputs]
Y = data[outputs].to_numpy().reshape(-1)

###END SOLUTION
print("The inputs are given by :", X.head())
print("The outputs are given by :", Y[:5])

Preprocess the data using the `StandardScaler()` function.

In [None]:
scaler = StandardScaler().fit(X)
X = scaler.transform(X)

Divide the data set into a 70/30 training and testing split.

### Question 3

- Use `train-test-split()` to split the data into 70 per cent training and 30 per cent testing sets.

- Use `random_state =42`.

In [None]:
###GRADED CELL

X_train, X_test, Y_train, Y_test = None, None, None, None

###BEGIN SOLUTION
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 42)
###END SOLUTION
print("The shape of X_train is ", X_train.shape)
print("The shape of X_test is ", X_test.shape)

## Build a logistic regression model

Recall that logistic regression models the relationship between one or more independent variables and a binary dependent variable by applying a sigmoid (logistic) function to produce an output between 0 and 1, representing the estimated probability of the event. It is a simple, interpretable model that is commonly used for binary classification problems.

### Question 4

- Create a `LogisticRegression` model with `penalty=None`  and assign it to `model`.

- Use the `.fit()` to train the model.

- Make predictions and assign them to `Y_train_pred` and `Y_test_pred`, respectively.

In [None]:
###GRADED CELL
model = None
Y_train_pred = None
Y_test_pred = None

###BEGIN SOLUTION
model = LogisticRegression(penalty=None)
model.fit(X_train, Y_train)
Y_train_pred = model.predict(X_train)
Y_test_pred = model.predict(X_test)
###END SOLUTION
print("The shape of Y_train_pred is ", Y_train_pred.shape)
print("The shape of Y_test_pred is ", Y_test_pred.shape)

### Question 5

- Compute the confusion matrices for the training and testing data sets and assign them to `confusion_matrix_train` and `confusion_matrix_test`, respectively. 

- Compute the `train_accuracy` and `test_accuracy` using the `accuracy_score()` of `sklearn_metrics`.

In [None]:
###GRADED CELL
confusion_matrix_train = None
confusion_matrix_test = None
train_accuracy = None
test_accuracy = None

###BEGIN SOLUTION
confusion_matrix_train = confusion_matrix(Y_train, Y_train_pred)
confusion_matrix_test = confusion_matrix(Y_test, Y_test_pred)

train_accuracy = accuracy_score(Y_train, Y_train_pred)
test_accuracy = accuracy_score(Y_test, Y_test_pred)
###END SOLUTION

print('Training Confusion Matrix', confusion_matrix_train)

print()
print('Testing Confusion Matrix', confusion_matrix_test)

print()
print('Training accuracy', train_accuracy)

print()
print('Testing accuracy', test_accuracy)


## Choose the best parameters

Regularisation involves adding a penalty to the loss function to reduce model complexity and help prevent overfitting. Ideally, this leads to better generalisation.

- Analyse the effect of L2 regularisation by examining how testing accuracy changes with different values of $C$.

- Create a plot that shows how $C$ varies, starting at $10^{-6}$ and ending at $10^{-2}$.

### Question 6

- Create a `LogisticRegression` model with `penalty='l2` and `max_iter=1000`.

- Use `GridSearchCV` with cross-validation `cv=5` and `scoring=accuracy`.

- Compute the `mean_test_score` of  `grid_search.cv_results_` and assign it to `mean_accuracies`.

In [None]:
###GRADED CELL

C_space = np.linspace(10e-6, 10e-2, 200)
param_grid = {'C': C_space}
logreg = None
grid_search = None
mean_accuracies = None

###BEGIN SOLUTION

logreg = LogisticRegression(penalty='l2', max_iter=1000)

grid_search = GridSearchCV(logreg, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, Y_train.ravel())


mean_accuracies = grid_search.cv_results_['mean_test_score']
###END SOLUTION
print('The mean_accuracies is given by ', mean_accuracies)


In [None]:
plt.plot(C_space, mean_accuracies)
plt.xlabel('C')
plt.ylabel('Cross-validated Accuracy')
plt.title('Analysis of Regularisation')



### Question 7

- Identify the `best_C` value and its corresponding cross-validated accuracy.

- Print the `best_C` and `best_accuracy` by computing the `best_index`.

- Print the `test_accuracy` on the `best_model`.

Hint: `best_model` can be obtained using `grid_search.best_estimator_`.

In [None]:
###GRADED CELL
best_C = None
best_accuracy = None
best_model = None
test_accuracy = None
###BEGIN SOLUTION
best_index = np.argmax(mean_accuracies)
best_C = C_space[best_index]
best_accuracy = mean_accuracies[best_index]
best_model = grid_search.best_estimator_
test_accuracy = best_model.score(X_test, Y_test)
###END SOLUTION
print(f'Best cross-validated accuracy was achieved with C = {best_C}, giving an accuracy of {best_accuracy:.4f}.')



print(f'Accuracy of the best model on the test set: {test_accuracy:.4f}')


## Select the probability threshold based on the FPR and FNR

Recall that you're not estimating $Y$ directly; you're estimating the probability of $Y = 1 | X$.

So far, classification has been based on selecting the class with the higher probability. In other words, you've been using the following classifier:
$$
  \hat{Y}(x) =
    \begin{cases}
      0 & \text{if } \hat{p}(x) < 0.5 \\
      1 & \text{if } \hat{p}(x) \geq 0.5 \\
    \end{cases}       
$$

However, you also need to consider that the probability threshold of 0.5 may not be optimal.

### Question 8

- Implement a function `predict_with_threshold` that takes a trained logistic regression model, a feature matrix X and a probability threshold. It should return class predictions (0 or 1) based on whether the predicted probability for class 1 is greater than or equal to the threshold. 

- For thresholds ranging from 0 to 1, compute and store the FNR and FPR on the test set.

- Plot the FNR and FPR as functions of the threshold to observe how these metrics change with varying thresholds.

Hint: Use the `confusion_matrix` function from `sklearn.metrics` to compute the FNR and FPR.

In [None]:
###GRADED CELL

model = LogisticRegression(penalty = 'l2', C = best_C)
model.fit(X_train, Y_train)
#Define predict_with_threshold(model,X,threshold):
#Calculate the probabilities and use the second column, which represents the probabilities
#Build an array of Boolean variables, checking whether the probabilities are larger than or equal to the threshold
#Compute predictions by transforming the Boolean variables and making them integers
#Return predictions

###BEGIN SOLUTIONS
def predict_with_threshold(model, X, threshold):
  probabilities = model.predict_proba(X)
  probabilities = probabilities[:, 1]
  boolean_threshold = (probabilities >= threshold)
  predictions = boolean_threshold.astype(int)

  return predictions
###END SOLUTIONS



### Question 9

- Compute the number of true positives and assign it to `num_of_positives`.

- Compute the number of true negatives and assign it to `num_of_negatives`.

In [None]:
###GRADED CELL
thresholds = np.linspace(0, 1, 101)

false_negatives = []
false_positives = []

###BEGIN SOLUTION

#Calculate the number of true positives and true negatives
num_of_positives = np.sum(Y_test)
num_of_negatives = len(Y_test) - num_of_positives

###END SOLUTION

print("The number of true positives is ", num_of_positives)
print("The number of true negatives is ", num_of_negatives)
for tp in thresholds:
  #Predict
  Y_pred = predict_with_threshold(model, X_test, tp)
  #Calculate predictions
  cm = confusion_matrix(Y_test, Y_pred)
  #Calculate the FNR and the FPR
  fnr = cm[1, 0] / num_of_positives
  fpr = cm[0, 1] / num_of_negatives
  #Append to lists
  false_negatives.append(fnr)
  false_positives.append(fpr)



The FNR and FPR are plotted against the probability threshold.

In [None]:
#Create a plot to see how the rates change with the threshold
fig, ax = plt.subplots(nrows=1, ncols=1)
fig.set_figheight(6)
fig.set_figwidth(10)
fig.suptitle('Investigation of the FNR and FPR against Probability Threshold')

ax.plot(thresholds, false_negatives, 'r', label = 'FNR')
ax.plot(thresholds, false_positives, 'b', label = 'FPR')
ax.set_ylabel('Performance Metrics')
ax.set_xlabel('Probability Thresholds')
ax.legend()

When the threshold is lowered, the FPR increases, while the FNR decreases. Conversely, increasing the threshold decreases the FPR and increases the FNR. The optimal threshold depends on the context of the problem and the relative costs of false positives and false negatives. 

The point where the FPR and FNR curves intersect on the probability threshold plot is a reasonable candidate as the 'best threshold', assuming that false positives and false negatives are equally costly.