# Self-study try-it activity 13.1: Changing the parameters of a logistic function

## Introduction

In this notebook, you'll analyse logistic regression using the [Pima Indians diabetes data set](https://www.kaggle.com/uciml/pima-indians-diabetes-database). The goal is to predict the onset of diabetes based on various diagnostic measurements.

Start by downloading and importing the data set.

Then, review a brief recap of logistic regression and its training procedure. You’ll fit a logistic regression model to the data set.

Next, you'll explore how to choose the best model for classification. This includes a short introduction to regularisation. The second set of exercises focuses on selecting the best regularisation constant and examining its effect on the model.

In [None]:
#Importing the necessary libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
import matplotlib.pyplot as plt

### Download and preprocess the data

- Download the `diabetes.csv` data set and store it in the variable `data`. 

- Display the first five rows of the data set.

In [None]:
data = ...

In [None]:
data = pd.read_csv('data/diabetes.csv')
data.head()

Identify the `inputs` and `outputs` and assign them to the variables `X` and `Y`, respectively.

In [None]:
outputs = ...
inputs = ...
X = ...
Y = ...

In [None]:
outputs = ['Outcome']
inputs = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'BMI', 'DiabetesPedigreeFunction', 'Age']

X = data[inputs]
Y = data[outputs].to_numpy().reshape(-1)

Scale the data and explain why this step needs to be performed.

In [None]:
scaler = StandardScaler().fit(X)
X = scaler.transform(X)

Standardisation transforms each feature so that it has a mean of 0 and a standard deviation of 1. The `diabetes.csv` data set has different units and scales. Without scaling, features with larger numeric ranges could dominate the learning process, leading to biased model performance. 

The `fit` method computes the mean and standard deviation from the data, and `transform` applies this scaling to the centre and scales all features accordingly.

Finally, divide the data set into an 80/20 training and testing split.

In [None]:
num_of_points = len(Y)

idx = list(range(num_of_points))
np.random.shuffle(idx)
idx_train = idx[:int(num_of_points * 0.8)]
idx_train.sort()
idx_test = idx[int(num_of_points * 0.8):]
idx_test.sort()

X_train = X[idx_train, :]
X_test = X[idx_test, :]

Y_train = Y[idx_train]
Y_test = Y[idx_test]

Note: `sklearn` uses `train-test-split` in `sklearn.model_selection` to split the data. Try splitting the data using the `train-test-split` method.

## Logistic regression

Logistic regression is a binary classifier that predicts the probability of an input that belongs to the positive class.

Assume you have a set of predictors $x \in \mathcal{X}$ and a binary output $y \in \{0, 1 \}$. You want to estimate the probability of an input belonging to the positive class, i.e. you want to build an estimator $\hat{p}(x)$ such that:

$$
\hat{p}(x) = \mathbb{P}( Y = 1 | X = x )
$$

You might first consider using a linear regression model:

$$
\hat{p}(x) = \beta_0 + x \beta_1
$$

However, this leads to a problem: probabilities must lie between $0$ and $1$, and the linear model does not guarantee this.

To address this issue, you can wrap the linear model in a function that constrains the output to the range [0, 1]. Here, you can use the sigmoid function:

$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$

Applying the sigmoid to the linear model gives you the logistic regression model:

$$
\hat{p}(x) = \sigma(\beta_0 + x \beta_1)
$$

which can also be written as:

$$
\hat{p}(x) = \frac{e^{\beta_0 + x \beta_1}}{1 + e^{\beta_0 + x \beta_1}}
$$

## Train the function

The parameters of a linear regression model are typically estimated using least squares. However, this method isn’t well suited for logistic regression because you're not estimating $Y$ directly. Instead, you're estimating $\textit{the probability}$ of Y, which makes maximum likelihood estimation a more appropriate choice.

$$
\mathcal{L} = \prod_{i : y_i = 1} \mathbb{P}(Y = 1 | X = x_i) \prod_{i' : y_{i'} = 0} (1 - \mathbb{P}(Y = 0 | X = x_{i'}))
$$

Rather than maximising the likelihood directly, it’s more common to minimise the negative log-likelihood:

$$
\ell(\beta) = - \log \mathcal{L}(\beta) = - \sum_{i : y_i = 1} \log\sigma(x_i^T \beta) - \sum_{i' : y_{i'} = 0}\log \sigma(x_{i'}^T\beta)
$$

This leads to the parameter estimate defined as $\hat{\beta}$:
$$
\hat{\beta} = \arg\min_{\beta} \ell(\beta)
$$

You can solve this using any gradient-based optimiser. Additionally, you can incorporate regularisation to reduce overfitting. You’ll explore this concept in more detail in this notebook.

### Question 1

Prove that the sigmoid function always produces an output between 0 and 1.
Hint: Consider the limits as $x \rightarrow \pm \infty$ and show that the function is increasing.

### Solution

The sigmoid function is $\sigma(x) = \frac{1}{1 + e^{-x}}$. Its output is always between 0 and 1.

**Limits as $( x \to -\infty)$ and $( x \to +\infty)$**

**As $x \to -\infty$:**

- $-x$ becomes very large and positive.

- Therefore, $e^{-x}$ becomes very **large**.

- The denominator $1 + e^{-x}$ is also **very large**.

- Therefore:
  $$
  \sigma(x) = \frac{1}{\text{very large number}} \approx 0
  $$

**As $x \to +\infty$:**

- $-x$ becomes very large and **negative**.

- Therefore, $e^{-x} \to 0$.

- The denominator $1 + e^{-x} \to 1$.

- Therefore:
  $$
  \sigma(x) = \frac{1}{1 + 0} = 1
  $$

**The output is always between 0 and 1.**

The exponential function $e^{-x}$ is always **positive** for any real $x$.

Thus, the denominator $1 + e^{-x}$ is always **greater than 1**.

Therefore, the sigmoid function:

$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$

is always:

- **Greater than 0** because the numerator is 1 and the denominator is always positive.
- **Less than 1** because the denominator is always greater than 1.

Hence, the output of $\sigma(x)$ always lies **strictly between 0 and 1**.

**The function always increases.**

The sigmoid function is **monotonically increasing**: as $x$ increases, $\sigma(x)$ also increases.

This means the function **never decreases** but smoothly transitions from values near 0 to values near 1 as $x$ goes from $-\infty$ to $+\infty$.

In other words:

$$
x_1 < x_2 \quad \Rightarrow \quad \sigma(x_1) < \sigma(x_2)
$$

The output of $\sigma(x)$ **steadily rises**, making it a smooth and useful activation function for binary classification and neural networks.

### Question 2

 Using the `LogisticRegression` class from `scikit.learn`, train a model on the data set above. 
 
 Make sure you are not regularising. For more information, review the sklearn documentation (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). 
 
 Show the model's training and testing accuracy and build a confusion matrix for each set.
 
 (Hint: The methods and functions required are $ \texttt{.fit()} $, $\texttt{.predict()}$, $\texttt{confusion\_matrix()}$, and $\texttt{accuracy\_score()}$.)

In [None]:
model = ...
Y_train_pred = ...
Y_test_pred = ...
confusion_matrix_train = ...
confusion_matrix_test = ...
train_accuracy = ...
test_accuracy = ...

In [None]:
#Define the model
model = LogisticRegression(penalty = None)
#Train the model
model.fit(X_train, Y_train)
#Create the predictions
Y_train_pred = model.predict(X_train)
Y_test_pred = model.predict(X_test)
#Build the confusion matrices
confusion_matrix_train = confusion_matrix(Y_train, Y_train_pred)
confusion_matrix_test = confusion_matrix(Y_test, Y_test_pred)
#Calculate the accuracies
train_accuracy = accuracy_score(Y_train, Y_train_pred)
test_accuracy = accuracy_score(Y_test, Y_test_pred)
#Display the results
print('Training Confusion Matrix')
print(confusion_matrix_train)
print()
print(f'Training accuracy = {train_accuracy}')
print()
print('Testing Confusion Matrix')
print(confusion_matrix_test)
print()
print(f'Testing accuracy = {test_accuracy}')

## Regularisation

Regularisation involves adding a penalty term to the loss function to reduce model complexity and help prevent overfitting. Ideally, this leads to better generalisation.

In the case of L2 regularisation, you can modify the optimisation objective as follows:

$$
\hat{\beta} = \arg\min_\beta \{ C \cdot \ell(\beta) + \frac{1}{2}\beta^T \beta \}
$$

This penalty encourages the parameters $\beta$ to stay closer to 0 and effectively simplifies the model by reducing the influence of any single feature. 

### Question 3

- Analyse the effect of L2 regularisation. In particular, focus on how the testing accuracy changes for different values of $C$. 

- Create a plot that shows how $C$ varies, starting at $10^{-6}$ and ending at $10^{-2}$.

### Solution

In [None]:

C_space = np.linspace(10e-6, 10e-2, 200)
accuracies = []

for C in C_space:
  model = LogisticRegression(penalty = 'l2', C = C)
  model.fit(X_train, Y_train)
  Y_test_pred = model.predict(X_test)
  accuracies.append(accuracy_score(Y_test, Y_test_pred))

plt.plot(C_space, accuracies)
plt.xlabel('C')
plt.ylabel('Testing Accuracy')
plt.title('Analysis of Regularisation')

best_model = np.argmax(accuracies)
best_accuracy = accuracies[best_model]
best_C = C_space[best_model]

print(f'Best test accuracy was achieved with C = {best_C}, giving an accuracy of {best_accuracy}.')
print(f'Without regularisation, we achieved an accuracy of {test_accuracy}.')
print()

### Question 4

- What behaviour do you observe as regularisation increases (i.e. as $C$ decreases)?

- From your analysis, which value of the regularisation constant gives the best result? How does the corresponding testing accuracy compare with your earlier results?

- For which values of $C$ does the model recover the previous training accuracy? Why do you think this occurs?

### Solutions

- When $C$ is very small in logistic regression, it implies strong regularisation, which heavily shrinks the model coefficients towards 0. This simplification can help prevent overfitting but may also lead to underfitting when the model fails to capture important patterns in the data. As a result, generalisation to unseen data may improve, but training performance can decline due to increased bias and the model's inability to represent more complex relationships.

- There may be some improvement in test performance with regularisation, but this is not guaranteed, as it depends on the data and the severity of overfitting in the unregularised model.

- You can recover the previous training accuracy when $C$ is large, which corresponds to weak regularisation. In this case, the penalty on the model coefficients is minimal, allowing them to grow larger and fit the training data more closely. This often leads to high training accuracy but at the risk of overfitting, meaning it captures noise or specific idiosyncrasies in the training set that don’t generalise well to new data. As a result, validation or test performance may degrade despite strong training results.