# Lesson 26: multilayer perceptron activity

## Notebook set up
### Imports

In [1]:
# Third party imports
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler

## 1. Data preparation

### 1.1. Load diabetes dataset

In [2]:
diabetes_df = pd.read_csv('https://gperdrizet.github.io/FSA_devops/assets/data/unit3/diabetes_prediction_train.csv')

In [3]:
diabetes_df.head()

Unnamed: 0,id,age,alcohol_consumption_per_week,physical_activity_minutes_per_week,diet_score,sleep_hours_per_day,screen_time_hours_per_day,bmi,waist_to_hip_ratio,systolic_bp,...,gender,ethnicity,education_level,income_level,smoking_status,employment_status,family_history_diabetes,hypertension_history,cardiovascular_history,diagnosed_diabetes
0,0,31,1,45,7.7,6.8,6.1,33.4,0.93,112,...,Female,Hispanic,Highschool,Lower-Middle,Current,Employed,0,0,0,1.0
1,1,50,2,73,5.7,6.5,5.8,23.8,0.83,120,...,Female,White,Highschool,Upper-Middle,Never,Employed,0,0,0,1.0
2,2,32,3,158,8.5,7.4,9.1,24.1,0.83,95,...,Male,Hispanic,Highschool,Lower-Middle,Never,Retired,0,0,0,0.0
3,3,54,3,77,4.6,7.0,9.2,26.6,0.83,121,...,Female,White,Highschool,Lower-Middle,Current,Employed,0,1,0,1.0
4,4,54,1,55,5.7,6.2,5.1,28.8,0.9,108,...,Male,White,Highschool,Upper-Middle,Never,Retired,0,1,0,1.0


In [4]:
diabetes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700000 entries, 0 to 699999
Data columns (total 26 columns):
 #   Column                              Non-Null Count   Dtype  
---  ------                              --------------   -----  
 0   id                                  700000 non-null  int64  
 1   age                                 700000 non-null  int64  
 2   alcohol_consumption_per_week        700000 non-null  int64  
 3   physical_activity_minutes_per_week  700000 non-null  int64  
 4   diet_score                          700000 non-null  float64
 5   sleep_hours_per_day                 700000 non-null  float64
 6   screen_time_hours_per_day           700000 non-null  float64
 7   bmi                                 700000 non-null  float64
 8   waist_to_hip_ratio                  700000 non-null  float64
 9   systolic_bp                         700000 non-null  int64  
 10  diastolic_bp                        700000 non-null  int64  
 11  heart_rate                

Define the label and feature columns below. The label is `diabetes` (binary: 0 or 1). Use the numeric columns as features: `age`, `hypertension`, `heart_disease`, `bmi`, `HbA1c_level`, `blood_glucose_level`.

In [5]:
label = # YOUR CODE HERE
features = # YOUR CODE HERE

SyntaxError: invalid syntax (2630344555.py, line 1)

### 1.2. Train test split

Use `train_test_split` to split the data into training and testing sets. Use `random_state=315` for reproducibility.

In [None]:
training_df, testing_df = # YOUR CODE HERE

### 1.3. Standard scale features

Neural networks perform better when features are scaled. Use `StandardScaler` to fit on the training features and transform both training and testing features.

**Hint:** Fit the scaler on `training_df[features]`, then transform both `training_df[features]` and `testing_df[features]`.

In [None]:
feature_scaler = StandardScaler()

# YOUR CODE HERE: fit and transform the features

## 2. Logistic regression model

Logistic regression is a linear model for classification. It serves as a good baseline before trying more complex models like neural networks.

### 2.1. Fit

Create a `LogisticRegression` model and fit it on the training data. Use `max_iter=1000` to ensure convergence.

In [None]:
logistic_model = # YOUR CODE HERE
fit_result = # YOUR CODE HERE

### 2.2. Test set evaluation

For classification, we can use accuracy, F1 score and/or AUC-ROC (and others) instead of R². Use sklearn's [`metrics`](https://scikit-learn.org/stable/api/sklearn.metrics.html) module .

In [None]:
logistic_predictions = # YOUR CODE HERE
logistic_accuracy = # YOUR CODE HERE
logistic_f1 = # YOUR CODE HERE
logistic_auc = # YOUR CODE HERE
print(f'Logistic regression accuracy on test set: {logistic_accuracy:.4f}')
print(f'Logistic regression F1 score on test set: {logistic_f1:.4f}')
print(f'Logistic regression AUC-ROC score on test set: {logistic_auc:.4f}')

### 2.3. Performance analysis

For classification, visualize performance using a confusion matrix.

In [None]:
# YOUR CODE HERE

## 3. Multilayer perceptron (MLP) classifier

Now let's build a neural network classifier using sklearn's [`MLPClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html).

### 3.1. Single epoch training function

Complete the training function below. It should:
1. Split the data into training and validation sets
2. Call `partial_fit` on the model (remember to pass `classes=[0, 1]` on the first call)
3. Record training and validation [`log_loss`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html) (aka binary cross-entropy) in the history dictionary

**Hint:** Use `model.partial_fit(X, y, classes=[0, 1])` for the first epoch. For subsequent epochs, `partial_fit` remembers the classes.

In [None]:
def train(model: MLPClassifier, df: pd.DataFrame, training_history: dict, classes: list = None) -> tuple[MLPClassifier, dict]:
    '''Trains sklearn MLP classifier model on given dataframe using validation split.
    Returns the updated model and training history dictionary containing training and
    validation log loss. If classes are not provided, assumes 0 and 1.'''

    global features, label

    df, val_df = train_test_split(df, random_state=315)
    
    # YOUR CODE HERE: call partial_fit on the model
    # If classes is provided, pass it to partial_fit
    
    # YOUR CODE HERE: append training and validation log loss to history
    
    return model, training_history

### 3.2. Model training

Create an `MLPClassifier` with:
- `hidden_layer_sizes=(64, 32)` - two hidden layers
- `activation='relu'` - ReLU activation function
- `learning_rate_init=0.001` - initial learning rate
- `warm_start=True` - keep weights between calls to fit
- `random_state=315` - for reproducibility

Train for 10 epochs using the training function above.

In [None]:
epochs = 10

training_history = {
    'training_loss': [],
    'validation_loss': []
}

mlp_model = MLPClassifier

for epoch in range(epochs):

    # YOUR CODE HERE

### 3.3. Learning curves

Plot the training and validation loss over epochs to visualize the learning process.

In [None]:
# YOUR CODE HERE: plot training and validation loss
# Use plt.plot() for each curve
# Add title, xlabel, ylabel, and legend

### 3.4. Test set evaluation

Evaluate the MLP model on the test set, similar to how you evaluated the logistic regression model.

In [None]:
mlp_predictions = # YOUR CODE HERE
mlp_accuracy = # YOUR CODE HERE
mlp_f1 = # YOUR CODE HERE
mlp_auc = # YOUR CODE HERE
print(f'MLP accuracy on test set: {mlp_accuracy:.4f}')
print(f'MLP F1 score on test set: {mlp_accuracy:.4f}')
print(f'MLP AUC-ROC score on test set: {mlp_accuracy:.4f}')

### 3.5. Performance analysis

Create a confusion matrix for the MLP model predictions.

In [None]:
# YOUR CODE HERE: create confusion matrix for MLP predictions
# Follow the same pattern as the logistic regression confusion matrix

## 4. Model comparison

Compare the performance of both models side by side.

In [None]:
print(f'Logistic Regression accuracy on test set: {logistic_accuracy:.4f}')
print(f'MLP accuracy on test set: {mlp_accuracy:.4f}')

Create a side-by-side comparison of the confusion matrices for both models.

In [None]:
# YOUR CODE HERE