# Diabetese Detection Models

This [dataset](https://raw.githubusercontent.com/mansont/datasets-tests/main/diabetese.csv) contains patient data and their diabetese condition: "1" they have diabetes, "0" they do not have diabetese.


Build the following models and compare their performance:
* A logistic regression model
* A single-layer perceptron model
* A multilayer perceptron

#### Imports

In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, Perceptron
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, roc_auc_score

#### Data prep

In [None]:
url = 'https://raw.githubusercontent.com/mansont/datasets-tests/main/diabetese.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [None]:
df.isna().sum()

pregnancies    0
glucose        0
diastolic      0
triceps        0
insulin        0
bmi            0
dpf            0
age            0
diabetes       0
dtype: int64

In [None]:
features = ['pregnancies', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi', 'dpf', 'age']
target = 'diabetes'
X = df[features]
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#### Models

A toolbox

In [None]:
def compare_accuracies(models: list, X_train, X_test, y_train, y_test):
  results = {'model': [], 'accuracy_score_test': [], 'accuracy_score_train': []}

  for model in models:
      y_pred_test = model.predict(X_test)
      y_pred_train = model.predict(X_train)

      accuracy_test = accuracy_score(y_test, y_pred_test)
      accuracy_train = accuracy_score(y_train, y_pred_train)

      results['model'].append(str(model))
      results['accuracy_score_test'].append(accuracy_test)
      results['accuracy_score_train'].append(accuracy_train)

  report_df = pd.DataFrame(results)

  display(report_df)

##### Logistic Regression

In [None]:
log_reg_model = LogisticRegression()
log_reg_model.fit(X_train, y_train)

##### Single-layer perception model

In [None]:
ppn = Perceptron()
ppn.fit(X_train, y_train)

##### Multi-layer perception

In [None]:
def build_and_fit_mlp(X, y, layers_dim: set[int] = (100,), activation: str = 'relu'):
  mlp = MLPClassifier(max_iter=10000, hidden_layer_sizes=layers_dim, activation=activation)
  mlp.fit(X, y)
  return mlp

In [None]:
mlp = build_and_fit_mlp(X_train, y_train)

In [None]:
compare_accuracies([log_reg_model, ppn, mlp], X_train, X_test, y_train, y_test)

Unnamed: 0,model,accuracy_score_test,accuracy_score_train
0,LogisticRegression(),0.729167,0.763889
1,Perceptron(),0.708333,0.776042
2,MLPClassifier(max_iter=1000),0.739583,0.809028


### Is there a notable difference in the MLP performance when a ReLU, Sigmoid or SoftMax activation function is used?


Note! First one uses relu

In [None]:
mlp2 = build_and_fit_mlp(X_train, y_train, activation='tanh')
mlp3 = build_and_fit_mlp(X_train, y_train, activation='logistic')
mlp4 = build_and_fit_mlp(X_train, y_train, activation='identity')

compare_accuracies([mlp, mlp2, mlp3, mlp4], X_train, X_test, y_train, y_test)

Unnamed: 0,model,accuracy_score_test,accuracy_score_train
0,MLPClassifier(max_iter=1000),0.739583,0.809028
1,"MLPClassifier(activation='tanh', max_iter=1000)",0.729167,0.776042
2,"MLPClassifier(activation='logistic', max_iter=...",0.744792,0.769097
3,"MLPClassifier(activation='identity', max_iter=...",0.729167,0.776042


Looks like Sigmoid performs slightly better

### Does the network performance change when the density (number of neurons) of the hidden layers change?

Let's increase density

In [None]:
mlp5 = build_and_fit_mlp(X_train, y_train, layers_dim=(200,))
mlp6 = build_and_fit_mlp(X_train, y_train, layers_dim=(500,))
mlp7 = build_and_fit_mlp(X_train, y_train, layers_dim=(1000,))

compare_accuracies([mlp, mlp5, mlp6, mlp7], X_train, X_test, y_train, y_test)

Unnamed: 0,model,accuracy_score_test,accuracy_score_train
0,MLPClassifier(max_iter=1000),0.739583,0.809028
1,"MLPClassifier(hidden_layer_sizes=(200,), max_i...",0.713542,0.833333
2,"MLPClassifier(hidden_layer_sizes=(500,), max_i...",0.723958,0.848958
3,"MLPClassifier(hidden_layer_sizes=(1000,), max_...",0.734375,0.883681


Overfitting!!!

Let's use more layers

In [None]:
mlp8 = build_and_fit_mlp(X_train, y_train, layers_dim=(100, 100))
mlp9 = build_and_fit_mlp(X_train, y_train, layers_dim=(100, 100, 100))
mlp10 = build_and_fit_mlp(X_train, y_train, layers_dim=(100, 100, 100, 100))
mlp11 = build_and_fit_mlp(X_train, y_train, layers_dim=(100, 100, 100, 100, 100, 100, 100))

compare_accuracies([mlp, mlp8, mlp9, mlp10], X_train, X_test, y_train, y_test)

Unnamed: 0,model,accuracy_score_test,accuracy_score_train
0,MLPClassifier(max_iter=1000),0.739583,0.809028
1,"MLPClassifier(hidden_layer_sizes=(100, 100), m...",0.744792,0.902778
2,"MLPClassifier(hidden_layer_sizes=(100, 100, 10...",0.697917,0.916667
3,"MLPClassifier(hidden_layer_sizes=(100, 100, 10...",0.666667,0.953125


Overfitting!!!