### MACHINE LEARNING MODELS FOR THE DIABETES DATASET (RAW)

First of all, we will perform data curation and preparation. Next, we will apply the following Machine Learning models:

- Linear Regression
- k-Nearest Neighbors(k-NN)
- Support Vector Machines (SVM)
- Classification Trees
- Random Forest
- Gradient Boosting
- Neural Networks(NN)

**Import the necessary libraries:**

In [None]:
import pandas as pd
import numpy as np

from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV**Load the dataset**


from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Dense
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier

import matplotlib.pyplot as plt

import shap

**Load the dataset**

In [2]:
df = pd.read_csv("/home/carmen/Escritorio/TFM/ml_anonymization/datasets/bank-additional-full_raw.csv", sep=";")

**We check that we have loaded the dataset correctly by showing the first 5 rows of the dataset**

In [5]:
df.head()

Unnamed: 0,Age,Gender,Polyuria,Polydipsia,sudden weight loss,weakness,Polyphagia,Genital thrush,visual blurring,Itching,Irritability,delayed healing,partial paresis,muscle stiffness,Alopecia,Obesity,class
0,40,Male,No,Yes,No,Yes,No,No,No,Yes,No,Yes,No,Yes,Yes,Yes,Positive
1,58,Male,No,No,No,Yes,No,No,Yes,No,No,No,Yes,No,Yes,No,Positive
2,41,Male,Yes,No,No,Yes,Yes,No,No,Yes,No,Yes,No,Yes,Yes,No,Positive
3,45,Male,No,No,Yes,Yes,Yes,Yes,No,Yes,No,Yes,No,No,No,No,Positive
4,60,Male,Yes,Yes,Yes,Yes,Yes,No,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Positive


**We explore the original dataset**

In [None]:
print(df.head())
print(df.info())
print(df.describe())
print(df.shape)

**We check if there is any null value to remove it if exists:**

In [1]:
df.isnull().sum()

NameError: name 'df' is not defined

**Eliminamos datos duplicados si los hubiese**

In [5]:
df.drop_duplicates(inplace=True)

In [6]:
df

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes
41184,46,blue-collar,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41185,56,retired,married,university.degree,no,yes,no,cellular,nov,fri,...,2,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41186,44,technician,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes


**Eliminamos las columnas que no parecen ser relevantes para nuestro análisis**
http://archive.ics.uci.edu/ml/datasets/Bank+Marketing

In [7]:
df = df.drop(["duration", "campaign", "pdays", "previous", "poutcome"], axis=1)

In [8]:
df

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,1.1,93.994,-36.4,4.857,5191.0,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,-1.1,94.767,-50.8,1.028,4963.6,yes
41184,46,blue-collar,married,professional.course,no,no,no,cellular,nov,fri,-1.1,94.767,-50.8,1.028,4963.6,no
41185,56,retired,married,university.degree,no,yes,no,cellular,nov,fri,-1.1,94.767,-50.8,1.028,4963.6,no
41186,44,technician,married,professional.course,no,no,no,cellular,nov,fri,-1.1,94.767,-50.8,1.028,4963.6,yes


**Convertimos las variables categóricas a factores:**

In [9]:
df["job"] = df["job"].astype("category").cat.codes
df["marital"] = df["marital"].astype("category").cat.codes
df["education"] = df["education"].astype("category").cat.codes
df["default"] = df["default"].astype("category").cat.codes
df["housing"] = df["housing"].astype("category").cat.codes
df["loan"] = df["loan"].astype("category").cat.codes
df["contact"] = df["contact"].astype("category").cat.codes
df["month"] = df["month"].astype("category").cat.codes
df["day_of_week"] = df["day_of_week"].astype("category").cat.codes
df["y"] = df["y"].astype("category").cat.codes

In [10]:
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,3,1,0,0,0,0,1,6,1,1.1,93.994,-36.4,4.857,5191.0,0
1,57,7,1,3,1,0,0,1,6,1,1.1,93.994,-36.4,4.857,5191.0,0
2,37,7,1,3,0,2,0,1,6,1,1.1,93.994,-36.4,4.857,5191.0,0
3,40,0,1,1,0,0,0,1,6,1,1.1,93.994,-36.4,4.857,5191.0,0
4,56,7,1,3,0,0,2,1,6,1,1.1,93.994,-36.4,4.857,5191.0,0


DUDA: Escalo entre 0 y 1 las edades? todas las variables? SI ESCALO ESCALO TODO

Hacer el escalado solo sobre train, luego se aplica, después ese escalado sobre test

**Dividimos el dataset en train y en test**

In [11]:
X = df.drop(["y"], axis=1)
y = df["y"]

# add stratify
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)


In [12]:
X_train

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
36454,24,9,2,5,0,0,0,0,4,3,-2.9,92.963,-40.8,1.262,5076.2
1233,32,1,1,5,0,0,0,1,6,2,1.1,93.994,-36.4,4.855,5191.0
24111,33,6,2,6,0,0,0,1,7,2,-0.1,93.200,-42.0,4.245,5195.8
15516,38,2,1,2,0,0,2,1,3,0,1.4,93.918,-42.7,4.957,5228.1
17916,39,7,1,3,0,2,2,0,3,3,1.4,93.918,-42.7,4.961,5228.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33267,26,7,1,2,0,0,0,1,6,3,-1.8,92.893,-46.2,1.291,5099.1
22714,35,4,1,6,0,0,2,0,1,0,1.4,93.444,-36.1,4.964,5228.1
6971,32,4,1,6,0,2,0,1,6,2,1.1,93.994,-36.4,4.860,5191.0
18503,34,9,1,5,1,0,0,0,3,2,1.4,93.918,-42.7,4.968,5228.1


In [19]:
X_test

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
8490,35,7,1,3,1,0,0,1,4,4,1.4,94.465,-41.8,4.864,5228.1
40844,30,8,2,5,0,2,0,0,9,3,-1.1,94.199,-37.5,0.880,4963.6
35681,37,6,2,0,0,2,0,0,6,1,-1.8,92.893,-46.2,1.244,5099.1
35994,31,1,2,5,0,0,0,0,6,3,-1.8,92.893,-46.2,1.266,5099.1
21961,31,9,1,6,0,1,1,0,1,4,1.4,93.444,-36.1,4.964,5228.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16062,30,1,0,2,0,2,0,0,3,3,1.4,93.918,-42.7,4.961,5228.1
23700,60,0,1,2,0,2,0,0,1,2,1.4,93.444,-36.1,4.962,5228.1
34189,21,8,2,3,0,0,0,0,6,4,-1.8,92.893,-46.2,1.281,5099.1
1963,38,9,2,5,0,2,0,1,6,0,1.1,93.994,-36.4,4.855,5191.0


In [14]:
y_train

36454    0
1233     1
24111    0
15516    0
17916    0
        ..
33267    0
22714    1
6971     0
18503    0
24790    0
Name: y, Length: 32940, dtype: int8

In [15]:
y_test

8490     0
40844    1
35681    0
35994    0
21961    0
        ..
16062    0
23700    0
34189    0
1963     0
12395    0
Name: y, Length: 8236, dtype: int8

**Finalmente hacemos el escalado por separado en train y en test**

In [22]:
scaler = MinMaxScaler()

X_train = pd.DataFrame(X_train)
for column in X_train.columns:
    df[[X_train]] = scaler.fit_transform(X_train[[column]])
X_test = pd.DataFrame(X_test)   
for column in X_test.columns:
    df[[X_test]] = scaler.fit_transform(X_test[[column]])

TypeError: Must pass DataFrame or 2-d ndarray with boolean values only

### KNN

Cargamos las librerias

In [20]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

Elegimos los distintos parámetros para la gridSearch

In [18]:
param_grid = {'n_neighbors': [3, 5, 7, 9, 11],
              'weights': ['uniform', 'distance'],
              'algorithm': ['ball_tree', 'kd_tree', 'brute']}

Creamos el clasificador 

In [21]:
knn = KNeighborsClassifier()


Creamos el objeto de GridSearch

In [22]:
# refit true, para que entrenemos con todos los datos
grid_search = GridSearchCV(knn, param_grid, cv=5, refit=True)

Hacemos el fit

In [23]:
grid_search.fit(X_train, y_train)

In [24]:
print("Best hyperparameters:", grid_search.best_params_)
print("Accuracy score:", grid_search.best_score_)

Best hyperparameters: {'algorithm': 'kd_tree', 'n_neighbors': 11, 'weights': 'uniform'}
Accuracy score: 0.887370977534912


### RANDOM FOREST

In [26]:
from sklearn.ensemble import RandomForestClassifier

In [27]:
param_grid = {"n_estimators": [50, 100, 200],
              "max_depth": [None, 5, 10],
              "min_samples_split": [2, 5, 10],
              "min_samples_leaf": [1, 2, 4]}

In [28]:
rfc = RandomForestClassifier(random_state=42)

In [29]:
grid_search = GridSearchCV(rfc, param_grid, cv=5)

In [30]:
grid_search.fit(X_train, y_train)

In [32]:
print("Best hyperparameters:", grid_search.best_params_)
print("Accuracy score:", grid_search.best_score_)

Best hyperparameters: {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 100}
Accuracy score: 0.8973891924711597


### SVM (Support Vector Machine)

In [33]:
from sklearn.svm import SVC

In [9]:
param_grid1 = {"C": [0.1, 1],
              "kernel": ["linear", "poly", "rbf", "sigmoid"],
              "gamma": ["scale", "auto"]}

In [None]:
param_grid2 = {"C": [0.1, 1],
              "kernel": ["rbf", "sigmoid"],
              "gamma": ["scale", "auto"]}

In [35]:
svm = SVC()

In [36]:
grid_search = GridSearchCV(svm, param_grid, cv=5)

In [None]:
grid_search.fit(X_train, y_train)

In [None]:
print("Best hyperparameters:", grid_search.best_params_)
print("Accuracy score:", grid_search.best_score_)

### RED NEURONAL: MLP (Perceptrón Multicapa) Hecho con scik-learn, hacer con keras?

In [1]:
from sklearn.neural_network import MLPClassifier

In [2]:
param_grid = {"hidden_layer_sizes": [(10,), (50,), (100,)],
              "activation": ["logistic", "relu"],
              "solver": ["lbfgs", "sgd", "adam"]}

In [3]:
mlp = MLPClassifier(random_state=42)

In [4]:
grid_search = GridSearchCV(mlp, param_grid, cv=5)

NameError: name 'GridSearchCV' is not defined

In [None]:
grid_search.fit(X_train, y_train)


In [None]:
print("Best hyperparameters:", grid_search.best_params_)
print("Accuracy score:", grid_search.best_score_)

adaboost, gradient boosting, plot_classifier_comparisson

mirar accuracy, matriz de confusion,roc. auc

### TODO:visualizacion de features 

#### RED NEURONAL KERAS:

In [6]:
from keras.models import Sequential
from keras.layers import Dense
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

2023-06-13 10:37:05.272671: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-13 10:37:05.539479: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


In [14]:
X_train

array([[ 4.0000e+01,  1.0000e+00,  1.0000e+00, ..., -4.2700e+01,
         4.9600e+00,  5.2281e+03],
       [ 3.1000e+01,  0.0000e+00,  1.0000e+00, ..., -4.6200e+01,
         1.2440e+00,  5.0991e+03],
       [ 5.9000e+01,  5.0000e+00,  1.0000e+00, ..., -4.6200e+01,
         1.3540e+00,  5.0991e+03],
       ...,
       [ 3.5000e+01,  0.0000e+00,  1.0000e+00, ..., -2.6900e+01,
         7.5400e-01,  5.0175e+03],
       [ 4.0000e+01,  4.0000e+00,  1.0000e+00, ..., -3.6400e+01,
         4.8560e+00,  5.1910e+03],
       [ 2.9000e+01,  0.0000e+00,  2.0000e+00, ..., -4.2700e+01,
         4.9600e+00,  5.2281e+03]])

In [25]:

# Initialize the model
model = Sequential()

# Add input layer and first hidden layer
model.add(Dense(64, input_dim=X_train.shape[1], activation='relu'))

# Add second hidden layer
model.add(Dense(32, activation='relu'))

# Add output layer
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=20, batch_size=32, verbose=1)
# Meter validacion para ver cuando parar en que época(para encontrar el mejor modelo)
# Cuando encuentre el bueno, reentreno con todo, scikitlearn lo hace solo, pero keras no

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test.values, verbose=0)
print(f'Test Loss: {loss:.4f}')
print(f'Test Accuracy: {accuracy*100:.2f}%')

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Test Loss: 0.2978
Test Accuracy: 88.06%


In [None]:
# poner semillas a tensorflow y a 