### MACHINE LEARNING MODELS FOR THE DIABETES DATASET (RAW)

First of all, we will perform data curation and preparation. Next, we will apply the following Machine Learning models:

- Linear Regresion
- k-Nearest Neighbors(k-NN)
- Classifiucation Trees
- Random Forest
- Support Vector Machines (SVM)
- Neural Networks(NN)

**Import the necessary libraries:**

In [103]:
import pandas as pd
import numpy as np

from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV


from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense

from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier


import matplotlib.pyplot as plt

import shap

  def _pt_shuffle_rec(i, indexes, index_mask, partition_tree, M, pos):
  def delta_minimization_order(all_masks, max_swap_size=100, num_passes=2):
  def _reverse_window(order, start, length):
  def _reverse_window_score_gain(masks, order, start, length):
  def _mask_delta_score(m1, m2):
  def identity(x):
  def _identity_inverse(x):
  def logit(x):
  def _logit_inverse(x):
  def _build_fixed_single_output(averaged_outs, last_outs, outputs, batch_positions, varying_rows, num_varying_rows, link, linearizing_weights):
  def _build_fixed_multi_output(averaged_outs, last_outs, outputs, batch_positions, varying_rows, num_varying_rows, link, linearizing_weights):
  def _init_masks(cluster_matrix, M, indices_row_pos, indptr):
  def _rec_fill_masks(cluster_matrix, indices_row_pos, indptr, indices, M, ind):
  def _single_delta_mask(dind, masked_inputs, last_mask, data, x, noop_code):
  def _delta_masking(masks, x, curr_delta_inds, varying_rows_out,
  def _jit_build_partition_tree(xmin, xmax, ymi

**Load the dataset**

In [2]:
df = pd.read_csv("/home/carmen/Escritorio/TFM/ml_anonymization/datasets/diabetes_data_raw.csv", sep=",")

**We check that we have loaded the dataset correctly by showing the first 5 rows of the dataset**

In [3]:
df.head()

Unnamed: 0,Age,Gender,Polyuria,Polydipsia,sudden weight loss,weakness,Polyphagia,Genital thrush,visual blurring,Itching,Irritability,delayed healing,partial paresis,muscle stiffness,Alopecia,Obesity,class
0,40,Male,No,Yes,No,Yes,No,No,No,Yes,No,Yes,No,Yes,Yes,Yes,Positive
1,58,Male,No,No,No,Yes,No,No,Yes,No,No,No,Yes,No,Yes,No,Positive
2,41,Male,Yes,No,No,Yes,Yes,No,No,Yes,No,Yes,No,Yes,Yes,No,Positive
3,45,Male,No,No,Yes,Yes,Yes,Yes,No,Yes,No,Yes,No,No,No,No,Positive
4,60,Male,Yes,Yes,Yes,Yes,Yes,No,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Positive


**We explore the original dataset**

In [4]:
print(df.head())
print(df.info())
print(df.describe())
print(df.shape)

   Age Gender Polyuria Polydipsia sudden weight loss weakness Polyphagia   
0   40   Male       No        Yes                 No      Yes         No  \
1   58   Male       No         No                 No      Yes         No   
2   41   Male      Yes         No                 No      Yes        Yes   
3   45   Male       No         No                Yes      Yes        Yes   
4   60   Male      Yes        Yes                Yes      Yes        Yes   

  Genital thrush visual blurring Itching Irritability delayed healing   
0             No              No     Yes           No             Yes  \
1             No             Yes      No           No              No   
2             No              No     Yes           No             Yes   
3            Yes              No     Yes           No             Yes   
4             No             Yes     Yes          Yes             Yes   

  partial paresis muscle stiffness Alopecia Obesity     class  
0              No              Yes      

**Check if there is any data that is null to eliminate it, if it exists:**

In [5]:
df.isnull().sum()

Age                   0
Gender                0
Polyuria              0
Polydipsia            0
sudden weight loss    0
weakness              0
Polyphagia            0
Genital thrush        0
visual blurring       0
Itching               0
Irritability          0
delayed healing       0
partial paresis       0
muscle stiffness      0
Alopecia              0
Obesity               0
class                 0
dtype: int64

**Eliminate duplicate data if any**

In [6]:
df.drop_duplicates(inplace=True)

**Convert categorical variables to factors:**

In [7]:
df["Age"] = df["Age"].astype("category").cat.codes
df["Gender"] = df["Gender"].astype("category").cat.codes
df["Polyuria"] = df["Polyuria"].astype("category").cat.codes
df["Polydipsia"] = df["Polydipsia"].astype("category").cat.codes
df["sudden weight loss"] = df["sudden weight loss"].astype("category").cat.codes  
df["weakness"] = df["weakness"].astype("category").cat.codes
df["Polyphagia"] = df["Polyphagia"].astype("category").cat.codes
df["Genital thrush"] = df["Genital thrush"].astype("category").cat.codes


df["visual blurring"] = df["visual blurring"].astype("category").cat.codes
df["Itching"] = df["Itching"].astype("category").cat.codes
df["Irritability"] = df["Irritability"].astype("category").cat.codes
df["delayed healing"] = df["delayed healing"].astype("category").cat.codes
df["partial paresis"] = df["partial paresis"].astype("category").cat.codes  
df["muscle stiffness"] = df["muscle stiffness"].astype("category").cat.codes
df["Alopecia"] = df["Alopecia"].astype("category").cat.codes
df["Obesity"] = df["Obesity"].astype("category").cat.codes

df["class"] = df["class"].astype("category").cat.codes

In [8]:
df.head()

Unnamed: 0,Age,Gender,Polyuria,Polydipsia,sudden weight loss,weakness,Polyphagia,Genital thrush,visual blurring,Itching,Irritability,delayed healing,partial paresis,muscle stiffness,Alopecia,Obesity,class
0,16,1,0,1,0,1,0,0,0,1,0,1,0,1,1,1,1
1,34,1,0,0,0,1,0,0,1,0,0,0,1,0,1,0,1
2,17,1,1,0,0,1,1,0,0,1,0,1,0,1,1,0,1
3,21,1,0,0,1,1,1,1,0,1,0,1,0,0,0,0,1
4,36,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1


**Next, we must do the scaling on train, we apply it, and we do the scaling again on test**

In [9]:
scaler = MinMaxScaler()
df[["Age"]] = scaler.fit_transform(df[["Age"]])

**We split the dataset into train and test**

In [10]:
X = df.drop(["class"], axis=1)
y = df["class"]

# add stratify
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)


**We check that the shapes of the splitted sets are correct as expected**

In [22]:
print("X_train shape:",X_train.shape)
print("X_test shape:",X_test.shape)
print("y_train:",y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (200, 16)
X_test shape: (51, 16)
y_train: (200,)
y_test shape: (51,)


### LOGISTIC REGRESSION:  
#### linear regression method for binary classification problems

 We create and train the logistic regression model

In [30]:
model = LogisticRegression()
model.fit(X_train, y_train)


We make the predictions on the test partition:

In [35]:
y_pred_logistic = model.predict(X_test)

We evaluate the model obtained:

In [55]:
accuracy_logistic = accuracy_score(y_test, y_pred_logistic)
report_logistic = classification_report(y_test, y_pred_logistic)
print("Accuracy:", accuracy_logistic)
print("Classification Report:\n", report_logistic)

Accuracy: 0.8235294117647058
Classification Report:
               precision    recall  f1-score   support

           0       0.67      0.88      0.76        16
           1       0.93      0.80      0.86        35

    accuracy                           0.82        51
   macro avg       0.80      0.84      0.81        51
weighted avg       0.85      0.82      0.83        51



### KNN

We choose the different parameters for gridSearch

In [39]:
param_grid = {'n_neighbors': [3, 5, 7, 9, 11],
              'weights': ['uniform', 'distance'],
              'algorithm': ['ball_tree', 'kd_tree', 'brute']}

We create the classifier

In [41]:
knn = KNeighborsClassifier()

We create the GridSearch object

In [42]:
# refit true, para que entrenemos con todos los datos
grid_search = GridSearchCV(knn, param_grid, cv=5, refit=True)

We make the model fit

In [43]:
grid_search.fit(X_train, y_train)

In [44]:
print("Best hyperparameters:", grid_search.best_params_)
print("Accuracy score:", grid_search.best_score_)

Best hyperparameters: {'algorithm': 'ball_tree', 'n_neighbors': 3, 'weights': 'distance'}
Accuracy score: 0.9200000000000002


Get the best model and its predictions

In [49]:
best_model = grid_search.best_estimator_
y_pred_knn = best_model.predict(X_test)

Evaluate the best model

In [54]:
accuracy_knn = accuracy_score(y_test, y_pred_knn)
report_knn = classification_report(y_test, y_pred_knn)
print("Best Parameters:", grid_search.best_params_)
print("Accuracy:", accuracy_knn)
print("Classification Report:\n", report_knn)

Best Parameters: {'algorithm': 'ball_tree', 'n_neighbors': 3, 'weights': 'distance'}
Accuracy: 0.8235294117647058
Classification Report:
               precision    recall  f1-score   support

           0       0.64      1.00      0.78        16
           1       1.00      0.74      0.85        35

    accuracy                           0.82        51
   macro avg       0.82      0.87      0.82        51
weighted avg       0.89      0.82      0.83        51



### CLASSIFICATION TREES

In [74]:
param_grid = {'max_depth': [3, 5, 7, 9, None],
              'min_samples_split': [2, 5, 10],
              'min_samples_leaf': [1, 2, 4]}

In [75]:
model = DecisionTreeClassifier(random_state=42)

In [76]:
grid_search = GridSearchCV(model, param_grid, cv=5)

In [77]:
grid_search.fit(X_train, y_train)

In [78]:
best_model = grid_search.best_estimator_
y_pred_ct = best_model.predict(X_test)

In [80]:
accuracy_ct = accuracy_score(y_test, y_pred_ct)
report_ct = classification_report(y_test, y_pred_ct)
print("Best Parameters:", grid_search.best_params_)
print("Accuracy:", accuracy_ct)
print("Classification Report:\n", report_ct)

Best Parameters: {'max_depth': 5, 'min_samples_leaf': 2, 'min_samples_split': 10}
Accuracy: 0.8235294117647058
Classification Report:
               precision    recall  f1-score   support

           0       0.65      0.94      0.77        16
           1       0.96      0.77      0.86        35

    accuracy                           0.82        51
   macro avg       0.81      0.85      0.81        51
weighted avg       0.87      0.82      0.83        51



### RANDOM FOREST

We choose the different parameters for gridSearch

In [56]:
param_grid = {"n_estimators": [50, 100, 200],
              "max_depth": [None, 5, 10],
              "min_samples_split": [2, 5, 10],
              "min_samples_leaf": [1, 2, 4]}

We create the classifier

In [57]:
rfc = RandomForestClassifier(random_state=42)

We create the GridSearch object

In [58]:
grid_search = GridSearchCV(rfc, param_grid, cv=5)

We make the model fit

In [59]:
grid_search.fit(X_train, y_train)

In [61]:
print("Best hyperparameters:", grid_search.best_params_)
print("Accuracy score:", grid_search.best_score_)

Best hyperparameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
Accuracy score: 0.93


In [62]:
best_model_rf = grid_search.best_estimator_
y_pred_rf = best_model_rf.predict(X_test)

In [63]:
accuracy_rf = accuracy_score(y_test, y_pred_rf)
report_rf = classification_report(y_test, y_pred_rf)
print("Best Parameters:", grid_search.best_params_)
print("Accuracy:", accuracy_rf)
print("Classification Report:\n", report_rf)

Best Parameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
Accuracy: 0.9215686274509803
Classification Report:
               precision    recall  f1-score   support

           0       0.80      1.00      0.89        16
           1       1.00      0.89      0.94        35

    accuracy                           0.92        51
   macro avg       0.90      0.94      0.91        51
weighted avg       0.94      0.92      0.92        51



### SVM (Support Vector Machine)

In [64]:
param_grid = {"C": [0.1, 0.25, 0.5, 0.75, 1, 2],
              "kernel": ["linear", "poly", "rbf", "sigmoid"],
              "gamma": ["scale", "auto"]}

In [65]:
svm = SVC()

In [66]:
grid_search = GridSearchCV(svm, param_grid, cv=5)

In [67]:
grid_search.fit(X_train, y_train)

In [68]:
print("Best hyperparameters:", grid_search.best_params_)
print("Accuracy score:", grid_search.best_score_)

Best hyperparameters: {'C': 0.5, 'gamma': 'scale', 'kernel': 'rbf'}
Accuracy score: 0.925


In [69]:
best_model_svm = grid_search.best_estimator_
y_pred_svm = best_model_svm.predict(X_test)

In [70]:
accuracy_svm = accuracy_score(y_test, y_pred_svm)
report_svm = classification_report(y_test, y_pred_svm)
print("Best Parameters:", grid_search.best_params_)
print("Accuracy:", accuracy_svm)
print("Classification Report:\n", report_svm)

Best Parameters: {'C': 0.5, 'gamma': 'scale', 'kernel': 'rbf'}
Accuracy: 0.8823529411764706
Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.81      0.81        16
           1       0.91      0.91      0.91        35

    accuracy                           0.88        51
   macro avg       0.86      0.86      0.86        51
weighted avg       0.88      0.88      0.88        51



### NEURAL NETWORK:

In [83]:
param_grid = {'hidden_units': [(16,), (32,), (64,)],
              'activation': ['relu', 'sigmoid'],
              'optimizer': ['adam', 'sgd']}

In [108]:
# Initialize the model
model = Sequential()

# Add input layer and first hidden layer
model.add(Dense(64, input_dim=X_train.shape[1], activation='relu'))

# Add second hidden layer
model.add(Dense(32, activation='relu'))

# Add output layer
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
model_nn = model.fit(X_train, y_train, epochs=20, batch_size=32, verbose=1)
# Meter validacion para ver cuando parar en que época(para encontrar el mejor modelo)
# Cuando encuentre el bueno, reentreno con todo, scikitlearn lo hace solo, pero keras no

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test.values, verbose=0)
print(f'Test Loss: {loss:.4f}')
print(f'Test Accuracy: {accuracy*100:.2f}%')

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Test Loss: 0.3159
Test Accuracy: 84.31%


### AdaBoost (Adaptative Boosting):

In [94]:
adaboost = AdaBoostClassifier(n_estimators=100, random_state=42)
adaboost.fit(X_train, y_train)
adaboost_predictions = adaboost.predict(X_test)
adaboost_accuracy = accuracy_score(y_test, adaboost_predictions)

In [96]:
print("AdaBoost Accuracy:", adaboost_accuracy)

AdaBoost Accuracy: 0.803921568627451


### Gradient Boosting:

In [98]:
gradient_boosting = GradientBoostingClassifier(n_estimators=100, random_state=42)
gradient_boosting.fit(X_train, y_train)
gradient_boosting_predictions = gradient_boosting.predict(X_test)
gradient_boosting_accuracy = accuracy_score(y_test, gradient_boosting_predictions)

In [99]:
print("Gradient Boosting Accuracy:", gradient_boosting_accuracy)

Gradient Boosting Accuracy: 0.8627450980392157


### TODO: visualizacion de features y plots; mirar accuracy, matriz de confusion,roc. auc

Adding shap:

In [110]:
# Convert the DataFrame inputs to NumPy array
X_train_array = X_train.values
X_test_array = X_test.values

In [130]:
# Inicializamoss SHAP explainer
explainer = shap.DeepExplainer(model, X_train_array)
# Calculamos SHAP values
shap_values = explainer.shap_values(X_test_array)

AttributeError: in user code:

    File "/home/carmen/mambaforge/envs/TFM/lib/python3.10/site-packages/shap/explainers/_deep/deep_tf.py", line 252, in grad_graph  *
        x_grad = tape.gradient(out, shap_rAnD)
    File "/home/carmen/mambaforge/envs/TFM/lib/python3.10/site-packages/shap/explainers/_deep/deep_tf.py", line 378, in custom_grad
        out = op_handlers[type_name](self, op, *grads) # we cut off the shap_ prefex before the lookup
    File "/home/carmen/mambaforge/envs/TFM/lib/python3.10/site-packages/shap/explainers/_deep/deep_tf.py", line 562, in handler
        var = explainer._variable_inputs(op)
    File "/home/carmen/mambaforge/envs/TFM/lib/python3.10/site-packages/shap/explainers/_deep/deep_tf.py", line 222, in _variable_inputs
        out = np.zeros(len(op.inputs), dtype=np.bool)
    File "/home/carmen/mambaforge/envs/TFM/lib/python3.10/site-packages/numpy/__init__.py", line 305, in __getattr__
        raise AttributeError(__former_attrs__[attr])

    AttributeError: module 'numpy' has no attribute 'bool'.
    `np.bool` was a deprecated alias for the builtin `bool`. To avoid this error in existing code, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
    The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
        https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations


In [131]:
# explain the model's predictions using SHAP
explainer = shap.explainers.Linear(model, X_train_array)
shap_values = explainer(X_train_array)

AttributeError: module 'numpy' has no attribute 'bool'.
`np.bool` was a deprecated alias for the builtin `bool`. To avoid this error in existing code, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

In [132]:
explainer = shap.KernelExplainer(model_nn, X_test_array)
shap_values = explainer.shap_values(X_test_array, nsamples=500)

Provided model function fails when applied to the provided data set.


TypeError: 'History' object is not callable

In [133]:
# Calculate SHAP values
shap_values = explainer.shap_values(X_test_array)

# Get feature names
feature_names = X_train.columns.tolist()

# Print the SHAP values for each feature
shap.summary_plot(shap_values, X_test, feature_names=feature_names)

AttributeError: in user code:

    File "/home/carmen/mambaforge/envs/TFM/lib/python3.10/site-packages/shap/explainers/_deep/deep_tf.py", line 252, in grad_graph  *
        x_grad = tape.gradient(out, shap_rAnD)
    File "/home/carmen/mambaforge/envs/TFM/lib/python3.10/site-packages/shap/explainers/_deep/deep_tf.py", line 378, in custom_grad
        out = op_handlers[type_name](self, op, *grads) # we cut off the shap_ prefex before the lookup
    File "/home/carmen/mambaforge/envs/TFM/lib/python3.10/site-packages/shap/explainers/_deep/deep_tf.py", line 562, in handler
        var = explainer._variable_inputs(op)
    File "/home/carmen/mambaforge/envs/TFM/lib/python3.10/site-packages/shap/explainers/_deep/deep_tf.py", line 222, in _variable_inputs
        out = np.zeros(len(op.inputs), dtype=np.bool)
    File "/home/carmen/mambaforge/envs/TFM/lib/python3.10/site-packages/numpy/__init__.py", line 305, in __getattr__
        raise AttributeError(__former_attrs__[attr])

    AttributeError: module 'numpy' has no attribute 'bool'.
    `np.bool` was a deprecated alias for the builtin `bool`. To avoid this error in existing code, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
    The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
        https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations


In [134]:
mean_absolute_shap_values = np.mean(np.abs(shap_values[0]), axis=0)

NameError: name 'shap_values' is not defined

In [137]:
# Conseguimos los nombres de las variables que queremos representar
feature_names = X_train.columns.tolist()

# Ordenamos de menos a más
sorted_features = sorted(zip(feature_names, mean_absolute_shap_values), key=lambda x: x[1], reverse=True)

# Printeamos las features
for feature, importance in sorted_features:
    print(f'{feature}: {importance:.4f}')

NameError: name 'mean_absolute_shap_values' is not defined