![Colegio Bourbaki](./Images/Bourbaki.png)

# Machine Learning

## Introducción a Redes Neuronales

### Librerias

In [None]:
#Data Analysis
import pandas as pd

#Plotting
import matplotlib.pyplot as plt
import seaborn as sns

#Neural Network Architecture
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

#Metrics
from sklearn.metrics import classification_report, confusion_matrix, auc, roc_curve

#Utils
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from torchsummary import summary

### Funciones de ayuda

In [None]:
def plot_confusion_matrix(y_true, y_pred):
    cm = confusion_matrix(y_true, y_pred)
    sns.heatmap(cm, annot=True, fmt="g")
    plt.xlabel("Predicted labels")
    plt.ylabel("True labels")
    plt.show()

### Carga de datos

Información sobre el conjunto de datos

Esta base de datos contiene 76 atributos, pero todos los experimentos publicados hacen referencia al uso de un subconjunto de 14 de ellos.  En concreto, la base de datos Cleveland es la única que han utilizado los investigadores de ML hasta la fecha.  El campo "meta" se refiere a la presencia de cardiopatía en el paciente.  Tiene un valor entero de 0 (sin presencia) a 4. Los experimentos con la base de datos Cleveland se han centrado simplemente en intentar distinguir la presencia (valores 1,2,3,4) de la ausencia (valor 0).  
   
Recientemente se han eliminado de la base de datos los nombres y números de la seguridad social de los pacientes, sustituyéndolos por valores ficticios.

Se ha "procesado" un fichero, el que contiene la base de datos de Cleveland.  Los cuatro ficheros no procesados también existen en este directorio.

Para conocer más sobre el dataset pueden ver los siguientes links:

* https://github.com/uci-ml-repo/ucimlrepo

* https://archive.ics.uci.edu/dataset/45/heart+disease

* https://pubmed.ncbi.nlm.nih.gov/2756873/

In [None]:
df = pd.read_csv('./Data/data.csv') #

In [None]:
df

| Variable Name | Role     | Type         | Demographic | Description                                      | Units  | Missing Values |
|---------------|----------|--------------|-------------|--------------------------------------------------|--------|----------------|
| age           | Feature  | Integer      | Age         |                                                  | years  | no             |
| sex           | Feature  | Categorical  | Sex         |                                                  |        | no             |
| cp            | Feature  | Categorical  |             |                                                  |        | no             |
| trestbps      | Feature  | Integer      |             | resting blood pressure (on admission to the hospital) | mm Hg | no             |
| chol          | Feature  | Integer      |             | serum cholestoral                                | mg/dl  | no             |
| fbs           | Feature  | Categorical  |             | fasting blood sugar > 120 mg/dl                  |        | no             |
| restecg       | Feature  | Categorical  |             |                                                  |        | no             |
| thalach       | Feature  | Integer      |             | maximum heart rate achieved                      |        | no             |
| exang         | Feature  | Categorical  |             | exercise induced angina                          |        | no             |
| oldpeak       | Feature  | Integer      |             | ST depression induced by exercise relative to rest |        | no             |


### Análisis exploratorio

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
df[df.columns[:13]].describe()

In [None]:
df[df.columns[:13]].hist(bins=50, figsize=(15, 12), layout=(4, 4))
plt.suptitle("Histogramas de las características", y=1, fontsize=16)
plt.tight_layout()
plt.show()

In [None]:
df['num'].value_counts()

In [None]:
df['num'].value_counts().plot(kind='bar')
plt.title('Distribución de clases')
plt.show()

Transformaremos el problema en binario, considerando 0 como ausencia de enfermedad coronaria, y 1 como presencia de enfermedad. No solo para simplificar el problema (ya que es introductorio), sino para que las clases estén más balanceadas.

In [None]:
df_binary = df.copy()

In [None]:
df_binary['num'] = df_binary['num'].apply(lambda x: 1 if x>0 else 0)

In [None]:
df_binary['num'].value_counts() 

In [None]:
df_binary['num'].value_counts().plot(kind='bar')
plt.title('Distribución de clases')
plt.show()

Como podemos ver están mejor distribuidas.

Eliminaremos los registros con valores nulos ya que son muy pocos.

In [None]:
df_binary = df_binary.dropna()

In [None]:
X = df_binary[df_binary.columns[:13]]
Y = df_binary['num']

In [None]:
# Escalado de datos
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

In [None]:
# Convert DataFrame to PyTorch tensors
X_tensor = torch.tensor(X, dtype=torch.float32)
Y_tensor = torch.tensor(Y.values, dtype=torch.float32)

Separamos en conjunto de entrenamiento y prueba

In [None]:
X_temp, X_test, Y_temp, Y_test = train_test_split(X_tensor, Y_tensor, test_size=0.2, random_state=42)
X_train, X_val, Y_train, Y_val = train_test_split(X_temp, Y_temp, test_size=0.05, random_state=42)

In [None]:
# Create TensorDatasets
train_dataset = TensorDataset(X_train, Y_train)
val_dataset = TensorDataset(X_val, Y_val)
test_dataset = TensorDataset(X_test, Y_test)

In [None]:
BATCH_SIZE = 1

In [None]:
# Create DataLoaders
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

### Arquitectura de la red Neuronal

In [None]:
X_train.shape, X_val.shape, X_test.shape

In [None]:
input_size = X_train.shape[1]
hidden_size = 16
output_size = 1

In [None]:
# Construcción del modelo
class NonLinearModel(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(NonLinearModel, self).__init__()
        self.dense1 = nn.Linear(input_size, hidden_size)
        self.dense2 = nn.Linear(hidden_size, 2*hidden_size)
        self.dense3 = nn.Linear(2*hidden_size, hidden_size)
        self.dense4 = nn.Linear(hidden_size, output_size)
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.relu(self.dense1(x))
        x = self.relu(self.dense2(x))
        x = self.relu(self.dense3(x))
        x = self.relu(self.dense4(x))
        x = self.sigmoid(x)
        return x

In [None]:
EPOCHS = 30
LR = 0.0001

In [None]:
model = NonLinearModel(input_size, hidden_size)
summary(model, (input_size,), batch_size=BATCH_SIZE, device='cpu')

In [None]:
cost = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=LR)

In [None]:
# Entrenamiento del modelo
train_losses = []
val_losses = []
train_accuracies = []
val_accuracies = []


for epoch in range(EPOCHS):
    # Training phase
    model.train()  # Set the model to training mode
    total_loss = 0.0
    correct_predictions = 0
    total_samples = 0

    for inputs, labels in train_loader:
        optimizer.zero_grad()

        # Forward pass
        outputs = model(inputs)
        loss = cost(outputs, labels.float().view(-1, 1))  # labels are 0 or 1

        # Backward pass and optimization
        loss.backward()
        optimizer.step()

        # Calculate accuracy
        predictions = (outputs > 0.5).float()
        correct_predictions += (predictions == labels.float().view(-1, 1)).sum().item()
        total_samples += labels.size(0)

        total_loss += loss.item()

    epoch_loss = total_loss / len(train_loader)
    epoch_accuracy = correct_predictions / total_samples

    train_losses.append(epoch_loss)
    train_accuracies.append(epoch_accuracy)

    # Validation phase
    model.eval()  # Set the model to evaluation mode
    total_loss = 0.0
    correct_predictions = 0
    total_samples = 0

    with torch.no_grad():
        for val_inputs, val_labels in val_loader:
            val_outputs = model(val_inputs)
            val_loss = cost(val_outputs, val_labels.float().view(-1, 1))

            # Calculate accuracy
            val_predictions = (val_outputs > 0.5).float()
            correct_predictions += (val_predictions == val_labels.float().view(-1, 1)).sum().item()
            total_samples += val_labels.size(0)

            total_loss += val_loss.item()

    val_epoch_loss = total_loss / len(val_loader)
    val_epoch_accuracy = correct_predictions / total_samples

    val_losses.append(val_epoch_loss)
    val_accuracies.append(val_epoch_accuracy)

    print(f'Epoch {epoch + 1}/{EPOCHS}, Train Loss: {epoch_loss:.4f}, Train Acc: {epoch_accuracy:.4f}, Val Loss: {val_epoch_loss:.4f}, Val Acc: {val_epoch_accuracy:.4f}')

In [None]:
fig, ax1 = plt.subplots(figsize=(10, 6))

# Plotting Accuracy on the primary y-axis
ax1.plot(train_accuracies, label='Training Accuracy', color='blue')
ax1.plot(val_accuracies, label='Validation Accuracy', color='green')
ax1.set_xlabel('Epoch')
ax1.set_ylim(0, 1)
ax1.legend(loc='upper left')

# Creating a secondary y-axis for Loss
ax2 = ax1.twinx()
ax2.plot(train_losses, label='Training Loss', color='blue', linestyle='--')
ax2.plot(val_losses, label='Validation Loss', color='green', linestyle='--')
ax2.legend(loc='upper right')
ax2.set_ylim(0, 2)

plt.title('Training and Validation Accuracy/Loss')
plt.show()

In [None]:
# Model evaluation on the test set
model.eval()
test_predictions = []

with torch.no_grad():
    for test_inputs in test_loader:  
        test_outputs = model(test_inputs[0])
        test_predictions.extend(test_outputs.cpu().numpy())

test_predictions = np.concatenate(test_predictions)

# Convert predictions to binary (0 or 1) based on a threshold (e.g., 0.5)
threshold = 0.5
y_pred = (torch.tensor(test_predictions) > threshold).int()

In [None]:
prob_pos = [value if value > 0.5 else 0 for value in test_predictions]

In [None]:
# Classification Report
print("Classification Report:")
print(classification_report(Y_test, y_pred))

In [None]:
plot_confusion_matrix(Y_test, y_pred)

In [None]:
# Compute ROC curve
fpr, tpr, thresholds = roc_curve(Y_test, prob_pos)

# Compute Area Under the Curve (AUC)
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color="darkorange", lw=2, label=f"ROC curve (AUC = {roc_auc:.2f})")
plt.plot([0, 1], [0, 1], color="navy", lw=2, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic (ROC) Curve")
plt.legend(loc="lower right")
plt.show()

Ejercicios:

* Aceptarían este modelo? **Qué puede estar pasando si observan el gráfico de entrenamiento y validación?**

* Investigar sobre las funciones de activación (ReLU, Sigmoid, etc.). Conocer sus funcionalidades, fortalezas y debilidades. **Por qué las necesitamos en general?**

* Mejorar el modelo, si es posible. Explorar con diferentess optimizadores, batch sizes, capas, activaciones, pérdidas, etc. Manipular hiperparámetros para obtener mejores resultados. **Es necesario incluir todas las características en el modelo?**

* Por otro lado, la función **XOR** es linealmente separable?

Links de Referencia:

* Pytorch Docs: https://pytorch.org/

Links de interés:

* Neural Network PlayGround: https://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.99698&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false

![Lenguaje Matemático](./Images/Matematicas.png)

![Contacto](./Images/Contacto.png)