<a href="https://colab.research.google.com/github/anelglvz/Working-Analyst/blob/main/ML-AI-for-the-Working-Analyst/Semana10/Semana10_0_Intro_keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

IMPORTANTE: Antes de iniciar cambiar entorno de ejecución a GPU

# Problema

Esta semana utilizaremos el mismo DataSet para ambas sesiones. Ésta primera sesión será para analizar un problema de fraude de modo sencillo, es decir, con un clasificador binario sencillo (que nos servirá para introducir algunas herramientas).

Podemos encontrar la base de datos en https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

Los datos contienen transacciones realizadas mediante tarjeta de crédito en Septiembre del 2013 por titulares de tarjeta en Europa. Ocurrieron en 2 días en los que hubo 492 fraudes de 284,807 transacciones.

Solo contiene variables numéricas y ya pasó por un proceso de PCA para solo quedarse con las mejores características.

La columna de "Time" contiene los segundos ocurridos entre transacciones. "Amount" es la cantidad de la transacción y "Class" es la variable que tiene 1 en caso de fraude y 0 en caso contrario.



Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Leer Datos usando un método distinto al que estamos acostumbrados

In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

In [None]:
# Get the real data from https://www.kaggle.com/mlg-ulb/creditcardfraud/
link = '/content/drive/MyDrive/Curso-WorkingAnalyst/semana9/creditcard.csv'
data = pd.read_csv(link)
data

In [None]:
X = data.drop(columns = ['Class','Time'])
X.head()

In [None]:
y = data['Class']
y

# Preparar conjunto de validación

Lo haremos manual (es decir, no aleatorio)

In [None]:
train_features, val_features, train_targets, val_targets = train_test_split(X,y,test_size=0.25, random_state=42)

print("Number of training samples:", len(train_features))
print("Number of validation samples:", len(val_features))

# Analizar desbalance de los datos

In [None]:
print(f"Number of positive samples in training data: {sum(y)} ({100 * float(sum(y)) / len(y):.2f}% of total)")

Asignaremos pesos de una manera balanceada (es una clase de peso "balanced" utilizada en keras). Manualmente se calcula como:

In [None]:
n_samples = len(y)
counts = np.bincount(np.array(y))


weight_for_0 = n_samples / (counts[0]*2)
weight_for_1 = n_samples / (counts[1]*2)

#counts = np.bincount(np.array(y))

#weight_for_0 = 1.0 / counts[0]
#weight_for_1 = 1.0 / counts[1]

print(f"{weight_for_0: .6f}")
print(f"{weight_for_1: .4f}")

Podemos ver que los pesos son practicamente una proporción de 500 a 1 aproximadamente

# Estandarizar los datos

In [None]:
#Tambien se puede hacer con sklearn.preprocessing.StandardScaler (ejercicio)
mean = np.mean(train_features, axis=0)
train_features -= mean
val_features -= mean
std = np.std(train_features, axis=0)
train_features /= std
val_features /= std

In [None]:
train_features

# Modelo de clasificación binaria

In [None]:
# Módulo montado en TensorFlow para creación de redes neuronales
import tensorflow as tf
from tensorflow import keras

In [None]:
train_features.shape[1]

In [None]:
model = keras.Sequential(
    [
        keras.layers.Dense(
            256, activation="relu", input_shape=(train_features.shape[1],)
        ),
        keras.layers.Dense(256, activation="relu"),
        keras.layers.Dropout(0.3),
        keras.layers.Dense(256, activation="relu"),
        keras.layers.Dropout(0.3),
        keras.layers.Dense(1, activation="sigmoid"),
    ]
)
model.summary()

In [None]:
tf.keras.utils.plot_model( 
    model,
    #to_file="model.png",
    show_shapes=True,
    show_dtype=False,
    show_layer_names=True,
    #rankdir="LR",
    #dpi=96,
)

# Entrenar modelo tomando en cuenta pesos para los datos


In [None]:
from tensorflow import random

random.set_seed(11)

In [None]:
metrics = [
    keras.metrics.FalseNegatives(name="fn"),
    keras.metrics.FalsePositives(name="fp"),
    keras.metrics.TrueNegatives(name="tn"),
    keras.metrics.TruePositives(name="tp"),
    keras.metrics.Precision(name="precision"),
    keras.metrics.Recall(name="recall"),
]

model.compile(
    optimizer=keras.optimizers.Adam(0.01), loss="binary_crossentropy", metrics=metrics
)

callbacks = [keras.callbacks.ModelCheckpoint("fraud_model_at_epoch_{epoch}.h5")]
class_weight = {0: weight_for_0, 1: weight_for_1}

model.fit(
    train_features,
    train_targets,
    batch_size=2048,
    epochs=30,
    verbose=1,
    callbacks=callbacks,
    validation_data=(val_features, val_targets),
    class_weight=class_weight,
)

In [None]:
y_train_pred = model.predict(train_features)
y_train_pred[y_train_pred < 0.5] = 0
y_train_pred[y_train_pred >= 0.5] = 1

In [None]:
y_test_pred = model.predict(val_features)
y_test_pred[y_test_pred < 0.8] = 0
y_test_pred[y_test_pred >= 0.8] = 1

In [None]:
len(y_test_pred)

In [None]:
y_test_pred.sum()

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt

In [None]:
ConfusionMatrixDisplay.from_predictions(train_targets, y_train_pred, cmap=plt.cm.Greens)

In [None]:
# Recuperar "pesos" de los coeficientes en la epoca "X"
# CUIDADO AL CORRER ESTA CELDA

# model.load_weights("/content/fraud_model_at_epoch_28.h5")

In [None]:
ConfusionMatrixDisplay.from_predictions(val_targets, y_test_pred, cmap=plt.cm.Greens)

In [None]:
from sklearn.metrics import average_precision_score,roc_auc_score

In [None]:
y_test_prob = model.predict(val_features)
y_test_prob

In [None]:
auprc = average_precision_score(val_targets, y_test_prob)
auprc

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(train_targets, y_train_pred))

In [None]:
print(classification_report(val_targets, y_test_pred))

# Ejercicios:

* ¿Que sucedería con el modelo si no usamos el parámetro "class_weight"?

* ¿Podemos recuperar solo la mejor época de nuestro entrenamiento?

* ¿Porque la matriz de confusión no parece dar mucha luz sobre lo que ocurre?