# Proyecto Integrador 2023-2

Estudiantes :

* Juan Felipe Cardona Arango
* Juan Sebastian Sanin Villareal
* Samuel Ceballos Posada
* Daniela Ximena Niño Barbosa



# Fase I. Business Understanding
Definición de necesidades del cliente (comprensión del negocio)

## Contexto del negocio

El mercado de otorgar créditos financieros a personas es un mercado competitivo en el que las instituciones financieras compiten por captar clientes y generar ingresos. Las instituciones financieras que participan en este mercado deben tomar decisiones informadas sobre quién debe recibir un crédito y en qué condiciones.

Información Relacionada con la Toma de Decisiones:

* Datos del Solicitante: Las instituciones financieras recopilan información sobre los solicitantes de crédito, incluyendo sus ingresos, gastos, deudas existentes, historial crediticio y propósito del crédito. Esta información se utiliza para evaluar el riesgo crediticio del solicitante.
* Modelos de Crédito: Las instituciones financieras utilizan modelos de crédito para evaluar el riesgo crediticio de los solicitantes. Estos modelos utilizan datos históricos para predecir la probabilidad de que un solicitante de crédito incumpla.
* Reglas del Negocio: Las instituciones financieras tienen reglas del negocio que deben seguir al otorgar créditos. Estas reglas están diseñadas para proteger a las instituciones financieras del riesgo crediticio.

Decisiones Clave:

* Decisión de otorgar crédito: La decisión más importante que deben tomar las instituciones financieras es si otorgar o rechazar un crédito. Esta decisión se basa en la evaluación del riesgo crediticio del solicitante.
* Condiciones del crédito: Una vez que la institución financiera decide otorgar un crédito, debe negociar las condiciones del crédito con el solicitante. Estas condiciones incluyen el monto del crédito, el plazo del crédito, la tasa de interés y las garantías.

Optimizaciones:

* Automatización: Las instituciones financieras pueden utilizar la automatización para mejorar la eficiencia de su proceso de aprobación de crédito. Esto puede ayudar a reducir los costos y aumentar la velocidad de respuesta.
* Modelos de Crédito: Las instituciones financieras pueden utilizar modelos de crédito más avanzados para mejorar la precisión de sus evaluaciones de riesgo crediticio.
* Análisis de datos: Las instituciones financieras pueden utilizar el análisis de datos para identificar patrones y tendencias que pueden ser útiles para la toma de decisiones de crédito.

## Pregunta de negocio

El objetivo es responder a la siguiente pregunta de negocio:

*XXXXXXX*

## Reglas de negocio

Las reglas del negocio para otorgar o rechazar un crédito financiero son las siguientes:

* Capacidad de pago: El solicitante debe tener la capacidad de pagar el crédito, incluyendo los intereses y comisiones. Esto se puede evaluar considerando los ingresos, gastos, deudas existentes y otros compromisos financieros del solicitante.
* Historial crediticio: El solicitante debe tener un historial crediticio positivo. Esto se puede verificar en las centrales de riesgo.
* Propósito del crédito: El propósito del crédito debe ser legítimo y estar respaldado por un plan de negocios o personal claro.
* Garantías: El solicitante puede ofrecer garantías para respaldar el crédito. Esto puede incluir bienes inmuebles, vehículos o activos comerciales.


# Fase II. Data Understanding
Estudio y comprensión de los datos

## Librerías y lectura del dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statistics
import scipy.stats as stats
from sklearn.metrics import *
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import label_binarize
# Modelos
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
random_state = 42

In [None]:
# # Access to Drive
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
# Read the dataset using Pandas
df = pd.read_csv('train.csv')
df

## EDA

### Exploring the dataset

In [None]:
df.columns

In [None]:
df.shape

In [None]:
df.dtypes

In [None]:
df.info()

Observaciones:
1. Existen valores nulos en el dataset.
2. El dataset tiene variables numericas y strings.

In [None]:
df.describe().T

In [None]:
df.describe(exclude=np.number).T

Observaciones:
1.	El campo Customer_ID tiene 12500 valores únicos, lo que significa que tenemos datos de 12500 clientes.
2. El campo Month tiene solo 8 valores únicos. Debemos analizar que meses están presentes.
3. El campo Age tiene 1788 valores únicos. Esto es una anomalía ya que únicamente hay edades de 0-100
4. El campo SSN tiene 12501 valores únicos, mientras que Customer_ID tiene solo 12500 valores únicos. Existe la posibilidad de que se haya ingresado un valor de SSN incorrecto para uno de los clientes, ya que la misma persona no puede tener múltiples SSN.

In [None]:
# Check missing value or incorrect data
for i in df:
  print('\n', i, df[i].unique())

Observaciones:
1. Existen valores nulos.
2. Existen valores con caracteres incorrectos, por ejemplo, el valor 28_ en el campo Age.

### Visualizing missing values

In [None]:
df = df.replace(['None', 'nan', 'NaN'], np.nan)

In [None]:
# Missing values from each row
df.isna().sum()

In [None]:
# Visualize the missing values
plt.bar(df.columns, list(df.isna().sum()))
plt.xticks(rotation = 90)
plt.show()

In [None]:
# Percentage of missing values
df.isnull().mean()*100

# Fase III. Data Preparation
Análisis de los datos y selección de características

## Transformación de los datos

### Borrar columnas y datos innecesarios

In [None]:
# Drop unnecesary columns
df = df.drop(df[['ID', 'Name', 'SSN', 'Type_of_Loan']], axis=1)

In [None]:
# Convert ID from hexadecimal to integer
df['Customer_ID'] = df['Customer_ID'].apply(lambda x: int(x[4:], 16))

In [None]:
# Replace character _ to blank in Payment_Behaviour
df['Payment_Behaviour'] = df['Payment_Behaviour'].astype(str).str.replace('_',' ')
# Remove invalid characters
sym = "\\`*_{}[]()>#@+!$:;%"
for i in df.columns:
    for c in sym:
        df[i] = df[i].astype(str).str.replace(c,'')

In [None]:
df = df.replace(['None', 'nan', 'NaN'], np.nan)

In [None]:
# Replace empty strings with nan values
df = df.replace('', np.nan)

### Cambiar tipos de datos

In [None]:
# Transform the values in Payment_of_Min_Amount to numbers
df['Month'] = pd.to_datetime(df['Month'], format='%B').dt.month

In [None]:
# Transform the values in Credit_Mix to numbers
df['Credit_Mix'] = df['Credit_Mix'].astype(str).str.replace('Bad','1')
df['Credit_Mix'] = df['Credit_Mix'].astype(str).str.replace('Standard','2')
df['Credit_Mix'] = df['Credit_Mix'].astype(str).str.replace('Good','3')

In [None]:
# Transform the values in Credit_History_Age to numbers
def str_to_int(string):
    if string != 'nan':
        years = int(string[:string.index('.')])
        months = int(string[string.index('.')+1:])/12
        return(years + months)
    else:
        return np.nan

df['Credit_History_Age'] = df['Credit_History_Age'].astype(str).str.replace(' Years and ','.')
df['Credit_History_Age'] = df['Credit_History_Age'].astype(str).str.replace('Months','')
df['Credit_History_Age'] = df['Credit_History_Age'].apply(lambda x: str_to_int(x))

In [None]:
# Transform the values in Payment_of_Min_Amount to numbers
df['Payment_of_Min_Amount'] = df['Payment_of_Min_Amount'].str.replace('NM', '0')
df['Payment_of_Min_Amount'] = df['Payment_of_Min_Amount'].str.replace('No', '1')
df['Payment_of_Min_Amount'] = df['Payment_of_Min_Amount'].str.replace('Yes', '2')

In [None]:
# Transform the values in Payment_Behaviour to numbers
df['Payment_Behaviour'] = df['Payment_Behaviour'].astype(str).str.replace('Low spent Small value payments','1')
df['Payment_Behaviour'] = df['Payment_Behaviour'].astype(str).str.replace('Low spent Medium value payments','2')
df['Payment_Behaviour'] = df['Payment_Behaviour'].astype(str).str.replace('Low spent Large value payments','3')
df['Payment_Behaviour'] = df['Payment_Behaviour'].astype(str).str.replace('High spent Small value payments','4')
df['Payment_Behaviour'] = df['Payment_Behaviour'].astype(str).str.replace('High spent Medium value payments','5')
df['Payment_Behaviour'] = df['Payment_Behaviour'].astype(str).str.replace('High spent Large value payments','6')

In [None]:
# Transform the values in Credit_Score to numbers
df['Credit_Score'] = df['Credit_Score'].astype(str).str.replace('Poor','1')
df['Credit_Score'] = df['Credit_Score'].astype(str).str.replace('Standard','2')
df['Credit_Score'] = df['Credit_Score'].astype(str).str.replace('Good','3')

In [None]:
df = df.replace(['None', 'nan', 'NaN'], np.nan)

In [None]:
# Missing values from each row
df.isna().sum()

In [None]:
# Convert datatypes
# Do not include data with missing values in int_cols
int_cols = ['Customer_ID', 'Month', 'Age', 'Num_Bank_Accounts', 'Num_Credit_Card', 'Interest_Rate', 'Num_of_Loan',
            'Delay_from_due_date', 'Payment_of_Min_Amount', 'Payment_Behaviour', 'Credit_Score']
float_cols = ['Annual_Income', 'Monthly_Inhand_Salary', 'Changed_Credit_Limit', 'Outstanding_Debt',
              'Credit_Utilization_Ratio', 'Total_EMI_per_month', 'Amount_invested_monthly', 'Monthly_Balance']

for i in int_cols:
    df[i] = df[i].astype(int)
for i in float_cols:
    df[i] = df[i].astype(float)

In [None]:
# Missing values from each row
df.isna().sum()

### Reemplazar valores nulos de columnas de datos discretos/categoricos

In [None]:
df['Month'].value_counts()

In [None]:
df['Age'].unique()

In [None]:
for i in df['Customer_ID'].unique():
    mode = df[df['Customer_ID'] == i]['Age'].mode()[0]
    df.loc[df["Customer_ID"] == i, "Age"] = mode
df['Age'].unique()

In [None]:
def fill_nan_with_mode(df, groupby, column):
    # Fill with local mode
    fill_mode = lambda x: x.fillna(pd.Series.mode(x).iat[0])
    result = df.groupby(groupby)[column].transform(fill_mode)
    df[column] = result

In [None]:
df['Occupation'].unique()

In [None]:
fill_nan_with_mode(df, 'Customer_ID', 'Occupation')
df['Occupation'].unique()

In [None]:
df['Credit_Mix'].unique()

In [None]:
fill_nan_with_mode(df, 'Customer_ID', 'Credit_Mix')
df['Credit_Mix'] = df['Credit_Mix'].astype(int)
df['Credit_Mix'].unique()

In [None]:
df['Payment_of_Min_Amount'].unique()

In [None]:
df['Payment_Behaviour'].unique()

In [None]:
df['Payment_Behaviour'] = df['Payment_Behaviour'].replace(98,np.nan)
fill_nan_with_mode(df, 'Customer_ID', 'Payment_Behaviour')
df['Payment_Behaviour'] = df['Payment_Behaviour'].astype(int)
df['Payment_Behaviour'].unique()

In [None]:
df['Credit_Score'].unique()

In [None]:
# Missing values from each row
df.isna().sum()

### Reemplazar valores nulos y outliers de columnas de datos continuos

In [None]:
# Define Outlier Range
def get_iqr_lower_upper(df, column, multiply=1.5):
    q1 = df[column].quantile(0.25)
    q3 = df[column].quantile(0.75)
    iqr = q3 -q1

    lower = q1-iqr*multiply
    upper = q3+iqr*multiply
    affect = df.loc[(df[column]<lower)|(df[column]>upper)].shape
    print('Outliers:', affect)
    return lower, upper

In [None]:
def Numeric_Wrong_Values_Reassign_Group_Min_Max(df, groupby, column, inplace=True):
    # Identify Wrong values Range
    def get_group_min_max(df, groupby, column):
          cur = df[df[column].notna()].groupby(groupby)[column].apply(list)
          mode_result = cur.apply(lambda x: stats.mode(x))

          if mode_result.empty:
              return np.nan, np.nan

          return mode_result.apply([min, max])

# Assigning Wrong values
    def make_group_NaN_and_fill_mode(df, groupby, column, inplace=True):
        # Filtrar y agrupar por el valor de la columna
        grouped_data = df.groupby(groupby)[column]

        # Calcular la moda para cada grupo
        mode_result = grouped_data.apply(lambda x: custom_mode(x))

        # Verificar si no hay cambios necesarios
        if mode_result.dropna().empty:
            return df[column] if inplace else mode_result

        # Obtener el valor mínimo y máximo de la moda para cada grupo
        mini_value = mode_result.min()
        maxi_value = mode_result.max()

        # Crear una copia del DataFrame original
        col = df[column].copy()

        # Asignar valores incorrectos a NaN
        col[(col < mini_value) | (col > maxi_value)] = np.nan

        # Llenar con la moda local
        mode_by_group = grouped_data.transform(lambda x: x.mode().iloc[0] if not x.mode().empty else np.NaN)
        result = col.fillna(mode_by_group)

        # En su lugar si se especifica
        if inplace:
            df[column] = result
        else:
            return result

    # Custom mode function to handle NaN and multiple modes
    def custom_mode(x):
        mode_result = x.mode()
        return mode_result.iloc[0] if not mode_result.empty else np.nan

    # Run
    if inplace:
        # Before Assigning NaN values
        nan_count_before = df[column].isna().sum()
        if nan_count_before > 0:
            print(f'\nBefore Assigning: {column}: have {nan_count_before} NaN Values', end='\n')

        print("\nExisting Min, Max Values:", df[column].apply([min, max]), sep='\n', end='\n')
        mini, maxi = get_group_min_max(df, groupby, column)
        print(f"\nGroupby by {groupby}'s Actual min, max Values:", f'min:\t{mini},\nmax:\t{maxi}', sep='\n', end='\n')

        a = df.groupby(groupby)[column].apply(list)
        print(f'\nBefore Assigning Example {column}:\n', *a.head().values, sep='\n', end='\n')

        # Assigning
        make_group_NaN_and_fill_mode(df, groupby, column)

        # After Assigning NaN values
        nan_count_after = df[column].isna().sum()
        if nan_count_after > 0:
            print(f'\nAfter Assigning: {column}: have {nan_count_after} NaN Values', end='\n')

        b = df.groupby(groupby)[column].apply(list)
        print(f'\nAfter Assigning Example {column}:\n', *b.head().values, sep='\n', end='\n')
    else:
        # Show
        return make_group_NaN_and_fill_mode(df, groupby, column)

In [None]:
df.Annual_Income.value_counts(dropna=False)

In [None]:
Numeric_Wrong_Values_Reassign_Group_Min_Max(df, 'Customer_ID', 'Annual_Income')

In [None]:
df.Monthly_Inhand_Salary.value_counts(dropna=False)

In [None]:
Numeric_Wrong_Values_Reassign_Group_Min_Max(df, 'Customer_ID', 'Monthly_Inhand_Salary')

In [None]:
df.Monthly_Inhand_Salary.value_counts(dropna=False)

In [None]:
Numeric_Wrong_Values_Reassign_Group_Min_Max(df, 'Customer_ID', 'Num_Bank_Accounts')

In [None]:
df.Num_Credit_Card.value_counts(dropna=False)

In [None]:
Numeric_Wrong_Values_Reassign_Group_Min_Max(df, 'Customer_ID', 'Num_Credit_Card')

In [None]:
df.Interest_Rate.value_counts(dropna=False)

In [None]:
Numeric_Wrong_Values_Reassign_Group_Min_Max(df, 'Customer_ID', 'Interest_Rate')

In [None]:
df.Num_of_Loan.value_counts(dropna=False)

In [None]:
Numeric_Wrong_Values_Reassign_Group_Min_Max(df, 'Customer_ID', 'Num_of_Loan')

In [None]:
df.Delay_from_due_date.value_counts(dropna=False)

In [None]:
Numeric_Wrong_Values_Reassign_Group_Min_Max(df, 'Customer_ID', 'Delay_from_due_date')

In [None]:
# df.Num_of_Delayed_Payment.value_counts(dropna=False)

In [None]:
# Numeric_Wrong_Values_Reassign_Group_Min_Max(df, 'Customer_ID', 'Num_of_Delayed_Payment')

In [None]:
df.Changed_Credit_Limit.value_counts(dropna=False)

In [None]:
Numeric_Wrong_Values_Reassign_Group_Min_Max(df, 'Customer_ID', 'Changed_Credit_Limit')

In [None]:
# df.Num_Credit_Inquiries.value_counts(dropna=False)

In [None]:
# Numeric_Wrong_Values_Reassign_Group_Min_Max(df, 'Customer_ID', 'Num_Credit_Inquiries')

In [None]:
df.Outstanding_Debt.value_counts(dropna=False)

In [None]:
Numeric_Wrong_Values_Reassign_Group_Min_Max(df, 'Customer_ID', 'Outstanding_Debt')

In [None]:
df.Credit_Utilization_Ratio.value_counts(dropna=False)

In [None]:
df.Credit_Utilization_Ratio.isna().sum()

In [None]:
df.Credit_History_Age.value_counts(dropna=False)

In [None]:
df['Credit_History_Age'] = df.groupby('Customer_ID', group_keys=False)['Credit_History_Age'].apply(lambda x: x.interpolate().bfill().ffill())

In [None]:
df.Total_EMI_per_month.value_counts(dropna=False)

In [None]:
Numeric_Wrong_Values_Reassign_Group_Min_Max(df, 'Customer_ID', 'Total_EMI_per_month')

In [None]:
df.Amount_invested_monthly.value_counts(dropna=False)

In [None]:
Numeric_Wrong_Values_Reassign_Group_Min_Max(df, 'Customer_ID', 'Amount_invested_monthly')

In [None]:
df.Monthly_Balance.value_counts(dropna=False)

In [None]:
Numeric_Wrong_Values_Reassign_Group_Min_Max(df, 'Customer_ID', 'Monthly_Balance')

### Change datatypes

In [None]:
# Convert datatypes
int_cols = ['Customer_ID', 'Month', 'Age', 'Num_Bank_Accounts', 'Num_Credit_Card', 'Interest_Rate', 'Num_of_Loan',
            'Delay_from_due_date', 'Payment_of_Min_Amount', 'Payment_Behaviour', 'Credit_Score']
float_cols = ['Annual_Income', 'Monthly_Inhand_Salary', 'Changed_Credit_Limit', 'Outstanding_Debt',
              'Credit_Utilization_Ratio', 'Total_EMI_per_month', 'Amount_invested_monthly', 'Monthly_Balance']

for i in int_cols:
    df[i] = df[i].astype(int)
for i in float_cols:
    df[i] = df[i].astype(float)

# Fase IV. Modeling

## Funciones de ayuda

In [None]:
def calculate_mode(lst):
    if not lst:
        return None
    return statistics.mode(lst)

In [None]:
def preprocessor(data):
    # Step 1: Data Preprocessing
    data = data.sort_values(by=['Customer_ID', 'Month'])

    # Encoding categorical values
    categorical_columns = [cname for cname in data.columns if data[cname].nunique() and data[cname].dtype == "object"]

    ordinal_encoder = OrdinalEncoder()
    data[categorical_columns] = ordinal_encoder.fit_transform(data[categorical_columns])
    return data

In [None]:
def client_summary(data, target_month, num_prev_months):
    months = [1, 2, 3, 4, 5, 6, 7, 8]
    start_index = months.index(target_month) - num_prev_months
    start_index = max(0, start_index)
    previous_months = months[start_index:months.index(target_month)]

    agg_data = data[data['Month'].between(previous_months[0], previous_months[-1], inclusive='both')].groupby('Customer_ID').agg({
        'Age'                      :'mean',
        'Occupation'               : list,
        'Annual_Income'            :'mean',
        'Monthly_Inhand_Salary'    :'mean',
        'Num_Bank_Accounts'        :'mean',
        'Num_Credit_Card'          :'mean',
        'Interest_Rate'            :'mean',
        'Num_of_Loan'              :'mean',
        'Delay_from_due_date'      :'mean',
        #'Num_of_Delayed_Payment'   :'mean',
        'Changed_Credit_Limit'     :'mean',
        #'Num_Credit_Inquiries'     :'mean',
        'Credit_Mix'               :'first',
        'Outstanding_Debt'         :'mean',
        'Credit_Utilization_Ratio' :'mean',
        'Credit_History_Age'       :'mean',
        'Payment_of_Min_Amount'    :list,
        'Total_EMI_per_month'      :'mean',
        'Amount_invested_monthly'  :'mean',
        'Payment_Behaviour'        :'mean',
        'Monthly_Balance'          :'mean',
        'Credit_Score'             : list   # List of credit scores for months 1 to 7
    }).reset_index()


    # Expand the list of credit scores into separate columns
    expanded_scores = agg_data['Credit_Score'].apply(pd.Series)
    expanded_scores.columns = [f'Credit_Score_{i}' for i in range(num_prev_months)]
    agg_data = pd.concat([agg_data, expanded_scores], axis=1)
    agg_data.drop('Credit_Score', axis=1, inplace=True)

    # Calculate mode for categorical variables
    agg_data['Occupation'] = agg_data['Occupation'].apply(calculate_mode)
    agg_data['Payment_of_Min_Amount'] = agg_data['Payment_of_Min_Amount'].apply(calculate_mode)

    # Target variable
    target_month_data = data[data['Month'] == target_month]
    target_month_data = target_month_data[["Customer_ID", "Credit_Score"]].rename(columns={'Credit_Score': 'Target_Score'})

    # Merge data and target
    agg_data = agg_data.merge(target_month_data, on="Customer_ID", how="inner")

    agg_data = agg_data.drop(['Customer_ID'], axis=1)

    return agg_data

In [None]:
def train_valid_test(data, test_month, prev_months):
    df_train = client_summary(data, prev_months+1, prev_months)
    for month in range(prev_months+2, test_month-1):
        summary = client_summary(data, month, prev_months)
        df_train = pd.merge(df_train, summary, how='outer')
    df_val = client_summary(data, test_month-1, prev_months)
    df_test = client_summary(data, test_month, prev_months)

    X_train, y_train = df_train.drop(['Target_Score'], axis=1), df_train['Target_Score']
    X_val, y_val = df_val.drop(['Target_Score'], axis=1), df_val['Target_Score']
    X_test, y_test = df_test.drop(['Target_Score'], axis=1), df_test['Target_Score']

    return X_train, X_val, X_test, y_train, y_val, y_test

## Validación de los modelos usando distintas cantidades de meses

### Funciones de prueba de modelos

In [None]:
def train_lr(X_train, y_train):
    # Parametros para ser evaluados
    param_grid = {
        'classifier__C': np.logspace(-4, 4, 50),
        'classifier__max_iter': [100, 1000],
    }
    # Modelo a usar
    model = LogisticRegression(random_state=random_state, class_weight='balanced')
    # Pipeline a usar
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', model),
    ])
    # Buscar los mejores parámetros
    lr = GridSearchCV(pipeline, cv=5, param_grid=param_grid, n_jobs=-1, scoring = 'f1', verbose=0)
    lr.fit(X_train, y_train)
    return lr

In [None]:
def train_lr_ridge(X_train, y_train):
    # Parametros para ser evaluados
    param_grid = {
        'classifier__C': np.logspace(-4, 4, 50),
        'classifier__max_iter': [100, 1000],
    }
    # Modelo a usar
    model = LogisticRegression(random_state=random_state, class_weight='balanced', penalty='l2', solver='lbfgs')
    # Pipeline a usar
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', model),
    ])
    # Buscar los mejores parámetros
    lr = GridSearchCV(pipeline, cv=5, param_grid=param_grid, n_jobs=-1, scoring = 'f1', verbose=0)
    lr.fit(X_train, y_train)
    return lr

In [None]:
def train_lr_lasso(X_train, y_train):
    # Parametros para ser evaluados
    param_grid = {
        'classifier__C': np.logspace(-4, 4, 50),
        'classifier__max_iter': [100, 1000],
    }
    # Modelo a usar
    model = LogisticRegression(random_state=random_state, class_weight='balanced', penalty='l1', solver='liblinear')
    # Pipeline a usar
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', model),
    ])
    # Buscar los mejores parámetros
    lr = GridSearchCV(pipeline, cv=5, param_grid=param_grid, n_jobs=-1, scoring = 'f1', verbose=0)
    lr.fit(X_train, y_train)
    return lr

In [None]:
def train_knn(X_train, y_train):
    param_grid = {
      'classifier__n_neighbors': [3, 5, 7, 9, 11, 15],
    }
    model = KNeighborsClassifier()
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', model),
    ])
    knn = GridSearchCV(pipeline, cv=5, param_grid=param_grid, n_jobs=-1, scoring = 'f1')
    knn.fit(X_train, y_train)
    return knn

In [None]:
def train_lda(X_train, y_train):
    param_grid = {
        'classifier__solver': ['svd', 'lsqr', 'eigen'],
    }
    model = LinearDiscriminantAnalysis()
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', model),
    ])
    lda = GridSearchCV(pipeline, cv=5, param_grid=param_grid, n_jobs=-1, scoring = 'f1')
    lda.fit(X_train, y_train)
    return lda

In [None]:
def train_qda(X_train, y_train):
    param_grid = {
        'classifier__reg_param': [0.1, 0.2, 0.3, 0.4, 0.5]
    }
    model = QuadraticDiscriminantAnalysis()
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', model),
    ])
    qda = GridSearchCV(pipeline, cv=5, param_grid=param_grid, n_jobs=-1, scoring = 'f1')
    qda.fit(X_train, y_train)
    return qda

In [None]:
def train_nb(X_train, y_train):
    param_grid = {
        'classifier__var_smoothing': np.logspace(0,-9, num=100)
    }
    model = GaussianNB()
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', model),
    ])
    nb = GridSearchCV(pipeline, cv=5, param_grid=param_grid, n_jobs=-1, scoring = 'f1')
    nb.fit(X_train, y_train)
    return nb

### Funciones para imprimir métricas

In [None]:
def clf_report_models(lr, lr_ridge, lr_lasso, knn, lda, qda, nb, X_val, y_val, months):
    for i in range(6):
        if i == 0:
            print(f'Logistic Regression - {months} prev months')
            y_pred = lr.predict(X_val)
        elif i == 1:
            print(f'\nLogistic Regression Ridge - {months} prev months')
            y_pred = lr_ridge.predict(X_val)
        elif i == 2:
            print(f'\nLogistic Regression Lasso - {months} prev months')
            y_pred = lr_lasso.predict(X_val)
        elif i == 3:
            print(f'\nKNN - {months} prev months')
            y_pred = knn.predict(X_val)
        elif i == 4:
            print(f'\nLDA - {months} prev months')
            y_pred = lda.predict(X_val)
        elif i == 5:
            print(f'\nQDA - {months} prev months')
            y_pred = qda.predict(X_val)
        elif i == 6:
            print(f'\nNaive Bayes - {months} prev months')
            y_pred = nb.predict(X_val)
        print(classification_report(y_val, y_pred))

In [None]:
# Obtener las puntuaciones de decisión para el conjunto de prueba
y_score = qda.decision_function(X_val)

# Binarizar las etiquetas
y_val_bin = label_binarize(y_val, classes=np.unique(y_val))

# Calcular la curva ROC para cada clase
fpr = dict()
tpr = dict()
roc_auc = dict()

for i in range(len(np.unique(y_val))):
    fpr[i], tpr[i], _ = roc_curve(y_val_bin[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# Calcular el micro-average ROC curve y AUC
fpr["micro"], tpr["micro"], _ = roc_curve(y_val_bin.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

# Plot the ROC curves
plt.figure(figsize=(10, 6))
plt.plot(fpr["micro"], tpr["micro"],
         label=f'Micro-average ROC curve (AUC = {roc_auc["micro"]:0.2f})',
         color='deeppink', linestyle=':', linewidth=4)

colors = ['aqua', 'darkorange', 'cornflowerblue']
for i, color in zip(range(len(np.unique(y_val))), colors):
    plt.plot(fpr[i], tpr[i], color=color, lw=2,
             label=f'ROC curve (class {i+1}, AUC = {roc_auc[i]:0.2f})')

plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve for QDA')
plt.legend(loc="lower right")
plt.show()


### Validación de modelos

In [None]:
# Data processing
df = preprocessor(data=df)

In [None]:
month_to_pred = 8
prev_months_to_pred = 3

X_train, X_val, X_test, y_train, y_val, y_test = train_valid_test(df, month_to_pred, prev_months_to_pred)
lr_3 = train_lr(X_train, y_train)
lr_ridge_3 = train_lr_ridge(X_train, y_train)
lr_lasso_3 = train_lr_lasso(X_train, y_train)
knn_3 = train_knn(X_train, y_train)
lda_3 = train_lda(X_train, y_train)
qda_3 = train_qda(X_train, y_train)
nb_3 = train_nb(X_train, y_train)

In [None]:
month_to_pred = 8
prev_months_to_pred = 4

X_train, X_val, X_test, y_train, y_val, y_test = train_valid_test(df, month_to_pred, prev_months_to_pred)
lr_4 = train_lr(X_train, y_train)
lr_ridge_4 = train_lr_ridge(X_train, y_train)
lr_lasso_4 = train_lr_lasso(X_train, y_train)
knn_4 = train_knn(X_train, y_train)
lda_4 = train_lda(X_train, y_train)
qda_4 = train_qda(X_train, y_train)
nb_4 = train_nb(X_train, y_train)

In [None]:
month_to_pred = 8
prev_months_to_pred = 5

X_train, X_val, X_test, y_train, y_val, y_test = train_valid_test(df, month_to_pred, prev_months_to_pred)
lr_5 = train_lr(X_train, y_train)
lr_ridge_5 = train_lr_ridge(X_train, y_train)
lr_lasso_5 = train_lr_lasso(X_train, y_train)
knn_5 = train_knn(X_train, y_train)
lda_5 = train_lda(X_train, y_train)
qda_5 = train_qda(X_train, y_train)
nb_5 = train_nb(X_train, y_train)

In [None]:
month_to_pred = 8
prev_months_to_pred = 6

X_train, X_val, X_test, y_train, y_val, y_test = train_valid_test(df, month_to_pred, prev_months_to_pred)
lr_6 = train_lr(X_train, y_train)
lr_ridge_6 = train_lr_ridge(X_train, y_train)
lr_lasso_6 = train_lr_lasso(X_train, y_train)
knn_6 = train_knn(X_train, y_train)
lda_6 = train_lda(X_train, y_train)
qda_6 = train_qda(X_train, y_train)
nb_6 = train_nb(X_train, y_train)