<a href="https://colab.research.google.com/github/agonzalezl2025/Parcial2/blob/main/Parcial2HE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**0. Preparación del entorno**

# New Section

## 0.1: Carga de datos

In [None]:
# === 1. Carga de Datos y Preprocesamiento ===

# Instalación silenciosa de librerías

# Frameworks de Deep Learning
!pip install -q tensorflow torch keras keras-tuner

# Análisis y manipulación de datos
!pip install -q pandas numpy scikit-learn

# Visualización
!pip install -q matplotlib seaborn plotly

# Optimización y evaluación
!pip install -q optuna tensorboard scikit-optimize

# Utilidades
!pip install -q tqdm joblib

# Análisis exploratorio y acceso a datos
!pip install -q ydata-profiling datasets huggingface_hub kaggle

# Importación de librerías
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
from ydata_profiling import ProfileReport


## 0.2: Definición de semilla

In [None]:
# Definir la semilla

SEED = 42

# **1. Carga y unión de los datasets**

## 1.1: Descargar y cargar los datasets desde Kaggle

In [None]:
# === 1.1 Descargar y Cargar los Datasets desde Kaggle ===
!kaggle datasets download -d sazidthe1/world-gdp-data
!unzip world-gdp-data.zip

# Cargar los datasets con la ruta correcta
gdp_data = pd.read_csv("gdp_data.csv")
country_codes = pd.read_csv("country_codes.csv")



Dataset URL: https://www.kaggle.com/datasets/sazidthe1/world-gdp-data
License(s): Attribution 4.0 International (CC BY 4.0)
world-gdp-data.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  world-gdp-data.zip
replace country_codes.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace gdp_data.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n


##  1.2: Inspección de los datos

In [None]:
# === 1.2 Inspección de los datos ===
print("Primeras filas de GDP Data:")
print(gdp_data.head())

print("\nPrimeras filas de Country Codes:")
print(country_codes.head())

Primeras filas de GDP Data:
  country_name country_code  year        value
0  Afghanistan          AFG  1960  537777811.1
1  Afghanistan          AFG  1961  548888895.6
2  Afghanistan          AFG  1962  546666677.8
3  Afghanistan          AFG  1963  751111191.1
4  Afghanistan          AFG  1964  800000044.4

Primeras filas de Country Codes:
  country_code                     region         income_group
0          ABW  Latin America & Caribbean          High income
1          AFG                 South Asia           Low income
2          AGO         Sub-Saharan Africa  Lower middle income
3          ALB      Europe & Central Asia  Upper middle income
4          AND      Europe & Central Asia          High income


## 1.3: Unir datasets

In [None]:
# === 1.3 Unir datasets ===
df = pd.merge(gdp_data, country_codes, on='country_code', how='inner')

df.head()


Unnamed: 0,country_name,country_code,year,value,region,income_group
0,Afghanistan,AFG,1960,537777811.1,South Asia,Low income
1,Afghanistan,AFG,1961,548888895.6,South Asia,Low income
2,Afghanistan,AFG,1962,546666677.8,South Asia,Low income
3,Afghanistan,AFG,1963,751111191.1,South Asia,Low income
4,Afghanistan,AFG,1964,800000044.4,South Asia,Low income


# **2. Clasificación y separación de datos train/test**

##2.1: Clasificación de datos en terciles

In [None]:
import numpy as np
import pandas as pd

# === 1. Función para calcular terciles por año ===
def categorize_gdp_by_year(row, p30_dict, p70_dict):
    year = row['year']
    value = row['value']

    # Tomamos los terciles correspondientes a ese año
    p30 = p30_dict.get(year, np.nan)
    p70 = p70_dict.get(year, np.nan)

    # Clasificación en 30%-40%-30%
    if value < p30:
        return 'Low'
    elif value < p70:
        return 'Medium'
    else:
        return 'High'

# === 2. Calcular terciles (30% y 70%) para cada año ===
p30_by_year = df.groupby('year')['value'].quantile(0.30).to_dict()
p70_by_year = df.groupby('year')['value'].quantile(0.70).to_dict()

# === 3. Aplicar la categorización por año usando los terciles ===
df['historic_gdp'] = df.apply(lambda row: categorize_gdp_by_year(row, p30_by_year, p70_by_year), axis=1)

# Aplicar One-Hot Encoding
df = pd.get_dummies(df, columns=['historic_gdp'], prefix='GDP')

# Convertir True/False a 0/1
df[['GDP_Low', 'GDP_Medium', 'GDP_High']] = df[['GDP_Low', 'GDP_Medium', 'GDP_High']].astype(int)


# === 5. Verificar que las nuevas columnas fueron creadas correctamente ===
print("\nColumnas después del One-Hot Encoding:")
print(df.columns)

# Mostrar los primeros registros para validar
print("\nPrimeras filas del DataFrame después de One-Hot Encoding:")
display(df.head())



Columnas después del One-Hot Encoding:
Index(['country_name', 'country_code', 'year', 'value', 'region',
       'income_group', 'GDP_High', 'GDP_Low', 'GDP_Medium'],
      dtype='object')

Primeras filas del DataFrame después de One-Hot Encoding:


Unnamed: 0,country_name,country_code,year,value,region,income_group,GDP_High,GDP_Low,GDP_Medium
0,Afghanistan,AFG,1960,537777811.1,South Asia,Low income,0,0,1
1,Afghanistan,AFG,1961,548888895.6,South Asia,Low income,0,0,1
2,Afghanistan,AFG,1962,546666677.8,South Asia,Low income,0,0,1
3,Afghanistan,AFG,1963,751111191.1,South Asia,Low income,0,0,1
4,Afghanistan,AFG,1964,800000044.4,South Asia,Low income,0,0,1


## 2.2 Separaración entre 80% train y 20% test

In [None]:

# === 4. Ordenar y dividir los datos en Train-Test ===
df = df.sort_values(by='year')
year_cutoff = int(round(df['year'].quantile(0.80)))

print(f"\nAño de corte para Train-Test: {year_cutoff}")

df_train = df[df['year'] <= year_cutoff].copy()
df_test = df[df['year'] > year_cutoff].copy()

# Verificar si 'historic_gdp' está en los datasets después de dividir
print("\nColumnas en df_train:")
print(df_train.columns)

print("\nColumnas en df_test:")
print(df_test.columns)

# Mostrar los primeros datos de entrenamiento y prueba
print("\nPrimeros datos de entrenamiento:")
display(df_train.head())

print("\nPrimeros datos de prueba:")
display(df_test.head())


Año de corte para Train-Test: 2012

Columnas en df_train:
Index(['country_name', 'country_code', 'year', 'value', 'region',
       'income_group', 'GDP_High', 'GDP_Low', 'GDP_Medium'],
      dtype='object')

Columnas en df_test:
Index(['country_name', 'country_code', 'year', 'value', 'region',
       'income_group', 'GDP_High', 'GDP_Low', 'GDP_Medium'],
      dtype='object')

Primeros datos de entrenamiento:


Unnamed: 0,country_name,country_code,year,value,region,income_group,GDP_High,GDP_Low,GDP_Medium
0,Afghanistan,AFG,1960,537777800.0,South Asia,Low income,0,0,1
2083,Colombia,COL,1960,4031153000.0,Latin America & Caribbean,Upper middle income,1,0,0
8399,Singapore,SGP,1960,704751700.0,East Asia & Pacific,High income,0,0,1
909,Belize,BLZ,1960,28072480.0,Latin America & Caribbean,Upper middle income,0,1,0
4142,Haiti,HTI,1960,273187200.0,Latin America & Caribbean,Lower middle income,0,1,0



Primeros datos de prueba:


Unnamed: 0,country_name,country_code,year,value,region,income_group,GDP_High,GDP_Low,GDP_Medium
2136,Colombia,COL,2013,382000000000.0,Latin America & Caribbean,Upper middle income,1,0,0
8699,South Sudan,SSD,2013,18426470000.0,Sub-Saharan Africa,Low income,0,0,1
8881,St. Kitts and Nevis,KNA,2013,874896300.0,Latin America & Caribbean,High income,0,1,0
1406,Brunei Darussalam,BRN,2013,18094330000.0,East Asia & Pacific,High income,0,0,1
2242,"Congo, Dem. Rep.",COD,2013,32679750000.0,Sub-Saharan Africa,Low income,0,0,1


# **Análisis de datos**

## Diccionario

In [None]:
dict_by_country = (
    df
    .groupby('country_code')
    .apply(lambda x: x.to_dict(orient='records'))
    .to_dict()
)

# Ejemplo: mostrar el contenido para 'AFG'
dict_by_country['BEN']

  .apply(lambda x: x.to_dict(orient='records'))


[{'country_name': 'Benin',
  'country_code': 'BEN',
  'year': 1960,
  'value': 226195578.1,
  'region': 'Sub-Saharan Africa',
  'income_group': 'Lower middle income',
  'historic_gdp': 'Low'},
 {'country_name': 'Benin',
  'country_code': 'BEN',
  'year': 1961,
  'value': 235668220.9,
  'region': 'Sub-Saharan Africa',
  'income_group': 'Lower middle income',
  'historic_gdp': 'Low'},
 {'country_name': 'Benin',
  'country_code': 'BEN',
  'year': 1962,
  'value': 236434954.3,
  'region': 'Sub-Saharan Africa',
  'income_group': 'Lower middle income',
  'historic_gdp': 'Low'},
 {'country_name': 'Benin',
  'country_code': 'BEN',
  'year': 1963,
  'value': 253927697.6,
  'region': 'Sub-Saharan Africa',
  'income_group': 'Lower middle income',
  'historic_gdp': 'Low'},
 {'country_name': 'Benin',
  'country_code': 'BEN',
  'year': 1964,
  'value': 269819005.8,
  'region': 'Sub-Saharan Africa',
  'income_group': 'Lower middle income',
  'historic_gdp': 'Low'},
 {'country_name': 'Benin',
  'count

## Análisis de datos: TRAIN

In [None]:
#Confirmamos los datos TRAIN:
df_train.head()

Unnamed: 0,country_name,country_code,year,value,region,income_group,historic_gdp
0,Afghanistan,AFG,1960,537777800.0,South Asia,Low income,Medium
6938,Nicaragua,NIC,1960,227223300.0,Latin America & Caribbean,Lower middle income,Low
10422,Zambia,ZMB,1960,713000000.0,Sub-Saharan Africa,Lower middle income,Medium
7385,Panama,PAN,1960,537147100.0,Latin America & Caribbean,High income,Medium
10252,"Venezuela, RB",VEN,1960,7663938000.0,Latin America & Caribbean,Upper middle income,High


In [None]:
#Análisis descriptivo de los datos de entrenamiento sin nigun cambio previo:
from ydata_profiling import ProfileReport
reporte_train = ProfileReport(df_train, title="Profiling Report Train dataset")
reporte_train.to_file("reporte_train.html")
reporte_train

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]



## Análisis de datos: TEST

In [None]:
#Confirmamos los datos TEST:
df_test.head()

Unnamed: 0,country_name,country_code,year,value,region,income_group,historic_gdp
803,Barbados,BRB,2013,4677248000.0,Latin America & Caribbean,High income,Low
5126,Kenya,KEN,2013,61671440000.0,Sub-Saharan Africa,Lower middle income,Medium
754,Bangladesh,BGD,2013,150000000000.0,South Asia,Lower middle income,High
7933,Romania,ROU,2013,190000000000.0,Europe & Central Asia,High income,High
7501,Papua New Guinea,PNG,2013,21261340000.0,East Asia & Pacific,Lower middle income,Medium


In [None]:
#Análisis descriptivo de los datos de entrenamiento sin nigun cambio previo:
from ydata_profiling import ProfileReport
reporte_test = ProfileReport(df_test, title="Profiling Report Test dataset")
reporte_test.to_file("reporte_test.html")
reporte_test

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]



# Estandarizar y Separar X y Y



## Normalizar

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0,1))  # Normalizar entre 0 y 1
df_train['value'] = scaler.fit_transform(df_train[['value']])  # Ajustar y transformar en train
df_test['value'] = scaler.transform(df_test[['value']])  # Solo transformar en test con la misma escala

# Mostrar las primeras filas de train y test para verificar
print("\nPrimeras filas de df_train después de la normalización:")
display(df_train.head())

print("\nPrimeras filas de df_test después de la normalización:")
display(df_test.head())


Primeras filas de df_train después de la estandarización:


Unnamed: 0,country_name,country_code,year,value,region,income_group,GDP_High,GDP_Low,GDP_Medium
0,Afghanistan,AFG,1960,-0.188385,South Asia,Low income,0,0,1
2083,Colombia,COL,1960,-0.183591,Latin America & Caribbean,Upper middle income,1,0,0
8399,Singapore,SGP,1960,-0.188156,East Asia & Pacific,High income,0,0,1
909,Belize,BLZ,1960,-0.189085,Latin America & Caribbean,Upper middle income,0,1,0
4142,Haiti,HTI,1960,-0.188748,Latin America & Caribbean,Lower middle income,0,1,0



Primeras filas de df_test después de la estandarización:


Unnamed: 0,country_name,country_code,year,value,region,income_group,GDP_High,GDP_Low,GDP_Medium
2136,Colombia,COL,2013,0.335088,Latin America & Caribbean,Upper middle income,1,0,0
8699,South Sudan,SSD,2013,-0.163837,Sub-Saharan Africa,Low income,0,0,1
8881,St. Kitts and Nevis,KNA,2013,-0.187923,Latin America & Caribbean,High income,0,1,0
1406,Brunei Darussalam,BRN,2013,-0.164293,East Asia & Pacific,High income,0,0,1
2242,"Congo, Dem. Rep.",COD,2013,-0.144278,Sub-Saharan Africa,Low income,0,0,1


## Separar en X y Y

In [None]:

# === 4. Separar en X (features) e y (labels) ===
X_train = df_train.drop(columns=['income_group'])  # Variables independientes
y_train = df_train['income_group']  # Variable dependiente

X_test = df_test.drop(columns=['income_group'])
y_test = df_test['income_group']

# === 5. Verificar resultados ===
print(f"Tamaño de X_train: {X_train.shape}, Tamaño de y_train: {y_train.shape}")
print(f"Tamaño de X_test: {X_test.shape}, Tamaño de y_test: {y_test.shape}")

# Mostrar la distribución de los años en cada conjunto
print("\nAños en entrenamiento:", df_train['year'].unique())
print("Años en prueba:", df_test['year'].unique())

# Mostrar las primeras filas de X e y
display(X_train.head(), y_train.head())
display(X_test.head(), y_test.head())

Tamaño de X_train: (8479, 8), Tamaño de y_train: (8479,)
Tamaño de X_test: (2069, 8), Tamaño de y_test: (2069,)

Años en entrenamiento: [1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973
 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987
 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001
 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012]
Años en prueba: [2013 2014 2015 2016 2017 2018 2019 2020 2021 2022]


Unnamed: 0,country_name,country_code,year,value,region,GDP_High,GDP_Low,GDP_Medium
0,Afghanistan,AFG,1960,-0.188385,South Asia,0,0,1
2083,Colombia,COL,1960,-0.183591,Latin America & Caribbean,1,0,0
8399,Singapore,SGP,1960,-0.188156,East Asia & Pacific,0,0,1
909,Belize,BLZ,1960,-0.189085,Latin America & Caribbean,0,1,0
4142,Haiti,HTI,1960,-0.188748,Latin America & Caribbean,0,1,0


Unnamed: 0,income_group
0,Low income
2083,Upper middle income
8399,High income
909,Upper middle income
4142,Lower middle income


Unnamed: 0,country_name,country_code,year,value,region,GDP_High,GDP_Low,GDP_Medium
2136,Colombia,COL,2013,0.335088,Latin America & Caribbean,1,0,0
8699,South Sudan,SSD,2013,-0.163837,Sub-Saharan Africa,0,0,1
8881,St. Kitts and Nevis,KNA,2013,-0.187923,Latin America & Caribbean,0,1,0
1406,Brunei Darussalam,BRN,2013,-0.164293,East Asia & Pacific,0,0,1
2242,"Congo, Dem. Rep.",COD,2013,-0.144278,Sub-Saharan Africa,0,0,1


Unnamed: 0,income_group
2136,Upper middle income
8699,Low income
8881,High income
1406,High income
2242,Low income


## Codificar

### Codificación de las X

In [None]:
from sklearn.preprocessing import LabelEncoder

# === 1. Codificar 'country_name' ===
country_encoder = LabelEncoder()
X_train['country_name'] = country_encoder.fit_transform(X_train['country_name'])
X_test['country_name'] = country_encoder.transform(X_test['country_name'])

# Guardar la correspondencia país - número en un DataFrame
country_mapping = pd.DataFrame({'country_name': country_encoder.classes_, 'country_id': range(len(country_encoder.classes_))})

# === 2. Codificar 'region' ===
region_encoder = LabelEncoder()
X_train['region'] = region_encoder.fit_transform(X_train['region'])
X_test['region'] = region_encoder.transform(X_test['region'])

# Guardar la correspondencia región - número en un DataFrame
region_mapping = pd.DataFrame({'region': region_encoder.classes_, 'region_id': range(len(region_encoder.classes_))})

# === 3. Eliminar la columna 'country_code' ===
X_train = X_train.drop(columns=['country_code'], errors='ignore')
X_test = X_test.drop(columns=['country_code'], errors='ignore')

# === 4. Mostrar las primeras filas para verificar ===
print("\nPrimeros valores de X_train después de la transformación:")
display(X_train.head())

print("\nLista completa de números asignados a cada país:")
display(country_mapping)

print("\nLista completa de números asignados a cada región:")
display(region_mapping)


Primeros valores de X_train después de la transformación:


Unnamed: 0,country_name,year,value,region,GDP_High,GDP_Low,GDP_Medium
0,0,1960,-0.188385,5,0,0,1
2083,41,1960,-0.183591,2,1,0,0
8399,168,1960,-0.188156,0,0,0,1
909,19,1960,-0.189085,2,0,1,0
4142,82,1960,-0.188748,2,0,1,0



Lista completa de números asignados a cada país:


Unnamed: 0,country_name,country_id
0,Afghanistan,0
1,Albania,1
2,Algeria,2
3,American Samoa,3
4,Andorra,4
...,...,...
209,Virgin Islands (U.S.),209
210,West Bank and Gaza,210
211,"Yemen, Rep.",211
212,Zambia,212



Lista completa de números asignados a cada región:


Unnamed: 0,region,region_id
0,East Asia & Pacific,0
1,Europe & Central Asia,1
2,Latin America & Caribbean,2
3,Middle East & North Africa,3
4,North America,4
5,South Asia,5
6,Sub-Saharan Africa,6


###Codificación de las Y

In [None]:
from sklearn.preprocessing import LabelEncoder

# === 1. Codificar 'income_group' en valores numéricos ===
income_encoder = LabelEncoder()
y_train = income_encoder.fit_transform(y_train)  # Convierte a números
y_test = income_encoder.transform(y_test)  # Transforma usando el mismo encoding

# === 2. Mostrar la correspondencia entre categorías y números ===
income_mapping = dict(zip(income_encoder.classes_, range(len(income_encoder.classes_))))
print("Asignación de valores numéricos para income_group:")
print(income_mapping)

# === 3. Verificar los primeros valores de y_train ===
print("\nPrimeros valores de y_train después de la conversión:")
print(y_train[:5])

print("\nPrimeros valores de y_test después de la conversión:")
print(y_test[:5])

Asignación de valores numéricos para income_group:
{'High income': 0, 'Low income': 1, 'Lower middle income': 2, 'Upper middle income': 3}

Primeros valores de y_train después de la conversión:
[1 3 0 3 2]

Primeros valores de y_test después de la conversión:
[3 1 0 0 1]


In [None]:
#Falta modelo para hacer anaálisis de Shap values
import shap

explainer = shap.Explainer(model, X_train)
shap_values = explainer(X_train)

np.shape(shap_values.values)

shap.plots.waterfall(shap_values[0])

shap.plots.waterfall(shap_values[1], max_display=4)


