## ETAPA 4: MODELOS DE MACHINE LEARNING SIN LOS AGREGADOS DE "SEX" Y "AGE GROUP"

En este notebook se remueven los datos de agregados "Both" y "All" de las categorías "Sex" y "Age group", respectivamente, durante la generación variables *dummy*.

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV

# modelos lineales
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import HuberRegressor 
from sklearn.linear_model import RANSACRegressor
from sklearn.linear_model import TheilSenRegressor
from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV, ElasticNet, ElasticNetCV

# modelos de arboles
from sklearn.tree import DecisionTreeRegressor 
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import HistGradientBoostingRegressor
#!pip install xgboost
#pip install pydot
from xgboost import XGBRegressor
#!pip install mapie
from mapie.regression import MapieRegressor

# otros
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error
from sklearn.metrics import r2_score
from sklearn.inspection import permutation_importance
from sklearn.preprocessing import PowerTransformer

import pickle
import joblib

### *Preparar Datos*

In [2]:
# Importar el dataset
df = pd.read_csv("../13 - Exports (preprocesamiento)/inmigrantes_merge.csv")

df.info()
df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9360 entries, 0 to 9359
Data columns (total 28 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Year                               9360 non-null   int64  
 1   Nationality code                   9360 non-null   object 
 2   Sex                                9360 non-null   object 
 3   Age group                          9360 non-null   object 
 4   Immigrant count                    9360 non-null   int64  
 5   Unemployment %                     9360 non-null   float64
 6   Political and Violence Percentile  9360 non-null   float64
 7   Probability of dying young         9360 non-null   float64
 8   Rule of Law Percentile             9360 non-null   float64
 9   Salaried workers %                 9360 non-null   float64
 10  GDP_growth                         9360 non-null   float64
 11  Inflation_annual                   9360 non-null   float

Unnamed: 0,Year,Nationality code,Sex,Age group,Immigrant count,Unemployment %,Political and Violence Percentile,Probability of dying young,Rule of Law Percentile,Salaried workers %,...,Non-state_deaths,Intrastate_deaths,Interstate_deaths,Number of residents,Political regime,Homicide Rate,Number of Turist,Spanish language,Restricciones_pandemia,Año post_pandemia
0,2008,DZA,Both,0 - 14,759,11.33,14.90,3.7,24.52,67.41,...,0,345,0,51922,3,0.95,44400000,0,0,0
1,2008,PER,Males,35 - 44,2938,4.03,17.31,5.1,25.96,44.47,...,0,40,0,60185,7,5.27,44400000,1,0,0
2,2008,PER,Males,45 - 54,1128,4.03,17.31,5.1,25.96,44.47,...,0,40,0,60185,7,5.27,44400000,1,0,0
3,2008,PER,Males,55 - 64,265,4.03,17.31,5.1,25.96,44.47,...,0,40,0,60185,7,5.27,44400000,1,0,0
4,2008,PER,Males,65+,156,4.03,17.31,5.1,25.96,44.47,...,0,40,0,60185,7,5.27,44400000,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9355,2022,PAK,Males,55 - 64,330,5.60,6.60,5.8,25.00,42.14,...,0,670,0,68821,6,4.21,59310000,0,0,1
9356,2022,PAK,Females,55 - 64,146,5.60,6.60,5.8,25.00,42.14,...,0,670,0,31675,6,4.21,59310000,0,0,1
9357,2022,PAK,Both,65+,169,5.60,6.60,5.8,25.00,42.14,...,0,670,0,100496,6,4.21,59310000,0,0,1
9358,2022,PAK,Males,65+,99,5.60,6.60,5.8,25.00,42.14,...,0,670,0,68821,6,4.21,59310000,0,0,1


Primero, evaluremos modelos que no aceptan datos nulos y, posteriormente los modelos de árboles que sí los aceptan. Luego compararemos las distintas métricas juntándolas en dataframes para una mejor comparación.

### *Modelos Que No Aceptan Datos Nulos*

Antes de proceder, debemos remover a Senegal de la lista de de paises ya que tenemos datos nulos para la variable Tasa de Homidicidios, de manera que no haya conflicto con los modelos que evaluaremos. Además, haremos de la varaible "Year" una variable ordinal y el resto de variables categóricas a variables dummy (los regímenes políticos ya están en formato ordinal).

En el caso de "Year", simplemente restaremos 2007 a la columna entera, y para el resto de las variables objeto usaremos la funcion *.get_dummies()*.

In [4]:
# Removemos las categorías de agregados
df_noagg = df[(df["Sex"] != "Both") & (df["Age group"] != "All")]

# Hacer copia del df removiendo Senegal que presenta datos nulos para tasa de homicidios
df_nonull = df_noagg[df_noagg['Nationality code'] != 'SEN'].copy()

# Transformar Year a variable ordinal de 1 (2008) a 15 (2022)
df_nonull['Year'] = df_nonull['Year'] - 2007

# Generar variables dummies a partir de nuestras variables categóricas "object" (no ordinales)
df_nonull = pd.get_dummies(df_nonull)

# Convertir las variables dummies booleanas en "int"
col_bool = df_nonull.select_dtypes(include = ['bool']).columns
df_nonull[col_bool] = df_nonull[col_bool].astype(int)

# Verificar cambio
df_nonull.info()
df_nonull

<class 'pandas.core.frame.DataFrame'>
Index: 5250 entries, 1 to 9359
Data columns (total 68 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   Year                                      5250 non-null   int64  
 1   Immigrant count                           5250 non-null   int64  
 2   Unemployment %                            5250 non-null   float64
 3   Political and Violence Percentile         5250 non-null   float64
 4   Probability of dying young                5250 non-null   float64
 5   Rule of Law Percentile                    5250 non-null   float64
 6   Salaried workers %                        5250 non-null   float64
 7   GDP_growth                                5250 non-null   float64
 8   Inflation_annual                          5250 non-null   float64
 9   Liberal democracy index                   5250 non-null   float64
 10  Health equality                          

Unnamed: 0,Year,Immigrant count,Unemployment %,Political and Violence Percentile,Probability of dying young,Rule of Law Percentile,Salaried workers %,GDP_growth,Inflation_annual,Liberal democracy index,...,Continent_America,Continent_Asia,Continent_Europe,Sub-region_Africa,Sub-region_Asia,Sub-region_Central America and Caribbean,Sub-region_European Union,Sub-region_North America,Sub-region_Rest of Europe,Sub-region_South America
1,1,2938,4.03,17.31,5.1,25.96,44.47,9.13,1.10,0.649,...,1,0,0,0,0,0,0,0,0,1
2,1,1128,4.03,17.31,5.1,25.96,44.47,9.13,1.10,0.649,...,1,0,0,0,0,0,0,0,0,1
3,1,265,4.03,17.31,5.1,25.96,44.47,9.13,1.10,0.649,...,1,0,0,0,0,0,0,0,0,1
4,1,156,4.03,17.31,5.1,25.96,44.47,9.13,1.10,0.649,...,1,0,0,0,0,0,0,0,0,1
8,1,4703,4.03,17.31,5.1,25.96,44.47,9.13,1.10,0.649,...,1,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9353,15,452,5.60,6.60,5.8,25.00,42.14,4.71,13.96,0.234,...,0,1,0,0,1,0,0,0,0,0
9355,15,330,5.60,6.60,5.8,25.00,42.14,4.71,13.96,0.234,...,0,1,0,0,1,0,0,0,0,0
9356,15,146,5.60,6.60,5.8,25.00,42.14,4.71,13.96,0.234,...,0,1,0,0,1,0,0,0,0,0
9358,15,99,5.60,6.60,5.8,25.00,42.14,4.71,13.96,0.234,...,0,1,0,0,1,0,0,0,0,0


Separemos el conjunto train/test y escalemos los datos.

In [67]:
# Separar variables input y variable target "Immigrant count" de df_null (dataframe sin atos nulos)
X = df_nonull.drop("Immigrant count", axis = 1) # variables predictoras
y = df_nonull["Immigrant count"]  # Target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 58) # separar datos en conjunto train y test en un 75% / 25%
scaler = MinMaxScaler() # definir scaler de datos 
X_train = scaler.fit_transform(X_train) # escalar los datos de entrenamiento
X_test = scaler.transform(X_test) # transformar  los datos de prueba

Ahora realizamos una primera evaluación de cada modelo y su rendimiento, para luego compararlos.

#### Regresion Lineal

In [68]:
model_lineal = LinearRegression() # definicion del modelo 

model_lineal.fit(X_train, y_train) # ajuste del modelo 

# Aplicar modelo sobre los datos de traint y test para predecir el target
y_train_pred_lineal = model_lineal.predict(X_train)
y_test_pred_lineal = model_lineal.predict(X_test)

# Calcular métricas en conjunto train
r2_train_lineal = np.round(r2_score(y_train, y_train_pred_lineal), 3)
rmse_train_lineal = np.round(np.sqrt(mean_squared_error(y_train, y_train_pred_lineal)), 0)
mae_train_lineal = np.round(mean_absolute_error(y_train, y_train_pred_lineal), 0)
mape_train_lineal = np.round(mean_absolute_percentage_error(y_train, y_train_pred_lineal), 2)

# Calcular métricas en conjunto test
r2_test_lineal = np.round(r2_score(y_test, y_test_pred_lineal), 3)
rmse_test_lineal = np.round(np.sqrt(mean_squared_error(y_test, y_test_pred_lineal)), 0)
mae_test_lineal = np.round(mean_absolute_error(y_test, y_test_pred_lineal), 0)
mape_test_lineal = np.round(mean_absolute_percentage_error(y_test, y_test_pred_lineal), 2)

# Mostrar métricas
print("R2 train:", r2_train_lineal)
print("RMSE - train:", rmse_train_lineal)
print("MAE - train:", mae_train_lineal)
print("MAPE - train:", mape_train_lineal)
print("")
print("R2 test:", r2_test_lineal)
print("RMSE - test:", rmse_test_lineal)
print("MAE - test:", mae_test_lineal)
print("MAPE - test:", mape_test_lineal)

R2 train: 0.52
RMSE - train: 1129.0
MAE - train: 621.0
MAPE - train: 3.3

R2 test: 0.513
RMSE - test: 966.0
MAE - test: 614.0
MAPE - test: 2.57


In [69]:
# Observar Coeficientes de cada variable para cada modelo en un dataframe
coefficients_lineal = pd.DataFrame({'Variable':df_nonull.drop(["Immigrant count"], axis=1, inplace=False).columns})
coefficients_lineal['modelo_lineal']= model_lineal.coef_

# Mostrar coeficientes
coefficients_lineal.head(20)

Unnamed: 0,Variable,modelo_lineal
0,Year,-1976.253
1,Unemployment %,1579.405
2,Political and Violence Percentile,1181.398
3,Probability of dying young,3645.572
4,Rule of Law Percentile,-4808.757
5,Salaried workers %,-980.4175
6,GDP_growth,-695.4065
7,Inflation_annual,1611.803
8,Liberal democracy index,2200.988
9,Health equality,546.4825


#### Regresion lineal - Huber (ventaja: bajo efecto de outliers)

In [70]:
model_huber = HuberRegressor(epsilon=1.15, alpha = 0.05) # definicion del modelo 

model_huber.fit(X_train, y_train) # ajuste del modelo 

# Aplicar modelo sobre los datos de traint y test para predecir el target
y_train_pred_huber = model_huber.predict(X_train)
y_test_pred_huber = model_huber.predict(X_test)

# Calcular métricas en conjunto train
r2_train_huber = np.round(r2_score(y_train, y_train_pred_huber), 3)
rmse_train_huber = np.round(np.sqrt(mean_squared_error(y_train, y_train_pred_huber)), 0)
mae_train_huber = np.round(mean_absolute_error(y_train, y_train_pred_huber), 0)
mape_train_huber = np.round(mean_absolute_percentage_error(y_train, y_train_pred_huber), 2)

# Calcular métricas en conjunto test
r2_test_huber = np.round(r2_score(y_test, y_test_pred_huber), 3)
rmse_test_huber = np.round(np.sqrt(mean_squared_error(y_test, y_test_pred_huber)), 0)
mae_test_huber = np.round(mean_absolute_error(y_test, y_test_pred_huber), 0)
mape_test_huber = np.round(mean_absolute_percentage_error(y_test, y_test_pred_huber), 2)

# Mostrar métricas
print("R2 train:", r2_train_huber)
print("RMSE - train:", rmse_train_huber)
print("MAE - train:", mae_train_huber)
print("MAPE - train:", mape_train_huber)
print("")
print("R2 test:", r2_test_huber)
print("RMSE - test:", rmse_test_huber)
print("MAE - test:", mae_test_huber)
print("MAPE - test:", mape_test_huber)

R2 train: 0.293
RMSE - train: 1370.0
MAE - train: 526.0
MAPE - train: 0.97

R2 test: 0.34
RMSE - test: 1125.0
MAE - test: 493.0
MAPE - test: 0.83


In [71]:
# Observar Coeficientes de cada variable para cada modelo en un dataframe
coefficients_huber = pd.DataFrame({'Variable':df_nonull.drop(["Immigrant count"], axis=1, inplace=False).columns})
coefficients_huber['modelo_lineal']= model_huber.coef_

# Mostrar coeficientes
coefficients_huber.head(20)

Unnamed: 0,Variable,modelo_lineal
0,Year,49.189602
1,Unemployment %,191.122568
2,Political and Violence Percentile,-52.315324
3,Probability of dying young,353.186116
4,Rule of Law Percentile,-98.252585
5,Salaried workers %,-310.378187
6,GDP_growth,-229.849611
7,Inflation_annual,288.534681
8,Liberal democracy index,178.501573
9,Health equality,-152.139687


#### Regresion lineal - RANSAC (ventaja: bueno para grandes outliers en "y")

In [72]:
model_ransac = RANSACRegressor(min_samples = 15) # definicion del modelo 

model_ransac.fit(X_train, y_train) # ajuste del modelo 

# Aplicar modelo sobre los datos de traint y test para predecir el target
y_train_pred_ransac = model_ransac.predict(X_train)
y_test_pred_ransac = model_ransac.predict(X_test)

# Calcular métricas en conjunto train
r2_train_ransac = np.round(r2_score(y_train, y_train_pred_ransac), 3)
rmse_train_ransac = np.round(np.sqrt(mean_squared_error(y_train, y_train_pred_ransac)), 0)
mae_train_ransac = np.round(mean_absolute_error(y_train, y_train_pred_ransac), 0)
mape_train_ransac = np.round(mean_absolute_percentage_error(y_train, y_train_pred_ransac), 2)

# Calcular métricas en conjunto test
r2_test_ransac = np.round(r2_score(y_test, y_test_pred_ransac), 3)
rmse_test_ransac = np.round(np.sqrt(mean_squared_error(y_test, y_test_pred_ransac)), 0)
mae_test_ransac = np.round(mean_absolute_error(y_test, y_test_pred_ransac), 0)
mape_test_ransac = np.round(mean_absolute_percentage_error(y_test, y_test_pred_ransac), 2)

# Mostrar métricas
print("R2 train:", r2_train_ransac)
print("RMSE - train:", rmse_train_ransac)
print("MAE - train:", mae_train_ransac)
print("MAPE - train:", mape_train_ransac)
print("")
print("R2 test:", r2_test_ransac)
print("RMSE - test:", rmse_test_ransac)
print("MAE - test:", mae_test_ransac)
print("MAPE - test:", mape_test_ransac)

R2 train: -16.904
RMSE - train: 6897.0
MAE - train: 992.0
MAPE - train: 0.87

R2 test: -4.94
RMSE - test: 3373.0
MAE - test: 662.0
MAPE - test: 0.8


#### Regresion lineal - TheilSen (ventaja: bueno para outliers pequeños tanto en "X" como en "y")

In [73]:
model_theilsen = TheilSenRegressor() # definicion del modelo 

model_theilsen.fit(X_train, y_train) # ajuste del modelo 

# Aplicar modelo sobre los datos de traint y test para predecir el target
y_train_pred_theilsen = model_theilsen.predict(X_train)
y_test_pred_theilsen = model_theilsen.predict(X_test)

# Calcular métricas en conjunto train
r2_train_theilsen = np.round(r2_score(y_train, y_train_pred_theilsen), 3)
rmse_train_theilsen = np.round(np.sqrt(mean_squared_error(y_train, y_train_pred_theilsen)), 0)
mae_train_theilsen = np.round(mean_absolute_error(y_train, y_train_pred_theilsen), 0)
mape_train_theilsen = np.round(mean_absolute_percentage_error(y_train, y_train_pred_theilsen), 2)

# Calcular métricas en conjunto test
r2_test_theilsen = np.round(r2_score(y_test, y_test_pred_theilsen), 3)
rmse_test_theilsen = np.round(np.sqrt(mean_squared_error(y_test, y_test_pred_theilsen)), 0)
mae_test_theilsen = np.round(mean_absolute_error(y_test, y_test_pred_theilsen), 0)
mape_test_theilsen = np.round(mean_absolute_percentage_error(y_test, y_test_pred_theilsen), 2)

# Mostrar métricas
print("R2 train:", r2_train_theilsen)
print("RMSE - train:", rmse_train_theilsen)
print("MAE - train:", mae_train_theilsen)
print("MAPE - train:", mape_train_theilsen)
print("")
print("R2 test:", r2_test_theilsen)
print("RMSE - test:", rmse_test_theilsen)
print("MAE - test:", mae_test_theilsen)
print("MAPE - test:", mape_test_theilsen)

R2 train: 0.242
RMSE - train: 1419.0
MAE - train: 640.0
MAPE - train: 2.82

R2 test: 0.251
RMSE - test: 1198.0
MAE - test: 611.0
MAPE - test: 2.06


#### Modelos lineales regularizados (Ridge, Lasso, E-Net)

**Buscar Alfa Optimo**

In [74]:
# Definir modelo Ridge y para Evaluar el valor del "alpha" óptimo
ridgecv = RidgeCV()
ridgecv.fit(X_train, y_train)
print("Alfa Optimo Ridge:", ridgecv.alpha_)

# Definir modelo Lasso y para Evaluar el valor del "alpha" óptimo
lassocv = LassoCV()
lassocv.fit(X_train, y_train)
print("Alfa Optimo Lasso:", lassocv.alpha_)

# Definir modelo E-Net y para Evaluar el valor del "alpha" óptimo
enetcv = ElasticNetCV()
enetcv.fit(X_train, y_train)
print("Alfa Optimo E-Net:", enetcv.alpha_)

Alfa Optimo Ridge: 0.1
Alfa Optimo Lasso: 0.2213522106249588
Alfa Optimo E-Net: 0.2912699474431206


In [75]:
# Ingresamos el valor de alpha en una variable
alpha_opt_ridge = ridgecv.alpha_

# Ingresamos el valor de alpha en una variable
alpha_opt_lasso = lassocv.alpha_

# Ingresamos el valor de alpha en una variable
alpha_opt_enet = enetcv.alpha_

**Entrenar modelo con Alfa optimo**

In [76]:
# Definir modelo Ridge con nuestro valor optimo de alpha, entrenar y predecir
modelo_ridge = Ridge(alpha = alpha_opt_ridge)
y_test_ridge = modelo_ridge.fit(X_train, y_train).predict(X_test)

# Definir modelo Lasso con nuestro valor optimo de alpha, entrenar y predecir
modelo_lasso = Lasso(alpha = alpha_opt_lasso)
y_test_lasso = modelo_lasso.fit(X_train, y_train).predict(X_test)

# Definir modelo E-Net con nuestro valor optimo de alpha, entrenar y predecir
modelo_enet = ElasticNet(alpha = alpha_opt_enet)
y_test_enet = modelo_enet.fit(X_train, y_train).predict(X_test)

In [77]:
# Observar Coeficientes de cada variable para cada modelo en un dataframe
coefficients = pd.DataFrame({'Variable':df_nonull.drop(["Immigrant count"], axis=1, inplace=False).columns})
coefficients['modelo_ridge']= modelo_ridge.coef_
coefficients['modelo_lasso']= modelo_lasso.coef_
coefficients['modelo_net']= modelo_enet.coef_

# Mostrar coeficientes
coefficients

Unnamed: 0,Variable,modelo_ridge,modelo_lasso,modelo_net
0,Year,-1934.147404,-1870.121708,308.146821
1,Unemployment %,1505.991885,1370.396956,139.321300
2,Political and Violence Percentile,1135.577297,1065.273359,-92.489791
3,Probability of dying young,3435.317088,3216.512909,119.515290
4,Rule of Law Percentile,-4685.771781,-4235.870745,16.718544
...,...,...,...,...
62,Sub-region_Central America and Caribbean,109.695126,-0.000000,-139.274441
63,Sub-region_European Union,81.001137,4.236960,43.240084
64,Sub-region_North America,484.464010,1.975162,-55.615174
65,Sub-region_Rest of Europe,20.141710,-0.000000,2.202235


**Evaluar Métricas**

In [78]:
# Métricas en test - Ridge
r2_test_ridge = np.round(r2_score(y_test, y_test_ridge), 3)
rmse_test_ridge = np.round(np.sqrt(mean_squared_error(y_test, y_test_ridge)), 0)
mae_test_ridge = np.round(mean_absolute_error(y_test, y_test_ridge), 0)
mape_test_ridge = np.round(mean_absolute_percentage_error(y_test, y_test_ridge), 2)

# Mostrar métricas - Ridge
print("R2 test - Ridge:", r2_test_ridge)
print("RMSE test - Ridge:", rmse_test_ridge)
print("MAE test - Ridge:", mae_test_ridge)
print("MAPE test - Ridge:", mape_test_ridge)

R2 test - Ridge: 0.514
RMSE test - Ridge: 965.0
MAE test - Ridge: 611.0
MAPE test - Ridge: 2.56


In [79]:
# Métricas en test - Lasso
r2_test_lasso = np.round(r2_score(y_test, y_test_lasso), 3)
rmse_test_lasso = np.round(np.sqrt(mean_squared_error(y_test, y_test_lasso)), 0)
mae_test_lasso = np.round(mean_absolute_error(y_test, y_test_lasso), 0)
mape_test_lasso = np.round(mean_absolute_percentage_error(y_test, y_test_lasso), 2)

# Mostrar métricas - Lasso
print("R2 test - Lasso:", r2_test_lasso)
print("RMSE test - Lasso:", rmse_test_lasso)
print("MAE test - Lasso:", mae_test_lasso)
print("MAPE test - Lasso:", mape_test_lasso)

R2 test - Lasso: 0.516
RMSE test - Lasso: 963.0
MAE test - Lasso: 608.0
MAPE test - Lasso: 2.53


In [80]:
# Métricas en test - E-Net
r2_test_enet = np.round(r2_score(y_test, y_test_enet), 3)
rmse_test_enet = np.round(np.sqrt(mean_squared_error(y_test, y_test_enet)), 0)
mae_test_enet = np.round(mean_absolute_error(y_test, y_test_enet), 0)
mape_test_enet = np.round(mean_absolute_percentage_error(y_test, y_test_enet), 2)

# Mostrar métricas - E-Net
print("R2 test - E-Net:", r2_test_enet)
print("RMSE test - E-Net:", rmse_test_enet)
print("MAE test - E-Net:", mae_test_enet)
print("MAPE test - E-Net:", mape_test_enet)

R2 test - E-Net: 0.339
RMSE test - E-Net: 1125.0
MAE test - E-Net: 625.0
MAPE test - E-Net: 2.37


#### Desicion Tree

In [29]:
# Definir diccionario de valores para parámetros 
params = {'max_depth': range(6,8), 
          'min_samples_leaf' : [1, 3, 4], 
          'min_samples_split': [20, 30], 
          "criterion" : ["squared_error", "absolute_error", "poisson"] 
          } 

# Definir modelo y aplicar combinaciones de parametros según el diccinario 
tree = DecisionTreeRegressor() 
tree_cv = GridSearchCV(tree, params, cv = 3, refit = True, scoring = "neg_mean_squared_error")

# Entrenar modelo con cada combinación de parámetro 
tree_cv.fit(X_train, y_train) 

# Montrar los valores de los parámetros 
print(tree_cv.best_params_)

{'criterion': 'absolute_error', 'max_depth': 7, 'min_samples_leaf': 1, 'min_samples_split': 20}


In [81]:
# Definir modelo con los mejores valores de parámetros 
tree_best =  DecisionTreeRegressor(max_depth = 8, 
                                   min_samples_leaf = 2,
                                   min_samples_split = 15, 
                                   criterion = tree_cv.best_params_['criterion']) 

# Entrenar con el conjunto de entrenamiento 
tree_best.fit(X_train, y_train) 

# Aplicar modelo sobre los datos de traint y test para predecir el target
y_test_pred_tree = tree_best.predict(X_test) 
y_train_pred_tree = tree_best.predict(X_train) 

# Calcular métricas en conjunto train
r2_train_tree = np.round(r2_score(y_train, y_train_pred_tree), 3)
rmse_train_tree = np.round(np.sqrt(mean_squared_error(y_train, y_train_pred_tree)), 0)
mae_train_tree = np.round(mean_absolute_error(y_train, y_train_pred_tree), 0)
mape_train_tree = np.round(mean_absolute_percentage_error(y_train, y_train_pred_tree), 2)

# Calcular métricas en conjunto test
r2_test_tree = np.round(r2_score(y_test, y_test_pred_tree), 3)
rmse_test_tree = np.round(np.sqrt(mean_squared_error(y_test, y_test_pred_tree)), 0)
mae_test_tree = np.round(mean_absolute_error(y_test, y_test_pred_tree), 0)
mape_test_tree = np.round(mean_absolute_percentage_error(y_test, y_test_pred_tree), 2)

# Mostrar métricas
print("R2 - train:", r2_train_tree)
print("RMSE - train:", rmse_train_tree)
print("MAE - train:", mae_train_tree)
print("MAPE - train:", mape_train_tree)
print("")
print("R2 - test:", r2_test_tree)
print("RMSE - test:", rmse_test_tree)
print("MAE - test:", mae_test_tree)
print("MAPE - test:", mape_test_tree)


R2 - train: 0.69
RMSE - train: 908.0
MAE - train: 386.0
MAPE - train: 0.59

R2 - test: 0.551
RMSE - test: 927.0
MAE - test: 437.0
MAPE - test: 0.68


#### Random Forest

In [37]:
# Definir diccionario de valores para parámetros 
params = {'n_estimators': [100], 
	      'criterion' : ['squared_error', 'friedman_mse', 'poisson'],
          "min_samples_split": [30, 50, 70], 
          'min_samples_leaf' : [2, 3, 5],
          "max_depth": [7, 8],
          }

# Definir modelo y aplicar combinaciones de parametros según el diccinario 
rf = RandomForestRegressor() 
rf_cv = GridSearchCV(rf, params, cv=3, scoring='neg_mean_squared_error').fit(X_train, y_train)

# Motrar mejores valores para parámeros
rf_cv.best_estimator_

In [82]:
# Definir modelo con los mejores valores de parámetros (Nota: usar los mejores, pero hacer modificaciones para comparar metricas)
rf_best = RandomForestRegressor(n_estimators = 100, 
                           max_depth = 8, 
                           criterion = 'poisson', 
                           min_samples_split = 20,  #se ajustó a 20 para mejorar resultados en test
                           min_samples_leaf = 2,  
                           )

# Entrenar con el conjunto de entrenamiento 
rf_best.fit(X_train, y_train) 

# Aplicar modelo sobre los datos de traint y test para predecir el target
y_train_pred_rf = rf_best.predict(X_train)
y_test_pred_rf = rf_best.predict(X_test)

# Calcular métricas en conjunto train
r2_train_rf = np.round(r2_score(y_train, y_train_pred_rf), 3)
rmse_train_rf = np.round(np.sqrt(mean_squared_error(y_train, y_train_pred_rf)), 0)
mae_train_rf = np.round(mean_absolute_error(y_train, y_train_pred_rf), 0)
mape_train_rf = np.round(mean_absolute_percentage_error(y_train, y_train_pred_rf), 2)

# Calcular métricas en conjunto test
r2_test_rf = np.round(r2_score(y_test, y_test_pred_rf), 3)
rmse_test_rf = np.round(np.sqrt(mean_squared_error(y_test, y_test_pred_rf)), 0)
mae_test_rf = np.round(mean_absolute_error(y_test, y_test_pred_rf), 0)
mape_test_rf = np.round(mean_absolute_percentage_error(y_test, y_test_pred_rf), 2)

# Mostrar métricas
print("R2 - train:", r2_train_rf)
print("RMSE - train:", rmse_train_rf)
print("MAE - train:", mae_train_rf)
print("MAPE - train:", mape_train_rf)
print("")
print("R2 - test:", r2_test_rf)
print("RMSE - test:", rmse_test_rf)
print("MAE - test:", mae_test_rf)
print("MAPE - test:", mape_test_rf)

R2 - train: 0.793
RMSE - train: 742.0
MAE - train: 337.0
MAPE - train: 0.74

R2 - test: 0.765
RMSE - test: 672.0
MAE - test: 356.0
MAPE - test: 0.79


In [83]:
# Estimación de importancia relativa de variables en el modelo
imp_rel_rf = rf_best.feature_importances_
importancias = pd.DataFrame({"variable": X.columns, "importancia relativa": imp_rel_rf}) \
.sort_values(by='importancia relativa', ascending = False)
importancias[:15]

Unnamed: 0,variable,importancia relativa
15,Number of residents,0.340538
51,Age group_25 - 34,0.099056
55,Age group_65+,0.057912
3,Probability of dying young,0.055079
0,Year,0.049398
54,Age group_55 - 64,0.047293
6,GDP_growth,0.040452
1,Unemployment %,0.037375
18,Number of Turist,0.031265
5,Salaried workers %,0.029773


#### KNN

In [42]:
# Definir diccionario de valores para parámetros 
params = {'n_neighbors': range(1, 20),
          'weights' : ['uniform', 'distance'],
          }

# Definir modelo y aplicar combinaciones de parametros según el diccinario entrenando el conjunto train
knn = KNeighborsRegressor()
knn_cv = GridSearchCV(knn, params, cv=3, scoring='neg_mean_squared_error').fit(X_train,y_train)

knn_cv.best_params_

{'n_neighbors': 11, 'weights': 'distance'}

In [84]:
knn_best =  KNeighborsRegressor(n_neighbors = 11, weights = 'uniform', leaf_size=30, p = 1)

knn_best.fit(X_train, y_train)

# Obtener predicciones con conjunto de entrenamiento y prueba
y_train_pred_knn = knn_best.predict(X_train)  
y_test_pred_knn = knn_best.predict(X_test)  

# Calcular métricas en conjunto train
r2_train_knn = np.round(r2_score(y_train, y_train_pred_knn), 3)
rmse_train_knn = np.round(np.sqrt(mean_squared_error(y_train, y_train_pred_knn)), 0)
mae_train_knn = np.round(mean_absolute_error(y_train, y_train_pred_knn), 0)
mape_train_knn = np.round(mean_absolute_percentage_error(y_train, y_train_pred_knn), 2)

# Calcular métricas en conjunto train
r2_test_knn = np.round(r2_score(y_test, y_test_pred_knn), 3)
rmse_test_knn = np.round(np.sqrt(mean_squared_error(y_test, y_test_pred_knn)), 0)
mae_test_knn = np.round(mean_absolute_error(y_test, y_test_pred_knn), 0)
mape_test_knn = np.round(mean_absolute_percentage_error(y_test, y_test_pred_knn), 2)

# Print metrics
print("R2 - train:", r2_train_knn)
print("RMSE - train:", rmse_train_knn)
print("MAE - train:", mae_train_knn)
print("MAPE - train:", mape_train_knn)
print("")
print("R2 - test:", r2_test_knn)
print("RMSE - test:", rmse_test_knn)
print("MAE - test:", mae_test_knn)
print("MAPE - test:", mape_test_knn)

R2 - train: 0.721
RMSE - train: 861.0
MAE - train: 376.0
MAPE - train: 0.87

R2 - test: 0.697
RMSE - test: 762.0
MAE - test: 381.0
MAPE - test: 0.96


#### SVR

In [5]:
# Definir diccionario de valores para parámetros 
params = {'kernel': ['rbf', 'linear'],
          'gamma': ['scale', 'auto'],
          'C' : [1.0, 0.85, 0.75] ,
          'max_iter': [-1, 100],
          "tol" : [0.001, 0.002, 0.0015, 0.1, 0.2]
          }

# Definir modelo y aplicar combinaciones de parametros según el diccinario 
svr = SVR()
svr_cv = GridSearchCV(svr, params, cv = 3, refit = True, scoring = 'neg_mean_squared_error')

# Entrenar modelo con cada combinación de parámetro 
svr_cv.fit(X_train, y_train)

# Motrar mejores valores para parámeros
svr_cv.best_params_



{'C': 1.0, 'gamma': 'scale', 'kernel': 'linear', 'max_iter': -1, 'tol': 0.0015}

In [85]:
# Definir modelo con mejors parámetros
svr_best = SVR(kernel = 'linear', 
               gamma = 'scale', 
               C = 1.0, 
               max_iter = -1, 
               tol = 0.0015
               )

# Enrenar con el conjunto de entrenamiento 
svr_best.fit(X_train, y_train) 

# Obtener predicciones con conjunto de entrenamiento y prueba
y_train_pred_svr = svr_best.predict(X_train)  
y_test_pred_svr = svr_best.predict(X_test)  

# Calcular métricas en conjunto train
r2_train_svr = np.round(r2_score(y_train, y_train_pred_svr), 3)
rmse_train_svr = np.round(np.sqrt(mean_squared_error(y_train, y_train_pred_svr)), 0)
mae_train_svr = np.round(mean_absolute_error(y_train, y_train_pred_svr), 0)
mape_train_svr = np.round(mean_absolute_percentage_error(y_train, y_train_pred_svr), 2)

# Calcular métricas en conjunto test
r2_test_svr = np.round(r2_score(y_test, y_test_pred_svr), 3)
rmse_test_svr = np.round(np.sqrt(mean_squared_error(y_test, y_test_pred_svr)), 0)
mae_test_svr = np.round(mean_absolute_error(y_test, y_test_pred_svr), 0)
mape_test_svr = np.round(mean_absolute_percentage_error(y_test, y_test_pred_svr), 2)

# Mostrar métricas
print("R2 - train:", r2_train_svr)
print("RMSE - train:", rmse_train_svr)
print("MAE - train:", mae_train_svr)
print("MAPE - train:", mape_train_svr)
print("")
print("R2 - test:", r2_test_svr)
print("RMSE - test:", rmse_test_svr)
print("MAE - test:", mae_test_svr)
print("MAPE - test:", mape_test_svr)


R2 - train: 0.055
RMSE - train: 1585.0
MAE - train: 620.0
MAPE - train: 0.89

R2 - test: 0.074
RMSE - test: 1332.0
MAE - test: 579.0
MAPE - test: 0.89


#### Red Neuronal Básica

Obtengamos primero un número de referencia en cuanto a el número de neuronas a usar. 

In [24]:
# Dos métodos para estimar numero de neuronas a usar
print(np.sqrt(len(X.columns)*2))
print(2/3 * len(X.columns) + 2)

11.575836902790225
46.666666666666664


In [60]:
# Definir diccionario de valores para parámetros 
params = {'max_iter': [200],
          'hidden_layer_sizes':[11, 44, (44, 11)],
          'batch_size': [30, 50, 70],
          'activation': ['relu', 'identity', 'tanh'],
          'alpha': [0.1, 0.01],
          'early_stopping' : [True],
          'solver' : ['adam', 'sgd', 'lbfgs']}

# Definir modelo y aplicar combinaciones de parametros según el diccinario 
rn = MLPRegressor()
rn_cv = GridSearchCV(rn, param_grid = params, cv = 3, scoring='neg_mean_squared_error', verbose=True, n_jobs = -1)

# Entrenar modelo con cada combinación de parámetro 
rn_cv.fit(X_train, y_train)

# Motrar mejores valores para parámeros
rn_cv.best_params_

Fitting 3 folds for each of 162 candidates, totalling 486 fits


54 fits failed out of a total of 486.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
54 fits failed with the following error:
Traceback (most recent call last):
  File "c:\ProgramFiles\Anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\ProgramFiles\Anaconda3\Lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py", line 749, in fit
    return self._fit(X, y, incremental=False)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\ProgramFiles\Anaconda3\Lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py", line 471, in _fit
    self._fit_stochastic(
  File "c:\ProgramFiles\Anaconda3\Lib\site-packages\skl

{'activation': 'relu',
 'alpha': 0.01,
 'batch_size': 50,
 'early_stopping': True,
 'hidden_layer_sizes': 44,
 'max_iter': 200,
 'solver': 'lbfgs'}

In [16]:
# hacer un RN con los mejores parametros obtenidos y entrenar
rn_best = MLPRegressor(activation = 'relu',
                           alpha = 0.01, 
                           batch_size= 50, 
                           early_stopping = True, 
                           hidden_layer_sizes = 60, 
                           max_iter = 200, 
                           solver = 'lbfgs',
                           )

# # Entrenar con el conjunto de entrenamiento 
rn_best.fit(X_train, y_train)

# Aplicar modelo sobre los datos de traint y test para predecir el target
y_test_pred_rn = rn_best.predict(X_test) 
y_train_pred_rn = rn_best.predict(X_train) 

# Calculo de metricas en train
r2_train_rn = np.round(r2_score(y_train, y_train_pred_rn), 3)
rmse_train_rn = np.round(np.sqrt(mean_squared_error(y_train, y_train_pred_rn)), 0)
mae_train_rn = np.round(mean_absolute_error(y_train, y_train_pred_rn), 0)
mape_train_rn = np.round(mean_absolute_percentage_error(y_train, y_train_pred_rn), 2)

# Calculo de metricas en test
r2_test_rn = np.round(r2_score(y_test, y_test_pred_rn), 3)
rmse_test_rn = np.round(np.sqrt(mean_squared_error(y_test, y_test_pred_rn)), 0) 
mae_test_rn = np.round(mean_absolute_error(y_test, y_test_pred_rn), 0)
mape_test_rn = np.round(mean_absolute_percentage_error(y_test, y_test_pred_rn), 2)  

# Mostrar métricas
print("R2 - train:", r2_train_rn)
print("RMSE - train:", rmse_train_rn)
print("MAE - train:", mae_train_rn)
print("MAPE - train:", mape_train_rn)
print("")
print("R2 - test:", r2_test_rn)
print("RMSE - test:", rmse_test_rn)
print("MAE - test:", mae_test_rn)
print("MAPE - test:", mape_test_rn)

R2 - train: 0.968
RMSE - train: 290.0
MAE - train: 188.0
MAPE - train: 0.57

R2 - test: 0.911
RMSE - test: 413.0
MAE - test: 242.0
MAPE - test: 0.57


### *Modelos Que Aceptan Datos Nulos*

Ahora evalueremos los datos con dos modelos que aceptan datos nulos (Hist Gradient Boosting y XGBoost), por lo que haremos una copia del cojunto de datos con todos los países y, al igual que antes, removeremos las categorías de agregados, haremos de la variable "Year" una variable ordinal y el resto de variable categóricas a variables *dummy*.

In [98]:
# Filtar y remover agregados
df_noagg = df[(df["Sex"] != "Both") & (df["Age group"] != "All")]

# hacer copia del df_noagg
df_copy = df_noagg.copy()

# Transformar Year a variable ordinal de 1 (2008) a 15 (2022)
df_copy['Year'] = df_copy['Year'] - 2007

# Generar variables dummies a partir de nuestras variables categóricas "object" (no ordinales)
df_copy = pd.get_dummies(df_copy)

# Convertir las variables dummies booleanas en "int"
col_bool = df_copy.select_dtypes(include = ['bool']).columns
df_copy[col_bool] = df_copy[col_bool].astype(int)

# Verificar cambio
df_copy.info()
df_copy

<class 'pandas.core.frame.DataFrame'>
Index: 5460 entries, 1 to 9359
Data columns (total 69 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   Year                                      5460 non-null   int64  
 1   Immigrant count                           5460 non-null   int64  
 2   Unemployment %                            5460 non-null   float64
 3   Political and Violence Percentile         5460 non-null   float64
 4   Probability of dying young                5460 non-null   float64
 5   Rule of Law Percentile                    5460 non-null   float64
 6   Salaried workers %                        5460 non-null   float64
 7   GDP_growth                                5460 non-null   float64
 8   Inflation_annual                          5460 non-null   float64
 9   Liberal democracy index                   5460 non-null   float64
 10  Health equality                          

Unnamed: 0,Year,Immigrant count,Unemployment %,Political and Violence Percentile,Probability of dying young,Rule of Law Percentile,Salaried workers %,GDP_growth,Inflation_annual,Liberal democracy index,...,Continent_America,Continent_Asia,Continent_Europe,Sub-region_Africa,Sub-region_Asia,Sub-region_Central America and Caribbean,Sub-region_European Union,Sub-region_North America,Sub-region_Rest of Europe,Sub-region_South America
1,1,2938,4.03,17.31,5.1,25.96,44.47,9.13,1.10,0.649,...,1,0,0,0,0,0,0,0,0,1
2,1,1128,4.03,17.31,5.1,25.96,44.47,9.13,1.10,0.649,...,1,0,0,0,0,0,0,0,0,1
3,1,265,4.03,17.31,5.1,25.96,44.47,9.13,1.10,0.649,...,1,0,0,0,0,0,0,0,0,1
4,1,156,4.03,17.31,5.1,25.96,44.47,9.13,1.10,0.649,...,1,0,0,0,0,0,0,0,0,1
8,1,4703,4.03,17.31,5.1,25.96,44.47,9.13,1.10,0.649,...,1,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9353,15,452,5.60,6.60,5.8,25.00,42.14,4.71,13.96,0.234,...,0,1,0,0,1,0,0,0,0,0
9355,15,330,5.60,6.60,5.8,25.00,42.14,4.71,13.96,0.234,...,0,1,0,0,1,0,0,0,0,0
9356,15,146,5.60,6.60,5.8,25.00,42.14,4.71,13.96,0.234,...,0,1,0,0,1,0,0,0,0,0
9358,15,99,5.60,6.60,5.8,25.00,42.14,4.71,13.96,0.234,...,0,1,0,0,1,0,0,0,0,0


Separemos el conjunto train/test y escalemos los datos.

In [99]:
# Separar variables input y variable target "Immigrant count" de df_copy
X = df_copy.drop("Immigrant count", axis = 1) # variables predictoras
y = df_copy["Immigrant count"]  # Target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 58) # separar datos en conjunto train y test en un 75% / 25%
scaler = MinMaxScaler() # definir scaler de datos 
X_train = scaler.fit_transform(X_train) # escalar los datos de entrenamiento
X_test = scaler.transform(X_test) # # escalar los datos de prueba

#### Hist Gradient Boosting

In [90]:
# Definir diccionario de valores para parámetros 
params = {'max_iter': [120], 
	      'loss' : ['squared_error', 'gamma', 'poisson'],
          "learning_rate": [0.1, 0.01], 
          'min_samples_leaf' : [2, 3, 5],
          "max_depth": [7, 8],
          'l2_regularization' : [0.0, 0.1, 0.3] #usar si se tienen muchas variables
          }

# Definir modelo y aplicar combinaciones de parametros según el diccinario 
hgb = HistGradientBoostingRegressor() 
hgb_cv = GridSearchCV(hgb, params, cv=3, scoring='neg_mean_squared_error').fit(X_train, y_train)

hgb_cv.best_estimator_

108 fits failed out of a total of 324.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
108 fits failed with the following error:
Traceback (most recent call last):
  File "c:\ProgramFiles\Anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\ProgramFiles\Anaconda3\Lib\site-packages\sklearn\ensemble\_hist_gradient_boosting\gradient_boosting.py", line 353, in fit
    self._validate_params()
  File "c:\ProgramFiles\Anaconda3\Lib\site-packages\sklearn\base.py", line 600, in _validate_params
    validate_parameter_constraints(
  File "c:\ProgramFiles\Anaconda3\Lib\site-packages\sklearn\utils\_param_validation.py", line 97, in validate_param

In [100]:
# Definir modelo con los mejores valores de parámetros (Nota: usar los mejores, pero hacer modificaciones para comparar metricas)
hgb_best = HistGradientBoostingRegressor(
                           max_iter = 120, 
                           max_depth = 8,
                           loss = 'poisson', 
                           learning_rate = 0.1, 
                           min_samples_leaf = 32,
                           max_leaf_nodes = 37,
                           )

# Entrenar con el conjunto de entrenamiento 
hgb_best.fit(X_train, y_train) 

# Aplicar modelo sobre los datos de traint y test para predecir el target
y_train_pred_hgb = hgb_best.predict(X_train)
y_test_pred_hgb = hgb_best.predict(X_test)

# Calculo de metricas en train
r2_train_hgb = np.round(r2_score(y_train, y_train_pred_hgb), 3)
rmse_train_hgb = np.round(np.sqrt(mean_squared_error(y_train, y_train_pred_hgb)), 0)
mae_train_hgb = np.round(mean_absolute_error(y_train, y_train_pred_hgb), 0)
mape_train_hgb = np.round(mean_absolute_percentage_error(y_train, y_train_pred_hgb), 2)

# Calculo de metricas en train
r2_test_hgb = np.round(r2_score(y_test, y_test_pred_hgb), 3)
rmse_test_hgb = np.round(np.sqrt(mean_squared_error(y_test, y_test_pred_hgb)), 0)
mae_test_hgb = np.round(mean_absolute_error(y_test, y_test_pred_hgb), 0)
mape_test_hgb = np.round(mean_absolute_percentage_error(y_test, y_test_pred_hgb), 2)

# Mostrar métricas
print("R2 - train:", r2_train_hgb)
print("RMSE - train:", rmse_train_hgb)
print("MAE - train:", mae_train_hgb)
print("MAPE - train:", mape_train_hgb)
print('')
print("R2 - test:", r2_test_hgb)
print("RMSE - test:", rmse_test_hgb)
print("MAE - test:", mae_test_hgb)
print("MAPE - test:", mape_test_hgb)

R2 - train: 0.981
RMSE - train: 213.0
MAE - train: 110.0
MAPE - train: 0.23

R2 - test: 0.957
RMSE - test: 321.0
MAE - test: 152.0
MAPE - test: 0.29


In [89]:
# Importancia por permutaciones
importancias_permu = permutation_importance(hgb_best, X_train, y_train, n_repeats=10, random_state=58)

# Importancias en dataframe
importances_hgb = pd.DataFrame({
    'feature': X.columns,
    'importance_mean': importancias_permu.importances_mean,
    'importance_std': importancias_permu.importances_std
}).sort_values(by='importance_mean', ascending=False)

importances_hgb.head(15)

Unnamed: 0,feature,importance_mean,importance_std
15,Number of residents,0.822939,0.028863
0,Year,0.20197,0.007397
52,Age group_25 - 34,0.186353,0.014187
1,Unemployment %,0.184697,0.019868
56,Age group_65+,0.176634,0.033134
55,Age group_55 - 64,0.138479,0.025939
18,Number of Turist,0.097825,0.009394
5,Salaried workers %,0.09444,0.009744
51,Age group_15 - 24,0.083975,0.005891
54,Age group_45 - 54,0.046044,0.007716


In [90]:
# Exportar tablas comparativas de metricas
importances_hgb.to_csv("../16 - Exports Modelos/sin agregados/feature_importance_hgb_noagg.csv", index = False)

#### XGBoost

In [5]:
# Definir diccionario de valores para parámetros 
params = {'objective' : ['reg:squarederror', 'reg:squaredlogerror'], # seleccionar 2 como max, reg:squarederror es el default
          'eval_metric' : ['rmse'],   # rmsle disminuye efecto de outliers
          'booster' : ["gbtree", "gblinear", "dart"], # gbtree (default) y dart estan basados en arboles
          'n_estimators': [200],
          'max_depth': [7, 9], # dependiendo de la dimensionalidad de los datos, usar valores de profundidad menor
          "learning_rate" : [0.1, 0.3, 0.05], 
       #   "colsample_bytree" : [1, 0.7], # porcentaje de variables a usar (buen parametro para cuando se tiene una gran cantidad de variables)
          }

# Definir modelo y aplicar combinaciones de parametros según el diccinario 
xgb = XGBRegressor()
xgb_cv = GridSearchCV(xgb, params, cv=3, scoring = 'neg_mean_squared_error') # elegir scoring deseano (r2, mae, mse, mape...)

# Entrenar modelo con cada combinación de parámetro 
xgb_cv.fit(X_train,y_train)

# Motrar mejores valores para parámeros
xgb_cv.best_params_

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters:

{'booster': 'dart',
 'eval_metric': 'rmse',
 'learning_rate': 0.3,
 'max_depth': 7,
 'n_estimators': 200,
 'objective': 'reg:squarederror'}

In [91]:
# Definir modelo con los mejores valores de parámetros (Nota: usar los mejores, pero hacer modificaciones al comparar metricas)
xgb_best = XGBRegressor(objective = 'reg:squarederror', 
                        eval_metric = 'rmse',
                        booster = 'dart',
                        n_estimators = 200,
                        max_depth = 7,
                        #alpha = 0.05, 
                        learning_rate = 0.3,
                        min_child_weight = 30, 
                      #  colsample_bytree = 0.8,
                    )

# Entrenar modelo con el conjunto de entrenamiento 
xgb_best.fit(X_train, y_train)

# Aplicar modelo sobre los datos de traint y test para predecir el target
y_train_pred_xgb = xgb_best.predict(X_train) 
y_test_pred_xgb = xgb_best.predict(X_test) 


# Calculate metrics for train set
r2_train_xgb = np.round(r2_score(y_train, y_train_pred_xgb), 3)
rmse_train_xgb = np.round(np.sqrt(mean_squared_error(y_train, y_train_pred_xgb)), 0)
mae_train_xgb = np.round(mean_absolute_error(y_train, y_train_pred_xgb), 0)
mape_train_xgb = np.round(mean_absolute_percentage_error(y_train, y_train_pred_xgb), 2)

# Calculate metrics for test set
r2_test_xgb = np.round(r2_score(y_test, y_test_pred_xgb), 3)
rmse_test_xgb = np.round(np.sqrt(mean_squared_error(y_test, y_test_pred_xgb)), 0)
mae_test_xgb = np.round(mean_absolute_error(y_test, y_test_pred_xgb), 0)
mape_test_xgb = np.round(mean_absolute_percentage_error(y_test, y_test_pred_xgb), 2)

# Print metrics
print("R2 - train:", r2_train_xgb)
print("RMSE - train:", rmse_train_xgb)
print("MAE - train:", mae_train_xgb)
print("MAPE - train:", mape_train_xgb)
print("")
print("R2 - test:", r2_test_xgb)
print("RMSE - test:", rmse_test_xgb)
print("MAE - test:", mae_test_xgb)
print("MAPE - test:", mape_test_xgb)


R2 - train: 0.978
RMSE - train: 232.0
MAE - train: 119.0
MAPE - train: 0.6

R2 - test: 0.91
RMSE - test: 465.0
MAE - test: 226.0
MAPE - test: 0.98


### *Comparar Modelos*

Juntemos los resultados todos los modelos en un dataframe por tipo de métrica, ordenando por el mejor valor en el conjunto test.

In [92]:
# Juntar en un dataframe los datos
modelos_r2 = pd.DataFrame({
    'Modelo' : ['Lineal', 'Lineal - Huber', 'Lineal - RANSAC', 'Lineal - Theilsen', 'Lineal - Ridge', 'Lineal - Lasso', 'Lineal - E-Net', 'Decision Tree', 'Random Forest', 'KNN', 'SVR', 'Red Neuronal', 'HGB', 'XGBoost'],
    'R² train' : [r2_train_lineal, r2_train_huber, r2_train_ransac, r2_train_theilsen, np.nan, np.nan, np.nan, r2_train_tree, r2_train_rf, r2_train_knn, r2_train_svr, r2_train_rn, r2_train_hgb, r2_train_xgb],
    'R² test' : [r2_test_lineal, r2_test_huber, r2_test_ransac, r2_test_theilsen, r2_test_ridge, r2_test_lasso, r2_test_enet, r2_test_tree, r2_test_rf, r2_test_knn, r2_test_svr, r2_test_rn, r2_test_hgb, r2_test_xgb],
})

# Ordenar de forma descendente por R² test
modelos_r2.sort_values(by = 'R² test', ascending = False)

Unnamed: 0,Modelo,R² train,R² test
12,HGB,0.981,0.957
11,Red Neuronal,0.968,0.911
13,XGBoost,0.978,0.91
8,Random Forest,0.793,0.765
9,KNN,0.721,0.697
7,Decision Tree,0.69,0.551
5,Lineal - Lasso,,0.516
4,Lineal - Ridge,,0.514
0,Lineal,0.52,0.513
1,Lineal - Huber,0.293,0.34


In [93]:
# Juntar en un dataframe los datos
modelos_rsme = pd.DataFrame({
    'Modelo' : ['Lineal', 'Lineal - Huber', 'Lineal - RANSAC', 'Lineal - Theilsen', 'Lineal - Ridge', 'Lineal - Lasso', 'Lineal - E-Net', 'Decision Tree', 'Random Forest', 'KNN','SVR', 'Red Neuronal', 'HGB', 'XGBoost'],
    'RSME train' : [rmse_train_lineal, rmse_train_huber, rmse_train_ransac, rmse_train_theilsen, np.nan, np.nan, np.nan, rmse_train_tree, rmse_train_rf, rmse_train_knn, rmse_train_svr, rmse_train_rn, rmse_train_hgb, rmse_train_xgb],
    'RSME test' : [rmse_test_lineal, rmse_test_huber, rmse_test_ransac, rmse_test_theilsen, rmse_test_ridge, rmse_test_lasso, rmse_test_enet, rmse_test_tree, rmse_test_rf, rmse_test_knn, rmse_test_svr, rmse_test_rn, rmse_test_hgb, rmse_test_xgb],
    })

# Ordenar de forma ascendente por RSME test
modelos_rsme.sort_values(by = 'RSME test', ascending = True)

Unnamed: 0,Modelo,RSME train,RSME test
12,HGB,213.0,321.0
11,Red Neuronal,290.0,413.0
13,XGBoost,232.0,465.0
8,Random Forest,742.0,672.0
9,KNN,861.0,762.0
7,Decision Tree,908.0,927.0
5,Lineal - Lasso,,963.0
4,Lineal - Ridge,,965.0
0,Lineal,1129.0,966.0
1,Lineal - Huber,1370.0,1125.0


In [94]:
# Juntar en un dataframe los datos
modelos_mae = pd.DataFrame({
    'Modelo' : ['Lineal', 'Lineal - Huber', 'Lineal - RANSAC', 'Lineal - Theilsen', 'Lineal - Ridge', 'Lineal - Lasso', 'Lineal - E-Net', 'Decision Tree', 'Random Forest', 'KNN', 'SVR', 'Red Neuronal', 'HGB', 'XGBoost'],
    'MAE train' : [mae_train_lineal, mae_train_huber, mae_train_ransac, mae_train_theilsen, np.nan, np.nan, np.nan, mae_train_tree, mae_train_rf, mae_train_knn, mae_train_svr, mae_train_rn, mae_train_hgb, mae_train_xgb],
    'MAE test' : [mae_test_lineal, mae_test_huber, mae_test_ransac, mae_test_theilsen, mae_test_ridge, mae_test_lasso, mae_test_enet, mae_test_tree, mae_test_rf, mae_test_knn, mae_test_svr, mae_test_rn, mae_test_hgb, mae_test_xgb],
})

# Ordenar de forma descendente por MAE test
modelos_mae.sort_values(by = 'MAE test', ascending = True)

Unnamed: 0,Modelo,MAE train,MAE test
12,HGB,110.0,152.0
13,XGBoost,119.0,226.0
11,Red Neuronal,188.0,242.0
8,Random Forest,337.0,356.0
9,KNN,376.0,381.0
7,Decision Tree,386.0,437.0
1,Lineal - Huber,526.0,493.0
10,SVR,620.0,579.0
5,Lineal - Lasso,,608.0
3,Lineal - Theilsen,640.0,611.0


In [95]:
# Juntar en un dataframe los datos
modelos_mape = pd.DataFrame({
    'Modelo' : ['Lineal', 'Lineal - Huber', 'Lineal - RANSAC', 'Lineal - Theilsen', 'Lineal - Ridge', 'Lineal - Lasso', 'Lineal - E-Net', 'Decision Tree', 'Random Forest', 'KNN', 'SVR', 'Red Neuronal', 'HGB', 'XGBoost'],
    'MAPE train' : [mape_train_lineal, mape_train_huber, mape_train_ransac, mape_train_theilsen, np.nan, np.nan, np.nan, mape_train_tree, mape_train_rf, mape_train_knn, mape_train_svr, mape_train_rn, mape_train_hgb, mape_train_xgb],
    'MAPE test' : [mape_test_lineal, mape_test_huber, mape_test_ransac, mape_test_theilsen, mape_test_ridge, mape_test_lasso, mape_test_enet, mape_test_tree, mape_test_rf, mape_test_knn, mape_test_svr, mape_test_rn, mape_test_hgb, mape_test_xgb],
})

# Ordenar de forma ascendente por MAPE test
modelos_mape.sort_values(by = 'MAPE test', ascending = True)

Unnamed: 0,Modelo,MAPE train,MAPE test
12,HGB,0.23,0.29
11,Red Neuronal,0.57,0.57
7,Decision Tree,0.59,0.68
8,Random Forest,0.74,0.79
2,Lineal - RANSAC,0.87,0.8
1,Lineal - Huber,0.97,0.83
10,SVR,0.89,0.89
9,KNN,0.87,0.96
13,XGBoost,0.6,0.98
3,Lineal - Theilsen,2.82,2.06


In [96]:
# Exportar tablas comparativas de metricas
modelos_r2.to_csv("../16 - Exports Modelos/sin agregados/metrics_r2_noagg.csv", index = False)
modelos_rsme.to_csv("../16 - Exports Modelos/sin agregados/metrics_rsme_noagg.csv", index = False)
modelos_mae.to_csv("../16 - Exports Modelos/sin agregados/metrics_mae_noagg.csv", index = False)
modelos_mape.to_csv("../16 - Exports Modelos/sin agregados/metrics_mape_noagg.csv", index = False)

De acuerdo a nuestra exploración inicial de modelos, HGB es el que da mejor resultados, seguido de la red neuronal (RN) de una capa. 

Guardemos nuestro mejor modelo de esta etapa y estudiemos el efecto de normalizar lso datos sobre el modelo RN.

In [101]:
# Exportar modelo HGB
with open("../16 - Exports Modelos/sin agregados/hgb_best_noagg.pkl", 'wb') as file:
    pickle.dump(hgb_best, file)

# Exportar scaler
joblib.dump(scaler, '../16 - Exports Modelos/sin agregados/scaler_nogg.pkl')

['../16 - Exports Modelos/sin agregados/scaler_nogg.pkl']

### *Transformar Datos para modelo de Redes Neuronales*

Considerando que en etapas previas observamos las variables inputs no presentan una distribución normal y que uno de los modelos con mejores métricas, además del HGB, fue el de RN de una capa, agregaremos unos pasos de transformación para normalizar los datos (target y target/inputs) para analizar una posible mejoría de las métricas en este modelo.

In [22]:
# Función para calcular R² Ajustado 
def adjusted_r2(r2, n, p):
    return 1 - (1 - r2) * ((n - 1) / (n - p - 1))

**Normalizar (ajustar a Gaussiana) sólo la variable target ("Immigrant count")**

In [61]:
# Cagar PowerTransformers y modelo
pt = joblib.load('../16 - Exports Modelos/sin agregados/power_transformer.pkl')
rn_best_trans1 = joblib.load('../16 - Exports Modelos/sin agregados/rn_trans1_noagg.pkl')

In [62]:
# Removemos las categorías de agregados
df_noagg = df[(df["Sex"] != "Both") & (df["Age group"] != "All")]

# hacer copia del df removiendo Senegal que presenta datos nulos para tasa de homicidios
df_nonull = df_noagg[df_noagg['Nationality code'] != 'SEN'].copy()

# Transformar Year a variable ordinal de 1 (2008) a 15 (2022)
df_nonull['Year'] = df_nonull['Year'] - 2007

# Generar variables dummies a partir de nuestras variables categóricas "object" (no ordinales)
df_nonull = pd.get_dummies(df_nonull)

# Convertir las variables dummies booleanas en "int"
col_bool = df_nonull.select_dtypes(include = ['bool']).columns
df_nonull[col_bool] = df_nonull[col_bool].astype(int)

# Separar variables input y variable target "Immigrant count" de df_null (dataframe sin atos nulos)
X = df_nonull.drop("Immigrant count", axis = 1) # variables predictoras
y = df_nonull["Immigrant count"]  # Target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 58) # separar datos en conjunto train y test en un 75% / 25%
scaler_trans1 = MinMaxScaler() # definir scaler de datos 
X_train = scaler_trans1.fit_transform(X_train) # escalar los datos de entrenamiento
X_test = scaler_trans1.fit_transform(X_test) # # escalar los datos de prueba

# Aplicar PowerTransformer a variable target
pt = PowerTransformer(method='box-cox')
y_train_trans = pt.fit_transform(y_train.values.reshape(-1, 1)).flatten()
y_test_trans = pt.transform(y_test.values.reshape(-1, 1)).flatten()

# Definir modelo con los mejores valores de parámetros (Nota: usar los mejores, pero hacer modificaciones para comparar metricas)
rn_best_trans1 = MLPRegressor(activation = 'relu',
                              alpha = 0.01, 
                               batch_size= 30, 
                               early_stopping = True, 
                               hidden_layer_sizes = 60, 
                               max_iter = 220, 
                               solver = 'lbfgs',
                            )

# Entrenar con el conjunto de entrenamiento 
rn_best_trans1.fit(X_train, y_train_trans)

# Aplicar modelo sobre los datos de train y test para predecir el target 
y_train_pred_trans = rn_best_trans1.predict(X_train) 
y_test_pred_trans = rn_best_trans1.predict(X_test)

# Inverse transform the predictions
y_train_pred = pt.inverse_transform(y_train_pred_trans.reshape(-1, 1)).flatten()
y_test_pred = pt.inverse_transform(y_test_pred_trans.reshape(-1, 1)).flatten()

# Calculo de metricas en train 
r2_train_rn_trans1 = np.round(r2_score(y_train, y_train_pred), 3) 
rmse_train_rn_trans1 = np.round(np.sqrt(mean_squared_error(y_train, y_train_pred)), 0) 
mae_train_rn_trans1 = np.round(mean_absolute_error(y_train, y_train_pred), 0) 
mape_train_rn_trans1 = np.round(mean_absolute_percentage_error(y_train, y_train_pred), 2)

# Calculo de metricas en test
r2_test_rn_trans1 = np.round(r2_score(y_test, y_test_pred), 3) 
rmse_test_rn_trans1 = np.round(np.sqrt(mean_squared_error(y_test, y_test_pred)), 0) 
mae_test_rn_trans1 = np.round(mean_absolute_error(y_test, y_test_pred), 0) 
mape_test_rn_trans1 = np.round(mean_absolute_percentage_error(y_test, y_test_pred), 2)

# Calcular R² Ajustado
adj_r2_train_rn_trans1 = np.round(adjusted_r2(r2_train_rn_trans1, len(y_train), X_train.shape[1]), 3)
adj_r2_test_rn_trans1 = np.round(adjusted_r2(r2_test_rn_trans1, len(y_test), X_test.shape[1]), 3)

# Mostrar métricas
print("R2 - train:", r2_train_rn_trans1)
print("Adjusted R2 - train:", adj_r2_train_rn_trans1)
print("RMSE - train:", rmse_train_rn_trans1)
print("MAE - train:", mae_train_rn_trans1)
print("MAPE - train:", mape_train_rn_trans1)
print('')
print("R2 - test:", r2_test_rn_trans1)
print("Adjusted R2 - test:", adj_r2_test_rn_trans1)
print("RMSE - test:", rmse_test_rn_trans1)
print("MAE - test:", mae_test_rn_trans1)
print("MAPE - test:", mape_test_rn_trans1)


R2 - train: 0.971
Adjusted R2 - train: 0.97
RMSE - train: 278.0
MAE - train: 123.0
MAPE - train: 0.13

R2 - test: 0.939
Adjusted R2 - test: 0.936
RMSE - test: 343.0
MAE - test: 162.0
MAPE - test: 0.18


In [63]:
# Guardar modelo
with open('../16 - Exports Modelos/sin agregados/rn_trans1_noagg.pkl', 'wb') as file:
    pickle.dump(rn_best_trans1, file)

# Guardar PowerTransformer de target
joblib.dump(pt, '../16 - Exports Modelos/sin agregados/power_transformer.pkl')

# Guardar scaler
joblib.dump(scaler_trans1, '../16 - Exports Modelos/sin agregados/scaler_trans1.pkl')

['scaler_trans1.pkl']

**Normalizar (ajustar a Gaussiana) variables inputs y target ("Immigrant count")**

In [129]:
# Removemos las categorías de agregados
df_noagg = df[(df["Sex"] != "Both") & (df["Age group"] != "All")]

# hacer copia del df removiendo Senegal que presenta datos nulos para tasa de homicidios
df_nonull = df_noagg[df_noagg['Nationality code'] != 'SEN'].copy()

# Transformar Year a variable ordinal de 1 (2008) a 15 (2022)
df_nonull['Year'] = df_nonull['Year'] - 2007

# Generar variables dummies a partir de nuestras variables categóricas "object" (no ordinales)
df_nonull = pd.get_dummies(df_nonull)

# Convertir las variables dummies booleanas en "int"
col_bool = df_nonull.select_dtypes(include = ['bool']).columns
df_nonull[col_bool] = df_nonull[col_bool].astype(int)

# Separar variables input y variable target "Immigrant count" de df_null (dataframe sin datos nulos)
X = df_nonull.drop("Immigrant count", axis = 1) # variables predictoras
y = df_nonull["Immigrant count"]  # Target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 58) # separar datos en conjunto train y test en un 75% / 25%

# Aplicar PowerTransformer a variables input
pt_X = PowerTransformer(method='yeo-johnson')
X_train_trans = pt_X.fit_transform(X_train)
X_test_trans = pt_X.transform(X_test)

scaler = MinMaxScaler()  # definir scaler de datos
X_train_trans_scale = scaler.fit_transform(X_train_trans)  # escalar los datos de entrenamiento
X_test_trans_scale = scaler.transform(X_test_trans)  # transformar los datos de prueba

# Aplicar PowerTransformer a variable target
pt_y = PowerTransformer(method='box-cox')
y_train_trans = pt_y.fit_transform(y_train.values.reshape(-1, 1)).flatten()
y_test_trans = pt_y.transform(y_test.values.reshape(-1, 1)).flatten()


# Definir modelo con los mejores valores de parámetros (Nota: usar los mejores, pero hacer modificaciones para comparar metricas)
rn_best_trans2 = MLPRegressor(activation = 'relu',
                              alpha = 0.01, 
                              batch_size= 35, 
                              early_stopping = True, 
                              hidden_layer_sizes = 60, 
                              max_iter = 220, 
                              solver = 'lbfgs',
                           )

# Entrenar con el conjunto de entrenamiento 
rn_best_trans2.fit(X_train_trans_scale, y_train_trans)

# Aplicar modelo sobre los datos de train y test para predecir el target 
y_train_pred_trans = rn_best_trans2.predict(X_train_trans_scale) 
y_test_pred_trans = rn_best_trans2.predict(X_test_trans_scale)

# Transformacion inversa de predicciones
y_train_pred = pt_y.inverse_transform(y_train_pred_trans.reshape(-1, 1)).flatten()
y_test_pred = pt_y.inverse_transform(y_test_pred_trans.reshape(-1, 1)).flatten()

# Calculo de metricas en train 
r2_train_rn_trans2 = np.round(r2_score(y_train, y_train_pred), 3) 
rmse_train_rn_trans2 = np.round(np.sqrt(mean_squared_error(y_train, y_train_pred)), 0) 
mae_train_rn_trans2 = np.round(mean_absolute_error(y_train, y_train_pred), 0) 
mape_train_rn_trans2 = np.round(mean_absolute_percentage_error(y_train, y_train_pred), 2)

# Calculo de metricas en test
r2_test_rn_trans2 = np.round(r2_score(y_test, y_test_pred), 3) 
rmse_test_rn_trans2 = np.round(np.sqrt(mean_squared_error(y_test, y_test_pred)), 0) 
mae_test_rn_trans2 = np.round(mean_absolute_error(y_test, y_test_pred), 0)
mape_test_rn_trans2 = np.round(mean_absolute_percentage_error(y_test, y_test_pred), 2)

# Calcular Adjusted R²
adj_r2_train_rn_trans2 = np.round(adjusted_r2(r2_train_rn, len(y_train), X_train.shape[1]), 3)
adj_r2_test_rn_trans2 = np.round(adjusted_r2(r2_test_rn, len(y_test), X_test.shape[1]), 3)

# Mostrar métricas
print("R2 - train:", r2_train_rn_trans2)
print("Adjusted R2 - train:", adj_r2_train_rn_trans2)
print("RMSE - train:", rmse_train_rn_trans2)
print("MAE - train:", mae_train_rn_trans2)
print("MAPE - train:", mape_train_rn_trans2)
print('')
print("R2 - test:", r2_test_rn_trans2)
print("Adjusted R2 - test:", adj_r2_test_rn_trans2)
print("RMSE - test:", rmse_test_rn_trans2)
print("MAE - test:", mae_test_rn_trans2)
print("MAPE - test:", mape_test_rn_trans2)

R2 - train: 0.972
Adjusted R2 - train: 0.967
RMSE - train: 275.0
MAE - train: 121.0
MAPE - train: 0.13

R2 - test: 0.937
Adjusted R2 - test: 0.906
RMSE - test: 347.0
MAE - test: 157.0
MAPE - test: 0.17


In [59]:
# Guardar modelo
with open('../16 - Exports Modelos/sin agregados/rn_trans2_noagg.pkl', 'wb') as file:
    pickle.dump(rn_best_trans2, file)

# Guardar PowerTransformer de inputs
joblib.dump(pt_X, '../16 - Exports Modelos/sin agregados/power_transformer_X.pkl')

# Guardar PowerTransformerpara target
joblib.dump(pt_y, '../16 - Exports Modelos/sin agregados/power_transformer_y.pkl')

# Guardar scaler
joblib.dump(scaler, '../16 - Exports Modelos/sin agregados/scaler_trans2.pkl')

['../16 - Exports Modelos/sin agregados/scaler_trans2.pkl']

In [64]:
# Juntar metricas en un dataframe
modelos_rn_trans_noagg = pd.DataFrame({
    'Métricas' : ['R² test', 'R² adjusted test', 'RMSE test', 'MAE test', 'MAPE test'],
    'RN + target normalizado' : [r2_test_rn_trans1, adj_r2_test_rn_trans1, rmse_test_rn_trans1, mae_test_rn_trans1, mape_test_rn_trans1],
    'RN + inputs/target normalizado' : [r2_test_rn_trans2, adj_r2_test_rn_trans2, rmse_test_rn_trans2, mae_test_rn_trans2, mape_test_rn_trans2],
})

modelos_rn_trans_noagg

Unnamed: 0,Métricas,RN + target normalizado,RN + inputs/target normalizado
0,R² test,0.939,0.937
1,R² adjusted test,0.936,0.906
2,RMSE test,343.0,347.0
3,MAE test,162.0,157.0
4,MAPE test,0.18,0.17


In [65]:
# Exportar tablas comparativas de metricas
modelos_rn_trans_noagg.to_csv("../16 - Exports Modelos/sin agregados/rn_trans_metrics_noagg.csv", index = False)

Vemos que los valores de métricas entre ambos modelos transformados son muy similares, pasemos a obtener predicciones para compararlos en gráficos.

### *Predicciones con mejores modelos para comparación*

Estas predicciones sobre todo el conjunto de datos "X" para compararlas en un gráfico de Tableau y visualizar el comportamiento frente a los datos reales de inmigración.

**HGB**

In [102]:
# Filtar y remover agregados
df_noagg = df[(df["Sex"] != "Both") & (df["Age group"] != "All")]

# Hacer copia del df_noagg
df_copy = df_noagg.copy()

# Transformar Year a variable ordinal de 1 (2008) a 15 (2022)
df_copy['Year'] = df_copy['Year'] - 2007

# Generar variables dummies a partir de nuestras variables categóricas "object" (no ordinales)
df_copy = pd.get_dummies(df_copy)

# Convertir las variables dummies booleanas en "int"
col_bool = df_copy.select_dtypes(include = ['bool']).columns
df_copy[col_bool] = df_copy[col_bool].astype(int)

# Verificar cambio
df_copy.info()
df_copy

<class 'pandas.core.frame.DataFrame'>
Index: 5460 entries, 1 to 9359
Data columns (total 69 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   Year                                      5460 non-null   int64  
 1   Immigrant count                           5460 non-null   int64  
 2   Unemployment %                            5460 non-null   float64
 3   Political and Violence Percentile         5460 non-null   float64
 4   Probability of dying young                5460 non-null   float64
 5   Rule of Law Percentile                    5460 non-null   float64
 6   Salaried workers %                        5460 non-null   float64
 7   GDP_growth                                5460 non-null   float64
 8   Inflation_annual                          5460 non-null   float64
 9   Liberal democracy index                   5460 non-null   float64
 10  Health equality                          

Unnamed: 0,Year,Immigrant count,Unemployment %,Political and Violence Percentile,Probability of dying young,Rule of Law Percentile,Salaried workers %,GDP_growth,Inflation_annual,Liberal democracy index,...,Continent_America,Continent_Asia,Continent_Europe,Sub-region_Africa,Sub-region_Asia,Sub-region_Central America and Caribbean,Sub-region_European Union,Sub-region_North America,Sub-region_Rest of Europe,Sub-region_South America
1,1,2938,4.03,17.31,5.1,25.96,44.47,9.13,1.10,0.649,...,1,0,0,0,0,0,0,0,0,1
2,1,1128,4.03,17.31,5.1,25.96,44.47,9.13,1.10,0.649,...,1,0,0,0,0,0,0,0,0,1
3,1,265,4.03,17.31,5.1,25.96,44.47,9.13,1.10,0.649,...,1,0,0,0,0,0,0,0,0,1
4,1,156,4.03,17.31,5.1,25.96,44.47,9.13,1.10,0.649,...,1,0,0,0,0,0,0,0,0,1
8,1,4703,4.03,17.31,5.1,25.96,44.47,9.13,1.10,0.649,...,1,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9353,15,452,5.60,6.60,5.8,25.00,42.14,4.71,13.96,0.234,...,0,1,0,0,1,0,0,0,0,0
9355,15,330,5.60,6.60,5.8,25.00,42.14,4.71,13.96,0.234,...,0,1,0,0,1,0,0,0,0,0
9356,15,146,5.60,6.60,5.8,25.00,42.14,4.71,13.96,0.234,...,0,1,0,0,1,0,0,0,0,0
9358,15,99,5.60,6.60,5.8,25.00,42.14,4.71,13.96,0.234,...,0,1,0,0,1,0,0,0,0,0


In [103]:
# Separar variables input y variable target "Immigrant count" de df_copy
X = df_copy.drop("Immigrant count", axis = 1) # variables predictoras
y = df_copy["Immigrant count"]  # Target

In [108]:
# Cagar PowerTransformers y modelo
scaler = joblib.load('../16 - Exports Modelos/sin agregados/scaler_nogg.pkl')
modelo_hgb = joblib.load('../16 - Exports Modelos/sin agregados/hgb_best_noagg.pkl')

# Transformar variables inputs
X_scale = scaler.transform(X)

# Predecir
y_pred_hgb = modelo_hgb.predict(X_scale)

# Hacer copi de dataframe inicial sin nulos
df_hgb_pred = df_noagg.iloc[:, :5].copy()

# Agregar predicicones al dataframe
df_hgb_pred['Prediccion_hgb_noagg'] = np.round(y_pred_hgb, 0).astype(int)

df_hgb_pred.info()
df_hgb_pred

<class 'pandas.core.frame.DataFrame'>
Index: 5460 entries, 1 to 9359
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Year                  5460 non-null   int64 
 1   Nationality code      5460 non-null   object
 2   Sex                   5460 non-null   object
 3   Age group             5460 non-null   object
 4   Immigrant count       5460 non-null   int64 
 5   Prediccion_hgb_noagg  5460 non-null   int32 
dtypes: int32(1), int64(2), object(3)
memory usage: 277.3+ KB


Unnamed: 0,Year,Nationality code,Sex,Age group,Immigrant count,Prediccion_hgb_noagg
1,2008,PER,Males,35 - 44,2938,2360
2,2008,PER,Males,45 - 54,1128,868
3,2008,PER,Males,55 - 64,265,184
4,2008,PER,Males,65+,156,137
8,2008,PER,Males,25 - 34,4703,4715
...,...,...,...,...,...,...
9353,2022,PAK,Females,45 - 54,452,538
9355,2022,PAK,Males,55 - 64,330,253
9356,2022,PAK,Females,55 - 64,146,166
9358,2022,PAK,Males,65+,99,114


In [117]:
# Exportar archivo con predicciones
df_hgb_pred[df_hgb_pred['Nationality code'] != 'SEN'].to_csv("../17 - Prediciones/predicciones_hgb_noagg.csv", index = False)

**RN TRANS 1**

In [120]:
# Hacer copia del df removiendo Senegal que presenta datos nulos para tasa de homicidios
df_nonull = df_noagg[df_noagg['Nationality code'] != 'SEN'].copy()
df_nonull_copy = df_nonull.copy()

# Transformar Year a variable ordinal de 1 (2008) a 15 (2022)
df_nonull_copy['Year'] = df_nonull_copy['Year'] - 2007

# Generar variables dummies a partir de nuestras variables categóricas "object" (no ordinales)
df_nonull_copy = pd.get_dummies(df_nonull_copy)

# Convertir las variables dummies booleanas en "int"
col_bool = df_nonull_copy.select_dtypes(include = ['bool']).columns
df_nonull_copy[col_bool] = df_nonull_copy[col_bool].astype(int)

# Verificar cambio
df_nonull_copy.info()
df_nonull_copy

<class 'pandas.core.frame.DataFrame'>
Index: 5250 entries, 1 to 9359
Data columns (total 68 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   Year                                      5250 non-null   int64  
 1   Immigrant count                           5250 non-null   int64  
 2   Unemployment %                            5250 non-null   float64
 3   Political and Violence Percentile         5250 non-null   float64
 4   Probability of dying young                5250 non-null   float64
 5   Rule of Law Percentile                    5250 non-null   float64
 6   Salaried workers %                        5250 non-null   float64
 7   GDP_growth                                5250 non-null   float64
 8   Inflation_annual                          5250 non-null   float64
 9   Liberal democracy index                   5250 non-null   float64
 10  Health equality                          

Unnamed: 0,Year,Immigrant count,Unemployment %,Political and Violence Percentile,Probability of dying young,Rule of Law Percentile,Salaried workers %,GDP_growth,Inflation_annual,Liberal democracy index,...,Continent_America,Continent_Asia,Continent_Europe,Sub-region_Africa,Sub-region_Asia,Sub-region_Central America and Caribbean,Sub-region_European Union,Sub-region_North America,Sub-region_Rest of Europe,Sub-region_South America
1,1,2938,4.03,17.31,5.1,25.96,44.47,9.13,1.10,0.649,...,1,0,0,0,0,0,0,0,0,1
2,1,1128,4.03,17.31,5.1,25.96,44.47,9.13,1.10,0.649,...,1,0,0,0,0,0,0,0,0,1
3,1,265,4.03,17.31,5.1,25.96,44.47,9.13,1.10,0.649,...,1,0,0,0,0,0,0,0,0,1
4,1,156,4.03,17.31,5.1,25.96,44.47,9.13,1.10,0.649,...,1,0,0,0,0,0,0,0,0,1
8,1,4703,4.03,17.31,5.1,25.96,44.47,9.13,1.10,0.649,...,1,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9353,15,452,5.60,6.60,5.8,25.00,42.14,4.71,13.96,0.234,...,0,1,0,0,1,0,0,0,0,0
9355,15,330,5.60,6.60,5.8,25.00,42.14,4.71,13.96,0.234,...,0,1,0,0,1,0,0,0,0,0
9356,15,146,5.60,6.60,5.8,25.00,42.14,4.71,13.96,0.234,...,0,1,0,0,1,0,0,0,0,0
9358,15,99,5.60,6.60,5.8,25.00,42.14,4.71,13.96,0.234,...,0,1,0,0,1,0,0,0,0,0


In [121]:
# Separar variables input y variable target "Immigrant count" de df_null (dataframe sin atos nulos)
X = df_nonull_copy.drop("Immigrant count", axis = 1) # variables predictoras
y = df_nonull_copy["Immigrant count"]  # Target

In [125]:
# Cagar PowerTransformers, scaler y modelo
pt = joblib.load('../16 - Exports Modelos/sin agregados/power_transformer.pkl')
scaler_trans1 = joblib.load('../16 - Exports Modelos/sin agregados/scaler_trans1.pkl')
modelo_rn1 = joblib.load('../16 - Exports Modelos/sin agregados/rn_trans1_noagg.pkl')

# Transformar variables inputs
X_scale = scaler_trans1.transform(X)

# Predecir
y_pred_rn_trans1 = modelo_rn1.predict(X_scale)

# Transformacion inversa de predicicones para obtener escala normal
y_pred_rn1 = pt.inverse_transform(y_pred_rn_trans1.reshape(-1, 1)).flatten()

# Hacer copia de dataframe inicial sin nulos
df_rn_pred = df_nonull.iloc[:, :5].copy()

# Agregar predicicones al dataframe
df_rn_pred['Prediccion_rn1_noagg'] = np.round(y_pred_rn1 , 0).astype(int)

df_rn_pred.info()
df_rn_pred

<class 'pandas.core.frame.DataFrame'>
Index: 5250 entries, 1 to 9359
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Year                  5250 non-null   int64 
 1   Nationality code      5250 non-null   object
 2   Sex                   5250 non-null   object
 3   Age group             5250 non-null   object
 4   Immigrant count       5250 non-null   int64 
 5   Prediccion_rn1_noagg  5250 non-null   int32 
dtypes: int32(1), int64(2), object(3)
memory usage: 266.6+ KB


Unnamed: 0,Year,Nationality code,Sex,Age group,Immigrant count,Prediccion_rn1_noagg
1,2008,PER,Males,35 - 44,2938,1633
2,2008,PER,Males,45 - 54,1128,828
3,2008,PER,Males,55 - 64,265,239
4,2008,PER,Males,65+,156,154
8,2008,PER,Males,25 - 34,4703,2546
...,...,...,...,...,...,...
9353,2022,PAK,Females,45 - 54,452,433
9355,2022,PAK,Males,55 - 64,330,306
9356,2022,PAK,Females,55 - 64,146,143
9358,2022,PAK,Males,65+,99,114


**RN TRANS 2**

In [130]:
# Cagar PowerTransformers, scaler y modelo
pt_X = joblib.load('../16 - Exports Modelos/sin agregados/power_transformer_X.pkl')
pt_y = joblib.load('../16 - Exports Modelos/sin agregados/power_transformer_y.pkl')
scaler_trans2 = joblib.load('../16 - Exports Modelos/sin agregados/scaler_trans2.pkl')
modelo_rn2 = joblib.load('../16 - Exports Modelos/sin agregados/rn_trans2_noagg.pkl')

# Transformar variables inputs
X_trans = pt_X.transform(X)

# Transformar variables inputs
X_trans_scale = scaler_trans2.transform(X_trans)

# Predecir
y_pred_rn_trans2 = modelo_rn2.predict(X_trans_scale)

# Transformacion inversa de prediccicones para obtener escala normal
y_pred_rn2 = pt_y.inverse_transform(y_pred_rn_trans2.reshape(-1, 1)).flatten()

# Agregar predicicones al dataframe
df_rn_pred['Prediccion_rn2_noagg'] = np.round(y_pred_rn2 , 0).astype(int)

df_rn_pred.info()
df_rn_pred

<class 'pandas.core.frame.DataFrame'>
Index: 5250 entries, 1 to 9359
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Year                  5250 non-null   int64 
 1   Nationality code      5250 non-null   object
 2   Sex                   5250 non-null   object
 3   Age group             5250 non-null   object
 4   Immigrant count       5250 non-null   int64 
 5   Prediccion_rn1_noagg  5250 non-null   int32 
 6   Prediccion_rn2_noagg  5250 non-null   int32 
dtypes: int32(2), int64(2), object(3)
memory usage: 287.1+ KB


Unnamed: 0,Year,Nationality code,Sex,Age group,Immigrant count,Prediccion_rn1_noagg,Prediccion_rn2_noagg
1,2008,PER,Males,35 - 44,2938,1633,1867
2,2008,PER,Males,45 - 54,1128,828,755
3,2008,PER,Males,55 - 64,265,239,183
4,2008,PER,Males,65+,156,154,159
8,2008,PER,Males,25 - 34,4703,2546,3088
...,...,...,...,...,...,...,...
9353,2022,PAK,Females,45 - 54,452,433,545
9355,2022,PAK,Males,55 - 64,330,306,233
9356,2022,PAK,Females,55 - 64,146,143,154
9358,2022,PAK,Males,65+,99,114,65


In [131]:
df_rn_pred.to_csv("../17 - Prediciones/predicciones_rn12_noagg.csv", index = False)

A través de observaciones en gráficas de las prediciones de ambos (no mostrado en este apartado, ver https://public.tableau.com/app/profile/cristian.de.andrade.correia/viz/ComparaciondeModelos-Sinagregados/Dashboard1), se notó que el modelo con normalización de inputs y target tenía una mejor capacidad predictiva (más consistente) en relación al valor real durante los períodos regulares, es decir, fuera de escenarios atípicos como en los año 2020, 2021 y 2022, de modo que se seleccionó este para realizar y comparar prediciones a futuro de los años 2023 y 2024. 

### *Prediciciones finales con modelo seleccionado*

Ahora obtengamos las predicicones para todo el conjunto de datos incluyendo dos años adicionales (2023 y 2024) para dos nacionalidades con imigracion alta y media: Colombia y Brasil. Nuetsro onjetivo es comparar los resultados de este modelo sin agregadas "Both" y "All" de sexo y grupo de edad, frente al mejor modelo obtenido con éstos.

*Nota: Los datos de los años 2023 y 2024 fueron preparados investigando los valores de variables inputs o planteando escenarios con base a la tendencia de los valores en años previos.*

In [4]:
# Importar el dataset para prediciones
df_predicciones = pd.read_csv("../17 - Prediciones/datos a predecir.csv")

df_predicciones.info()
df_predicciones

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9456 entries, 0 to 9455
Data columns (total 28 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Year                               9456 non-null   int64  
 1   Nationality code                   9456 non-null   object 
 2   Sex                                9456 non-null   object 
 3   Age group                          9456 non-null   object 
 4   Immigrant count                    9360 non-null   float64
 5   Unemployment %                     9456 non-null   float64
 6   Political and Violence Percentile  9456 non-null   float64
 7   Probability of dying young         9456 non-null   float64
 8   Rule of Law Percentile             9456 non-null   float64
 9   Salaried workers %                 9456 non-null   float64
 10  GDP_growth                         9456 non-null   float64
 11  Inflation_annual                   9456 non-null   float

Unnamed: 0,Year,Nationality code,Sex,Age group,Immigrant count,Unemployment %,Political and Violence Percentile,Probability of dying young,Rule of Law Percentile,Salaried workers %,...,Non-state_deaths,Intrastate_deaths,Interstate_deaths,Number of residents,Political regime,Homicide Rate,Number of Turist,Spanish language,Restricciones_pandemia,Año post_pandemia
0,2008,DZA,Both,0 - 14,759.0,11.33,14.90,3.7,24.52,67.41,...,0,345,0,51922,3,0.95,44400000,0,0,0
1,2008,PER,Males,35 - 44,2938.0,4.03,17.31,5.1,25.96,44.47,...,0,40,0,60185,7,5.27,44400000,1,0,0
2,2008,PER,Males,45 - 54,1128.0,4.03,17.31,5.1,25.96,44.47,...,0,40,0,60185,7,5.27,44400000,1,0,0
3,2008,PER,Males,55 - 64,265.0,4.03,17.31,5.1,25.96,44.47,...,0,40,0,60185,7,5.27,44400000,1,0,0
4,2008,PER,Males,65+,156.0,4.03,17.31,5.1,25.96,44.47,...,0,40,0,60185,7,5.27,44400000,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9451,2024,BRA,Males,35 - 44,,7.00,34.00,7.7,45.40,68.04,...,2100,0,0,92400,6,21.40,70500000,0,0,0
9452,2024,BRA,Males,45 - 54,,7.00,34.00,7.7,45.40,68.04,...,2100,0,0,92400,6,21.40,70500000,0,0,0
9453,2024,BRA,Males,55 - 64,,7.00,34.00,7.7,45.40,68.04,...,2100,0,0,92400,6,21.40,70500000,0,0,0
9454,2024,BRA,Males,65+,,7.00,34.00,7.7,45.40,68.04,...,2100,0,0,92400,6,21.40,70500000,0,0,0


In [5]:
# Filtar y remover agregados
df_noagg = df_predicciones[(df_predicciones["Sex"] != "Both") & (df_predicciones["Age group"] != "All")]

# Hacer copia del df removiendo Senegal que presenta datos nulos para tasa de homicidios
df_nonull = df_noagg[df_noagg['Nationality code'] != 'SEN'].copy()
df_nonull_copy = df_nonull.copy()

# Transformar Year a variable ordinal de 1 (2008) a 15 (2022)
df_nonull_copy['Year'] = df_nonull_copy['Year'] - 2007

# Generar variables dummies a partir de nuestras variables categóricas "object" (no ordinales)
df_nonull_copy = pd.get_dummies(df_nonull_copy)

# Convertir las variables dummies booleanas en "int"
col_bool = df_nonull_copy.select_dtypes(include = ['bool']).columns
df_nonull_copy[col_bool] = df_nonull_copy[col_bool].astype(int)

# Verificar cambio
df_nonull_copy.info()
df_nonull_copy

<class 'pandas.core.frame.DataFrame'>
Index: 5306 entries, 1 to 9454
Data columns (total 68 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   Year                                      5306 non-null   int64  
 1   Immigrant count                           5250 non-null   float64
 2   Unemployment %                            5306 non-null   float64
 3   Political and Violence Percentile         5306 non-null   float64
 4   Probability of dying young                5306 non-null   float64
 5   Rule of Law Percentile                    5306 non-null   float64
 6   Salaried workers %                        5306 non-null   float64
 7   GDP_growth                                5306 non-null   float64
 8   Inflation_annual                          5306 non-null   float64
 9   Liberal democracy index                   5306 non-null   float64
 10  Health equality                          

Unnamed: 0,Year,Immigrant count,Unemployment %,Political and Violence Percentile,Probability of dying young,Rule of Law Percentile,Salaried workers %,GDP_growth,Inflation_annual,Liberal democracy index,...,Continent_America,Continent_Asia,Continent_Europe,Sub-region_Africa,Sub-region_Asia,Sub-region_Central America and Caribbean,Sub-region_European Union,Sub-region_North America,Sub-region_Rest of Europe,Sub-region_South America
1,1,2938.0,4.03,17.31,5.1,25.96,44.47,9.13,1.1,0.649,...,1,0,0,0,0,0,0,0,0,1
2,1,1128.0,4.03,17.31,5.1,25.96,44.47,9.13,1.1,0.649,...,1,0,0,0,0,0,0,0,0,1
3,1,265.0,4.03,17.31,5.1,25.96,44.47,9.13,1.1,0.649,...,1,0,0,0,0,0,0,0,0,1
4,1,156.0,4.03,17.31,5.1,25.96,44.47,9.13,1.1,0.649,...,1,0,0,0,0,0,0,0,0,1
8,1,4703.0,4.03,17.31,5.1,25.96,44.47,9.13,1.1,0.649,...,1,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9450,17,,7.00,34.00,7.7,45.40,68.04,2.01,4.0,0.670,...,1,0,0,0,0,0,0,0,0,1
9451,17,,7.00,34.00,7.7,45.40,68.04,2.01,4.0,0.670,...,1,0,0,0,0,0,0,0,0,1
9452,17,,7.00,34.00,7.7,45.40,68.04,2.01,4.0,0.670,...,1,0,0,0,0,0,0,0,0,1
9453,17,,7.00,34.00,7.7,45.40,68.04,2.01,4.0,0.670,...,1,0,0,0,0,0,0,0,0,1


In [6]:
# Separar variables input y variable target "Immigrant count" de df_null (dataframe sin atos nulos)
X = df_nonull_copy.drop("Immigrant count", axis = 1) # variables predictoras
y = df_nonull_copy["Immigrant count"]  # Target

In [7]:
# Cagar PowerTransformers, scaler y modelo
pt_X = joblib.load('../16 - Exports Modelos/sin agregados/power_transformer_X.pkl')
pt_y = joblib.load('../16 - Exports Modelos/sin agregados/power_transformer_y.pkl')
scaler_trans2 = joblib.load('../16 - Exports Modelos/sin agregados/scaler_trans2.pkl')
modelo_rn2 = joblib.load('../16 - Exports Modelos/sin agregados/rn_trans2_noagg.pkl')

# Transformar variables inputs
X_trans = pt_X.transform(X)

# Escalar variables inputs
X_trans_scale = scaler_trans2.transform(X_trans)

# Predecir
y_pred_trans = modelo_rn2.predict(X_trans_scale)

# Transformacion inversa de prediccicones para obtener escala normal
y_pred = pt_y.inverse_transform(y_pred_trans.reshape(-1, 1)).flatten()

**Intervalos de confianza**

Adicionalmente, estimemos un intervalo de confianza del 90% para nuestras predicciones en base a nuestro modelo elegido. Para ello usaremos el método de Conformal Prediction.

In [10]:
# Filtar y remover agregados
df_noagg = df_predicciones[(df_predicciones["Sex"] != "Both") & (df_predicciones["Age group"] != "All")]

# Hacer copia del df removiendo Senegal que presenta datos nulos para tasa de homicidios
df_nonull = df_noagg[df_noagg['Nationality code'] != 'SEN'].copy()
df_nonull_copy = df_nonull.copy()

# Remover años 2023 y 2024 de target e inputs que poseen datos nulos (se incluyen para prediciones futuras)
df_nonull_copy = df_nonull_copy[(df_nonull_copy['Year'] != 2023) & (df_nonull_copy['Year'] != 2024)]

# Transformar Year a variable ordinal de 1 (2008) a 15 (2022)
df_nonull_copy['Year'] = df_nonull_copy['Year'] - 2007

# Generar variables dummies a partir de nuestras variables categóricas "object" (no ordinales)
df_nonull_copy = pd.get_dummies(df_nonull_copy)

# Convertir las variables dummies booleanas en "int"
col_bool = df_nonull_copy.select_dtypes(include = ['bool']).columns
df_nonull_copy[col_bool] = df_nonull_copy[col_bool].astype(int)

# Separar target de inputs
X = df_nonull_copy.drop("Immigrant count", axis = 1) # variables predictoras
y = df_nonull_copy["Immigrant count"]  # Target

# Separar entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=58)  # separar datos en conjunto train y test en un 75% / 25%

# Aplicar PowerTransformer a variables input de entrenamiento
X_train_trans = pt_X.fit_transform(X_train)

# Escarlar variable inputs de entrenamiento
X_train_trans_scale = scaler_trans2.fit_transform(X_train_trans)  # escalar los datos de entrenamiento

# Aplicar PowerTransformer a variable target de entrenamineto
y_train_trans = pt_y.fit_transform(y_train.values.reshape(-1, 1)).flatten()

# Definir modelo con los mejores valores de parámetros (Nota: usar los mejores, pero hacer modificaciones para comparar metricas)
rn_trans2 = MLPRegressor(activation = 'relu',
                              alpha = 0.01, 
                              batch_size= 35, 
                              early_stopping = True, 
                              hidden_layer_sizes = 60, 
                              max_iter = 220, 
                              solver = 'lbfgs',
                           )

# MapieRegressor
mapie = MapieRegressor(rn_trans2, method="naive")

# Fit MapieRegressor en datos de entrenamiento
mapie.fit(X_train_trans_scale, y_train_trans)

# Predecir intervalos en datos
predictions = mapie.predict(X_trans_scale, alpha=0.1)  # 90% intervalo de confianza

# Obtener la tupla correspondiente a los intervalos
y_test_pred_interval = predictions[1]

# Extraer valores de intervalos
y_pred_interval_lower_t = y_test_pred_interval[:, 0]
y_pred_interval_upper_t = y_test_pred_interval[:, 1]

# Transform inverse predictions and intervals
y_pred_lower = pt_y.inverse_transform(y_pred_interval_lower_t.reshape(-1, 1)).flatten()
y_pred_upper = pt_y.inverse_transform(y_pred_interval_upper_t.reshape(-1, 1)).flatten()

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


**Agregar predicicones e intervalos de conf. al dataframe y exportar**

In [11]:
# Hacer copia de dataframe inicial sin nulos y variables inputs que no necesitamos
df_pred = df_nonull.iloc[:, :5].copy()

df_pred['Prediction_rn_noagg'] = np.round(y_pred , 0).astype(int)
df_pred['Immigrant count'] = df_pred['Immigrant count'].astype('Int64') # pasar a int porque tiene datos nulos y automaticamente se asigna como float
df_pred['Lower limit_noagg'] = np.round(y_pred_lower, 0).astype(int)
df_pred['Upper limit_noagg'] = np.round(y_pred_upper, 0).astype(int)

df_pred.info()
df_pred

<class 'pandas.core.frame.DataFrame'>
Index: 5306 entries, 1 to 9454
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Year                 5306 non-null   int64 
 1   Nationality code     5306 non-null   object
 2   Sex                  5306 non-null   object
 3   Age group            5306 non-null   object
 4   Immigrant count      5250 non-null   Int64 
 5   Prediction_rn_noagg  5306 non-null   int32 
 6   Lower limit_noagg    5306 non-null   int32 
 7   Upper limit_noagg    5306 non-null   int32 
dtypes: Int64(1), int32(3), int64(1), object(3)
memory usage: 316.1+ KB


Unnamed: 0,Year,Nationality code,Sex,Age group,Immigrant count,Prediction_rn_noagg,Lower limit_noagg,Upper limit_noagg
1,2008,PER,Males,35 - 44,2938,1867,1266,2219
2,2008,PER,Males,45 - 54,1128,755,534,993
3,2008,PER,Males,55 - 64,265,183,170,343
4,2008,PER,Males,65+,156,159,101,214
8,2008,PER,Males,25 - 34,4703,3088,2406,4054
...,...,...,...,...,...,...,...,...
9450,2024,BRA,Males,25 - 34,,4203,2102,3572
9451,2024,BRA,Males,35 - 44,,2585,1281,2244
9452,2024,BRA,Males,45 - 54,,1040,480,898
9453,2024,BRA,Males,55 - 64,,411,216,429


In [13]:
# Exportar prediccicones de modelo seleccionado para el conjunto de datos sin agregados
df_pred.to_csv("../17 - Prediciones/predicciones_rn_noagg.csv", index = False)

En el siguiente link se puede ver la comparativa de las prediciones de los modelos con y sin agregados: https://public.tableau.com/app/profile/cristian.de.andrade.correia/viz/PrediccindeInmigrantesenEspaa/Dashboard1