# 1. Data Extraction

In this notebook we will extract our data and analyse it. For that purpose, we are importing from our library where we define the
```bcrp_dataframe``` dataframe. This function will allows us to use the API interface of the Central Bank of Reserve of Peru (BCRP) to automatically create a pandas dataframe with the necessary codes.

## 1.1 Libraries

We import the necessary libraries, including our own library in the modules file

In [33]:
# Warnings
import warnings
warnings.filterwarnings("ignore")

# Basic Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests
import seaborn as sns
from scipy import stats
from functools import reduce

# Statsmodels
import statsmodels.api as sm
import pmdarima as pmd
from pmdarima.arima import auto_arima
from statsmodels.tsa.api import VAR
from statsmodels.tsa.vector_ar.var_model import VARResults
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import STL

# Machine Learning models
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, TimeSeriesSplit
from sklearn.linear_model import Ridge, Lasso, ElasticNet, ElasticNetCV, LinearRegression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import (
    mean_absolute_error,
    mean_squared_error,
    mean_absolute_percentage_error,
    median_absolute_error,
    r2_score,
    precision_score

)

from xgboost import XGBRegressor



In [34]:
# We import our own functions
import sys
sys.path.append('../../..')  # Move two levels up to the project root
from modules.functions import *

## 1.2 Extraction
We define our inputs and apply them the ```bcrp_dataframe``` function in order to obtain the pandas dataframe with the corresponding series

We define the following inputs:

    series     = the code of the series we are going to extract
    start_date = the starting date, when the BCRP starts using the interest rate as a policy measure
    end_date   = December 2019
    freq       = Monthly frequency

### df_1
We can now create the first dataframe with the ```bcrp_dataframe``` function. This dataframe contains out target variable Headline Inflation 

In [35]:
series     = ['PN01271PM']
start_date = '2003-09'
end_date   = '2023-12'
freq       = 'Mensual'

In [36]:
df_1 = bcrp_dataframe( series , start_date , end_date , freq )
df_1.head()

Unnamed: 0_level_0,Índice de precios Lima Metropolitana (var% mensual) - IPC
Fecha,Unnamed: 1_level_1
2003-09-01,0.558598
2003-10-01,0.049032
2003-11-01,0.167685
2003-12-01,0.563951
2004-01-01,0.537447


In [37]:
df_1 = get_trend(df_1)
df_1.head()

Unnamed: 0_level_0,Índice de precios Lima Metropolitana (var% mensual) - IPC
Fecha,Unnamed: 1_level_1
2003-09-01,0.518834
2003-10-01,0.492247
2003-11-01,0.465462
2003-12-01,0.438413
2004-01-01,0.411138


### df_2
We create the second dataframe with the ```bcrp_dataframe``` function. This dataframe contains rate variables. We use this variables in levels. It is not necessary to differentiate them.

In [38]:
series     = ['PD04722MM', 'PN01207PM']
start_date = '2003-09'
end_date   = '2023-12'
freq       = 'Mensual'

In [39]:
df_2 = bcrp_dataframe( series , start_date , end_date , freq )
df_2.head()

Unnamed: 0_level_0,Tasas de interés del Banco Central de Reserva - Tasa de Referencia de la Política Monetaria,Tipo de cambio - promedio del periodo (S/ por US$) - Interbancario - Promedio
Fecha,Unnamed: 1_level_1,Unnamed: 2_level_1
2003-09-01,2.75,3.480898
2003-10-01,2.75,3.478177
2003-11-01,2.5,3.477635
2003-12-01,2.5,3.471176
2004-01-01,2.5,3.467352


### df_3
We create the third dataframe with the ```bcrp_dataframe``` function. This dataframe contains monetary variables as well as commodities. We differentiate those variables in order to get their monthly % change

In [46]:
series     = ['PN00495MM', 'PN06481IM', 'PN02125PM', 'PN01661XM','PN01662XM','PN01664XM','PN01660XM']
start_date = '2003-09'
end_date   = '2023-12'
freq       = 'Mensual'

In [47]:
df_3 = bcrp_dataframe( series , start_date , end_date , freq )
df_3 = np.log(df_3)
df_3 = df_3.dropna()
df_3.head()

Unnamed: 0_level_0,Emisión primaria y multiplicador (millones S/) - Circulante Desestacionalizado,Liquidez internacional del BCRP - RIN - Reservas Internacionales Netas (millones US$),Remuneraciones - Remuneración Mínima Vital - Índice Real (base 1994 = 100),Cotizaciones de productos (promedio del periodo) - Trigo - EEUU (US$ por toneladas),Cotizaciones de productos (promedio del periodo) - Maíz - EEUU (US$ por toneladas),Cotizaciones de productos (promedio del periodo) - Aceite Soya - EEUU (US$ por toneladas),Cotizaciones de productos (promedio del periodo) - Petróleo - WTI (US$ por barriles)
Fecha,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2003-09-01,8.591877,9.185581,5.440117,4.884206,4.419001,6.238332,3.348787
2003-10-01,8.602051,9.191108,5.495508,4.900658,4.3908,6.40489,3.41251
2003-11-01,8.61343,9.240173,5.493832,5.009696,4.463461,6.421768,3.434971
2003-12-01,8.629901,9.229584,5.488208,5.046169,4.511148,6.471752,3.472088
2004-01-01,8.649204,9.265177,5.482848,5.049709,4.571827,6.49818,3.531787


In [48]:
df = df_1.join(df_2).join(df_3)
df.dropna(inplace=True)
df.head()

Unnamed: 0_level_0,Índice de precios Lima Metropolitana (var% mensual) - IPC,Tasas de interés del Banco Central de Reserva - Tasa de Referencia de la Política Monetaria,Tipo de cambio - promedio del periodo (S/ por US$) - Interbancario - Promedio,Emisión primaria y multiplicador (millones S/) - Circulante Desestacionalizado,Liquidez internacional del BCRP - RIN - Reservas Internacionales Netas (millones US$),Remuneraciones - Remuneración Mínima Vital - Índice Real (base 1994 = 100),Cotizaciones de productos (promedio del periodo) - Trigo - EEUU (US$ por toneladas),Cotizaciones de productos (promedio del periodo) - Maíz - EEUU (US$ por toneladas),Cotizaciones de productos (promedio del periodo) - Aceite Soya - EEUU (US$ por toneladas),Cotizaciones de productos (promedio del periodo) - Petróleo - WTI (US$ por barriles)
Fecha,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2003-09-01,0.518834,2.75,3.480898,8.591877,9.185581,5.440117,4.884206,4.419001,6.238332,3.348787
2003-10-01,0.492247,2.75,3.478177,8.602051,9.191108,5.495508,4.900658,4.3908,6.40489,3.41251
2003-11-01,0.465462,2.5,3.477635,8.61343,9.240173,5.493832,5.009696,4.463461,6.421768,3.434971
2003-12-01,0.438413,2.5,3.471176,8.629901,9.229584,5.488208,5.046169,4.511148,6.471752,3.472088
2004-01-01,0.411138,2.5,3.467352,8.649204,9.265177,5.482848,5.049709,4.571827,6.49818,3.531787


In [49]:
df.tail()

Unnamed: 0_level_0,Índice de precios Lima Metropolitana (var% mensual) - IPC,Tasas de interés del Banco Central de Reserva - Tasa de Referencia de la Política Monetaria,Tipo de cambio - promedio del periodo (S/ por US$) - Interbancario - Promedio,Emisión primaria y multiplicador (millones S/) - Circulante Desestacionalizado,Liquidez internacional del BCRP - RIN - Reservas Internacionales Netas (millones US$),Remuneraciones - Remuneración Mínima Vital - Índice Real (base 1994 = 100),Cotizaciones de productos (promedio del periodo) - Trigo - EEUU (US$ por toneladas),Cotizaciones de productos (promedio del periodo) - Maíz - EEUU (US$ por toneladas),Cotizaciones de productos (promedio del periodo) - Aceite Soya - EEUU (US$ por toneladas),Cotizaciones de productos (promedio del periodo) - Petróleo - WTI (US$ por barriles)
Fecha,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2023-08-01,0.182324,7.75,3.697768,11.194037,11.182374,5.646268,5.714305,5.288844,7.354495,4.399204
2023-09-01,0.133933,7.5,3.730995,11.196305,11.173721,5.646104,5.677783,5.17894,7.283402,4.490772
2023-10-01,0.085365,7.25,3.845759,11.202967,11.172515,5.649334,5.548594,5.191875,7.138916,4.448655
2023-11-01,0.036736,7.0,3.760795,11.200187,11.180961,5.650966,5.567082,5.14624,7.07603,4.351357
2023-12-01,-0.011845,6.75,3.733942,11.186627,11.1709,5.646966,5.599153,5.145342,7.04632,4.276196


## 1.3 Data Inspection
We inspect the df. We first verify that all values are non-null. The, we apply the ```describe``` function to see the main variables.

In [50]:
df.isna().sum()

Índice de precios Lima Metropolitana (var% mensual) - IPC                                       0
Tasas de interés del Banco Central de Reserva  - Tasa de Referencia de la Política Monetaria    0
Tipo de cambio - promedio del periodo (S/ por US$) - Interbancario - Promedio                   0
Emisión primaria y multiplicador (millones S/) - Circulante Desestacionalizado                  0
Liquidez internacional del BCRP - RIN - Reservas Internacionales Netas (millones US$)           0
Remuneraciones - Remuneración Mínima Vital - Índice Real (base 1994 = 100)                      0
Cotizaciones de productos (promedio del periodo) - Trigo - EEUU (US$ por toneladas)             0
Cotizaciones de productos (promedio del periodo) - Maíz - EEUU (US$ por toneladas)              0
Cotizaciones de productos (promedio del periodo) - Aceite Soya - EEUU (US$ por toneladas)       0
Cotizaciones de productos (promedio del periodo) - Petróleo - WTI (US$ por barriles)            0
dtype: int64

In [51]:
df.describe()

Unnamed: 0,Índice de precios Lima Metropolitana (var% mensual) - IPC,Tasas de interés del Banco Central de Reserva - Tasa de Referencia de la Política Monetaria,Tipo de cambio - promedio del periodo (S/ por US$) - Interbancario - Promedio,Emisión primaria y multiplicador (millones S/) - Circulante Desestacionalizado,Liquidez internacional del BCRP - RIN - Reservas Internacionales Netas (millones US$),Remuneraciones - Remuneración Mínima Vital - Índice Real (base 1994 = 100),Cotizaciones de productos (promedio del periodo) - Trigo - EEUU (US$ por toneladas),Cotizaciones de productos (promedio del periodo) - Maíz - EEUU (US$ por toneladas),Cotizaciones de productos (promedio del periodo) - Aceite Soya - EEUU (US$ por toneladas),Cotizaciones de productos (promedio del periodo) - Petróleo - WTI (US$ por barriles)
count,244.0,244.0,244.0,244.0,244.0,244.0,244.0,244.0,244.0,244.0
mean,0.268869,3.658811,3.21474,10.207171,10.662819,5.616074,5.434735,5.026962,6.717469,4.182901
std,0.148128,1.710424,0.363194,0.766773,0.62592,0.103723,0.313514,0.390336,0.344587,0.34895
min,-0.011845,0.25,2.552173,8.591877,9.185581,5.439158,4.792837,4.164457,6.106831,2.823541
25%,0.157445,2.75,2.870174,9.678975,10.365558,5.507451,5.198043,4.828253,6.46286,3.935347
50%,0.248044,3.75,3.251714,10.392176,11.010612,5.646617,5.407049,4.942209,6.624612,4.21659
75%,0.330206,4.25,3.399687,10.748751,11.107878,5.705018,5.691647,5.383466,7.004992,4.464242
max,0.679208,7.75,4.108055,11.305032,11.288803,5.777829,6.276845,5.757876,7.581216,4.897019


We have 195 observation ranging from ```2003-10-01``` to ```2019-12-01```. The mean of monthly % change of all CPI variables is around 0.2. The mean of the lacing rate and the interest rate is 10.7% and 3.67%, respectively. The three monetary variables have a small monthly % change, around 0.01 and 0.001 for Minimum Wage index. 

## 1.4 Data adjustment
We will rename the columns for easier identification of the variables. We will also create a new dataframe with the lags of the variables. 

In [52]:
# New column names
columns = {
    'Índice de precios Lima Metropolitana (var% mensual) - IPC': 'CPI',
    'Tasas de interés del Banco Central de Reserva  - Tasa de Referencia de la Política Monetaria': 'Monetary Policy Rate',
    'Tipo de cambio - promedio del periodo (S/ por US$) - Interbancario - Promedio' : 'Exchange rate', 
    'Emisión primaria y multiplicador (millones S/) - Circulante Desestacionalizado': 'Circulating Currency Seasonally Adjusted (mill S/)',
    'Liquidez internacional del BCRP - RIN - Reservas Internacionales Netas (millones US$)': 'Net International Reserves (mill $)',
    'Remuneraciones - Remuneración Mínima Vital - Índice Real (base 1994 = 100)': 'Real Minimum Wage (Index)',
    'Cotizaciones de productos (promedio del periodo) - Trigo - EEUU (US$ por toneladas)': 'Wheat (US$ per ton)',
    'Cotizaciones de productos (promedio del periodo) - Maíz - EEUU (US$ por toneladas)': 'Corn  (US$ per ton)',
    'Cotizaciones de productos (promedio del periodo) - Aceite Soya - EEUU (US$ por toneladas)': 'Soybean oil (US$ per ton)',
    'Cotizaciones de productos (promedio del periodo) - Petróleo - WTI (US$ por barriles)': 'Crude oil (US$ per barrel)'  
}

# We rename the columns so they are easier to analyse
df.rename(columns=columns, inplace=True)

In [53]:
df_lags = df.copy()

for variable in df_lags.columns[1:]:
    df_lags[f'{variable}_lag_1'] = df_lags[variable].shift()
    df_lags[f'{variable}_lag_2'] = df_lags[variable].shift(2)
    df_lags[f'{variable}_lag_3'] = df_lags[variable].shift(3)
    df_lags[f'{variable}_lag_4'] = df_lags[variable].shift(4)

In [54]:
# We delete contemporary variables
df_lags.drop(columns = ['Monetary Policy Rate','Circulating Currency Seasonally Adjusted (mill S/)',
       'Net International Reserves (mill $)', 'Real Minimum Wage (Index)', 'Wheat (US$ per ton)', 'Corn  (US$ per ton)', 
       'Soybean oil (US$ per ton)', 'Crude oil (US$ per barrel)', 'Exchange rate'], inplace = True)

df_lags = df_lags.dropna()

## 1.5 Save Results
We save it to the ```input``` folder, where we can use it to do the forecasting in the next notebook.

In [55]:
df.to_csv('../../../input/df_raw_test.csv')

In [56]:
df_lags.to_csv('../../../input/df_lags_test.csv')

In [57]:
df_lags.tail()

Unnamed: 0_level_0,CPI,Monetary Policy Rate_lag_1,Monetary Policy Rate_lag_2,Monetary Policy Rate_lag_3,Monetary Policy Rate_lag_4,Exchange rate_lag_1,Exchange rate_lag_2,Exchange rate_lag_3,Exchange rate_lag_4,Circulating Currency Seasonally Adjusted (mill S/)_lag_1,...,Corn (US$ per ton)_lag_3,Corn (US$ per ton)_lag_4,Soybean oil (US$ per ton)_lag_1,Soybean oil (US$ per ton)_lag_2,Soybean oil (US$ per ton)_lag_3,Soybean oil (US$ per ton)_lag_4,Crude oil (US$ per barrel)_lag_1,Crude oil (US$ per barrel)_lag_2,Crude oil (US$ per barrel)_lag_3,Crude oil (US$ per barrel)_lag_4
Fecha,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-08-01,0.182324,7.75,7.75,7.75,7.75,3.601255,3.650419,3.688668,3.76545,11.196625,...,5.461808,5.531248,7.342273,7.158705,7.063081,7.124298,4.327658,4.253043,4.271259,4.37587
2023-09-01,0.133933,7.75,7.75,7.75,7.75,3.697768,3.601255,3.650419,3.688668,11.194037,...,5.487035,5.461808,7.354495,7.342273,7.158705,7.063081,4.399204,4.327658,4.253043,4.271259
2023-10-01,0.085365,7.5,7.75,7.75,7.75,3.730995,3.697768,3.601255,3.650419,11.196305,...,5.368204,5.487035,7.283402,7.354495,7.342273,7.158705,4.490772,4.399204,4.327658,4.253043
2023-11-01,0.036736,7.25,7.5,7.75,7.75,3.845759,3.730995,3.697768,3.601255,11.202967,...,5.288844,5.368204,7.138916,7.283402,7.354495,7.342273,4.448655,4.490772,4.399204,4.327658
2023-12-01,-0.011845,7.0,7.25,7.5,7.75,3.760795,3.845759,3.730995,3.697768,11.200187,...,5.17894,5.288844,7.07603,7.138916,7.283402,7.354495,4.351357,4.448655,4.490772,4.399204


In [58]:
df.tail()

Unnamed: 0_level_0,CPI,Monetary Policy Rate,Exchange rate,Circulating Currency Seasonally Adjusted (mill S/),Net International Reserves (mill $),Real Minimum Wage (Index),Wheat (US$ per ton),Corn (US$ per ton),Soybean oil (US$ per ton),Crude oil (US$ per barrel)
Fecha,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2023-08-01,0.182324,7.75,3.697768,11.194037,11.182374,5.646268,5.714305,5.288844,7.354495,4.399204
2023-09-01,0.133933,7.5,3.730995,11.196305,11.173721,5.646104,5.677783,5.17894,7.283402,4.490772
2023-10-01,0.085365,7.25,3.845759,11.202967,11.172515,5.649334,5.548594,5.191875,7.138916,4.448655
2023-11-01,0.036736,7.0,3.760795,11.200187,11.180961,5.650966,5.567082,5.14624,7.07603,4.351357
2023-12-01,-0.011845,6.75,3.733942,11.186627,11.1709,5.646966,5.599153,5.145342,7.04632,4.276196


In [59]:
df_lags.columns

Index(['CPI', 'Monetary Policy Rate_lag_1', 'Monetary Policy Rate_lag_2',
       'Monetary Policy Rate_lag_3', 'Monetary Policy Rate_lag_4',
       'Exchange rate_lag_1', 'Exchange rate_lag_2', 'Exchange rate_lag_3',
       'Exchange rate_lag_4',
       'Circulating Currency Seasonally Adjusted (mill S/)_lag_1',
       'Circulating Currency Seasonally Adjusted (mill S/)_lag_2',
       'Circulating Currency Seasonally Adjusted (mill S/)_lag_3',
       'Circulating Currency Seasonally Adjusted (mill S/)_lag_4',
       'Net International Reserves (mill $)_lag_1',
       'Net International Reserves (mill $)_lag_2',
       'Net International Reserves (mill $)_lag_3',
       'Net International Reserves (mill $)_lag_4',
       'Real Minimum Wage (Index)_lag_1', 'Real Minimum Wage (Index)_lag_2',
       'Real Minimum Wage (Index)_lag_3', 'Real Minimum Wage (Index)_lag_4',
       'Wheat (US$ per ton)_lag_1', 'Wheat (US$ per ton)_lag_2',
       'Wheat (US$ per ton)_lag_3', 'Wheat (US$ per ton)_l