<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

In [None]:
#| include: false
#skip
! [ -e /content ] && pip install -Uqq gingado nbdev # install or upgrade gingado on colab

In [2]:
#| include: false
from nbdev.showdoc import show_doc

`gingado` provides data augmentation functionalities that can help users to augment their datasets with a time series dimension. This can be done both on a stand-alone basis as the user incorporates new data on top of the original dataset, or as part of a `scikit-learn` [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) that also includes other steps like data transformation and model estimation.

## Data augmentation with SDMX

The **S**tatistical **D**ata and **M**etadata e**X**change (SDMX) is an ISO standard comprising:
* technical standards
* statistical guidelines, including cross-domain concepts and codelists
* an IT architecture and tools

SDMX is sponsored by the Bank for International Settlements, European Central Bank, Eurostat, International Monetary Fund, Organisation for Economic Co-operation and Development, United Nations, and World Bank Group.

More information about the SDMX is available on its [webpage](http://sdmx.org).

In [1]:
#|output: asis
#| echo: false
show_doc(AugmentSDMX)

  else: warn(msg)


---

[source](https://github.com/dkgaraujo/gingado/tree/main/blob/main/gingado/augmentation.py#L15){target="_blank" style="float:right; font-size:smaller"}

### AugmentSDMX

>      AugmentSDMX (sources={'BIS': 'WS_CBPOL_D'}, variance_threshold=None,
>                   propagate_last_known_value=True, fillna=0, verbose=True)

Base class for all estimators in scikit-learn.

As mentioned above, `gingado`'s transformers are built to be compatible with `scikit-learn`. The code below demonstrates this compatibility.

First, we create the example dataset. In this case, it comprises the daily foreign exchange rate of selected currencies to the Euro. The Brazilian Real (BRL) is chosen for this example as the dependent variable.

In [5]:
#collapse_output
from gingado.utils import load_SDMX_data, Lag
from sklearn.model_selection import TimeSeriesSplit

X = load_SDMX_data(
    sources={'ECB': 'EXR'}, 
    keys={'FREQ': 'D', 'CURRENCY': ['EUR', 'AUD', 'BRL', 'CAD', 'CHF', 'GBP', 'JPY', 'SGD', 'USD']},
    params={"startPeriod": 2003}
    )
# drop rows with empty values
X.dropna(inplace=True)
# adjust column names in this simple example for ease of understanding:
# remove parts related to source and dataflow names
X.columns = X.columns.str.replace("ECB__EXR_D__", "").str.replace("__EUR__SP00__A", "")
X = Lag(lags=1, jump=0, keep_contemporaneous_X=True).fit_transform(X)
y = X.pop('BRL')
# retain only the lagged variables in the X variable
X = X[X.columns[X.columns.str.contains('_lag_')]]

Querying data from ECB's dataflow 'EXR' - Exchange Rates...


In [6]:
X_train, X_test = X.iloc[:-1], X.tail(1)
y_train, y_test = y.iloc[:-1], y.tail(1)

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((4970, 8), (4970,), (1, 8), (1,))

Next, the data augmentation object provided by `gingado` adds more data. In this case, for brevity only one dataflow from one source is listed. If users want to add more SDMX sources, simply add more keys to the dictionary. And if users want data from all dataflows from a given source provided the keys and parameters such as frequency and dates match, the value should be set to `'all'`, as in `{'ECB': ['CISS'], 'BIS': 'all'}`.

In [7]:
#collapse_output
test_src = {'ECB': ['CISS'], 'BIS': ['WS_CBPOL_D']}

X_train__fit_transform = AugmentSDMX(sources=test_src).fit_transform(X=X_train)
X_train__fit_then_transform = AugmentSDMX(sources=test_src).fit(X=X_train).transform(X=X_train, training=True)

assert X_train__fit_transform.shape == X_train__fit_then_transform.shape

Querying data from ECB's dataflow 'CISS' - Composite Indicator of Systemic Stress...


2022-06-01 01:11:00,886 pandasdmx.reader.sdmxml - INFO: Use supplied dsd=… argument for non–structure-specific message


Querying data from BIS's dataflow 'WS_CBPOL_D' - Policy rates daily...


2022-06-01 01:12:24,776 pandasdmx.reader.sdmxml - INFO: Use supplied dsd=… argument for non–structure-specific message


Querying data from ECB's dataflow 'CISS' - Composite Indicator of Systemic Stress...
Querying data from BIS's dataflow 'WS_CBPOL_D' - Policy rates daily...


[`AugmentSDMX`](https://dkgaraujo.github.io/gingado/augmentation.html#augmentsdmx) can also be part of a `Pipeline` object, which minimises operational errors during modelling and avoids using testing data during training:

This is the dataset now after this particular augmentation:

In [8]:
#collapse_output
print(f"No of columns: {len(X_train__fit_transform.columns)} {X_train__fit_transform.columns}")
X_train__fit_transform

No of columns: 68 Index(['AUD_lag_1', 'BRL_lag_1', 'CAD_lag_1', 'CHF_lag_1', 'GBP_lag_1',
       'JPY_lag_1', 'SGD_lag_1', 'USD_lag_1',
       'ECB__CISS_D__AT__Z0Z__4F__EC__SS_CIN__IDX',
       'ECB__CISS_D__BE__Z0Z__4F__EC__SS_CIN__IDX',
       'ECB__CISS_D__CN__Z0Z__4F__EC__SS_CIN__IDX',
       'ECB__CISS_D__DE__Z0Z__4F__EC__SS_CIN__IDX',
       'ECB__CISS_D__ES__Z0Z__4F__EC__SS_CIN__IDX',
       'ECB__CISS_D__FI__Z0Z__4F__EC__SS_CIN__IDX',
       'ECB__CISS_D__FR__Z0Z__4F__EC__SS_CIN__IDX',
       'ECB__CISS_D__GB__Z0Z__4F__EC__SS_CIN__IDX',
       'ECB__CISS_D__IE__Z0Z__4F__EC__SS_CIN__IDX',
       'ECB__CISS_D__IT__Z0Z__4F__EC__SS_CIN__IDX',
       'ECB__CISS_D__NL__Z0Z__4F__EC__SS_CIN__IDX',
       'ECB__CISS_D__PT__Z0Z__4F__EC__SS_CIN__IDX',
       'ECB__CISS_D__U2__Z0Z__4F__EC__SS_BM__CON',
       'ECB__CISS_D__U2__Z0Z__4F__EC__SS_CI__IDX',
       'ECB__CISS_D__U2__Z0Z__4F__EC__SS_CIN__IDX',
       'ECB__CISS_D__U2__Z0Z__4F__EC__SS_CO__CON',
       'ECB__CISS_D__U2__Z0Z__4F__E

Unnamed: 0_level_0,AUD_lag_1,BRL_lag_1,CAD_lag_1,CHF_lag_1,GBP_lag_1,JPY_lag_1,SGD_lag_1,USD_lag_1,ECB__CISS_D__AT__Z0Z__4F__EC__SS_CIN__IDX,ECB__CISS_D__BE__Z0Z__4F__EC__SS_CIN__IDX,...,BIS__WS_CBPOL_D_D__RO,BIS__WS_CBPOL_D_D__RS,BIS__WS_CBPOL_D_D__RU,BIS__WS_CBPOL_D_D__SA,BIS__WS_CBPOL_D_D__SE,BIS__WS_CBPOL_D_D__TH,BIS__WS_CBPOL_D_D__TR,BIS__WS_CBPOL_D_D__US,BIS__WS_CBPOL_D_D__XM,BIS__WS_CBPOL_D_D__ZA
TIME_PERIOD,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2003-01-03,1.8554,3.6770,1.6422,1.4528,0.65200,124.40,1.8188,1.0446,0.021899,0.043292,...,,9.5,,,3.75,1.75,44.0,1.250,2.75,13.50
2003-01-06,1.8440,3.6112,1.6264,1.4555,0.65000,124.56,1.8132,1.0392,0.020801,0.039924,...,19.75,9.5,,2.00,3.75,1.75,44.0,1.250,2.75,13.50
2003-01-07,1.8281,3.5145,1.6383,1.4563,0.64950,124.40,1.8210,1.0488,0.019738,0.038084,...,19.75,9.5,,2.00,3.75,1.75,44.0,1.250,2.75,13.50
2003-01-08,1.8160,3.5139,1.6257,1.4565,0.64960,124.82,1.8155,1.0425,0.019947,0.040338,...,19.75,9.5,21.0,2.00,3.75,1.75,44.0,1.250,2.75,13.50
2003-01-09,1.8132,3.4405,1.6231,1.4586,0.64950,124.90,1.8102,1.0377,0.017026,0.040535,...,19.75,9.5,21.0,2.00,3.75,1.75,44.0,1.250,2.75,13.50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-05-24,1.4982,5.1623,1.3626,1.0310,0.84783,136.05,1.4639,1.0659,0.269626,0.200358,...,3.75,2.0,14.0,1.75,0.25,0.50,14.0,0.875,0.00,4.75
2022-05-25,1.5152,5.1793,1.3714,1.0334,0.85750,136.49,1.4722,1.0720,0.264778,0.204644,...,3.75,2.0,14.0,1.75,0.25,0.50,14.0,0.875,0.00,4.75
2022-05-26,1.5126,5.1736,1.3720,1.0269,0.85295,135.34,1.4676,1.0656,0.249738,0.198993,...,3.75,2.0,14.0,1.75,0.25,0.50,14.0,0.875,0.00,4.75
2022-05-27,1.5110,5.1741,1.3715,1.0283,0.85073,135.95,1.4709,1.0697,0.237198,0.188693,...,3.75,2.0,14.0,1.75,0.25,0.50,14.0,0.875,0.00,4.75


In [9]:
from gingado.augmentation import AugmentSDMX
from sklearn.pipeline import Pipeline
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ('augmentation', AugmentSDMX(sources={'BIS': 'WS_CBPOL_D'})),
    ('imp', IterativeImputer(max_iter=10)),
    ('forest', RandomForestRegressor())
], verbose=True)

### Tuning the data augmentation to enhance model performance

And since [`AugmentSDMX`](https://dkgaraujo.github.io/gingado/augmentation.html#augmentsdmx) can be included in a `Pipeline`, it can also be fine-tuned by parameter search techniques (such as grid search), further helping users make the best of available data to enhance performance of their models.

In [10]:
#collapse_output
grid = GridSearchCV(
    estimator=pipeline,
    param_grid={'augmentation': ['passthrough', AugmentSDMX(sources={'ECB': 'CISS'})]},
    verbose=3,
    cv=TimeSeriesSplit()
    )

y_pred_grid = grid.fit(X_train, y_train).predict(X_test)

Fitting 5 folds for each of 2 candidates, totalling 10 fits
[Pipeline] ...... (step 1 of 3) Processing augmentation, total=   0.0s
[Pipeline] ............... (step 2 of 3) Processing imp, total=   0.0s
[Pipeline] ............ (step 3 of 3) Processing forest, total=   0.3s
[CV 1/5] END ..........augmentation=passthrough;, score=0.623 total time=   0.3s
[Pipeline] ...... (step 1 of 3) Processing augmentation, total=   0.0s
[Pipeline] ............... (step 2 of 3) Processing imp, total=   0.0s
[Pipeline] ............ (step 3 of 3) Processing forest, total=   0.6s
[CV 2/5] END ..........augmentation=passthrough;, score=0.423 total time=   0.6s
[Pipeline] ...... (step 1 of 3) Processing augmentation, total=   0.0s
[Pipeline] ............... (step 2 of 3) Processing imp, total=   0.0s
[Pipeline] ............ (step 3 of 3) Processing forest, total=   0.9s
[CV 3/5] END ..........augmentation=passthrough;, score=0.912 total time=   0.9s
[Pipeline] ...... (step 1 of 3) Processing augmentation, t

2022-06-01 01:31:08,006 pandasdmx.reader.sdmxml - INFO: Use supplied dsd=… argument for non–structure-specific message


Querying data from ECB's dataflow 'CISS' - Composite Indicator of Systemic Stress...
[Pipeline] ...... (step 1 of 3) Processing augmentation, total=   3.4s
[Pipeline] ............... (step 2 of 3) Processing imp, total=   0.1s
[Pipeline] ............ (step 3 of 3) Processing forest, total=   0.7s


2022-06-01 01:31:12,084 pandasdmx.reader.sdmxml - INFO: Use supplied dsd=… argument for non–structure-specific message


Querying data from ECB's dataflow 'CISS' - Composite Indicator of Systemic Stress...
[CV 1/5] END augmentation=AugmentSDMX(sources={'ECB': 'CISS'});, score=0.453 total time=   7.8s


2022-06-01 01:31:15,679 pandasdmx.reader.sdmxml - INFO: Use supplied dsd=… argument for non–structure-specific message


Querying data from ECB's dataflow 'CISS' - Composite Indicator of Systemic Stress...
[Pipeline] ...... (step 1 of 3) Processing augmentation, total=  13.6s
[Pipeline] ............... (step 2 of 3) Processing imp, total=   0.2s
[Pipeline] ............ (step 3 of 3) Processing forest, total=   1.8s


2022-06-01 01:31:31,223 pandasdmx.reader.sdmxml - INFO: Use supplied dsd=… argument for non–structure-specific message


Querying data from ECB's dataflow 'CISS' - Composite Indicator of Systemic Stress...
[CV 2/5] END augmentation=AugmentSDMX(sources={'ECB': 'CISS'});, score=0.385 total time=  24.2s


2022-06-01 01:31:39,924 pandasdmx.reader.sdmxml - INFO: Use supplied dsd=… argument for non–structure-specific message


Querying data from ECB's dataflow 'CISS' - Composite Indicator of Systemic Stress...
[Pipeline] ...... (step 1 of 3) Processing augmentation, total=  13.4s
[Pipeline] ............... (step 2 of 3) Processing imp, total=   0.2s
[Pipeline] ............ (step 3 of 3) Processing forest, total=   2.8s


2022-06-01 01:31:56,266 pandasdmx.reader.sdmxml - INFO: Use supplied dsd=… argument for non–structure-specific message


Querying data from ECB's dataflow 'CISS' - Composite Indicator of Systemic Stress...
[CV 3/5] END augmentation=AugmentSDMX(sources={'ECB': 'CISS'});, score=0.917 total time=  28.6s


2022-06-01 01:32:08,450 pandasdmx.reader.sdmxml - INFO: Use supplied dsd=… argument for non–structure-specific message


Querying data from ECB's dataflow 'CISS' - Composite Indicator of Systemic Stress...
[Pipeline] ...... (step 1 of 3) Processing augmentation, total=  16.7s
[Pipeline] ............... (step 2 of 3) Processing imp, total=   0.3s
[Pipeline] ............ (step 3 of 3) Processing forest, total=   3.9s


2022-06-01 01:32:29,346 pandasdmx.reader.sdmxml - INFO: Use supplied dsd=… argument for non–structure-specific message


Querying data from ECB's dataflow 'CISS' - Composite Indicator of Systemic Stress...
[CV 4/5] END augmentation=AugmentSDMX(sources={'ECB': 'CISS'});, score=0.926 total time=  36.8s
Querying data from ECB's dataflow 'CISS' - Composite Indicator of Systemic Stress...


2022-06-01 01:32:45,419 pandasdmx.reader.sdmxml - INFO: Use supplied dsd=… argument for non–structure-specific message


[Pipeline] ...... (step 1 of 3) Processing augmentation, total=  21.5s
[Pipeline] ............... (step 2 of 3) Processing imp, total=   0.3s
[Pipeline] ............ (step 3 of 3) Processing forest, total=   4.8s


2022-06-01 01:33:11,953 pandasdmx.reader.sdmxml - INFO: Use supplied dsd=… argument for non–structure-specific message


Querying data from ECB's dataflow 'CISS' - Composite Indicator of Systemic Stress...
[CV 5/5] END augmentation=AugmentSDMX(sources={'ECB': 'CISS'});, score=-1.428 total time=  47.8s
[Pipeline] ...... (step 1 of 3) Processing augmentation, total=   0.0s
[Pipeline] ............... (step 2 of 3) Processing imp, total=   0.0s
[Pipeline] ............ (step 3 of 3) Processing forest, total=   1.9s


In [11]:
grid.best_params_

{'augmentation': 'passthrough'}

In [18]:
print(f"The best model was achieved by {'not ' if grid.best_params_['augmentation'] == 'passthrough' else ''}using the data augmentation.")

The best model was achieved by *not* using the data augmentation.


In [12]:
print(f"The last value in the training dataset was {y_train.tail(1).to_numpy()}. The predicted value was {y_pred_grid}, and the actual value was {y_test.to_numpy()}.")

The last value in the training dataset was [5.0629]. The predicted value was [5.102027], and the actual value was [5.0965].


### Sources of data

`gingado` seeks to only lists realiable data sources by choice, with a focus on official sources. This is meant to provide users with the trust that their dataset will be complemented by reliable sources. Unfortunately, it is not possible at this stage to include *all* official sources given the substantial manual and maintenance work. `gingado` leverages the existence of the [Statistical Data and Metadata eXchange (SDMX)](https://sdmx.org), an organisation of official data sources that establishes common data and metadata formats, to download data that is relevant (and hopefully also useful) to users.

The function below from the package [simpledmx](https://github.com/dkgaraujo/simpledmx) returns a list of codes corresponding to the data sources available to provide `gingado` users with data through SDMX.

In [2]:
#collapse_output
from gingado.utils import list_SDMX_sources
list_SDMX_sources()

['ABS',
 'ABS_XML',
 'BBK',
 'BIS',
 'CD2030',
 'ECB',
 'ESTAT',
 'ILO',
 'IMF',
 'INEGI',
 'INSEE',
 'ISTAT',
 'LSD',
 'NB',
 'NBB',
 'OECD',
 'SGR',
 'SPC',
 'STAT_EE',
 'UNICEF',
 'UNSD',
 'WB',
 'WB_WDI']

You can also see what the available dataflows are. The code below returns a dictionary where each key is the code for an SDMX source, and the values associated with each key are the code and name for the respective dataflows.

In [3]:
#| collapse: true
from gingado.utils import list_all_dataflows

dflows = list_all_dataflows()
dflows

ABS_XML  ABORIGINAL_POP_PROJ                 Projected population, Aboriginal and Torres St...
         ABORIGINAL_POP_PROJ_REMOTE          Projected population, Aboriginal and Torres St...
         ABS_ABORIGINAL_POPPROJ_INDREGION    Projected population, Aboriginal and Torres St...
         ABS_ACLD_LFSTATUS                   Australian Census Longitudinal Dataset (ACLD):...
         ABS_ACLD_TENURE                     Australian Census Longitudinal Dataset (ACLD):...
                                                                   ...                        
UNSD     DF_UNData_UNFCC                                                       SDMX_GHG_UNDATA
WB       DF_WITS_Tariff_TRAINS                                WITS - UNCTAD TRAINS Tariff Data
         DF_WITS_TradeStats_Development                             WITS TradeStats Devlopment
         DF_WITS_TradeStats_Tariff                                      WITS TradeStats Tariff
         DF_WITS_TradeStats_Trade                 

For example, the dataflows from the World Bank are:

In [21]:
dflows['WB']

DF_WITS_Tariff_TRAINS             WITS - UNCTAD TRAINS Tariff Data
DF_WITS_TradeStats_Development          WITS TradeStats Devlopment
DF_WITS_TradeStats_Tariff                   WITS TradeStats Tariff
DF_WITS_TradeStats_Trade                     WITS TradeStats Trade
dtype: object

In [None]:
#| echo: false
import nbdev; nbdev.nbdev_export()