In [None]:
#hide
%load_ext autoreload
%autoreload 2

In [None]:
# default_exp augmentation

# Data augmentation

> Functions to augment the user's dataset with information from official sources.

In [None]:
#hide
from nbdev.showdoc import *

## Introduction

Most users will probably find it helpful to use the `DataAugment` transformer, which is compatible with `scikit-learn`, rather than the underlying functions. However, they are documented below in case their use might address some specific user need.

## Sources of data

`gingado` only lists official data sources by choice. This is meant to provide users with the trust that their dataset will be complemented by reliable sources. Unfortunately, it is not possible at this stage to include *all* official sources - let alone all reliable sources - because that requires substantial manual and maintenance work. `gingado` leverages the existence of the [Statistical Data and Metadata eXchange (SDMX)](https://sdmx.org), an organisation of official data sources that establishes common data and metadata formats, to download data that is relevant (and hopefully also useful) to users.

The function below from the package [simpledmx](https://github.com/dkgaraujo/simpledmx) returns a list of codes corresponding to the data sources available to provide `gingado` users with data through SDMX.

In [None]:
#export
from simpledmx import *

In [None]:
list_sdmx_sources()

['ABS',
 'ABS_XML',
 'BBK',
 'BIS',
 'CD2030',
 'ECB',
 'ESTAT',
 'ILO',
 'IMF',
 'INEGI',
 'INSEE',
 'ISTAT',
 'LSD',
 'NB',
 'NBB',
 'OECD',
 'SGR',
 'SPC',
 'STAT_EE',
 'UNICEF',
 'UNSD',
 'WB',
 'WB_WDI']

In [None]:
#hide
import pandas as pd
from sklearn.feature_selection import VarianceThreshold

In [None]:
#export
def augm_with_sdmx(df, freq, sources, variance_threshold=None):
    """Downloads relevant data from SDMX sources to complement the original dataset

    Arguments:
      df: a pandas DataFrame
      freq: the frequency of the desired data from SDMX; for example, 'A' is annual
      sources: the list of SDMX sources or None; a list of possible sources can be obtained by running the function list_sdmx_sources()    
      variance_threshold: a value larger than or equal to 0 or None, where 0 will lead to the removal of all data that does not vary across the dataset and None uses the scikit-learn default
    """
    start_date, end_date = min(df.index), max(df.index)
        
    sdmx_data = get_sdmx_data(
        start_date=start_date,
        end_date=end_date,
        freq=freq,
        sources=sources
        )
    sdmx_data = sdmx_data.dropna(axis=1).sort_index()
    sdmx_data.reset_index(inplace=True)
    sdmx_data['TIME_PERIOD'] = pd.to_datetime(sdmx_data['TIME_PERIOD'])
    sdmx_data.set_index('TIME_PERIOD', inplace=True)
    
    feat_sel = VarianceThreshold() if variance_threshold is None else VarianceThreshold(threshold=variance_threshold)
    feat_sel.fit(sdmx_data)
    
    # TODO: log which features were not kept and why
    sdmx_data = sdmx_data.iloc[:, feat_sel.get_support()]

    #sdmx_data = feat_sel.fit_transform(sdmx_data)
        
    if df is None:
        return sdmx_data
    df = df.merge(sdmx_data, how='left', left_on=time_col, right_on='TIME_PERIOD')
    return df

In [None]:
show_doc(augm_with_sdmx)

<h4 id="augm_with_sdmx" class="doc_header"><code>augm_with_sdmx</code><a href="__main__.py#L2" class="source_link" style="float:right">[source]</a></h4>

> <code>augm_with_sdmx</code>(**`df`**, **`freq`**, **`sources`**, **`variance_threshold`**=*`None`*)

Downloads relevant data from SDMX sources to complement the original dataset

Arguments:
  df: a pandas DataFrame
  freq: the frequency of the desired data from SDMX; for example, 'A' is annual
  sources: the list of SDMX sources or None; a list of possible sources can be obtained by running the function list_sdmx_sources()    
  variance_threshold: a value larger than or equal to 0 or None, where 0 will lead to the removal of all data that does not vary across the dataset and None uses the scikit-learn default

## Using `gingado` to jumpstart a dataset

Since `gingado` downloads data from official sources through SDMX, users may want to use this funcitonality to gather the dataset of interest instead of augmenting some previously existent data. In these cases, the argument `df` must be set to `None`, like so:

In [None]:
new_data = augm_with_sdmx(df=None, start_date='2018', end_date='2020', freq='A', time_col=None, sources='BIS')

  6%|▌         | 1/18 [00:00<00:04,  3.97it/s]

Source: BIS, dataflow: WS_CBPOL_D ok!


 11%|█         | 2/18 [00:00<00:04,  3.91it/s]

Source: BIS, dataflow: WS_CBPOL_M ok!


 17%|█▋        | 3/18 [00:00<00:04,  3.58it/s]

Source: BIS, dataflow: WS_CBS_PUB ok!


 22%|██▏       | 4/18 [00:01<00:03,  3.92it/s]

Source: BIS, dataflow: WS_CREDIT_GAP ok!


 28%|██▊       | 5/18 [00:01<00:03,  3.81it/s]

Source: BIS, dataflow: WS_DEBT_SEC2_PUB ok!


 39%|███▉      | 7/18 [00:01<00:02,  4.10it/s]

Source: BIS, dataflow: WS_DER_OTC_TOV ok!
Source: BIS, dataflow: WS_DSR ok!


 50%|█████     | 9/18 [00:02<00:01,  4.57it/s]

Source: BIS, dataflow: WS_EER_D ok!
Source: BIS, dataflow: WS_EER_M ok!


 56%|█████▌    | 10/18 [00:02<00:01,  4.19it/s]

Source: BIS, dataflow: WS_GLI ok!


 61%|██████    | 11/18 [00:02<00:01,  3.91it/s]

Source: BIS, dataflow: WS_LBS_D_PUB ok!


 67%|██████▋   | 12/18 [00:02<00:01,  4.18it/s]

Source: BIS, dataflow: WS_LONG_CPI ok!


 78%|███████▊  | 14/18 [00:03<00:00,  4.32it/s]

Source: BIS, dataflow: WS_OTC_DERIV2 ok!
Source: BIS, dataflow: WS_SPP ok!


 89%|████████▉ | 16/18 [00:03<00:00,  4.88it/s]

Source: BIS, dataflow: WS_TC ok!
Source: BIS, dataflow: WS_XRU ok!


 94%|█████████▍| 17/18 [00:03<00:00,  5.23it/s]

Source: BIS, dataflow: WS_XRU_D ok!


100%|██████████| 18/18 [00:04<00:00,  4.29it/s]


Source: BIS, dataflow: WS_XTD_DERIV ok!


 17%|█▋        | 3/18 [00:00<00:00, 21.58it/s]

Trying to download WS_CBPOL_D from BIS... not possible.
Trying to download WS_CBPOL_M from BIS... not possible.
Trying to download WS_CBS_PUB from BIS... not possible.
Trying to download WS_CREDIT_GAP from BIS... not possible.
Trying to download WS_DEBT_SEC2_PUB from BIS... not possible.


 33%|███▎      | 6/18 [00:41<01:38,  8.20s/it]

Trying to download WS_DER_OTC_TOV from BIS... ok!


 56%|█████▌    | 10/18 [00:42<00:26,  3.35s/it]

Trying to download WS_DSR from BIS... not possible.
Trying to download WS_EER_D from BIS... not possible.
Trying to download WS_EER_M from BIS... not possible.
Trying to download WS_GLI from BIS... not possible.
Trying to download WS_LBS_D_PUB from BIS... not possible.


 67%|██████▋   | 12/18 [00:42<00:13,  2.31s/it]

Trying to download WS_LONG_CPI from BIS... ok!
Trying to download WS_OTC_DERIV2 from BIS... not possible.
Trying to download WS_SPP from BIS... not possible.
Trying to download WS_TC from BIS... not possible.


 89%|████████▉ | 16/18 [00:42<00:02,  1.25s/it]

Trying to download WS_XRU from BIS... ok!
Trying to download WS_XRU_D from BIS... not possible.


100%|██████████| 18/18 [01:08<00:00,  3.79s/it]


Trying to download WS_XTD_DERIV from BIS... ok!


  0%|          | 0/4 [00:00<?, ?it/s]

Getting data from BIS's WS_DER_OTC_TOV


 75%|███████▌  | 3/4 [00:19<00:05,  5.19s/it]

Successful
Getting data from BIS's WS_LONG_CPI
Successful
Getting data from BIS's WS_XRU
Successful
Getting data from BIS's WS_XTD_DERIV


100%|██████████| 4/4 [00:20<00:00,  5.02s/it]


Successful


In [None]:
new_data

Unnamed: 0_level_0,BIS_WS_LONG_CPI_A_AE_628,BIS_WS_LONG_CPI_A_AE_771,BIS_WS_LONG_CPI_A_AR_628,BIS_WS_LONG_CPI_A_AR_771,BIS_WS_LONG_CPI_A_AT_628,BIS_WS_LONG_CPI_A_AT_771,BIS_WS_LONG_CPI_A_AU_628,BIS_WS_LONG_CPI_A_AU_771,BIS_WS_LONG_CPI_A_BE_628,BIS_WS_LONG_CPI_A_BE_771,...,BIS_WS_XTD_DERIV_A_U_J_T_MXN_8A,BIS_WS_XTD_DERIV_A_U_J_T_TO1_8A,BIS_WS_XTD_DERIV_A_U_J_T_TO1_8B,BIS_WS_XTD_DERIV_A_U_J_T_TO1_8C,BIS_WS_XTD_DERIV_A_U_J_T_TO1_8E,BIS_WS_XTD_DERIV_A_U_J_T_TO1_8F,BIS_WS_XTD_DERIV_A_U_J_T_TO1_8G,BIS_WS_XTD_DERIV_A_U_J_T_TO1_8K,BIS_WS_XTD_DERIV_A_U_J_T_USD_8A,BIS_WS_XTD_DERIV_A_U_J_T_ZAR_8A
TIME_PERIOD,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2018-01-01,116.781662,3.070301,416.975221,34.1503,116.259014,2.004903,117.898023,1.911401,115.451626,2.053165,...,11.0,846590.0,471264.0,284211.0,90731.0,59179.0,31552.0,383.0,462350.0,350.0
2019-01-01,114.524888,-1.932473,637.155776,52.80423,118.03518,1.527766,119.797086,1.610768,117.110457,1.43682,...,7.0,852979.0,514054.0,244999.0,93613.0,61141.0,32472.0,312.0,503307.0,278.0
2020-01-01,112.143786,-2.079113,895.091635,40.482386,119.741496,1.4456,120.811655,0.846906,117.978002,0.740792,...,1.0,774172.0,446158.0,238840.0,88831.0,58457.0,30373.0,343.0,435049.0,298.0


The code above uses a greedy SDMX downloader that is not too concerned about selecting datasets in advance; rather, it downloads it data it possibly can from those official sources for the time period and frequency in question. It then filters out those data points that do not vary throughout the period, avoiding the use of memory to store data that does not contribute to the predictive power of the model. The dataset is then ready to be used.

Two things are important to highlight. First, choosing even one source (the [BIS](www.bis.org) in this example) leads to the download of hundreds of variables. Some of them might be representing the same underlying concepts, but for different jurisdictions. The second thing to bear in mind is that download and in particular parsing of the SDMX data can take up some time depending on your local setting.

To use `gingado` to augment your dataset instead of creating a completely new one as done above, simply pass the original DataFrame as the argument `df` and name the corresponding column with the time values in the argument `time_col`.