## Split Dataset

The purpose of this notebook is to split the dataset into `training` and `test`. By doing this at the begining of the project it is possible to avoid leaking information about the data and using the `test` set as "production" data, which would available only after putting the model into production.

To be able to split the dataset it is necessary to have access to the columns and check the distribution of the data to understand what attribute  should be used to slit the dataset.


### Tasks:
 - [x] Load external dataset;
 - [x] Identify split column and value;
 - [x] Split dataset into training and test sets;
 - [x] Save datasets.

## Libraries and Configurations

In [1]:
import pandas as pd
from IPython.core.display import HTML

from application.code.core.configurations import configs
from application.code.adapters.storage import save_dataset
from application.code.core.dataset_split_service import compute_cumulative_records_by_date

## Local Functions

In [2]:
def summarize_set(df: pd.DataFrame, title: str):

    display(HTML(f'<h4>{title}</h4>'))
    display(HTML(
        f''' - <strong>Records:</strong> {len(df):,}<br />
             - <strong>Periods:</strong> from {df.period.min()} to {df.period.max()}
        '''))

## Load External Dataset

In [3]:
df = pd.read_csv(configs.external_dataset, sep=';', encoding='latin1')

print(f'Records: {len(df)}')

print('\nSample:')
df.head(3).T

Records: 4955

Sample:


Unnamed: 0,0,1,2
id,"4,53E+11","4,53E+11","4,53E+11"
safra_abertura,201405,201405,201405
cidade,CAMPO LIMPO PAULISTA,CAMPO LIMPO PAULISTA,CAMPO LIMPO PAULISTA
estado,SP,SP,SP
idade,37,37,37
sexo,F,F,F
limite_total,4700,4700,4700
limite_disp,5605,5343,2829
data,4.12.2019,9.11.2019,6.05.2019
valor,31,15001,50


## Analize Split Configurations

Based on the sample, it is possible to identify that transactions are recorded by date. Considering that a model should be trainined on known data, **the transaction date should be the split column** to separate the past from the future and avoid data leackage.

To decide the date to split, it is necessary to analyze the distribution of records based on the date.

In [4]:
cumulative_transactions_df = compute_cumulative_records_by_date(df)

display(HTML('<h3>Cummulative Transactions by Date</h3>'))
display(cumulative_transactions_df)

display(HTML('<h3>Last ~20% of Transactions</h3>'))

cutoff_period = (cumulative_transactions_df 
                 .loc[lambda f: f['percentage'] >= 80]
                 ['period']
                 .min()
                )

display(HTML(f'<strong>Cutoff Period</strong>: {cutoff_period}'))
(
    cumulative_transactions_df
    .loc[lambda f: f['period'] >= cutoff_period]
)

Unnamed: 0,period,transactions,total_transactions,percentage
0,2019-04-01,9,9,0.181635
1,2019-04-02,5,14,0.282543
2,2019-04-03,9,23,0.464178
3,2019-04-04,12,35,0.706357
4,2019-04-05,12,47,0.948537
...,...,...,...,...
398,2020-05-03,5,4926,99.414733
399,2020-05-04,7,4933,99.556004
400,2020-05-05,9,4942,99.737639
401,2020-05-06,2,4944,99.778002


Unnamed: 0,period,transactions,total_transactions,percentage
301,2020-01-27,25,3969,80.100908
302,2020-01-28,19,3988,80.484359
303,2020-01-29,23,4011,80.948537
304,2020-01-30,8,4019,81.109990
305,2020-01-31,15,4034,81.412714
...,...,...,...,...
398,2020-05-03,5,4926,99.414733
399,2020-05-04,7,4933,99.556004
400,2020-05-05,9,4942,99.737639
401,2020-05-06,2,4944,99.778002


## Split Dataset

With a cutoff period defined, it is possible to split the main dataset into train and test sets.

In [5]:
df_with_period = (
    df
    .assign(period=lambda f: pd.to_datetime(f["data"],
                                            format="%d.%m.%Y").apply(lambda dt: str(dt.date())))
)

train_df = (
    df_with_period
    .loc[lambda f: f['period'] < cutoff_period]
    .reset_index(drop=True)
)
test_df = (
    df_with_period
    .loc[lambda f: f['period'] >= cutoff_period]
    .reset_index(drop=True)
)

Check both datasets do not share dates

In [6]:
train_dates = set(train_df['data'].tolist())
test_dates = set(test_df['data'].tolist())

assert len(train_dates & test_dates) == 0, \
'There is shared information between train and test datsets'

Summarize information about the datasets

In [7]:
summarize_set(train_df, 'Training')
summarize_set(test_df, 'Test')

## Save Datasets

In [8]:
save_dataset(train_df, configs.base_path, 'raw', 'train')
save_dataset(test_df, configs.base_path, 'raw', 'test')