In [1]:
import numpy as np
import polars as pl
import pandas as pd
from data_prep_utilities import *
from dataset_descriptions import dataset_full, dataset_small

# Data Preparation Notebook

This notebook loads the data, performs feature selection and engineering, and joins the tables. The end result is a Train/Val/Test split, to be used for any model training.

## Data Explanation

A couple notes on data interpretation:

Where predictors were transformed, columns describing the transformation have been added with a capital letter suffixing the predictor name
* P - Transform DPD (Days past due)
* M - Masking categories
* A - Transform amount
* D - Transform date
* T - Unspecified Transform
* L - Unspecified Transform

On depths: depth of a table refers to how many num_group# columns are used to index. Each case_id is only featured once for each unique set of indices, although it may not have a listing for every set. The indexing is not necessarily chronological either; dates where num_group1 == 2 may be earlier than dates where num_group1 == 0. It may be useful to pull summary information for each case_id, e.g. min, max, median, fraction_empty.

In [2]:
# for exploration purposes: this gives more information about each feature
dataPath = "/kaggle/input/home-credit-credit-risk-model-stability/"
feature_definitions = pl.read_csv(dataPath + "feature_definitions.csv")
print(feature_definitions.head())

shape: (5, 2)
┌─────────────────────────┬───────────────────────────────────┐
│ Variable                ┆ Description                       │
│ ---                     ┆ ---                               │
│ str                     ┆ str                               │
╞═════════════════════════╪═══════════════════════════════════╡
│ actualdpd_943P          ┆ Days Past Due (DPD) of previous … │
│ actualdpdtolerance_344P ┆ DPD of client with tolerance.     │
│ addres_district_368M    ┆ District of the person's address… │
│ addres_role_871L        ┆ Role of person's address.         │
│ addres_zip_823M         ┆ Zip code of the address.          │
└─────────────────────────┴───────────────────────────────────┘


In [3]:
# for exploration: investigate a particular df or set of dfs
df_info = {
    "name":"tax_registry_c_1",
    "depth":2,
}
train_df, submit_df = load_df(**df_info)
train_df.head()

case_id,employername_160M_max,num_group1_max,pmtamount_36A_max,processingdate_168D_max
i64,str,i64,f64,str
7515,"""733f5e7b""",11,600.0,"""2018-11-26"""
1288686,"""3e7e3591""",5,7336.8003,"""2019-02-11"""
1451892,"""12da981f""",4,3744.6423,"""2019-06-18"""
1446810,"""596cd835""",5,4726.728,"""2019-07-12"""
4517,"""2f6e71e3""",5,2717.208,"""2019-02-14"""


In [4]:
# create a generator to step through features and their descriptions
cols=train_df.columns
if df_info['depth'] > 0:
    cols = [c[:-4] for c in cols]
pl.Config.set_tbl_width_chars(100)
desc = feature_definitions.filter(pl.col('Variable').is_in(cols)).rows()
def next_row(desc):
    for row in desc:
        print(row[0],":")
        print(row[1])
        yield
row = next_row(desc)
print(len(desc))

3


In [5]:
next(row)

employername_160M :
Employer's name.


# Load Data

## Example: Generating splits from dataset descriptions

Below is a small dataset description; in fact, it describes the same dataset used in the starter notebook.

In [27]:
####################################################
# stores dataset info, arguments for load_df
#    description: notes to self. Ignored by load functions
#    name: from the actual name of the file, ignoring extra info (e.g., train/train_{NAME}_1.csv)
#    features (default all): specify columns to keep (ignore all others)
#    feature_types (default all): from kept features, select only those ending with these tags
#    depth (default 0): from kaggle description. If >0, aggregation will be performed
#    agg_max (default True): if depth>0, return the max for each case_id for each a feature
#    agg_min (default False): if depth>0, return the min for each case_id for each a feature
#    agg_median (default False): if depth>0, return the max for each case_id for each a feature
#####################################################
dataset_small = {
    "base":{
        "description": "links case_id to WEEK_NUM and target",
        "name":"base",
    },
    "static_0":{
        "description":"contains transaction history for each case_id (late payments, total debt, etc)",
        "name":"static_0",
        "feature_types":["A", "M"],
    },
    "static_cb":{
        "description":"data from an external cb: demographic data, risk assessment, number of credit checks",
        "name":"static_cb",
        "feature_types":["A", "M"],
    },
    "person_1_feats_1":{
        "description":" internal demographic information: zip code, marital status, gender etc (all hashed)",
        "name":"person_1",
        "features":["mainoccupationinc_384A", "incometype_1044T"],
        "depth":1,
    },
    "person_1_feats_2":{
        "description":" internal demographic information: zip code, marital status, gender etc (all hashed)",
        "name":"person_1",
        "features":["housetype_905L"],
        "depth":1,
    },
    "credit_bureau_b_2":{
        "description":"historical data from an external source, num and value of overdue payments",
        "name":"credit_bureau_b_2",
        "features":["pmts_pmtsoverdue_635A","pmts_dpdvalue_108P"],
        "depth":2,
    }
}

We call load_all_dfs to load the specified datasets from csv, select features, aggregate as indicated, then join all.

In [2]:
train_df, submission_df = load_all_dfs(dataset_small)

We will only use submission_df at the end. We save our model's results on this submission_df data for kaggle to evaluate. Train_df is passed to our split function, which returns the splits ready for scaling and training.

In [29]:
train_sets, val_sets, test_sets = train_val_test_split(train_df, train_split=0.6)

These sets are lists of pandas dfs of the form 
* base (case_id, WEEK_NUM, target)
* X (all predictor columns)
* y (target only)

In [30]:
train_sets[1].shape # X_train

(915995, 48)

We can now use these splits to train a model. Note that depending on the model, there may still be imputation/scaling/other augmentation necessary.