### `Intro Notes`

1. Large number of tables stored in `.csv` and `.parquet` formats.
2. Train baseline model using`base tables`.
    - `base` tables has both train_base.csv and `test_base.csv`
3. Build primary pipeline using internal data sources and then move to incorporating external sources.


### Special Columns

- case_id - This is the unique identifier for each credit case. You'll need this ID to join relevant tables to the base table.
- date_decision - This refers to the date when a decision was made regarding the approval of the loan.
- WEEK_NUM - This is the week number used for aggregation. **In the test sample, WEEK_NUM continues sequentially from the last training value of WEEK_NUM.**
- MONTH - This column represents the month and is _intended for aggregation purposes_.
- `target` - This is the target value, determined after a certain period based on whether or not the client defaulted on the specific credit case (loan).
- num_group1 - This is an indexing column used for the historical records of case_id in both depth=1 and depth=2 tables.
- num_group2 - This is the second indexing column for depth=2 tables' historical records of case_id. The order of num_group1 and num_group2 is important and will be clarified in feature definitions.

### Additional 

 -  All other raw columns serve as predictors. Definitions are in `feature_definitions.csv`
 -  depth=0 tables can use predictors as features directly.
 -  depth>0 may require aggregations to condense records for each case id.
 
### Transformations

    P - Transform DPD (Days past due)
    M - Masking categories
    A - Transform amount
    D - Transform date
    T - Unspecified Transform
    L - Unspecified Transform
    
### Final Edits    
    pmts_month_158T is for active contract
    pmts_month_706T is for closed contract
    dateofcredstart_181D - Start date of a credit contract.




In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import polars as pl
from fastai.tabular.all import *
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestRegressor
import warnings

pd.options.display.max_rows = 100
pd.options.display.max_columns = 100

In [2]:
path = "/kaggle/input/home-credit-credit-risk-model-stability/"

We will need to find a way visualize individual file sizes to determine which library will be best suited for handling data frames.

## 1. Table Sizes to Put Things Into Perspective

ref : Function from [notebook](http://https://www.kaggle.com/code/sergiosaharovskiy/home-credit-crms-2024-eda-and-submission/notebook#Overview)

In [3]:
def display_all(df):
    with pd.option_context('display.max_rows', 1000, 'display.max_columns', 1000):
        print(df)

In [4]:
import subprocess
from pathlib import Path

def get_disk_usage(directory): # SICK!!!
    cmd = f'du {directory}/* -h | sort -rh'
    result = subprocess.run(cmd, shell=True, stdout=subprocess.PIPE, text=True) # needs review
    output_lines = result.stdout.split('\n')
    # Extract file/directory names and sizes
    data = [line.split('\t') for line in output_lines if line]
    df = pd.DataFrame(data, columns=['str', 'path'])
    df['file_name']  = df.path.str.replace('train_|test_', '', regex=True)\
                         .apply(lambda x: Path(x).stem)
    return df

In [5]:
train_disk_usage = get_disk_usage(f'{path}/csv_files/train').reset_index()
test_disk_usage = get_disk_usage(f'{path}/csv_files/test')

In [6]:
train_disk_usage.reset_index().merge(test_disk_usage, on=['file_name'], how='outer',
                                    suffixes=['_train', '_test'])\
                              .sort_values(by='index')\
                              .drop(columns=['index'])

Unnamed: 0,level_0,str_train,path_train,file_name,str_test,path_test
17,0.0,2.9G,/kaggle/input/home-credit-credit-risk-model-stability//csv_files/train/train_credit_bureau_a_2_5.csv,credit_bureau_a_2_5,4.0K,/kaggle/input/home-credit-credit-risk-model-stability//csv_files/test/test_credit_bureau_a_2_5.csv
16,1.0,2.4G,/kaggle/input/home-credit-credit-risk-model-stability//csv_files/train/train_credit_bureau_a_2_4.csv,credit_bureau_a_2_4,4.0K,/kaggle/input/home-credit-credit-risk-model-stability//csv_files/test/test_credit_bureau_a_2_4.csv
15,2.0,2.3G,/kaggle/input/home-credit-credit-risk-model-stability//csv_files/train/train_credit_bureau_a_2_3.csv,credit_bureau_a_2_3,4.0K,/kaggle/input/home-credit-credit-risk-model-stability//csv_files/test/test_credit_bureau_a_2_3.csv
18,3.0,2.2G,/kaggle/input/home-credit-credit-risk-model-stability//csv_files/train/train_credit_bureau_a_2_6.csv,credit_bureau_a_2_6,4.0K,/kaggle/input/home-credit-credit-risk-model-stability//csv_files/test/test_credit_bureau_a_2_6.csv
21,4.0,1.6G,/kaggle/input/home-credit-credit-risk-model-stability//csv_files/train/train_credit_bureau_a_2_9.csv,credit_bureau_a_2_9,4.0K,/kaggle/input/home-credit-credit-risk-model-stability//csv_files/test/test_credit_bureau_a_2_9.csv
14,5.0,1.6G,/kaggle/input/home-credit-credit-risk-model-stability//csv_files/train/train_credit_bureau_a_2_2.csv,credit_bureau_a_2_2,4.0K,/kaggle/input/home-credit-credit-risk-model-stability//csv_files/test/test_credit_bureau_a_2_2.csv
6,6.0,1.5G,/kaggle/input/home-credit-credit-risk-model-stability//csv_files/train/train_credit_bureau_a_1_1.csv,credit_bureau_a_1_1,8.0K,/kaggle/input/home-credit-credit-risk-model-stability//csv_files/test/test_credit_bureau_a_1_1.csv
20,7.0,1.2G,/kaggle/input/home-credit-credit-risk-model-stability//csv_files/train/train_credit_bureau_a_2_8.csv,credit_bureau_a_2_8,4.0K,/kaggle/input/home-credit-credit-risk-model-stability//csv_files/test/test_credit_bureau_a_2_8.csv
7,8.0,949M,/kaggle/input/home-credit-credit-risk-model-stability//csv_files/train/train_credit_bureau_a_1_2.csv,credit_bureau_a_1_2,8.0K,/kaggle/input/home-credit-credit-risk-model-stability//csv_files/test/test_credit_bureau_a_1_2.csv
5,9.0,837M,/kaggle/input/home-credit-credit-risk-model-stability//csv_files/train/train_credit_bureau_a_1_0.csv,credit_bureau_a_1_0,8.0K,/kaggle/input/home-credit-credit-risk-model-stability//csv_files/test/test_credit_bureau_a_1_0.csv


## 2. Base EDA