# 🧾 Table of Contents

- [2. Getting the Data](#2.-Getting-the-Data)
  - [Overviewing the Data](#Overviewing-the-Data)
  - [Preparing our toolbox](#Preparing-our-toolbox)
    - [Application_train.csv](#Application_train.csv)
      - [Highlights & Ideas](#Highlights-&-Ideas)
    - [bureau.csv](#bureau.csv)
      - [Highlights & Ideas](#Highlights-&-Ideas)
    - [bureau_balance.csv](#bureau_balance.csv)
      - [Highlights & Ideas](#Highlights-&-Ideas)
    - [previous_application.csv](#previous_application.csv)
      - [Highlights & Ideas](#Highlights-&-Ideas)
    - [POS_CASH_balance.csv](#POS_CASH_balance.csv)
      - [Highlights & Ideas](#Highlights-&-Ideas)
    - [installments_payments.csv](#installments_payments.csv)
      - [Highlights & Ideas](#Highlights-&-Ideas)
    - [credit_card_balance.csv](#credit_card_balance.csv)
      - [Highlights & Ideas](#Highlights-&-Ideas)
    - [Final Thoughts](#Final-Thoughts)

# 2. Getting the Data

The available data is in this [link](https://drive.google.com/file/d/17fyteuN2MdGdbP5_Xq_sySN_yH91vTup/view). It is a ```.zip``` file with 673mb. There are no impediments related to legal obligations, authorizations or sensitive information.

With a file of this size, we will definetely experience restrictions on uploading it to Github. So, I either use Git Large File Storage solution or automate the process of getting the data using ```make```. I chose the latter to improve reproducibility and leave the project structure as light as possible. To use it:

1. The Shell Terminal in the root folder of the project:
 
    ```make data```

 1. The Ipython Terminal Interpreter from this notebook:
 
    ```!cd .. && make data```

The package requirements will be installed and the data will be download into ```/data/raw/``` folder.

## Overviewing the Data

There are 10 available Datasets and a Metadata PDF file. From the 10 Datasets:
- 1 is the Columns Descriptions.
- 1 is a Sample Submission.
- 1 is the test set for the Final Submission.
- The rest is data to be explored.
  
- ER Diagram below:

<div align="center">
<img src="../references/home_credit.png" alt="Home Credit ER Diagram" style="height: 500px"/>
</div>

Ok, now it is time to take a high-level view into the datasets.

## Preparing our toolbox

In [9]:
%load_ext autoreload
%autoreload 2

from src.data.explore_data import (
    list_datasets,
    describe_feature,
    overview_data,
    create_dataframe,
    describe_features,
)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [5]:
list_datasets()

('application_test_student.csv',
 'HomeCredit_columns_description.csv',
 'POS_CASH_balance.csv',
 'credit_card_balance.csv',
 'installments_payments.csv',
 'application_train.csv',
 'bureau.csv',
 'previous_application.csv',
 'bureau_balance.csv',
 'sample_submission.csv')

### Application_train.csv
*Starting with the flagship dataset*

In [10]:
overview_data("application_train.csv", sample_mode="random", sample_size=5)

#### Dataset name (.csv)

application_train.csv


#### Dataset Random Samples

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,ENTRANCES_MODE,FLOORSMAX_MODE,FLOORSMIN_MODE,LANDAREA_MODE,LIVINGAPARTMENTS_MODE,LIVINGAREA_MODE,NONLIVINGAPARTMENTS_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BUILD_MEDI,COMMONAREA_MEDI,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,FLOORSMIN_MEDI,LANDAREA_MEDI,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,FONDKAPREMONT_MODE,HOUSETYPE_MODE,TOTALAREA_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
161368,412052,0,Cash loans,F,N,N,1,135000.0,755190.0,36459.0,675000.0,Unaccompanied,Working,Higher education,Married,Rented apartment,0.018801,-10343,-370,-4547.0,-1583,,1,1,0,1,0,0,Accountants,3.0,2,2,TUESDAY,11,0,0,0,0,0,0,Medicine,0.271976,0.165908,0.643026,,,0.9687,,,,,,,,,0.0023,,,,,0.9687,,,,,,,,,0.0024,,,,,0.9687,,,,,,,,,0.0023,,,,block of flats,0.0018,,Yes,0.0,0.0,0.0,0.0,0.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
177644,391895,0,Cash loans,M,Y,Y,1,121500.0,621900.0,49266.0,562500.0,Unaccompanied,Working,Secondary / secondary special,Married,House / apartment,0.030755,-11080,-635,-3896.0,-3722,18.0,1,1,0,1,1,0,Drivers,3.0,2,2,TUESDAY,12,0,0,0,0,1,1,Industry: type 11,0.55266,0.420643,0.42413,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3.0,0.0,3.0,0.0,-2416.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,1.0,3.0
42295,279443,0,Cash loans,F,N,Y,0,135000.0,651595.5,33399.0,526500.0,Unaccompanied,Working,Higher education,Single / not married,House / apartment,0.02461,-10779,-482,-7.0,-3451,,1,1,0,1,0,0,Laborers,1.0,2,2,THURSDAY,15,0,0,0,0,0,0,Business Entity Type 3,,0.362556,,0.0124,0.0,0.9995,0.9932,0.0018,0.0,0.0345,0.0833,0.0417,0.0039,0.0151,0.0098,0.0232,0.0208,0.0126,0.0,0.9995,0.9935,0.0018,0.0,0.0345,0.0833,0.0417,0.004,0.0165,0.0102,0.0233,0.022,0.0125,0.0,0.9995,0.9933,0.0018,0.0,0.0345,0.0833,0.0417,0.004,0.0154,0.0099,0.0233,0.0212,reg oper account,block of flats,0.0086,Block,No,0.0,0.0,0.0,0.0,0.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,1.0,0.0,0.0,5.0
161674,237294,0,Cash loans,F,N,N,0,99000.0,334152.0,14157.0,270000.0,Unaccompanied,Pensioner,Secondary / secondary special,Married,House / apartment,0.009175,-21365,365243,-10459.0,-4599,,1,0,0,1,1,0,,2.0,2,2,SUNDAY,19,0,0,0,0,0,0,XNA,0.834642,0.368393,0.146442,0.0247,0.063,0.9841,0.7824,0.0431,0.0,0.069,0.0833,0.0417,0.0,0.0202,0.0242,0.0,0.0,0.0252,0.0654,0.9841,0.7909,0.0435,0.0,0.069,0.0833,0.0417,0.0,0.022,0.0252,0.0,0.0,0.025,0.063,0.9841,0.7853,0.0433,0.0,0.069,0.0833,0.0417,0.0,0.0205,0.0246,0.0,0.0,,block of flats,0.0426,"Stone, brick",No,1.0,1.0,1.0,1.0,-2520.0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,1.0,0.0,1.0
235457,401672,0,Cash loans,M,N,N,1,90000.0,319981.5,12316.5,243000.0,Unaccompanied,Working,Secondary / secondary special,Married,House / apartment,0.005144,-13078,-4128,-4192.0,-4190,,1,1,0,1,0,0,Low-skill Laborers,3.0,2,2,MONDAY,17,0,0,0,0,0,0,Self-employed,0.395638,0.108437,0.707699,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3.0,2.0,3.0,0.0,-851.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0


#### Dataset info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 246008 entries, 0 to 246007
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 229.0+ MB


#### Dataset descriptive statistics

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,ENTRANCES_MODE,FLOORSMAX_MODE,FLOORSMIN_MODE,LANDAREA_MODE,LIVINGAPARTMENTS_MODE,LIVINGAREA_MODE,NONLIVINGAPARTMENTS_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BUILD_MEDI,COMMONAREA_MEDI,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,FLOORSMIN_MEDI,LANDAREA_MEDI,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,FONDKAPREMONT_MODE,HOUSETYPE_MODE,TOTALAREA_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
count,246008.0,246008.0,246008,246008,246008,246008,246008.0,246008.0,246008.0,245998.0,245782.0,244960,246008,246008,246008,246008,246008.0,246008.0,246008.0,246008.0,246008.0,83649.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,168771,246007.0,246008.0,246008.0,246008,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008,107205.0,245464.0,197280.0,121053.0,101918.0,125912.0,82328.0,74030.0,114800.0,122064.0,123525.0,79009.0,99921.0,77730.0,122391.0,75094.0,110148.0,121053.0,101918.0,125912.0,82328.0,74030.0,114800.0,122064.0,123525.0,79009.0,99921.0,77730.0,122391.0,75094.0,110148.0,121053.0,101918.0,125912.0,82328.0,74030.0,114800.0,122064.0,123525.0,79009.0,99921.0,77730.0,122391.0,75094.0,110148.0,77616,122468,127171.0,120844,129247,245195.0,245195.0,245195.0,245195.0,246007.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,246008.0,212836.0,212836.0,212836.0,212836.0,212836.0,212836.0
unique,,,2,3,2,2,,,,,,7,8,5,6,6,,,,,,,,,,,,,18,,,,7,,,,,,,,58,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,4,3,,7,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
top,,,Cash loans,F,N,Y,,,,,,Unaccompanied,Working,Secondary / secondary special,Married,House / apartment,,,,,,,,,,,,,Laborers,,,,TUESDAY,,,,,,,,Business Entity Type 3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,reg oper account,block of flats,,Panel,No,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
freq,,,222622,161867,162355,170535,,,,,,198726,126919,174831,157074,218258,,,,,,,,,,,,,44126,,,,43046,,,,,,,,54495,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,58983,120287,,52819,127375,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
mean,278280.072908,0.081176,,,,,0.415527,168912.2,599628.3,27129.162648,538928.9,,,,,,0.020882,-16042.794393,63963.755699,-4988.0333,-2991.647642,12.034346,0.999996,0.819481,0.199095,0.998138,0.281023,0.056722,,2.15076,2.052092,2.031206,,12.064518,0.015186,0.050616,0.040499,0.078254,0.230541,0.179592,,0.502277,0.5143916,0.510838,0.117426,0.088385,0.977715,0.752387,0.044643,0.078917,0.149693,0.226097,0.231765,0.066152,0.100702,0.107329,0.008674,0.028299,0.114261,0.087517,0.977066,0.75953,0.042587,0.074466,0.145183,0.222161,0.227929,0.06477,0.105589,0.105934,0.007917,0.026986,0.117851,0.087902,0.977722,0.75564,0.044617,0.07803,0.149187,0.225701,0.231482,0.066967,0.101897,0.108541,0.008503,0.028191,,,0.102518,,,1.423598,0.144045,1.406803,0.100691,-962.70539,4.1e-05,0.71079,8.5e-05,0.015101,0.087932,0.000191,0.081103,0.003821,1.2e-05,0.003906,8e-06,0.003581,0.002951,0.001175,0.009996,0.000264,0.007975,0.000589,0.000508,0.000289,0.006291,0.006944,0.034487,0.267403,0.264109,1.90004
std,102790.909988,0.273106,,,,,0.719922,260381.8,403067.2,14504.965232,369973.8,,,,,,0.013852,4365.973763,141400.318322,3520.987048,1510.020637,11.861705,0.002016,0.38462,0.399321,0.043108,0.4495,0.231311,,0.909167,0.509063,0.502715,,3.264923,0.122294,0.219213,0.197126,0.268571,0.42118,0.383848,,0.211078,0.1908912,0.19488,0.108059,0.082228,0.059405,0.113013,0.075547,0.134542,0.099886,0.144524,0.161308,0.080915,0.092138,0.110374,0.046717,0.069178,0.107892,0.084158,0.064524,0.109838,0.074033,0.132289,0.100822,0.143613,0.161073,0.08137,0.097539,0.111664,0.045087,0.069954,0.108926,0.081984,0.060121,0.111812,0.075697,0.134423,0.100206,0.144954,0.161801,0.081824,0.093252,0.112075,0.046307,0.069892,,,0.107308,,,2.423894,0.449464,2.402398,0.364917,826.831325,0.006376,0.453397,0.009239,0.121956,0.283197,0.013821,0.272994,0.061696,0.003492,0.062379,0.002851,0.059736,0.054244,0.034255,0.099477,0.016253,0.088948,0.024271,0.022536,0.016986,0.083236,0.109538,0.204179,0.91664,0.611269,1.868217
min,100002.0,0.0,,,,,0.0,25650.0,45000.0,1615.5,40500.0,,,,,,0.00029,-25229.0,-17912.0,-23738.0,-7197.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,1.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.014568,8.173617e-08,0.000527,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,0.0,,,0.0,0.0,0.0,0.0,-4292.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,189165.5,0.0,,,,,0.0,112500.0,270000.0,16561.125,238500.0,,,,,,0.010006,-19691.0,-2758.0,-7481.0,-4297.25,5.0,1.0,1.0,0.0,1.0,0.0,0.0,,2.0,2.0,2.0,,10.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.334285,0.392653,0.37065,0.0577,0.0442,0.9767,0.6872,0.0078,0.0,0.069,0.1667,0.0833,0.0187,0.0504,0.0453,0.0,0.0,0.0525,0.0407,0.9767,0.6994,0.0072,0.0,0.069,0.1667,0.0833,0.0165,0.0542,0.0427,0.0,0.0,0.0583,0.0437,0.9767,0.6914,0.0078,0.0,0.069,0.1667,0.0833,0.0187,0.0513,0.0457,0.0,0.0,,,0.0412,,,0.0,0.0,0.0,0.0,-1570.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,278392.5,0.0,,,,,0.0,148500.0,514777.5,24930.0,450000.0,,,,,,0.01885,-15763.0,-1215.0,-4503.0,-3250.0,9.0,1.0,1.0,0.0,1.0,0.0,0.0,,2.0,2.0,2.0,,12.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.505994,0.5657089,0.535276,0.0876,0.0764,0.9816,0.7552,0.021,0.0,0.1379,0.1667,0.2083,0.0481,0.0756,0.0745,0.0,0.0036,0.084,0.0747,0.9816,0.7648,0.019,0.0,0.1379,0.1667,0.2083,0.0458,0.0771,0.0731,0.0,0.0011,0.0874,0.0759,0.9816,0.7585,0.0208,0.0,0.1379,0.1667,0.2083,0.0486,0.0761,0.0749,0.0,0.00305,,,0.0688,,,0.0,0.0,0.0,0.0,-757.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,367272.25,0.0,,,,,1.0,202500.0,808650.0,34599.375,679500.0,,,,,,0.028663,-12418.0,-289.0,-2018.0,-1715.0,15.0,1.0,1.0,0.0,1.0,1.0,0.0,,3.0,2.0,2.0,,14.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.6752,0.6634999,0.669057,0.1485,0.1122,0.9866,0.8232,0.0515,0.12,0.2069,0.3333,0.375,0.0854,0.121,0.1299,0.0039,0.0277,0.145,0.1125,0.9866,0.8236,0.049,0.1208,0.2069,0.3333,0.375,0.084,0.1313,0.1251,0.0039,0.023,0.1489,0.1116,0.9866,0.8256,0.0514,0.12,0.2069,0.3333,0.375,0.0865,0.1231,0.1303,0.0039,0.0266,,,0.1275,,,2.0,0.0,2.0,0.0,-273.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0


In [12]:
describe_features("application_train.csv", display_option="condensed")

Unnamed: 0,Dataset,Column,Description,Special
0,application_{train|test}.csv,SK_ID_CURR,ID of loan in our sample,
1,application_{train|test}.csv,TARGET,Target variable (1 - client with payment diffi...,
2,application_{train|test}.csv,NAME_CONTRACT_TYPE,Identification if loan is cash or revolving,
3,application_{train|test}.csv,CODE_GENDER,Gender of the client,
4,application_{train|test}.csv,FLAG_OWN_CAR,Flag if the client owns a car,
...,...,...,...,...
117,application_{train|test}.csv,AMT_REQ_CREDIT_BUREAU_DAY,Number of enquiries to Credit Bureau about the...,
118,application_{train|test}.csv,AMT_REQ_CREDIT_BUREAU_WEEK,Number of enquiries to Credit Bureau about the...,
119,application_{train|test}.csv,AMT_REQ_CREDIT_BUREAU_MON,Number of enquiries to Credit Bureau about the...,
120,application_{train|test}.csv,AMT_REQ_CREDIT_BUREAU_QRT,Number of enquiries to Credit Bureau about the...,


#### Highlights & Ideas

There are **246k+** records and **122** columns. A sizeable dataset.

- ```NAME_CONTRACT_TYPE```: research about types of loan contracts.
- From column 21 ```OWN_CAR_AGE``` to column 27 ```FLAG_EMAIL```, there is room for studying correlation and feature engineering, such as rearrangement.
- From column 44 ```APARTMENTS_AVG``` to column 90 ```EMERGENCYSTATE_MODE``` we get normalized aspects about the building where the client lives.
    - Their description are all the same in the column description dataset: "Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor",normalized".
    - Well, here it gets trick. I intend to split the data between train and validation. Considering the model will be deployed, will data always come normalized?
    - Some columns, such as ```WALLSMATERIAL_MODE```, are categorical. They look like normal categorical columns and not from the mode (suffix).
- ```DAYS_LAST_PHONE_CHANGE```: called for my attention. Applicants changed their phone in the exact date of the application? Seems odd.
- From column 96 ```FLAG_DOCUMENT_2``` to column 115 ```FLAG_DOCUMENT_21```, there is also room for studying correlation and feature engineering, such as rearrangement.

### bureau.csv

In [40]:
overview_data(
    "bureau.csv", display_option="condensed", sample_mode="random", sample_size=5
)

#### Dataset name (.csv)

bureau.csv


#### Dataset Random Samples

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
1236926,305481,5916012,Closed,currency 1,-1531,0,-435.0,-846.0,,0,450000.0,0.0,,0.0,Credit card,-846,0.0
904815,288975,5667606,Closed,currency 1,-2263,0,-1166.0,-1165.0,,0,169884.0,0.0,0.0,0.0,Consumer credit,-1164,
849614,309911,6052617,Closed,currency 1,-932,0,911.0,-233.0,,0,112500.0,0.0,,0.0,Credit card,-228,
16038,358247,6025723,Closed,currency 1,-800,0,-70.0,-636.0,,0,140611.5,,,0.0,Consumer credit,-636,0.0
124732,452117,5242575,Closed,currency 1,-782,0,-417.0,-415.0,0.0,0,43960.5,0.0,0.0,0.0,Consumer credit,-415,


#### Dataset info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1716428 entries, 0 to 1716427
Data columns (total 17 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   SK_ID_CURR              int64  
 1   SK_ID_BUREAU            int64  
 2   CREDIT_ACTIVE           object 
 3   CREDIT_CURRENCY         object 
 4   DAYS_CREDIT             int64  
 5   CREDIT_DAY_OVERDUE      int64  
 6   DAYS_CREDIT_ENDDATE     float64
 7   DAYS_ENDDATE_FACT       float64
 8   AMT_CREDIT_MAX_OVERDUE  float64
 9   CNT_CREDIT_PROLONG      int64  
 10  AMT_CREDIT_SUM          float64
 11  AMT_CREDIT_SUM_DEBT     float64
 12  AMT_CREDIT_SUM_LIMIT    float64
 13  AMT_CREDIT_SUM_OVERDUE  float64
 14  CREDIT_TYPE             object 
 15  DAYS_CREDIT_UPDATE      int64  
 16  AMT_ANNUITY             float64
dtypes: float64(8), int64(6), object(3)
memory usage: 222.6+ MB


#### Dataset descriptive statistics

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
count,1716428.0,1716428.0,1716428,1716428,1716428.0,1716428.0,1610875.0,1082775.0,591940.0,1716428.0,1716415.0,1458759.0,1124648.0,1716428.0,1716428,1716428.0,489637.0
unique,,,4,4,,,,,,,,,,,15,,
top,,,Closed,currency 1,,,,,,,,,,,Consumer credit,,
freq,,,1079273,1715020,,,,,,,,,,,1251615,,
mean,278214.9,5924434.0,,,-1142.108,0.8181666,510.5174,-1017.437,3825.418,0.006410406,354994.6,137085.1,6229.515,37.91276,,-593.7483,15712.76
std,102938.6,532265.7,,,795.1649,36.54443,4994.22,714.0106,206031.6,0.09622391,1149811.0,677401.1,45032.03,5937.65,,720.7473,325826.9
min,100001.0,5000000.0,,,-2922.0,0.0,-42060.0,-42023.0,0.0,0.0,0.0,-4705600.0,-586406.1,0.0,,-41947.0,0.0
25%,188866.8,5463954.0,,,-1666.0,0.0,-1138.0,-1489.0,0.0,0.0,51300.0,0.0,0.0,0.0,,-908.0,0.0
50%,278055.0,5926304.0,,,-987.0,0.0,-330.0,-897.0,0.0,0.0,125518.5,0.0,0.0,0.0,,-395.0,0.0
75%,367426.0,6385681.0,,,-474.0,0.0,474.0,-425.0,0.0,0.0,315000.0,40153.5,0.0,0.0,,-33.0,13500.0


In [47]:
describe_features("bureau.csv", display_option="condensed")

Unnamed: 0,Dataset,Column,Description,Special
122,bureau.csv,SK_ID_CURR,ID of loan in our sample - one loan in our sam...,hashed
123,bureau.csv,SK_BUREAU_ID,Recoded ID of previous Credit Bureau credit re...,hashed
124,bureau.csv,CREDIT_ACTIVE,Status of the Credit Bureau (CB) reported credits,
125,bureau.csv,CREDIT_CURRENCY,Recoded currency of the Credit Bureau credit,recoded
126,bureau.csv,DAYS_CREDIT,How many days before current application did c...,time only relative to the application
127,bureau.csv,CREDIT_DAY_OVERDUE,Number of days past due on CB credit at the ti...,
128,bureau.csv,DAYS_CREDIT_ENDDATE,Remaining duration of CB credit (in days) at t...,time only relative to the application
129,bureau.csv,DAYS_ENDDATE_FACT,Days since CB credit ended at the time of appl...,time only relative to the application
130,bureau.csv,AMT_CREDIT_MAX_OVERDUE,Maximal amount overdue on the Credit Bureau cr...,
131,bureau.csv,CNT_CREDIT_PROLONG,How many times was the Credit Bureau credit pr...,


#### Highlights & Ideas

There are **1.1M+** records and **17** columns. A lengthy dataset. The are two main challenges here:

1. This is not out flagship dataset. How can we extract information from this dataset to enrich the data we have from ```application_train.csv```?
2. There are multiple entries for some IDs (every previous loan in the Credit Bureau is stored here). How can we summarize the data to integrate it with the other dataset? Use the most recent, eldest, average, use the mode, or even ignore some columns are all possible ideas.

### bureau_balance.csv

In [45]:
overview_data(
    "bureau_balance.csv",
    display_option="condensed",
    sample_mode="random",
    sample_size=5,
)

#### Dataset name (.csv)

bureau_balance.csv


#### Dataset Random Samples

Unnamed: 0,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
10077170,6423524,-10,X
15189524,6059583,-20,0
6921920,6298033,-70,C
2380938,6562465,-2,0
21676529,5929260,-31,C


#### Dataset info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27299925 entries, 0 to 27299924
Data columns (total 3 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   SK_ID_BUREAU    int64 
 1   MONTHS_BALANCE  int64 
 2   STATUS          object
dtypes: int64(2), object(1)
memory usage: 624.8+ MB


#### Dataset descriptive statistics

Unnamed: 0,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
count,27299920.0,27299920.0,27299925
unique,,,8
top,,,C
freq,,,13646993
mean,6036297.0,-30.74169,
std,492348.9,23.86451,
min,5001709.0,-96.0,
25%,5730933.0,-46.0,
50%,6070821.0,-25.0,
75%,6431951.0,-11.0,


In [46]:
describe_features("bureau_balance.csv", display_option="condensed")

Unnamed: 0,Dataset,Column,Description,Special
139,bureau_balance.csv,SK_BUREAU_ID,Recoded ID of Credit Bureau credit (unique cod...,hashed
140,bureau_balance.csv,MONTHS_BALANCE,Month of balance relative to application date ...,time only relative to the application
141,bureau_balance.csv,STATUS,Status of Credit Bureau loan during the month ...,


#### Highlights & Ideas

As expected from a balance dataset, there are far too many records: **27.2M**. 

Not all ```SK_ID_BUREAU``` from **bureau.csv** will have balance data and not all balance data is complete (some have only for a few months and some only present value X - for unknown - in their ```STATUS```). So, it will be complicated to extract meaningful data from this dataset and to balance the effort-value tradeoff.

### previous_application.csv

In [48]:
overview_data(
    "previous_application.csv",
    display_option="condensed",
    sample_mode="random",
    sample_size=5,
)

#### Dataset name (.csv)

previous_application.csv


#### Dataset Random Samples

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NAME_CONTRACT_TYPE,AMT_ANNUITY,AMT_APPLICATION,AMT_CREDIT,AMT_DOWN_PAYMENT,AMT_GOODS_PRICE,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,...,NAME_SELLER_INDUSTRY,CNT_PAYMENT,NAME_YIELD_GROUP,PRODUCT_COMBINATION,DAYS_FIRST_DRAWING,DAYS_FIRST_DUE,DAYS_LAST_DUE_1ST_VERSION,DAYS_LAST_DUE,DAYS_TERMINATION,NFLAG_INSURED_ON_APPROVAL
876718,1327604,127360,Consumer loans,5318.1,27616.5,26082.0,2763.0,27616.5,WEDNESDAY,10,...,Consumer electronics,6.0,high,POS household with interest,365243.0,-1174.0,-1024.0,-1024.0,-1017.0,0.0
1463966,1830718,218834,Cash loans,19737.945,135000.0,165226.5,,135000.0,WEDNESDAY,8,...,XNA,12.0,high,Cash Street: high,365243.0,-1193.0,-863.0,-863.0,-855.0,1.0
354631,1258084,104079,Consumer loans,3304.71,36184.5,32562.0,3622.5,36184.5,FRIDAY,11,...,Connectivity,12.0,middle,POS mobile with interest,365243.0,-574.0,-244.0,-574.0,-569.0,0.0
408441,2257201,389250,Consumer loans,10912.545,240948.0,240948.0,0.0,240948.0,SATURDAY,9,...,Consumer electronics,24.0,low_action,POS household without interest,365243.0,-724.0,-34.0,-34.0,-26.0,0.0
37031,1038283,385546,Cash loans,,0.0,0.0,,,THURSDAY,13,...,XNA,,XNA,Cash,,,,,,


#### Dataset info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1670214 entries, 0 to 1670213
Data columns (total 37 columns):
 #   Column                       Non-Null Count    Dtype  
---  ------                       --------------    -----  
 0   SK_ID_PREV                   1670214 non-null  int64  
 1   SK_ID_CURR                   1670214 non-null  int64  
 2   NAME_CONTRACT_TYPE           1670214 non-null  object 
 3   AMT_ANNUITY                  1297979 non-null  float64
 4   AMT_APPLICATION              1670214 non-null  float64
 5   AMT_CREDIT                   1670213 non-null  float64
 6   AMT_DOWN_PAYMENT             774370 non-null   float64
 7   AMT_GOODS_PRICE              1284699 non-null  float64
 8   WEEKDAY_APPR_PROCESS_START   1670214 non-null  object 
 9   HOUR_APPR_PROCESS_START      1670214 non-null  int64  
 10  FLAG_LAST_APPL_PER_CONTRACT  1670214 non-null  object 
 11  NFLAG_LAST_APPL_IN_DAY       1670214 non-null  int64  
 12  RATE_DOWN_PAYMENT            774370 non-nu

#### Dataset descriptive statistics

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NAME_CONTRACT_TYPE,AMT_ANNUITY,AMT_APPLICATION,AMT_CREDIT,AMT_DOWN_PAYMENT,AMT_GOODS_PRICE,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,...,NAME_SELLER_INDUSTRY,CNT_PAYMENT,NAME_YIELD_GROUP,PRODUCT_COMBINATION,DAYS_FIRST_DRAWING,DAYS_FIRST_DUE,DAYS_LAST_DUE_1ST_VERSION,DAYS_LAST_DUE,DAYS_TERMINATION,NFLAG_INSURED_ON_APPROVAL
count,1670214.0,1670214.0,1670214,1297979.0,1670214.0,1670213.0,774370.0,1284699.0,1670214,1670214.0,...,1670214,1297984.0,1670214,1669868,997149.0,997149.0,997149.0,997149.0,997149.0,997149.0
unique,,,4,,,,,,7,,...,11,,5,17,,,,,,
top,,,Cash loans,,,,,,TUESDAY,,...,XNA,,XNA,Cash,,,,,,
freq,,,747553,,,,,,255118,,...,855720,,517215,285990,,,,,,
mean,1923089.0,278357.2,,15955.12,175233.9,196114.0,6697.402,227847.3,,12.48418,...,,16.05408,,,342209.855039,13826.269337,33767.774054,76582.403064,81992.343838,0.33257
std,532598.0,102814.8,,14782.14,292779.8,318574.6,20921.5,315396.6,,3.334028,...,,14.56729,,,88916.115834,72444.869708,106857.034789,149647.415123,153303.516729,0.471134
min,1000001.0,100001.0,,0.0,0.0,0.0,-0.9,0.0,,0.0,...,,0.0,,,-2922.0,-2892.0,-2801.0,-2889.0,-2874.0,0.0
25%,1461857.0,189329.0,,6321.78,18720.0,24160.5,0.0,50841.0,,10.0,...,,6.0,,,365243.0,-1628.0,-1242.0,-1314.0,-1270.0,0.0
50%,1923110.0,278714.5,,11250.0,71046.0,80541.0,1638.0,112320.0,,12.0,...,,12.0,,,365243.0,-831.0,-361.0,-537.0,-499.0,0.0
75%,2384280.0,367514.0,,20658.42,180360.0,216418.5,7740.0,234000.0,,15.0,...,,24.0,,,365243.0,-411.0,129.0,-74.0,-44.0,1.0


In [52]:
describe_features("previous_application.csv", display_option="condensed")

Unnamed: 0,Dataset,Column,Description,Special
173,previous_application.csv,SK_ID_PREV,ID of previous credit in Home credit related t...,hashed
174,previous_application.csv,SK_ID_CURR,ID of loan in our sample,hashed
175,previous_application.csv,NAME_CONTRACT_TYPE,"Contract product type (Cash loan, consumer loa...",
176,previous_application.csv,AMT_ANNUITY,Annuity of previous application,
177,previous_application.csv,AMT_APPLICATION,For how much credit did client ask on the prev...,
178,previous_application.csv,AMT_CREDIT,Final credit amount on the previous applicatio...,
179,previous_application.csv,AMT_DOWN_PAYMENT,Down payment on the previous application,
180,previous_application.csv,AMT_GOODS_PRICE,Goods price of good that client asked for (if ...,
181,previous_application.csv,WEEKDAY_APPR_PROCESS_START,On which day of the week did the client apply ...,
182,previous_application.csv,HOUR_APPR_PROCESS_START,Approximately at what day hour did the client ...,rounded


#### Highlights & Ideas

Again, multiple records per ```SK_ID_CURR```. Every previous application the applicant made with Home Credit is stored here. In this case, even more difficult to identify how to extract information from here: we may summarize the data per ```SK_ID_CURR``` using statistical measures or decide which previous application to use.

This Dataset is connected to 3 others, containing installments and more particular data from each ```SK_ID_PREV```.



### POS_CASH_balance.csv

In [51]:
overview_data(
    "POS_CASH_balance.csv",
    display_option="condensed",
    sample_mode="random",
    sample_size=5,
)

#### Dataset name (.csv)

POS_CASH_balance.csv


#### Dataset Random Samples

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,CNT_INSTALMENT,CNT_INSTALMENT_FUTURE,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
6385902,1141316,278001,-37,12.0,0.0,Completed,0,0
389847,2682740,248899,-67,12.0,4.0,Active,0,0
7302029,2136647,209965,-73,18.0,18.0,Active,0,0
6010681,1678541,432875,-8,24.0,16.0,Active,0,0
2031441,1870173,117636,-86,10.0,7.0,Active,0,0


#### Dataset info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10001358 entries, 0 to 10001357
Data columns (total 8 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   SK_ID_PREV             int64  
 1   SK_ID_CURR             int64  
 2   MONTHS_BALANCE         int64  
 3   CNT_INSTALMENT         float64
 4   CNT_INSTALMENT_FUTURE  float64
 5   NAME_CONTRACT_STATUS   object 
 6   SK_DPD                 int64  
 7   SK_DPD_DEF             int64  
dtypes: float64(2), int64(5), object(1)
memory usage: 610.4+ MB


#### Dataset descriptive statistics

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,CNT_INSTALMENT,CNT_INSTALMENT_FUTURE,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
count,10001360.0,10001360.0,10001360.0,9975287.0,9975271.0,10001358,10001360.0,10001360.0
unique,,,,,,9,,
top,,,,,,Active,,
freq,,,,,,9151119,,
mean,1903217.0,278403.9,-35.01259,17.08965,10.48384,,11.60693,0.6544684
std,535846.5,102763.7,26.06657,11.99506,11.10906,,132.714,32.76249
min,1000001.0,100001.0,-96.0,1.0,0.0,,0.0,0.0
25%,1434405.0,189550.0,-54.0,10.0,3.0,,0.0,0.0
50%,1896565.0,278654.0,-28.0,12.0,7.0,,0.0,0.0
75%,2368963.0,367429.0,-13.0,24.0,14.0,,0.0,0.0


In [53]:
describe_features("POS_CASH_balance.csv", display_option="condensed")

Unnamed: 0,Dataset,Column,Description,Special
142,POS_CASH_balance.csv,SK_ID_PREV,ID of previous credit in Home Credit related t...,
143,POS_CASH_balance.csv,SK_ID_CURR,ID of loan in our sample,
144,POS_CASH_balance.csv,MONTHS_BALANCE,Month of balance relative to application date ...,time only relative to the application
145,POS_CASH_balance.csv,CNT_INSTALMENT,Term of previous credit (can change over time),
146,POS_CASH_balance.csv,CNT_INSTALMENT_FUTURE,Installments left to pay on the previous credit,
147,POS_CASH_balance.csv,NAME_CONTRACT_STATUS,Contract status during the month,
148,POS_CASH_balance.csv,SK_DPD,DPD (days past due) during the month of previo...,
149,POS_CASH_balance.csv,SK_DPD_DEF,DPD during the month with tolerance (debts wit...,


#### Highlights & Ideas

Well, seems like every "balance" dataframe will present the same challenge. However, this dataset may provide useful information about any days past due (```SK_DPD```) in the past or present.

### installments_payments.csv

In [88]:
overview_data(
    "installments_payments.csv",
    display_option="condensed",
    sample_mode="random",
    sample_size=5,
)

#### Dataset name (.csv)

installments_payments.csv


#### Dataset Random Samples

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
5672587,2652367,232137,0.0,18,-2398.0,-2403.0,4500.0,4500.0
566432,1845882,167969,1.0,14,-136.0,-140.0,8043.48,8043.48
1780626,2430317,105881,1.0,15,-120.0,-120.0,48222.855,48222.855
2601634,1052426,170461,1.0,7,-1083.0,-1090.0,11721.735,11721.735
4353865,2148550,265570,0.0,47,-1125.0,-1128.0,7875.0,7875.0


#### Dataset info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13605401 entries, 0 to 13605400
Data columns (total 8 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   SK_ID_PREV              int64  
 1   SK_ID_CURR              int64  
 2   NUM_INSTALMENT_VERSION  float64
 3   NUM_INSTALMENT_NUMBER   int64  
 4   DAYS_INSTALMENT         float64
 5   DAYS_ENTRY_PAYMENT      float64
 6   AMT_INSTALMENT          float64
 7   AMT_PAYMENT             float64
dtypes: float64(5), int64(3)
memory usage: 830.4 MB


#### Dataset descriptive statistics

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
count,13605400.0,13605400.0,13605400.0,13605400.0,13605400.0,13602500.0,13605400.0,13602500.0
mean,1903365.0,278444.9,0.8566373,18.8709,-1042.27,-1051.114,17050.91,17238.22
std,536202.9,102718.3,1.035216,26.66407,800.9463,800.5859,50570.25,54735.78
min,1000001.0,100001.0,0.0,1.0,-2922.0,-4921.0,0.0,0.0
25%,1434191.0,189639.0,0.0,4.0,-1654.0,-1662.0,4226.085,3398.265
50%,1896520.0,278685.0,1.0,8.0,-818.0,-827.0,8884.08,8125.515
75%,2369094.0,367530.0,1.0,19.0,-361.0,-370.0,16710.21,16108.42
max,2843499.0,456255.0,178.0,277.0,-1.0,-1.0,3771488.0,3771488.0


In [86]:
describe_features("installments_payments.csv", display_option="condensed")

Unnamed: 0,Dataset,Column,Description,Special
211,installments_payments.csv,SK_ID_PREV,ID of previous credit in Home credit related t...,hashed
212,installments_payments.csv,SK_ID_CURR,ID of loan in our sample,hashed
213,installments_payments.csv,NUM_INSTALMENT_VERSION,Version of installment calendar (0 is for cred...,
214,installments_payments.csv,NUM_INSTALMENT_NUMBER,On which installment we observe payment,
215,installments_payments.csv,DAYS_INSTALMENT,When the installment of previous credit was su...,time only relative to the application
216,installments_payments.csv,DAYS_ENTRY_PAYMENT,When was the installments of previous credit p...,time only relative to the application
217,installments_payments.csv,AMT_INSTALMENT,What was the prescribed installment amount of ...,
218,installments_payments.csv,AMT_PAYMENT,What the client actually paid on previous cred...,


#### Highlights & Ideas

A dataset with only numerical features, presenting the same challenge as the other datasets.

### credit_card_balance.csv

In [127]:
overview_data(
    "credit_card_balance.csv",
    display_option="condensed",
    sample_mode="random",
    sample_size=5,
)

#### Dataset name (.csv)

credit_card_balance.csv


#### Dataset Random Samples

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,AMT_BALANCE,AMT_CREDIT_LIMIT_ACTUAL,AMT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,AMT_INST_MIN_REGULARITY,...,AMT_RECIVABLE,AMT_TOTAL_RECEIVABLE,CNT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_CURRENT,CNT_DRAWINGS_OTHER_CURRENT,CNT_DRAWINGS_POS_CURRENT,CNT_INSTALMENT_MATURE_CUM,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
2515502,2229653,284348,-45,136399.635,135000,0.0,0.0,0.0,0.0,6750.0,...,136399.635,136399.635,0.0,0,0.0,0.0,48.0,Active,1,1
1307940,1786293,156412,-77,0.0,90000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0,0.0,0.0,5.0,Active,0,0
318549,2199394,397441,-88,0.0,180000,0.0,0.0,0.0,0.0,9000.0,...,0.0,0.0,0.0,0,0.0,0.0,1.0,Active,0,0
2558004,1171732,364620,-6,345647.43,337500,84600.0,84600.0,0.0,0.0,16579.215,...,335103.93,335103.93,6.0,6,0.0,0.0,2.0,Active,0,0
3802735,1798989,361308,-69,93392.505,202500,0.0,0.0,0.0,0.0,10125.0,...,93392.505,93392.505,0.0,0,0.0,0.0,18.0,Active,0,0


#### Dataset info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3840312 entries, 0 to 3840311
Data columns (total 23 columns):
 #   Column                      Dtype  
---  ------                      -----  
 0   SK_ID_PREV                  int64  
 1   SK_ID_CURR                  int64  
 2   MONTHS_BALANCE              int64  
 3   AMT_BALANCE                 float64
 4   AMT_CREDIT_LIMIT_ACTUAL     int64  
 5   AMT_DRAWINGS_ATM_CURRENT    float64
 6   AMT_DRAWINGS_CURRENT        float64
 7   AMT_DRAWINGS_OTHER_CURRENT  float64
 8   AMT_DRAWINGS_POS_CURRENT    float64
 9   AMT_INST_MIN_REGULARITY     float64
 10  AMT_PAYMENT_CURRENT         float64
 11  AMT_PAYMENT_TOTAL_CURRENT   float64
 12  AMT_RECEIVABLE_PRINCIPAL    float64
 13  AMT_RECIVABLE               float64
 14  AMT_TOTAL_RECEIVABLE        float64
 15  CNT_DRAWINGS_ATM_CURRENT    float64
 16  CNT_DRAWINGS_CURRENT        int64  
 17  CNT_DRAWINGS_OTHER_CURRENT  float64
 18  CNT_DRAWINGS_POS_CURRENT    float64
 19  CNT_INSTALMENT_MATURE

#### Dataset descriptive statistics

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,AMT_BALANCE,AMT_CREDIT_LIMIT_ACTUAL,AMT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,AMT_INST_MIN_REGULARITY,...,AMT_RECIVABLE,AMT_TOTAL_RECEIVABLE,CNT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_CURRENT,CNT_DRAWINGS_OTHER_CURRENT,CNT_DRAWINGS_POS_CURRENT,CNT_INSTALMENT_MATURE_CUM,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
count,3840312.0,3840312.0,3840312.0,3840312.0,3840312.0,3090496.0,3840312.0,3090496.0,3090496.0,3535076.0,...,3840312.0,3840312.0,3090496.0,3840312.0,3090496.0,3090496.0,3535076.0,3840312,3840312.0,3840312.0
unique,,,,,,,,,,,...,,,,,,,,7,,
top,,,,,,,,,,,...,,,,,,,,Active,,
freq,,,,,,,,,,,...,,,,,,,,3698436,,
mean,1904504.0,278324.2,-34.52192,58300.16,153808.0,5961.325,7433.388,288.1696,2968.805,3540.204,...,58088.81,58098.29,0.309449,0.7031439,0.004812496,0.5594791,20.82508,,9.283667,0.331622
std,536469.5,102704.5,26.66775,106307.0,165145.7,28225.69,33846.08,8201.989,20796.89,5600.154,...,105965.4,105971.8,1.100401,3.190347,0.08263861,3.240649,20.05149,,97.5157,21.47923
min,1000018.0,100006.0,-96.0,-420250.2,0.0,-6827.31,-6211.62,0.0,0.0,0.0,...,-420250.2,-420250.2,0.0,0.0,0.0,0.0,0.0,,0.0,0.0
25%,1434385.0,189517.0,-55.0,0.0,45000.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,4.0,,0.0,0.0
50%,1897122.0,278396.0,-28.0,0.0,112500.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,15.0,,0.0,0.0
75%,2369328.0,367580.0,-11.0,89046.69,180000.0,0.0,0.0,0.0,0.0,6633.911,...,88899.49,88914.51,0.0,0.0,0.0,0.0,32.0,,0.0,0.0


In [128]:
describe_features("credit_card_balance.csv", display_option="condensed")

Unnamed: 0,Dataset,Column,Description,Special
150,credit_card_balance.csv,SK_ID_PREV,ID of previous credit in Home credit related t...,hashed
151,credit_card_balance.csv,SK_ID_CURR,ID of loan in our sample,hashed
152,credit_card_balance.csv,MONTHS_BALANCE,Month of balance relative to application date ...,time only relative to the application
153,credit_card_balance.csv,AMT_BALANCE,Balance during the month of previous credit,
154,credit_card_balance.csv,AMT_CREDIT_LIMIT_ACTUAL,Credit card limit during the month of the prev...,
155,credit_card_balance.csv,AMT_DRAWINGS_ATM_CURRENT,Amount drawing at ATM during the month of the ...,
156,credit_card_balance.csv,AMT_DRAWINGS_CURRENT,Amount drawing during the month of the previou...,
157,credit_card_balance.csv,AMT_DRAWINGS_OTHER_CURRENT,Amount of other drawings during the month of t...,
158,credit_card_balance.csv,AMT_DRAWINGS_POS_CURRENT,Amount drawing or buying goods during the mont...,
159,credit_card_balance.csv,AMT_INST_MIN_REGULARITY,Minimal installment for this month of the prev...,


#### Highlights & Ideas

Here, we pretty much observe the same behavior from **POS_CASH_balance.csv**, but with credit card characteristics.

### Final Thoughts

Through this first analysis, seems like we might have a considerable amount of the Iunfamous **NaNs** in the data. We also have multiple huge datasets with relations that are not so easy to fit as feature data. Therefore, I will dive a little bit deeper into the exploratory data analysis to gather more insights.

*Next notebook: [3.0-ejk-eda-applications](3.0-ejk-eda-applications.ipynb).*