# How to use wranglev.py functions

In [1]:
import pandas as pd
import numpy as np
from wranglev import get_reports_data
from wranglev import get_application_data
import wranglev as prep
import warnings
warnings.filterwarnings("ignore")

### `get_reports_data(creditrecordcsv)`

Getting the data from `credit_record.csv` is simple:

In [2]:
expanded = get_reports_data('credit_record.csv')

What does `get_reports_data` do for us? 

Returns `expanded`

This function takes in the credit_record.csv and creates a DataFrame that finds all records with at least 12 months of history. For records that have a default (150 days late or longer), the function finds the 12 months prior to that default. For records without a default, it looks at the most recent 12 months. 

The DataFrame then drops the most recent 6 months of data for each observation in order to create a blind period for forecasting. 

A scoring system was developed to represent increasing degrees of lateness:
- No debt = 0
- Debt paid off = 1
- 0-29 days late = 2
- 30-59 days late = 3
- 60-89 days late = 4
- 90-119 days late = 5
- 120-149 days late = 6

1. `expanded` - This dataframe contains unique observations representing the history of a single credit card.
    - `id`: Unique ID (may have a matching ID in the application data)
    - `0-29`: Number of months within a 60 month period that a balance was 0-29 days past due
    - `30-59`: Number of months within a 60 month period that a balance was 30-59 days past due
    - `60-89`: Number of months within a 60 month period that a balance was 60-89 days past due
    - `90-119`: Number of months within a 60 month period that a balance was 90-119 days past due
    - `120-149`: Number of months within a 60 month period that a balance was 120-149 days past due
    - `bad_debt`: Number of months within a 60 month period that a balance was 150+ days past due
    - `no_debt`: Number of months within a 60 month period that there was no debt on record
    - `paid_off`: Number of months within a 60 month period where the existing debt was paid off
    - `months_exist`: Number of months the credit line has been active
    - `month_01` through `month_06`: Each month is given a score based on the degree of credit use and lateness 
    - `total_score`: The combined score of all 6 months
    - `odd_months_score`: The combined score of the odd months
    - `last_half`: The combined score of the last 3 months
    - `first_half`: The combined score of the first 3 months
    - `difference_score`: The difference between the `last_half` and the `first_half`
    - `odds_evens_score`: The combined score of months 1, 3, 4, and 6
    - `begining_score`: The combined score of the first 2 months
    - `middle_score`: The combined score of months 3 and 4
    - `ending_score`: The combined score of months 5 and 6
    - `spread_score`: The combined score of months 2 through 5
    - `alpha_omgea_score`: The combined score of months 1 and 6
    - `begining_ending_score`: The combined score of months 1, 2, 5, and 6
    
The purpose of these features is to identify unique patterns within the 6 month observation window. 

In [3]:
expanded

Unnamed: 0,id,0-29,30-59,60-89,90-119,120-149,paid_off,no_debt,months_exist,month_01,...,first_half_score,difference_score,odds_evens_score,begining_score,middle_score,ending_score,spread_score,alpha_omgea_score,begining_ending_score,defaulted
0,5001712,3.0,0.0,0.0,0.0,0.0,3.0,0.0,19,2,...,6,-3,6,4,3,2,6,3,6,0
1,5001713,0.0,0.0,0.0,0.0,0.0,0.0,6.0,22,0,...,0,0,0,0,0,0,0,0,0,0
2,5001714,0.0,0.0,0.0,0.0,0.0,0.0,6.0,15,0,...,0,0,0,0,0,0,0,0,0,0
3,5001715,0.0,0.0,0.0,0.0,0.0,0.0,6.0,60,0,...,0,0,0,0,0,0,0,0,0,0
4,5001717,6.0,0.0,0.0,0.0,0.0,0.0,0.0,22,2,...,6,0,8,4,4,4,8,4,8,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32187,5150481,0.0,0.0,0.0,0.0,0.0,0.0,6.0,43,0,...,0,0,0,0,0,0,0,0,0,0
32188,5150482,6.0,0.0,0.0,0.0,0.0,0.0,0.0,18,2,...,6,0,8,4,4,4,8,4,8,0
32189,5150483,0.0,0.0,0.0,0.0,0.0,0.0,6.0,18,0,...,0,0,0,0,0,0,0,0,0,0
32190,5150484,6.0,0.0,0.0,0.0,0.0,0.0,0.0,13,2,...,6,0,8,4,4,4,8,4,8,0


In [4]:
expanded.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32192 entries, 0 to 32191
Data columns (total 28 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     32192 non-null  int64  
 1   0-29                   32192 non-null  float64
 2   30-59                  32192 non-null  float64
 3   60-89                  32192 non-null  float64
 4   90-119                 32192 non-null  float64
 5   120-149                32192 non-null  float64
 6   paid_off               32192 non-null  float64
 7   no_debt                32192 non-null  float64
 8   months_exist           32192 non-null  int64  
 9   month_01               32192 non-null  int64  
 10  month_02               32192 non-null  int64  
 11  month_03               32192 non-null  int64  
 12  month_04               32192 non-null  int64  
 13  month_05               32192 non-null  int64  
 14  month_06               32192 non-null  int64  
 15  to

### `get_application_data(applicationrecordcsv)`

Getting the data from the 'application_record.csv' is simple:

In [5]:
apps = get_application_data('application_record.csv')

What does `get_application_data` do for us?
1. Reads the csv into a dataframe
1. Converts all column headers to lowercase
1. Converts 'id' to string type
1. Fills null values in occupation type with 'Other'
1. Adds a column for years employed
1. Adds a column for age in years (rounded down)
1. Replaces "Y" and "N" throughout the dataframe for 1s and 0s (except for gender)
1. Reverses the sign on days_birth and days_employed
1. Changes the employed_years for permanently retired pensioners to an estimate based on their age, gender, and occupation
1. Converts the days_employed for retired pensioners from a placeholder value to an estimate based on their age, gender, and occupation

Returns `apps`, the cleaned dataframe

In [6]:
apps

Unnamed: 0,id,code_gender,flag_own_car,flag_own_realty,cnt_children,amt_income_total,name_income_type,name_education_type,name_family_status,name_housing_type,days_birth,days_employed,flag_mobil,flag_work_phone,flag_phone,flag_email,occupation_type,cnt_fam_members,employed_years,age
0,5008804,M,1,1,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,12005,4542,1,1,0,0,Other,2.0,12.0,32.0
1,5008805,M,1,1,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,12005,4542,1,1,0,0,Other,2.0,12.0,32.0
2,5008806,M,1,1,0,112500.0,Working,Secondary / secondary special,Married,House / apartment,21474,1134,1,0,0,0,Security staff,2.0,3.0,58.0
3,5008808,F,0,1,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,19110,3051,1,0,1,1,Sales staff,1.0,8.0,52.0
4,5008809,F,0,1,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,19110,3051,1,0,1,1,Sales staff,1.0,8.0,52.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
438552,6840104,M,0,1,0,135000.0,Pensioner,Secondary / secondary special,Separated,House / apartment,22717,15705,1,0,0,0,Other,1.0,43.0,62.0
438553,6840222,F,0,0,0,103500.0,Working,Secondary / secondary special,Single / not married,House / apartment,15939,3007,1,0,0,0,Laborers,1.0,8.0,43.0
438554,6841878,F,0,0,0,54000.0,Commercial associate,Higher education,Single / not married,With parents,8169,372,1,1,0,0,Sales staff,1.0,1.0,22.0
438555,6842765,F,0,1,0,72000.0,Pensioner,Secondary / secondary special,Married,House / apartment,21673,13879,1,0,0,0,Other,2.0,38.0,59.0


In [7]:
apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 438557 entries, 0 to 438556
Data columns (total 20 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   id                   438557 non-null  int64  
 1   code_gender          438557 non-null  object 
 2   flag_own_car         438557 non-null  int64  
 3   flag_own_realty      438557 non-null  int64  
 4   cnt_children         438557 non-null  int64  
 5   amt_income_total     438557 non-null  float64
 6   name_income_type     438557 non-null  object 
 7   name_education_type  438557 non-null  object 
 8   name_family_status   438557 non-null  object 
 9   name_housing_type    438557 non-null  object 
 10  days_birth           438557 non-null  int64  
 11  days_employed        438557 non-null  int64  
 12  flag_mobil           438557 non-null  int64  
 13  flag_work_phone      438557 non-null  int64  
 14  flag_phone           438557 non-null  int64  
 15  flag_email       

### `prep.encode_dummies(apps)`
By passing the apps dataframe into this function, we add on dummy variables to encode our categorical variables. This function does not drop the original categorical variables, as certain exploration tasks are made easier by keeping the originals. 

Naturally, the original variables will need to be dropped before modeling. 

`add_apps_dummies()` returns the dataframe `apps_encoded`

In [8]:
apps_encoded = prep.encode_dummies(apps)

In [9]:
apps_encoded.columns

Index(['id', 'code_gender', 'flag_own_car', 'flag_own_realty', 'cnt_children',
       'amt_income_total', 'name_income_type', 'name_education_type',
       'name_family_status', 'name_housing_type', 'days_birth',
       'days_employed', 'flag_mobil', 'flag_work_phone', 'flag_phone',
       'flag_email', 'occupation_type', 'cnt_fam_members', 'employed_years',
       'age', 'name_income_type_commercial_associate',
       'name_income_type_pensioner', 'name_income_type_state_servant',
       'name_income_type_student', 'name_income_type_working',
       'name_education_type_academic_degree',
       'name_education_type_higher_education',
       'name_education_type_incomplete_higher',
       'name_education_type_lower_secondary',
       'name_education_type_secondary_/_secondary_special',
       'name_housing_type_co-op_apartment',
       'name_housing_type_house_/_apartment',
       'name_housing_type_municipal_apartment',
       'name_housing_type_office_apartment',
       'name_housing

In [10]:
apps_encoded

Unnamed: 0,id,code_gender,flag_own_car,flag_own_realty,cnt_children,amt_income_total,name_income_type,name_education_type,name_family_status,name_housing_type,...,occupation_type_low-skill_laborers,occupation_type_managers,occupation_type_medicine_staff,occupation_type_other,occupation_type_private_service_staff,occupation_type_realty_agents,occupation_type_sales_staff,occupation_type_secretaries,occupation_type_security_staff,occupation_type_waiters/barmen_staff
0,5008804,M,1,1,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,...,0,0,0,1,0,0,0,0,0,0
1,5008805,M,1,1,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,...,0,0,0,1,0,0,0,0,0,0
2,5008806,M,1,1,0,112500.0,Working,Secondary / secondary special,Married,House / apartment,...,0,0,0,0,0,0,0,0,1,0
3,5008808,F,0,1,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,...,0,0,0,0,0,0,1,0,0,0
4,5008809,F,0,1,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,...,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
438552,6840104,M,0,1,0,135000.0,Pensioner,Secondary / secondary special,Separated,House / apartment,...,0,0,0,1,0,0,0,0,0,0
438553,6840222,F,0,0,0,103500.0,Working,Secondary / secondary special,Single / not married,House / apartment,...,0,0,0,0,0,0,0,0,0,0
438554,6841878,F,0,0,0,54000.0,Commercial associate,Higher education,Single / not married,With parents,...,0,0,0,0,0,0,1,0,0,0
438555,6842765,F,0,1,0,72000.0,Pensioner,Secondary / secondary special,Married,House / apartment,...,0,0,0,1,0,0,0,0,0,0


### `prep.wrangle_credit`
This function uses both the `get_reports_data` and `get_application_data` functions to create a single DataFrame from merging the two together. It then splits the DataFrame into train, validate, and test sets. This split is stratified on the target variable ('defaulted').

In [11]:
train, validate, test = prep.wrangle_credit()

In [12]:
train.head(5)

Unnamed: 0,id,code_gender,flag_own_car,flag_own_realty,cnt_children,amt_income_total,name_income_type,name_education_type,name_family_status,name_housing_type,...,first_half_score,difference_score,odds_evens_score,begining_score,middle_score,ending_score,spread_score,alpha_omgea_score,begining_ending_score,defaulted
22151,5142128,F,0,0,0,283500.0,Commercial associate,Secondary / secondary special,Married,Municipal apartment,...,0,0,0,0,0,0,0,0,0,0
21572,5136981,F,0,0,0,306000.0,State servant,Higher education,Married,House / apartment,...,7,-2,9,4,6,2,9,3,6,0
8328,5052719,F,0,0,0,126000.0,Pensioner,Secondary / secondary special,Married,House / apartment,...,3,0,4,2,2,2,4,2,4,0
19787,5117901,F,0,0,0,76500.0,Pensioner,Secondary / secondary special,Widow,House / apartment,...,3,0,4,2,2,2,4,2,4,0
12424,5069147,M,0,1,0,216000.0,Commercial associate,Higher education,Married,House / apartment,...,2,-1,3,2,0,1,0,3,3,0


In [13]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17633 entries, 22151 to 3971
Data columns (total 47 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     17633 non-null  int64  
 1   code_gender            17633 non-null  object 
 2   flag_own_car           17633 non-null  int64  
 3   flag_own_realty        17633 non-null  int64  
 4   cnt_children           17633 non-null  int64  
 5   amt_income_total       17633 non-null  float64
 6   name_income_type       17633 non-null  object 
 7   name_education_type    17633 non-null  object 
 8   name_family_status     17633 non-null  object 
 9   name_housing_type      17633 non-null  object 
 10  days_birth             17633 non-null  int64  
 11  days_employed          17633 non-null  int64  
 12  flag_mobil             17633 non-null  int64  
 13  flag_work_phone        17633 non-null  int64  
 14  flag_phone             17633 non-null  int64  
 15 

### `prep.split_data(df, pct=0.10)`

This simply splits the dataframe of your choice into train, validate, and test sets. The random_state is set to 123 for reproduction.

In [14]:
train2, validate2, test2 = prep.split_data(apps_encoded)

In [15]:
train2.shape, validate2.shape, test2.shape

((315760, 55), (78941, 55), (43856, 55))

### `prep.split_stratify_data(df, stratify_target, pct=0.10)`
This is similar to `split_data` but this allows us to stratify on a variable. This also has the random_state set to allow for reproducibility. 

In [16]:
train3, validate3, test3 = prep.split_stratify_data(apps_encoded, 'code_gender')

In [17]:
train3.shape, validate3.shape, test3.shape

((315760, 55), (78941, 55), (43856, 55))

### `prep.calc_vif`
This function will calculate the Variance Inflation Factor as a measure of multicollinearity between variables. In order to utilize this function, you will need to remove non-numeric columns. 

In [18]:
# Removing columns that will not be used during modeling as well as non-numeric columns
cols_to_drop = ['id', 'code_gender', 'flag_own_car', 'flag_own_realty', 'cnt_children','name_income_type', 'name_education_type',
       'name_family_status', 'name_housing_type', 'days_birth',
       'days_employed', 'flag_mobil', 'flag_work_phone', 'flag_phone',
       'flag_email', 'occupation_type', 'cnt_fam_members', 'employed_years',
       'age']

train = train.drop(columns = cols_to_drop)
validate = validate.drop(columns = cols_to_drop)
test = test.drop(columns = cols_to_drop)

# dropping columns that have vif score greater than 5
t = train.drop(columns = ['odd_months_score', 'last_half_score', 'first_half_score',
       'difference_score', 'odds_evens_score', 'begining_score',
       'middle_score', 'ending_score', 'spread_score', 'alpha_omgea_score',
       'begining_ending_score','0-29', '30-59', '60-89', '90-119', '120-149','month_01', 'month_02',
       'month_03', 'month_04', 'month_05', 'month_06', 'months_exist'])

In [19]:
# getting the vif dataframe
vif_df = prep.calc_vif(train)

### `prep.create_scaled_x_y(train, validate, test, target)`
This creates scaled X and y sets from train, validate, and test sets.

In [20]:
X_train_scaled, y_train, X_validate_scaled, y_validate, X_test_scaled, y_test = prep.create_scaled_x_y(train, validate, test, 'defaulted')

In [21]:
X_train_scaled.shape, y_train.shape

((17633, 27), (17633,))

In [22]:
X_validate_scaled.shape, y_validate.shape

((4409, 27), (4409,))

In [23]:
X_test_scaled.shape, y_test.shape

((2450, 27), (2450,))

In [24]:
X_train_scaled.head()

Unnamed: 0,amt_income_total,0-29,30-59,60-89,90-119,120-149,paid_off,no_debt,months_exist,month_01,...,last_half_score,first_half_score,difference_score,odds_evens_score,begining_score,middle_score,ending_score,spread_score,alpha_omgea_score,begining_ending_score
22151,0.935789,-0.765542,-0.1742,-0.047443,-0.022598,-0.016103,-1.056056,2.451417,-1.049106,-1.646331,...,-1.810907,-1.810369,0.139035,-1.867549,-1.773187,-1.765746,-1.782479,-1.837328,-1.838006,-1.885666
21572,1.158089,0.035407,5.112661,-0.047443,-0.022598,-0.016103,-0.358624,-0.483858,-1.203828,1.06673,...,0.805976,1.698891,-1.70082,1.699445,1.162507,2.723102,-0.229131,1.666547,0.506492,0.515919
8328,-0.62031,-0.765542,-0.1742,-0.047443,-0.022598,-0.016103,1.03624,-0.483858,0.730199,-0.2898,...,-0.240778,-0.3064,0.139035,-0.282218,-0.30534,-0.269463,-0.229131,-0.28005,-0.275007,-0.284609
19787,-1.10937,-0.765542,-0.1742,-0.047443,-0.022598,-0.016103,1.03624,-0.483858,0.266032,-0.2898,...,-0.240778,-0.3064,0.139035,-0.282218,-0.30534,-0.269463,-0.229131,-0.28005,-0.275007,-0.284609
12424,0.268889,-0.365067,-0.1742,-0.047443,-0.022598,-0.016103,-0.70734,1.472992,-0.739662,1.06673,...,-1.287531,-0.807723,-0.780892,-0.678551,-0.30534,-1.765746,-1.005805,-1.837328,0.506492,-0.684873


In [25]:
y_train.head()

22151    0
21572    0
8328     0
19787    0
12424    0
Name: defaulted, dtype: int64