# Source

https://www.kaggle.com/datasets/saurabhbagchi/dish-network-hackathon

Predict whether a client will default on the vehicle loan payment or not

A non-banking financial institution (NBFI) or non-bank financial company (NBFC) is a Financial Institution that does not have a full banking license or is not supervised by a national or international banking regulatory agency. NBFC facilitates bank-related financial services, such as investment, risk pooling, contractual savings, and market brokering.

An NBFI is struggling to mark profits due to an increase in defaults in the vehicle loan category. The company aims to determine the client’s loan repayment abilities and understand the relative importance of each parameter contributing to a borrower’s ability to repay the loan.

Goal:

The goal of the problem is to predict whether a client will default on the vehicle loan payment or not. For each ID in the Test_Dataset, you must predict the “Default” level.

Datasets

The problem contains two datasets, TrainDataset and TestDataset. Model building is to be done on TrainDataset and the Model testing is to be done on TestDataset. The output from the Test_Dataset is to be submitted to the Hackathon platform.

Metric to measure

The metric to measure is the F1Score. F1Score is the harmonic mean of Recall and Precision. In this Hackathon, you will get the F1Score of 1. Please visit the link for more details on F1Score- https://en.wikipedia.org/wiki/F-score

Submission File Format:

You should submit a CSV file with exactly 80900 entries plus a header row.

https://en.wikipedia.org/wiki/F-score (more information on F-score can be found in adjacent link)

The file should have exactly two columns

● ID (sorted in any order)
● Default (contains 0 & 1, 1 represents default)

# Init

In [2]:
import numpy as np
import pandas as pd

In [3]:
!pwd

/home/dmdp/workspace/datasets/car_loan_default


# Data

In [4]:
df_dict = pd.read_csv('v0/Data_Dictionary.csv')

In [5]:
df_dict

Unnamed: 0,Variable,Description
0,ID,Client Loan application ID
1,Client_Income,Client Income in $
2,Car_Owned,Any Car owned by client before applying for th...
3,Bike_Owned,Any bike owned by client (0 means No and 1 mea...
4,Active_Loan,Any other active loan at the time of aplicatio...
5,House_Own,Any house owned by client (0 means No and 1 me...
6,Child_Count,Number of children the client has
7,Credit_Amount,Credit amount of the loan in $
8,Loan_Annuity,Loan annuity in $
9,Accompany_Client,Who accompanied the client when client applied...


In [6]:
df = pd.read_csv('v0/Train_Dataset.csv', low_memory=False)

In [7]:
df

Unnamed: 0,ID,Client_Income,Car_Owned,Bike_Owned,Active_Loan,House_Own,Child_Count,Credit_Amount,Loan_Annuity,Accompany_Client,...,Client_Permanent_Match_Tag,Client_Contact_Work_Tag,Type_Organization,Score_Source_1,Score_Source_2,Score_Source_3,Social_Circle_Default,Phone_Change,Credit_Bureau,Default
0,12142509,6750,0.0,0.0,1.0,0.0,0.0,61190.55,3416.85,Alone,...,Yes,Yes,Self-employed,0.568066,0.478787,,0.0186,63.0,,0
1,12138936,20250,1.0,0.0,1.0,,0.0,15282,1826.55,Alone,...,Yes,Yes,Government,0.563360,0.215068,,,,,0
2,12181264,18000,0.0,0.0,1.0,0.0,1.0,59527.35,2788.2,Alone,...,Yes,Yes,Self-employed,,0.552795,0.329655054,0.0742,277.0,0.0,0
3,12188929,15750,0.0,0.0,1.0,1.0,0.0,53870.4,2295.45,Alone,...,Yes,Yes,XNA,,0.135182,0.631354537,,1700.0,3.0,0
4,12133385,33750,1.0,0.0,1.0,0.0,2.0,133988.4,3547.35,Alone,...,Yes,Yes,Business Entity Type 3,0.508199,0.301182,0.355638717,0.2021,674.0,1.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
121851,12207714,29250,0.0,0.0,,1.0,0.0,107820,3165.3,Relative,...,Yes,No,Business Entity Type 2,,0.173527,0.184116156,0.0577,0.0,1.0,1
121852,12173765,15750,0.0,1.0,1.0,0.0,0.0,104256,3388.05,Alone,...,Yes,Yes,Self-employed,,0.371559,0.406617437,0.0825,4.0,0.0,0
121853,12103937,8100,0.0,1.0,0.0,1.0,1.0,55107.9,2989.35,Alone,...,No,No,Trade: type 6,0.169049,0.048079,,,0.0,,0
121854,12170623,38250,1.0,1.0,0.0,1.0,0.0,45000,2719.35,Alone,...,Yes,Yes,Business Entity Type 3,0.182737,0.103538,0.077498546,0.0979,0.0,2.0,0


In [8]:
df['Default'].mean()

0.0807920824579832

In [9]:
df.columns = [name.lower() for name in df.columns]

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 121856 entries, 0 to 121855
Data columns (total 40 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   id                          121856 non-null  int64  
 1   client_income               118249 non-null  object 
 2   car_owned                   118275 non-null  float64
 3   bike_owned                  118232 non-null  float64
 4   active_loan                 118221 non-null  float64
 5   house_own                   118195 non-null  float64
 6   child_count                 118218 non-null  float64
 7   credit_amount               118224 non-null  object 
 8   loan_annuity                117044 non-null  object 
 9   accompany_client            120110 non-null  object 
 10  client_income_type          118155 non-null  object 
 11  client_education            118211 non-null  object 
 12  client_marital_status       118383 non-null  object 
 13  client_gender 

# Preprocessing

Converting numerical fields into numbers

In [11]:
column_replacement_patterns = {
    'credit_amount': ['$'],
    'client_income': ['$'],
    'employed_days': ['x'],
    'loan_annuity': ['$', '#VALUE!'],
    'id_days': ['x'],
    'population_region_relative': ['@', '#'],
    'age_days': ['x'],
    'registration_days': ['x'],
    'score_source_3': ['&'],
}

In [12]:
for column in column_replacement_patterns.keys():
    if df[column].dtype != object:
        continue
    print(f'Converting {column}')
    for replacement_pattern in column_replacement_patterns[column]:
        df[column] = df[column].str.replace(replacement_pattern, '', regex=False)
    df[column] = df[column] \
        .str.strip() \
        .replace('', 'NaN') \
        .astype('float')

Converting credit_amount
Converting client_income
Converting employed_days
Converting loan_annuity
Converting id_days
Converting population_region_relative
Converting age_days
Converting registration_days
Converting score_source_3


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 121856 entries, 0 to 121855
Data columns (total 40 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   id                          121856 non-null  int64  
 1   client_income               118234 non-null  float64
 2   car_owned                   118275 non-null  float64
 3   bike_owned                  118232 non-null  float64
 4   active_loan                 118221 non-null  float64
 5   house_own                   118195 non-null  float64
 6   child_count                 118218 non-null  float64
 7   credit_amount               118219 non-null  float64
 8   loan_annuity                117030 non-null  float64
 9   accompany_client            120110 non-null  object 
 10  client_income_type          118155 non-null  object 
 11  client_education            118211 non-null  object 
 12  client_marital_status       118383 non-null  object 
 13  client_gender 

In [14]:
df.sample(5).T

Unnamed: 0,53710,21430,74398,25099,85937
id,12110299,12149915,12114647,12128128,12147418
client_income,7650.0,22500.0,,9000.0,11250.0
car_owned,,0.0,0.0,0.0,1.0
bike_owned,1.0,0.0,0.0,1.0,0.0
active_loan,1.0,0.0,0.0,0.0,0.0
house_own,1.0,0.0,1.0,1.0,0.0
child_count,1.0,0.0,2.0,0.0,0.0
credit_amount,15638.4,80865.0,36000.0,31276.8,50849.55
loan_annuity,1655.1,2377.35,1370.25,2433.15,2154.15
accompany_client,Partner,Alone,Relative,Relative,Alone


# Save

In [15]:
df.to_csv('v1/train.zip')

df.shape

(121856, 40)

In [16]:
df_reduced_1 = df \
    .dropna(subset=['credit_amount', 'client_income'])

df_reduced_1.to_csv('v1/train_reduced_1.zip')

df_reduced_1.shape

(114708, 40)