# AC - Loan or Not to Loan

In this project we shall predict, based on a training dataset and using different prediction methods, if certain loan applications should or shouldn't be accepted based on loan applications reviewed in the past and their current status nowadays, which may vary from: paid, delayed, etc.

For this we will use different libraries from Python that are specific for these kinds of tasks and a dataset provided by Kaggle. Let's start by importing both of these.

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt

In [79]:
account = pd.read_csv("./data/account.csv", na_values= ['?', 'NA', ''],sep= ';')
card_test = pd.read_csv("./data/card_test.csv",na_values= ['?', 'NA', ''],sep=  ';')
card_train = pd.read_csv("./data/card_train.csv",na_values= ['?', 'NA', ''],sep=  ';')
client = pd.read_csv("./data/client.csv",na_values= ['?', 'NA', ''],sep=  ';')
disposition = pd.read_csv("./data/disp.csv",na_values= ['?', 'NA', ''],sep=  ';')
district = pd.read_csv("./data/district.csv",na_values= ['?', 'NA', ''],sep= ';')
loan_test = pd.read_csv("./data/loan_test.csv", na_values= ['?', 'NA', ''],sep= ';')
loan_train = pd.read_csv("./data/loan_train.csv",na_values= ['?', 'NA', ''],sep=  ';')
transference_test = pd.read_csv("./data/trans_test.csv",na_values= ['?', 'NA', ''],sep=  ';')
transference_train = pd.read_csv("./data/trans_train.csv",na_values= ['?', 'NA', ''],sep=  ';')

 After reading all CSV to Pandas Dataframes, we can now proceed with the Merge of Data to one unique CSV and clean the data all together, comparing it throughout the merge process, to avoid repeated columns. We must also verify if all the merges didn't create new data. Since we are using the Disposition dataframe as primary table to merge all the other ones, there should always be the same number of rows as that table. We will also save the step by step data in csv files in every merge.

In [80]:
#merge 1
disposition_client = disposition.merge(client, on=["client_id"], how='inner')
disposition_client = disposition_client.rename(columns={'district_id': 'client_district_id'})

disposition_client.to_csv("disp_client.csv")

#always check if no new rows were created between merges
print("---------------MERGE 1---------------")
print("Rows of disposition:", disposition.shape[0])
print("Rows of disposition + client:", disposition_client.shape[0])
print("-------------------------------------\n")


#merge 2
disposition_client_account = disposition_client.merge(
    account, on=["account_id"], how="inner")
disposition_client_account = disposition_client_account.rename(
    columns={'district_id': 'account_district_id'})
    
disposition_client_account.to_csv("disp_cli_acc.csv", index=False)

#always check if no new rows were created between merges
print("---------------MERGE 2---------------")
print("Rows of MERGE 1:", disposition_client.shape[0])
print("Rows of MERGE 1 + account:", disposition_client_account.shape[0])
print("-------------------------------------\n")


#if, client's district != account's district, first join with district.csv to check info about client's district, then join with accounts with district again to have account's district info. don't forget to change columns name between joins
### check if client_district_id is the same as account_district_id, if so, we only need to have one of those columns 

print("Check if Client's district is always the same as associated Account's district:"+"\n")
comparison_column = np.where(disposition_client_account["client_district_id"] == disposition_client_account["account_district_id"], True, False)
print("Comparison between the two district columns:", comparison_column)
print("Unique values of the comparison column:", np.unique(comparison_column))


---------------MERGE 1---------------
Rows of disposition: 5369
Rows of disposition + client: 5369
-------------------------------------

---------------MERGE 2---------------
Rows of MERGE 1: 5369
Rows of MERGE 1 + account: 5369
-------------------------------------

Check if Client's district is always the same as associated Account's district:

Comparison between the two district columns: [ True  True  True ...  True  True  True]
Unique values of the comparison column: [False  True]


If there are "False" values in the comparison column, it means that **there are accounts with different districts than its associated clients**. That being said, we have to go one step back and before the second merge we have to associate the district with the client and only then proceed with the previously mentioned Merge 2.

In [81]:
#new Merge 2
disposition_client = pd.read_csv("disp_client.csv")

disposition_client = disposition_client.rename(
    columns={'client_district_id': 'district_id'})

district = district.rename(columns={'code ':'district_id'})  #FIXME: WHY ARE YOU GIVING A KEY NOT FOUND ERROR WHEN THERE IS CLEARLY A COLUMN NAMED CODE

disposition_client_district = disposition_client.merge(
    district, on=["district_id"], how='inner')

disposition_client_district = disposition_client_district.rename(
    columns={'district_id': 'client_district_id', 'name ': 'client_district_name', 
    'region': 'client_district_region', 'no. of inhabitants': 'client_district_no. of inhabitants', 
    'no. of municipalities with inhabitants < 499 ': 'client_district_no. of municipalities with inhabitants < 499', 
    'no. of municipalities with inhabitants 500-1999': 'client_district_no. of municipalities with inhabitants 500-1999', 
    'no. of municipalities with inhabitants 2000-9999 ': 'client_district_no. of municipalities with inhabitants 2000-9999', 
    'no. of municipalities with inhabitants >10000 ': 'client_district_no. of municipalities with inhabitants >10000', 
    'no. of cities ': 'client_district_no. of cities', 'ratio of urban inhabitants ': 'client_district_ratio of urban inhabitants', 
    'average salary ': 'client_district_average salary', 'unemploymant rate \'95 ': 'client_district_unemploymant rate \'95', 
    'unemploymant rate \'96 ': 'client_district_unemploymant rate \'96', 
    'no. of enterpreneurs per 1000 inhabitants ': 'client_district_no. of enterpreneurs per 1000 inhabitants', 
    'no. of commited crimes \'95 ': 'client_district_no. of commited crimes \'95', 
    'no. of commited crimes \'96 ': 'client_district_no. of commited crimes \'96'})

disposition_client_district.to_csv("disp_client-d.csv", index=False)

#always check if no new rows were created between merges
print("---------------New MERGE 2---------------")
print("Rows of MERGE 1:", disposition_client.shape[0])
print("Rows of MERGE 1 + district:", disposition_client_district.shape[0])
print("-------------------------------------\n")


---------------New MERGE 2---------------
Rows of MERGE 1: 5369
Rows of MERGE 1 + district: 5369
-------------------------------------



Now that we have the client district data, we can move on and merge the account data and, consequently, the district account data.

In [82]:
#merge 3
disposition_client_d_account = disposition_client_district.merge(
    account, on=["account_id"], how="inner")

disposition_client_d_account.to_csv("disp_cli-d_acc.csv", index=False)

#always check if no new rows were created between merges
print("---------------MERGE 3---------------")
print("Rows of MERGE 2:", disposition_client_district.shape[0])
print("Rows of MERGE 2 + account:", disposition_client_d_account.shape[0])
print("-------------------------------------\n")


#merge 4
disposition_client_account_districts = disposition_client_d_account.merge(district, on=["district_id"], how="inner")
disposition_client_account_districts = disposition_client_account_districts.rename(
    columns={'district_id': 'account_district_id', 'name ': 'account_district_name',
             'region': 'account_district_region', 'no. of inhabitants': 'account_district_no. of inhabitants',
             'no. of municipalities with inhabitants < 499 ': 'account_district_no. of municipalities with inhabitants < 499',
             'no. of municipalities with inhabitants 500-1999': 'account_district_no. of municipalities with inhabitants 500-1999',
             'no. of municipalities with inhabitants 2000-9999 ': 'account_district_no. of municipalities with inhabitants 2000-9999',
             'no. of municipalities with inhabitants >10000 ': 'account_district_no. of municipalities with inhabitants >10000',
             'no. of cities ': 'account_district_no. of cities', 'ratio of urban inhabitants ': 'account_district_ratio of urban inhabitants',
             'average salary ': 'account_district_average salary', 'unemploymant rate \'95 ': 'account_district_unemploymant rate \'95',
             'unemploymant rate \'96 ': 'account_district_unemploymant rate \'96',
             'no. of enterpreneurs per 1000 inhabitants ': 'account_district_no. of enterpreneurs per 1000 inhabitants',
             'no. of commited crimes \'95 ': 'account_district_no. of commited crimes \'95',
             'no. of commited crimes \'96 ': 'account_district_no. of commited crimes \'96'})


disposition_client_account_districts.to_csv("disp_cli_acc_dist.csv", index=False)

#always check if no new rows were created between merges
print("---------------MERGE 4---------------")
print("Rows of MERGE 3:", disposition_client_d_account.shape[0])
print("Rows of MERGE 3 + district:", disposition_client_account_districts.shape[0])
print("-------------------------------------\n")


---------------MERGE 3---------------
Rows of MERGE 2: 5369
Rows of MERGE 2 + account: 5369
-------------------------------------

---------------MERGE 4---------------
Rows of MERGE 3: 5369
Rows of MERGE 3 + district: 5369
-------------------------------------



The client's birthday is in a very specific format which allows us to know which of them are men or women. However, for prediction purposes, the format in which it is isn't very readable or usable. Let's reverse the date conversion and create a new column "gender" to substitute the previous date format.

In [83]:
static_dataset = pd.read_csv("disp_cli_acc_dist.csv")

static_dataset['gender'] = static_dataset['birth_number'].apply(
    lambda x: 'M' if (int(str(x)[2:3])<=12) else 'F')

def fix_birthday(x):
    s = str(x)
   
    if(int(s[2:4])>12):
        partial=str(int(s[2:4])-50)
        if(len(partial)==1): 
            partial="0"+partial  
        res=s[0:2]+partial+s[4:]
       
    else:
        return x
    
    return int(res)

#FIXME
static_dataset["birth_number"] = static_dataset["birth_number"].apply(fix_birthday)

print(static_dataset['birth_number'])

0       701213
1       780313
2       350708
3       800413
4       791021
5       460224
6       410124
7       660929
8       791123
9       440203
10      390806
11      790218
12      611027
13      520522
14      440204
15      320306
16      261018
17      460716
18      730102
19      240909
20      210507
21      700521
22      630225
23      650122
24      680413
25      730410
26      230322
27      410206
28      470908
29      630722
         ...  
5339    480108
5340    651115
5341    570909
5342    710106
5343    800312
5344    610725
5345    660820
5346    201025
5347    230108
5348    410127
5349    410225
5350    650828
5351    730512
5352    510709
5353    771122
5354    760506
5355    190823
5356    460430
5357    550411
5358    760916
5359    840918
5360    600606
5361    600308
5362    620102
5363    681030
5364    680417
5365    411104
5366    630515
5367    570725
5368    561206
Name: birth_number, Length: 5369, dtype: int64


In [84]:
print(static_dataset["client_district_unemploymant rate '95"].isnull().sum())
print(static_dataset["client_district_no. of commited crimes '95"].isnull().sum())
print(static_dataset["account_district_unemploymant rate '95"].isnull().sum())
print(static_dataset["account_district_no. of commited crimes '95"].isnull().sum())
static_dataset.dtypes

61
61
57
57


Unnamed: 0                                                             int64
disp_id                                                                int64
client_id                                                              int64
account_id                                                             int64
type                                                                  object
birth_number                                                           int64
client_district_id                                                     int64
client_district_name                                                  object
client_district_region                                                object
client_district_no. of inhabitants                                     int64
client_district_no. of municipalities with inhabitants < 499           int64
client_district_no. of municipalities with inhabitants 500-1999        int64
client_district_no. of municipalities with inhabitants 2000-9999       int64

As we can see, there are some numeric columns that have non-numeric values: client_district_unemploymant rate '95, account_district_unemploymant rate '95, client_district_no. of commited crimes '95 and account_district_no. of commited crimes '95. To fix this we replaced all non-numeric values with the mean of the numeric values.

In [85]:

static_dataset["client_district_unemploymant rate '95"]=pd.to_numeric(static_dataset["client_district_unemploymant rate '95"], errors='coerce')
static_dataset["client_district_unemploymant rate '95"]=static_dataset["client_district_unemploymant rate '95"].fillna(static_dataset["client_district_unemploymant rate '95"].mean())


static_dataset["client_district_no. of commited crimes '95"]=pd.to_numeric(static_dataset["client_district_no. of commited crimes '95"], errors='coerce')
static_dataset["client_district_no. of commited crimes '95"]=static_dataset["client_district_no. of commited crimes '95"].fillna(static_dataset["client_district_no. of commited crimes '95"].mean())


static_dataset["account_district_unemploymant rate '95"]=pd.to_numeric(static_dataset["account_district_unemploymant rate '95"], errors='coerce')
static_dataset["account_district_unemploymant rate '95"]=static_dataset["account_district_unemploymant rate '95"].fillna(static_dataset["account_district_unemploymant rate '95"].mean())



static_dataset["account_district_no. of commited crimes '95"]=pd.to_numeric(static_dataset["account_district_no. of commited crimes '95"], errors='coerce')
static_dataset["account_district_no. of commited crimes '95"]=static_dataset["account_district_no. of commited crimes '95"].fillna(int(static_dataset["account_district_no. of commited crimes '95"].mean()))

static_dataset.dtypes


Unnamed: 0                                                             int64
disp_id                                                                int64
client_id                                                              int64
account_id                                                             int64
type                                                                  object
birth_number                                                           int64
client_district_id                                                     int64
client_district_name                                                  object
client_district_region                                                object
client_district_no. of inhabitants                                     int64
client_district_no. of municipalities with inhabitants < 499           int64
client_district_no. of municipalities with inhabitants 500-1999        int64
client_district_no. of municipalities with inhabitants 2000-9999       int64

In [86]:
static_dataset.to_csv("static_dataset.csv", index=False)

In [87]:
train_dataset = pd.merge(static_dataset, loan_train, left_on='account_id', right_on='account_id', suffixes=('_loan', '_account'))
train_dataset = pd.merge(train_dataset, card_train, on = 'disp_id', how = 'outer', suffixes = ('', '_card'))
train_dataset = pd.merge(train_dataset, transference_train, on = 'account_id', suffixes = ('', '_account'))
train_dataset.info()
#train_dataset.to_csv("train_dataset.csv", index=False)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30373 entries, 0 to 30372
Data columns (total 59 columns):
Unnamed: 0                                                           30373 non-null float64
disp_id                                                              30373 non-null int64
client_id                                                            30373 non-null float64
account_id                                                           30373 non-null float64
type                                                                 30373 non-null object
birth_number                                                         30373 non-null float64
client_district_id                                                   30373 non-null float64
client_district_name                                                 30373 non-null object
client_district_region                                               30373 non-null object
client_district_no. of inhabitants                                  

Converting nominal data columns to numerical values

Frequency collummn:
Issuance after transaction : 0
Weekly issuance : 1
Monthly issuance : 2

In [76]:
static_dataset_int=static_dataset
static_dataset_int['frequency'] = static_dataset_int['frequency'].str.replace('issuance after transaction','0')
static_dataset_int['frequency'] = static_dataset_int['frequency'].str.replace('weekly issuance','1')
static_dataset_int['frequency'] = static_dataset_int['frequency'].str.replace('monthly issuance','2')
static_dataset_int.to_csv("dataset_int.csv")