# AC - Loan or Not to Loan

In this project we shall predict, based on a training dataset and using different prediction methods, if certain loan applications should or shouldn't be accepted based on loan applications reviewed in the past and their current status nowadays, which may vary from: paid, delayed, etc.

For this we will use different libraries from Python that are specific for these kinds of tasks and a dataset provided by Kaggle. Let's start by importing both of these.

In [1]:
import pandas as pd
import numpy as np


In [2]:
account = pd.read_csv("./data/account.csv", ';')
card_test = pd.read_csv("./data/card_test.csv", ';')
card_train = pd.read_csv("./data/card_train.csv", ';')
client = pd.read_csv("./data/client.csv", ';')
disposition = pd.read_csv("./data/disp.csv", ';')
district = pd.read_csv("./data/district.csv",';')
loan_test = pd.read_csv("./data/loan_test.csv", ';')
loan_train = pd.read_csv("./data/loan_train.csv", ';')
transference_test = pd.read_csv("./data/trans_test.csv", ';')
transference_train = pd.read_csv("./data/trans_train.csv", ';')

  exec(code_obj, self.user_global_ns, self.user_ns)
  exec(code_obj, self.user_global_ns, self.user_ns)


 After reading all CSV to Pandas Dataframes, we can now proceed with the Merge of Data to one unique CSV and clean the data all together, comparing it throughout the merge process, to avoid repeated columns. We must also verify if all the merges didn't create new data. Since we are using the Disposition dataframe as primary table to merge all the other ones, there should always be the same number of rows as that table. We will also save the step by step data in csv files in every merge.

In [7]:
#merge 1
disposition_client = disposition.merge(client, on=["client_id"], how='inner')
disposition_client.rename(columns={'district_id': 'client_district_id'}, inplace=True, errors='raise')

disposition_client.to_csv("disp_client.csv", index=False)

#always check if no new rows were created between merges
print("---------------MERGE 1---------------")
print("Rows of disposition:", disposition.shape[0])
print("Rows of disposition + client:", disposition_client.shape[0])
print("-------------------------------------\n")


#merge 2
disposition_client_account = disposition_client.merge(
    account, on=["account_id"], how="inner")
disposition_client_account.rename(
    columns={'district_id': 'account_district_id'}, inplace=True, errors='raise')
    
disposition_client_account.to_csv("disp_cli_acc.csv", index=False)

#always check if no new rows were created between merges
print("---------------MERGE 2---------------")
print("Rows of MERGE 1:", disposition_client.shape[0])
print("Rows of MERGE 1 + account:", disposition_client_account.shape[0])
print("-------------------------------------\n")


#if, client's district != account's district, first join with district.csv to check info about client's district, then join with accounts with district again to have account's district info. don't forget to change columns name between joins
### check if client_district_id is the same as account_district_id, if so, we only need to have one of those columns 

print("Check if Client's district is always the same as associated Account's district:"+"\n")
comparison_column = np.where(disposition_client_account["client_district_id"] == disposition_client_account["account_district_id"], True, False)
print("Comparison between the two district columns:", comparison_column)
print("Unique values of the comparison column:", np.unique(comparison_column))


---------------MERGE 1---------------
Rows of disposition: 5369
Rows of disposition + client: 5369
-------------------------------------

---------------MERGE 2---------------
Rows of MERGE 1: 5369
Rows of MERGE 1 + account: 5369
-------------------------------------

Check if Client's district is always the same as associated Account's district:

Comparison between the two district columns: [ True  True  True ...  True  True  True]
Unique values of the comparison column: [False  True]


If there are "False" values in the comparison column, it means that **there are accounts with different districts than its associated clients**. That being said, we have to go one step back and before the second merge we have to associate the district with the client and only then proceed with the previously mentioned Merge 2.

In [11]:
#new Merge 2
disposition_client = pd.read_csv("disp_client.csv")

disposition_client.rename(
    columns={'client_district_id': 'district_id'}, inplace=True, errors='raise')

disposition_client_district = disposition_client.merge(
    district, on=["district_id"], how='inner')

disposition_client_district.rename(
    columns={'district_id': 'client_district_id', 'name ': 'client_district_name', 
    'region': 'client_district_region', 'no. of inhabitants': 'client_district_no. of inhabitants', 
    'no. of municipalities with inhabitants < 499 ': 'client_district_no. of municipalities with inhabitants < 499', 
    'no. of municipalities with inhabitants 500-1999': 'client_district_no. of municipalities with inhabitants 500-1999', 
    'no. of municipalities with inhabitants 2000-9999 ': 'client_district_no. of municipalities with inhabitants 2000-9999', 
    'no. of municipalities with inhabitants >10000 ': 'client_district_no. of municipalities with inhabitants >10000', 
    'no. of cities ': 'client_district_no. of cities', 'ratio of urban inhabitants ': 'client_district_ratio of urban inhabitants', 
    'average salary ': 'client_district_average salary', 'unemploymant rate \'95 ': 'client_district_unemploymant rate \'95', 
    'unemploymant rate \'96 ': 'client_district_unemploymant rate \'96', 
    'no. of enterpreneurs per 1000 inhabitants ': 'client_district_no. of enterpreneurs per 1000 inhabitants', 
    'no. of commited crimes \'95 ': 'client_district_no. of commited crimes \'95', 
    'no. of commited crimes \'96 ': 'client_district_no. of commited crimes \'96'}, inplace=True, errors='raise')

disposition_client_district.to_csv("disp_client-d.csv", index=False)

#always check if no new rows were created between merges
print("---------------New MERGE 2---------------")
print("Rows of MERGE 1:", disposition_client.shape[0])
print("Rows of MERGE 1 + district:", disposition_client_district.shape[0])
print("-------------------------------------\n")


      disp_id  client_id  account_id       type  birth_number  \
0           1          1           1      OWNER        706213   
1         420        420         343      OWNER        780313   
2         499        499         413      OWNER        355708   
3         519        519         431      OWNER        800413   
4         682        682         568      OWNER        791021   
...       ...        ...         ...        ...           ...   
5364     9622       9930        8039      OWNER        720623   
5365     9762      10070        8153      OWNER        740423   
5366    10958      11266        9153      OWNER        380925   
5367    10959      11267        9153  DISPONENT        365826   
5368    11393      11701        9504      OWNER        490414   

      client_district_id client_district_name client_district_region  \
0                     18                Pisek          south Bohemia   
1                     18                Pisek          south Bohemia   
2  

Now that we have the client district data, we can move on and merge the account data and, consequently, the district account data.

In [16]:
#merge 3
disposition_client_d_account = disposition_client_district.merge(
    account, on=["account_id"], how="inner")

disposition_client_d_account.to_csv("disp_cli-d_acc.csv", index=False)

#always check if no new rows were created between merges
print("---------------MERGE 3---------------")
print("Rows of MERGE 2:", disposition_client_district.shape[0])
print("Rows of MERGE 2 + account:", disposition_client_d_account.shape[0])
print("-------------------------------------\n")


#merge 4
disposition_client_account_districts = disposition_client_d_account.merge(district, on=["district_id"], how="inner")
disposition_client_account_districts.rename(
    columns={'district_id': 'account_district_id', 'name ': 'account_district_name',
             'region': 'account_district_region', 'no. of inhabitants': 'account_district_no. of inhabitants',
             'no. of municipalities with inhabitants < 499 ': 'account_district_no. of municipalities with inhabitants < 499',
             'no. of municipalities with inhabitants 500-1999': 'account_district_no. of municipalities with inhabitants 500-1999',
             'no. of municipalities with inhabitants 2000-9999 ': 'account_district_no. of municipalities with inhabitants 2000-9999',
             'no. of municipalities with inhabitants >10000 ': 'account_district_no. of municipalities with inhabitants >10000',
             'no. of cities ': 'account_district_no. of cities', 'ratio of urban inhabitants ': 'account_district_ratio of urban inhabitants',
             'average salary ': 'account_district_average salary', 'unemploymant rate \'95 ': 'account_district_unemploymant rate \'95',
             'unemploymant rate \'96 ': 'account_district_unemploymant rate \'96',
             'no. of enterpreneurs per 1000 inhabitants ': 'account_district_no. of enterpreneurs per 1000 inhabitants',
             'no. of commited crimes \'95 ': 'account_district_no. of commited crimes \'95',
             'no. of commited crimes \'96 ': 'account_district_no. of commited crimes \'96'}, inplace=True, errors='raise')


disposition_client_account_districts.to_csv("disp_cli_acc_dist.csv", index=False)

#always check if no new rows were created between merges
print("---------------MERGE 4---------------")
print("Rows of MERGE 3:", disposition_client_d_account.shape[0])
print("Rows of MERGE 3 + district:", disposition_client_account_districts.shape[0])
print("-------------------------------------\n")


---------------MERGE 3---------------
Rows of MERGE 2: 5369
Rows of MERGE 2 + account: 5369
-------------------------------------

---------------MERGE 4---------------
Rows of MERGE 3: 5369
Rows of MERGE 3 + district: 5369
-------------------------------------



The client's birthday is in a very specific format which allows us to know which of them are men or women. However, for prediction purposes, the format in which it is isn't very readable or usable. Let's reverse the date conversion and create a new column "gender" to substitute the previous date format.

In [18]:
static_dataset = pd.read_csv("disp_cli_acc_dist.csv")

static_dataset['gender'] = static_dataset['birth_number'].apply(
    lambda x: 'M' if (int(x[3:2])<=12) else 'F')

static_dataset["birth_number"] = static_dataset["birth_number"].apply(lambda x: x if (int(x[3:2]) <= 12) else #TODO)


0       706213
1       780313
2       355708
3       800413
4       791021
         ...  
5364    680417
5365    416104
5366    635515
5367    575725
5368    561206
Name: birth_number, Length: 5369, dtype: int64
