# Data Cleaning

The goal of the analysis is to produce a model which can predict if a bank account should be approved for a loan or not. For the purposes of this data cleaning notebook, we want to define an account level of granularity to use in our analysis notebook.

The dataset is from a Czech bank and their client information is separated into 8 tables:

- **account** - static characteristics of an account
- **client** - characteristics of a client
- **disposition** - relationship between a client and their account
- **permanent order** - characteristics of a payment
- **transaction** - transaction on an account
- **loan** - loan granted for a given account
- **credit card** - credit card issued to an account
- **demographic** - demographic characteristics of a district

More information on this dataset can be found here: 
[Financial Dataset](https://sorry.vse.cz/~berka/challenge/pkdd1999/berka.htm)

In [187]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
# import seaborn as sns
# import pymysql
import datetime
plt.style.use('fivethirtyeight')
#plt.style.use('default')
from functools import reduce
pd.set_option('display.max_columns',None)
pd.set_option('display.min_rows', None)
from glob import glob

## Data Import

In [4]:
df_account = pd.read_csv('csv_files/raw_csv/df_account.csv', low_memory = False) 
df_client = pd.read_csv('csv_files/raw_csv/df_client.csv', low_memory = False)
df_disp = pd.read_csv('csv_files/raw_csv/df_disp.csv', low_memory = False)
df_order = pd.read_csv('csv_files/raw_csv/df_order.csv', low_memory = False)
df_trans = pd.read_csv('csv_files/raw_csv/df_trans.csv', low_memory = False)
df_loan = pd.read_csv('csv_files/raw_csv/df_loan.csv', low_memory = False)
df_card = pd.read_csv('csv_files/raw_csv/df_card.csv', low_memory = False)
df_district = pd.read_csv('csv_files/raw_csv/df_district.csv', low_memory = False)

There are 8 different raw CSV files that we need to manipulate. I decided to create this function to give an overview of a data frame as we'll be investigating several dataframes.

In [58]:
def df_overview(df):
    
    '''
    Outputs an overview of the dataframe:
    - Sample of dataframe
    - Shape
    - Data types
    - Missing values (%)
    - Descriptive statistics
    '''
    print('\nShape of dataframe:\n')
    print(f'{df.shape[0]} rows | {df.shape[1]} columns')
    print('-' * 42)
    
    print('\nSample of dataframe:\n')
    print(df.head())
    print('-' * 42)
    
    print('\nData types of dataframe:\n')
    print(df.dtypes)
    print('-' * 42)
    
    print('\nMissing values by % in dataframe\n')
    print(df.isnull().sum()*100/df.shape[0])
    print('-' * 42)
    
    print('\nDecriptive statistics of dataframe:\n')
    print(df.describe())
    print('-' * 42)

In [59]:
df_overview(df_account)


Shape of dataframe:

4500 rows | 4 columns
------------------------------------------

Sample of dataframe:

   account_id  district_id         frequency        date
0           1           18  POPLATEK MESICNE  1995-03-24
1           2            1  POPLATEK MESICNE  1993-02-26
2           3            5  POPLATEK MESICNE  1997-07-07
3           4           12  POPLATEK MESICNE  1996-02-21
4           5           15  POPLATEK MESICNE  1997-05-30
------------------------------------------

Data types of dataframe:

account_id      int64
district_id     int64
frequency      object
date           object
dtype: object
------------------------------------------

Missing values by % in dataframe

account_id     0.0
district_id    0.0
frequency      0.0
date           0.0
dtype: float64
------------------------------------------

Decriptive statistics of dataframe:

         account_id  district_id
count   4500.000000  4500.000000
mean    2786.067556    37.310444
std     2313.811984    25.1

Changes that need to be made:
- fields in the `frequency` column
- change data type of `date` column

In [60]:
# Let's convert to English for better understanding.
df_account['frequency'].replace('POPLATEK MESICNE', 'monthly', inplace=True)
df_account['frequency'].replace('POPLATEK TYDNE', 'weekly', inplace=True)
df_account['frequency'].replace('POPLATEK PO OBRATU', 'after_trans', inplace=True)
df_account.rename(columns = {'frequency':'stmt_frq'}, inplace=True) #statement frequency

Frequency of an account is defined as: "frequency of issuance of statements"
- "POPLATEK MESICNE" stands for monthly issuance
- "POPLATEK TYDNE" stands for weekly issuance
- "POPLATEK PO OBRATU" stands for issuance after transaction

In [62]:
# Convert date to a datetime type variable
df_account['date'] = pd.to_datetime(df_account['date'])

In [67]:
df_overview(df_client)


Shape of dataframe:

5369 rows | 4 columns
------------------------------------------

Sample of dataframe:

   client_id gender  birth_date  district_id
0          1      F  1970-12-13           18
1          2      M  1945-02-04            1
2          3      F  1940-10-09            1
3          4      M  1956-12-01            5
4          5      F  1960-07-03            5
------------------------------------------

Data types of dataframe:

client_id       int64
gender         object
birth_date     object
district_id     int64
dtype: object
------------------------------------------

Missing values by % in dataframe

client_id      0.0
gender         0.0
birth_date     0.0
district_id    0.0
dtype: float64
------------------------------------------

Decriptive statistics of dataframe:

          client_id  district_id
count   5369.000000  5369.000000
mean    3359.011920    37.310114
std     2832.911984    25.043690
min        1.000000     1.000000
25%     1418.000000    14.000000


Changes that need to be made:
- change data type of `birth_date` column

In [68]:
# Convert date to a datetime type variable
df_client['birth_date'] = pd.to_datetime(df_client['birth_date'])

In [72]:
df_overview(df_disp)


Shape of dataframe:

5369 rows | 4 columns
------------------------------------------

Sample of dataframe:

   disp_id  client_id  account_id       type
0        1          1           1      OWNER
1        2          2           2      OWNER
2        3          3           2  DISPONENT
3        4          4           3      OWNER
4        5          5           3  DISPONENT
------------------------------------------

Data types of dataframe:

disp_id        int64
client_id      int64
account_id     int64
type          object
dtype: object
------------------------------------------

Missing values by % in dataframe

disp_id       0.0
client_id     0.0
account_id    0.0
type          0.0
dtype: float64
------------------------------------------

Decriptive statistics of dataframe:

            disp_id     client_id    account_id
count   5369.000000   5369.000000   5369.000000
mean    3337.097970   3359.011920   2767.496927
std     2770.418826   2832.911984   2307.843630
min        1.0

Note: Multiple clients can be on one account.

In [73]:
df_overview(df_order)


Shape of dataframe:

6471 rows | 6 columns
------------------------------------------

Sample of dataframe:

   order_id  account_id bank_to  account_to  amount k_symbol
0     29401           1      YZ    87144583  2452.0     SIPO
1     29402           2      ST    89597016  3372.7     UVER
2     29403           2      QR    13943797  7266.0     SIPO
3     29404           3      WX    83084338  1135.0     SIPO
4     29405           3      CD    24485939   327.0      NaN
------------------------------------------

Data types of dataframe:

order_id        int64
account_id      int64
bank_to        object
account_to      int64
amount        float64
k_symbol       object
dtype: object
------------------------------------------

Missing values by % in dataframe

order_id       0.000000
account_id     0.000000
bank_to        0.000000
account_to     0.000000
amount         0.000000
k_symbol      21.310462
dtype: float64
------------------------------------------

Decriptive statistics of da

Changes that need to be made:

- fields in the `k_symbol` column
    - 21% of the fields are missing

In [76]:
# Let's convert k_symbol to English
df_order['k_symbol'].replace('POJISTNE', 'insurance', inplace=True)
df_order['k_symbol'].replace('SIPO', 'household', inplace=True)
df_order['k_symbol'].replace('LEASING', 'leasing', inplace=True)
df_order['k_symbol'].replace('UVER', 'loan', inplace=True)
df_order['k_symbol'].replace(np.nan, 'unknown', inplace=True)
df_order.rename(columns = {'k_symbol':'order_payment_type'}, inplace=True) 

`k_symbol` is defined as: "characterization of the payment"
- "POJISTNE" stands for insurance payment
- "SIPO" stands for household
- "LEASING" stands for leasing
- "UVER" stands for loan payment

In [92]:
df_order['order_payment_type'].value_counts(normalize=True) * 100

household    54.118374
unknown      21.310462
loan         11.080204
insurance     8.221295
leasing       5.269665
Name: order_payment_type, dtype: float64

We have 21% of missing values as `unknown` for now and will decide on how to handles them once we have a proper dataset to work with.

In [93]:
df_overview(df_trans)


Shape of dataframe:

1056320 rows | 10 columns
------------------------------------------

Sample of dataframe:

   trans_id  account_id        date    type      operation  amount  balance  \
0         1           1  1995-03-24  PRIJEM          VKLAD    1000     1000   
1         5           1  1995-04-13  PRIJEM  PREVOD Z UCTU    3679     4679   
2         6           1  1995-05-13  PRIJEM  PREVOD Z UCTU    3679    20977   
3         7           1  1995-06-13  PRIJEM  PREVOD Z UCTU    3679    26835   
4         8           1  1995-07-13  PRIJEM  PREVOD Z UCTU    3679    30415   

  k_symbol bank     account  
0      NaN  NaN         NaN  
1      NaN   AB  41403269.0  
2      NaN   AB  41403269.0  
3      NaN   AB  41403269.0  
4      NaN   AB  41403269.0  
------------------------------------------

Data types of dataframe:

trans_id        int64
account_id      int64
date           object
type           object
operation      object
amount          int64
balance         int64
k_symbo

There are 4 columns with missing values:
- `operation` - mode of transaction
- `k_symbol` - characterization of the transaction
- `bank` - bank of the partner
- `account` - account of the partner

We won't be using `bank` and `account` as they have over 70% of rows missing and non-relevant information.


`type` - type of transaction
- "PRIJEM" stands for credit
- "VYDAJ" stands for withdrawal

`operation` - mode of transaction
- "VYBER KARTOU" credit card withdrawal
- "VKLAD" credit in cash
- "PREVOD Z UCTU" collection from another bank
- "VYBER" withdrawal in cash
- "PREVOD NA UCET" remittance to another bank

`k_symbol` - characterization of transaction
- "POJISTNE" stands for insurrance payment
- "SLUZBY" stands for payment for statement
- "UROK" stands for interest credited
- "SANKC. UROK" sanction interest if negative balance
- "SIPO" stands for household
- "DUCHOD" stands for old-age pension
- "UVER" stands for loan payment

In [113]:
# Replace names for type
df_trans['type'].replace('PRIJEM', 'credit', inplace=True)
df_trans['type'].replace('VYDAJ', 'withdrawal', inplace=True)
df_trans.rename(columns = {'type':'trans_type'}, inplace=True) 

# Replace names for operation
df_trans['operation'].replace('VYBER KARTOU', 'cc_withdrawal', inplace=True) #credit card withdrawal
df_trans['operation'].replace('VKLAD', 'c_cash', inplace=True) #credit in cash
df_trans['operation'].replace('PREVOD Z UCTU', 'col_bank', inplace=True) #collection from another bank
df_trans['operation'].replace('VYBER', 'withdrawal_c', inplace=True) #withdrawal in cash
df_trans['operation'].replace('PREVOD NA UCET', 'remittance', inplace=True) #withdrawal in cash
df_trans['operation'].replace(np.nan, 'unknown', inplace=True)

# Replace names for k_symbol
df_trans['k_symbol'].replace('POJISTNE', 'insurance', inplace=True) 
df_trans['k_symbol'].replace('SLUZBY', 'statement', inplace=True) 
df_trans['k_symbol'].replace('UROK', 'int_cred', inplace=True) #interest credited
df_trans['k_symbol'].replace('SANKC. UROK', 'sanc_int', inplace=True) #sanction interest if negative balance
df_trans['k_symbol'].replace('SIPO', 'household', inplace=True) 
df_trans['k_symbol'].replace('DUCHOD', 'pension', inplace=True) 
df_trans['k_symbol'].replace('UVER', 'loan', inplace=True) 
df_trans['k_symbol'].replace(np.nan, 'unknown', inplace=True)
df_trans['k_symbol'].replace(' ', 'unknown', inplace=True)
df_trans.rename(columns = {'k_symbol':'trans_payment_type'}, inplace=True) 

In [114]:
df_trans['operation'].value_counts(normalize=True) * 100

withdrawal_c     41.172940
remittance       19.717794
unknown          17.335088
c_cash           14.838591
col_bank          6.174833
cc_withdrawal     0.760754
Name: operation, dtype: float64

We have 17% of mode of transactions that are missing values as `unknown`.

In [116]:
df_trans['trans_payment_type'].value_counts(normalize=True) * 100

unknown      50.677257
int_cred     17.335088
statement    14.752348
household    11.177011
pension       2.872046
insurance     1.751363
loan          1.285595
sanc_int      0.149292
Name: trans_payment_type, dtype: float64

We have 51% of characterization of transactions that are missing values as `unknown` and may consider dropping this variable if it provides no high information.

In [119]:
# Convert date to a datetime type variable
df_trans['date'] = pd.to_datetime(df_trans['date'])

In [121]:
df_overview(df_loan)


Shape of dataframe:

682 rows | 7 columns
------------------------------------------

Sample of dataframe:

   loan_id  account_id        date  amount  duration  payments status
0     4959           2  1994-01-05   80952        24    3373.0      A
1     4961          19  1996-04-29   30276        12    2523.0      B
2     4962          25  1997-12-08   30276        12    2523.0      A
3     4967          37  1998-10-14  318480        60    5308.0      D
4     4968          38  1998-04-19  110736        48    2307.0      C
------------------------------------------

Data types of dataframe:

loan_id         int64
account_id      int64
date           object
amount          int64
duration        int64
payments      float64
status         object
dtype: object
------------------------------------------

Missing values by % in dataframe

loan_id       0.0
account_id    0.0
date          0.0
amount        0.0
duration      0.0
payments      0.0
status        0.0
dtype: float64
--------------

Changes to be made:
- `date` to datetime type variable

In [122]:
# Convert date to a datetime type variable
df_loan['date'] = pd.to_datetime(df_loan['date'])

In [123]:
# Rename loan columns
df_loan = df_loan.rename(columns={'amount': 'loan_amount', 'duration':'loan_duration', 'payments':'monthly_loan_payment', 'status':'loan_status'})

In [124]:
df_overview(df_card)


Shape of dataframe:

892 rows | 4 columns
------------------------------------------

Sample of dataframe:

   card_id  disp_id     type      issued
0        1        9     gold  1998-10-16
1        2       19  classic  1998-03-13
2        3       41     gold  1995-09-03
3        4       42  classic  1998-11-26
4        5       51   junior  1995-04-24
------------------------------------------

Data types of dataframe:

card_id     int64
disp_id     int64
type       object
issued     object
dtype: object
------------------------------------------

Missing values by % in dataframe

card_id    0.0
disp_id    0.0
type       0.0
issued     0.0
dtype: float64
------------------------------------------

Decriptive statistics of dataframe:

           card_id       disp_id
count   892.000000    892.000000
mean    480.855381   3511.862108
std     306.933982   2984.373626
min       1.000000      9.000000
25%     229.750000   1387.000000
50%     456.500000   2938.500000
75%     684.250000   445

Changes to be made:
- `issued` to datetime variable

In [125]:
df_card['issued'] = pd.to_datetime(df_card['issued'])

In [127]:
df_overview(df_district)


Shape of dataframe:

77 rows | 16 columns
------------------------------------------

Sample of dataframe:

   district_id           A2               A3       A4  A5  A6  A7  A8  A9  \
0            1  Hl.m. Praha           Prague  1204953   0   0   0   1   1   
1            2      Benesov  central Bohemia    88884  80  26   6   2   5   
2            3       Beroun  central Bohemia    75232  55  26   4   1   5   
3            4       Kladno  central Bohemia   149893  63  29   6   2   6   
4            5        Kolin  central Bohemia    95616  65  30   4   1   6   

     A10    A11  A12   A13  A14      A15    A16  
0  100.0  12541  0.2  0.43  167  85677.0  99107  
1   46.7   8507  1.6  1.85  132   2159.0   2674  
2   41.7   8980  1.9  2.21  111   2824.0   2813  
3   67.4   9753  4.6  5.05  109   5244.0   5892  
4   51.4   9307  3.8  4.43  118   2616.0   3040  
------------------------------------------

Data types of dataframe:

district_id      int64
A2              object
A3          

In [128]:
df_district = df_district.rename(columns={'A2':'district_name', 
                                          'A3':'region', 
                                          'A4':'population', 
                                          'A5':'nmu_lt499',
                                          'A6':'nmu_500to1999', 
                                          'A7':'nmu_2000to9999', 
                                          'A8':'nmu_gt10000',
                                          'A9':'n_cty', 
                                          'A10':'ratio_urban', 
                                          'A11':'avg_salary', 
                                          'A12':'unemp_95', 
                                          'A13': 'unemp_96',
                                          'A14':'nentrep_p1000', 
                                          'A15':'ncrimes_95', 
                                          'A16':'ncrimes_96'})

## Merge Dataframes

In [129]:
# Time to merge different dateframes
df_final = pd.merge(df_account, df_disp, on = 'account_id') #shape: 5369, 7

In [133]:
df_final = pd.merge(df_final, df_district, on = 'district_id') # shape: 5369, 22

In [136]:
df_final = pd.merge(df_final, df_client, on = 'client_id') #shape: 5369, 25

#can drop one of the district_id

In [139]:
df_final = pd.merge(df_final, df_card, on='disp_id', how='outer', suffixes=('_disp','_card')) #shape: 5369, 28

In [142]:
df_final = pd.merge(df_final, df_loan, on='account_id', how='inner', suffixes=('_account','_loan')) #shape: 827, 34

We only keep the account with loans as we'll be using this information to differnetiate between good and bad clients.

In [144]:
len(df_final['account_id'].unique()) 

682

This makes sense as the semi anonymized dataset contains 606 successful and 76 not successful loans.

In [149]:
df_final.head()

Unnamed: 0,account_id,district_id_x,stmt_frq,date_account,disp_id,client_id,type_disp,district_name,region,population,nmu_lt499,nmu_500to1999,nmu_2000to9999,nmu_gt10000,n_cty,ratio_urban,avg_salary,unemp_95,unemp_96,nentrep_p1000,ncrimes_95,ncrimes_96,gender,birth_date,district_id_y,card_id,type_card,issued,loan_id,date_loan,loan_amount,loan_duration,monthly_loan_payment,loan_status
0,2350,18,monthly,1996-08-04,2841,2841,OWNER,Pisek,south Bohemia,70699,60,13,2,1,4,65.3,8968,2.8,3.35,131,1740.0,1910,F,1973-04-10,18,446.0,classic,1998-06-04,5451,1997-05-11,159744,48,3328.0,C
1,9156,18,monthly,1997-09-14,10963,11271,OWNER,Pisek,south Bohemia,70699,60,13,2,1,4,65.3,8968,2.8,3.35,131,1740.0,1910,M,1937-12-08,18,,,NaT,6856,1998-12-01,163332,36,4537.0,C
2,10973,18,weekly,1993-04-20,13182,13490,OWNER,Pisek,south Bohemia,70699,60,13,2,1,4,65.3,8968,2.8,3.35,131,1740.0,1910,F,1969-05-25,18,,,NaT,7235,1993-10-13,154416,48,3217.0,A
3,2,1,monthly,1993-02-26,2,2,OWNER,Hl.m. Praha,Prague,1204953,0,0,0,1,1,100.0,12541,0.2,0.43,167,85677.0,99107,M,1945-02-04,1,,,NaT,4959,1994-01-05,80952,24,3373.0,A
4,2,1,monthly,1993-02-26,3,3,DISPONENT,Hl.m. Praha,Prague,1204953,0,0,0,1,1,100.0,12541,0.2,0.43,167,85677.0,99107,F,1940-10-09,1,,,NaT,4959,1994-01-05,80952,24,3373.0,A


In [151]:
# all transactions for accounts with loans
df_trans_account = pd.merge(df_trans, df_final[['account_id','date_loan']], on = 'account_id') #shape: 233627, 11 

In [153]:
#Drop duplicate transactions
df_trans_account = df_trans_account.drop_duplicates() #shape: 191556, 11

In [157]:
# Gets the difference between the date of the loan and the date of the transaction
df_trans_account['date_diff'] = (df_trans_account['date_loan'] - df_trans_account['date']) 

In [162]:
# Let's drop transactions that occured after the loan date
df_trans_account.drop(df_trans_account[df_trans_account['date_diff'] < datetime.timedelta(0)].index, inplace=True) #shape 54860, 12

In [164]:
# Here, we get get counts of the different transaction types, operations, and payment types
df_trans_type_counts = df_trans_account.groupby('account_id')['trans_type'].value_counts().to_frame()
df_operation_counts = df_trans_account.groupby('account_id')['operation'].value_counts().to_frame()
df_payment_type_counts = df_trans_account.groupby('account_id')['trans_payment_type'].value_counts().to_frame()

In [177]:
df_trans_type_counts.index = df_trans_type_counts.index.set_names(['account_id', 'transaction_type'])
df_operation_counts.index = df_operation_counts.index.set_names(['account_id', 'operation_type'])
df_payment_type_counts.index = df_payment_type_counts.index.set_names(['account_id', 'payment_type'])

In [179]:
df_trans_type_counts.reset_index(inplace=True)
df_operation_counts.reset_index(inplace=True)
df_payment_type_counts.reset_index(inplace=True)

In [180]:
# flatten the counts using a pivot table
df_trans_type_counts = df_trans_type_counts.pivot(index='account_id', columns='transaction_type', values='trans_type').fillna(0).reset_index(inplace=False)
df_operation_counts = df_operation_counts.pivot(index='account_id', columns='operation_type', values='operation').fillna(0).reset_index(inplace=False)
df_payment_type_counts = df_payment_type_counts.pivot(index='account_id', columns='payment_type', values='trans_payment_type').fillna(0).reset_index(inplace=False)

In [182]:
df_trans_type_counts.columns = ['num_trans_' + str(col) for col in df_trans_type_counts.columns]
df_operation_counts.columns = ['num_ops_' + str(col) for col in df_operation_counts.columns]
df_payment_type_counts.columns = ['num_pay_' + str(col) for col in df_payment_type_counts.columns]

In [184]:
df_trans_type_counts.rename(columns={'num_trans_account_id':'account_id'}, inplace=True)
df_operation_counts.rename(columns={'num_ops_account_id':'account_id'}, inplace=True)
df_payment_type_counts.rename(columns={'num_pay_account_id':'account_id'}, inplace=True)

In [188]:
#Create new transaction counts data frame 
counts_dataframes = [df_trans_type_counts, df_operation_counts, df_payment_type_counts]
df_counts = reduce(lambda left,right: pd.merge(left,right,on='account_id'), counts_dataframes)#shape: 682,17

In [189]:
df_final = pd.merge(df_final, df_counts, on = 'account_id') #shape: 827, 50

In [192]:
df_final.head()

Unnamed: 0,account_id,district_id_x,stmt_frq,date_account,disp_id,client_id,type_disp,district_name,region,population,nmu_lt499,nmu_500to1999,nmu_2000to9999,nmu_gt10000,n_cty,ratio_urban,avg_salary,unemp_95,unemp_96,nentrep_p1000,ncrimes_95,ncrimes_96,gender,birth_date,district_id_y,card_id,type_card,issued,loan_id,date_loan,loan_amount,loan_duration,monthly_loan_payment,loan_status,num_trans_VYBER,num_trans_credit,num_trans_withdrawal,num_ops_c_cash,num_ops_cc_withdrawal,num_ops_col_bank,num_ops_remittance,num_ops_unknown,num_ops_withdrawal_c,num_pay_household,num_pay_insurance,num_pay_int_cred,num_pay_loan,num_pay_sanc_int,num_pay_statement,num_pay_unknown
0,2350,18,monthly,1996-08-04,2841,2841,OWNER,Pisek,south Bohemia,70699,60,13,2,1,4,65.3,8968,2.8,3.35,131,1740.0,1910,F,1973-04-10,18,446.0,classic,1998-06-04,5451,1997-05-11,159744,48,3328.0,C,0.0,15.0,13.0,10.0,0.0,0.0,3.0,5.0,10.0,2.0,0.0,5.0,0.0,0.0,2.0,19.0
1,9156,18,monthly,1997-09-14,10963,11271,OWNER,Pisek,south Bohemia,70699,60,13,2,1,4,65.3,8968,2.8,3.35,131,1740.0,1910,M,1937-12-08,18,,,NaT,6856,1998-12-01,163332,36,4537.0,C,4.0,34.0,28.0,11.0,0.0,0.0,0.0,23.0,32.0,0.0,0.0,23.0,0.0,0.0,6.0,37.0
2,10973,18,weekly,1993-04-20,13182,13490,OWNER,Pisek,south Bohemia,70699,60,13,2,1,4,65.3,8968,2.8,3.35,131,1740.0,1910,F,1969-05-25,18,,,NaT,7235,1993-10-13,154416,48,3217.0,A,2.0,16.0,14.0,10.0,0.0,0.0,0.0,6.0,16.0,0.0,0.0,6.0,0.0,0.0,4.0,22.0
3,2,1,monthly,1993-02-26,2,2,OWNER,Hl.m. Praha,Prague,1204953,0,0,0,1,1,100.0,12541,0.2,0.43,167,85677.0,99107,M,1945-02-04,1,,,NaT,4959,1994-01-05,80952,24,3373.0,A,3.0,22.0,30.0,2.0,0.0,10.0,6.0,10.0,27.0,6.0,0.0,10.0,0.0,0.0,6.0,33.0
4,2,1,monthly,1993-02-26,3,3,DISPONENT,Hl.m. Praha,Prague,1204953,0,0,0,1,1,100.0,12541,0.2,0.43,167,85677.0,99107,F,1940-10-09,1,,,NaT,4959,1994-01-05,80952,24,3373.0,A,3.0,22.0,30.0,2.0,0.0,10.0,6.0,10.0,27.0,6.0,0.0,10.0,0.0,0.0,6.0,33.0


Now we have a dataframe with our accounts which have loans and different counts of their transaction information.

Most lenders ask to see at least two-three months' worth of statements before they issue you a loan.

Let's take a look at the account balance before the loan date for the past 3 months.

In [193]:
df_trans_account_copy = df_trans_account.copy()

In [195]:
# Let's take a look at transactions within different time frames upto 3 months
df_trans_account_30 = df_trans_account_copy.copy()
df_trans_account_60 = df_trans_account_copy.copy()
df_trans_account_90 = df_trans_account_copy.copy()

In [196]:
# Reducing transactions by months (30, 60, 90 days)
df_trans_account_30.drop(df_trans_account_30[df_trans_account_30['date_diff'] > datetime.timedelta(30)].index, inplace=True)
df_trans_account_60.drop(df_trans_account_60[df_trans_account_60['date_diff'] > datetime.timedelta(60)].index, inplace=True)
df_trans_account_90.drop(df_trans_account_90[df_trans_account_90['date_diff'] > datetime.timedelta(90)].index, inplace=True)

In [197]:
df_trans_account_30.shape

(4960, 12)

In [198]:
df_trans_account_90.shape

#Makes sense as there are more transaction with more time

(14316, 12)

In [199]:
mon_1_balance = df_trans_account_30.groupby('account_id')['balance'].agg(['min','max','mean','count']).reset_index()
mon_2_balance = df_trans_account_60.groupby('account_id')['balance'].agg(['min','max','mean','count']).reset_index()
mon_3_balance = df_trans_account_90.groupby('account_id')['balance'].agg(['min','max','mean','count']).reset_index()

In [200]:
mon_1_balance.rename(columns = {'min':'min1','max':'max1','mean':'mean1','count':'count1'}, inplace=True)
mon_2_balance.rename(columns = {'min':'min2','max':'max2','mean':'mean2','count':'count2'}, inplace=True)
mon_3_balance.rename(columns = {'min':'min3','max':'max3','mean':'mean3','count':'count3'}, inplace=True)

In [201]:
#Created new client data frame with different balance statistics for different time frames
balance_dataframes = [df_final, mon_1_balance, mon_2_balance, mon_3_balance]
df_final = reduce(lambda left,right: pd.merge(left,right,on='account_id'), balance_dataframes)#shape:827,62 

In [202]:
# Let's focus on the owners of the account but we'll keep a count if there are multiple people on an account 
df_num_clients = df_final.groupby('account_id', as_index=False)['type_disp'].count().rename(columns={'type_disp':'num_clients'})

In [205]:
df_final = pd.merge(df_final, df_num_clients, on = 'account_id')

In [208]:
df_final_owner = df_final[df_final['type_disp']=='OWNER']

In [209]:
df_final_owner.shape

(682, 63)

Our final dataframe that will be used will have accounts which have loans and different counts of their transaction information with a focus on the account balance before the loan date for the past 3 months.

## Save our dataframe

In [210]:
df_final_owner.to_csv('csv_files/trans_csv/df_final_owner.csv')