# 1. Data preprocessing

The goal of this worksheet is to fully process the dataset so it can be used for modeling.

This worksheet contains the following steps:
1. Obtaining the data
2. Inspecting the data
3. Cleaning and filtering the data
4. Feature engineering
5. Creating the target variable
6. Data selection for modeling


In [98]:
import pandas as pd
from requests import get
import numpy as np

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

## Obtaining the data

The dataset is quite large so it should be only downloaded once. However if you work on it continuously then consider re-downloading it once per month, as Bondora adds more loans and some of the old loans are updated. After downloading we extract the zip file and load the extracted .csv file into a DataFrame and also sort it in a chronological order.

In [127]:
def download():

    import zipfile
    url = "https://bondora.com/marketing/media/LoanData.zip"

    with open("data/Data.zip", "wb") as file:
        response = get(url)
        file.write(response.content)

    with zipfile.ZipFile("data/Data.zip","r") as zip_ref:
        zip_ref.extractall("data/")

#download()

In [None]:
data = pd.read_csv("LoanData.csv")
data = data.sort_values(by=['ListedOnUTC'])

## Inspecting the data

We can see that the dataset contains 192771 loans and has 112 different columns. Loans range from 2009 to the current date.

In [100]:
print(f"DataFrame shape: {data.shape}")

DataFrame shape: (192771, 112)


In [101]:
data.head(10)

Unnamed: 0,ReportAsOfEOD,LoanId,LoanNumber,ListedOnUTC,BiddingStartedOn,BidsPortfolioManager,BidsApi,BidsManual,UserName,NewCreditCustomer,LoanApplicationStartedDate,LoanDate,ContractEndDate,FirstPaymentDate,MaturityDate_Original,MaturityDate_Last,ApplicationSignedHour,ApplicationSignedWeekday,VerificationType,LanguageCode,Age,DateOfBirth,Gender,Country,AppliedAmount,Amount,Interest,LoanDuration,MonthlyPayment,County,City,UseOfLoan,Education,MaritalStatus,NrOfDependants,EmploymentStatus,EmploymentDurationCurrentEmployer,EmploymentPosition,WorkExperience,OccupationArea,HomeOwnershipType,IncomeFromPrincipalEmployer,IncomeFromPension,IncomeFromFamilyAllowance,IncomeFromSocialWelfare,IncomeFromLeavePay,IncomeFromChildSupport,IncomeOther,IncomeTotal,ExistingLiabilities,LiabilitiesTotal,RefinanceLiabilities,DebtToIncome,FreeCash,MonthlyPaymentDay,ActiveScheduleFirstPaymentReached,PlannedPrincipalTillDate,PlannedInterestTillDate,LastPaymentOn,CurrentDebtDaysPrimary,DebtOccuredOn,CurrentDebtDaysSecondary,DebtOccuredOnForSecondary,ExpectedLoss,LossGivenDefault,ExpectedReturn,ProbabilityOfDefault,DefaultDate,PrincipalOverdueBySchedule,PlannedPrincipalPostDefault,PlannedInterestPostDefault,EAD1,EAD2,PrincipalRecovery,InterestRecovery,RecoveryStage,StageActiveSince,ModelVersion,Rating,EL_V0,Rating_V0,EL_V1,Rating_V1,Rating_V2,Status,Restructured,ActiveLateCategory,WorseLateCategory,CreditScoreEsMicroL,CreditScoreEsEquifaxRisk,CreditScoreFiAsiakasTietoRiskGrade,CreditScoreEeMini,PrincipalPaymentsMade,InterestAndPenaltyPaymentsMade,PrincipalWriteOffs,InterestAndPenaltyWriteOffs,PrincipalBalance,InterestAndPenaltyBalance,NoOfPreviousLoansBeforeLoan,AmountOfPreviousLoansBeforeLoan,PreviousRepaymentsBeforeLoan,PreviousEarlyRepaymentsBefoleLoan,PreviousEarlyRepaymentsCountBeforeLoan,GracePeriodStart,GracePeriodEnd,NextPaymentDate,NextPaymentNr,NrOfScheduledPayments,ReScheduledOn,PrincipalDebtServicingCost,InterestAndPenaltyDebtServicingCost,ActiveLateLastPaymentCategory
957,2021-10-20,FA160D69-2682-4A60-8D8E-9BB700EA30CE,37,2009-02-21 14:12:39,2009-02-21 14:12:39,0,0.0,63.91,BO57KKKA,True,2009-02-21 14:12:39,2009-03-07,2009-09-10,2009-04-10,2009-09-10,2009-09-10,14,7,2.0,1,21,,0.0,EE,63.91,63.91,20.0,6,,,,3,5.0,4.0,0.0,5.0,UpTo2Years,,5To10Years,,,15500.0,0.0,0.0,0.0,0.0,0.0,0.0,15500.0,0,0.0,0,0.0,0.0,10,True,1000.0,34.14,2009-09-10,,,,,,,,,,,,,,,,,,,,,,,,,,Repaid,False,,,,,,,63.91,2.18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,0.0,0.0,
958,2021-10-20,8E929B92-7C99-421D-8499-9BB70125F390,42,2009-02-21 17:50:14,2009-02-21 17:50:14,0,0.0,83.09,Maret,True,2009-02-21 17:50:14,2009-03-03,2011-03-14,2009-04-13,2011-03-14,2011-03-14,17,7,2.0,1,19,,1.0,EE,958.67,83.08,10.0,24,,,,6,2.0,3.0,2.0,3.0,UpTo1Year,,LessThan2Years,,,4000.0,0.0,0.0,0.0,0.0,0.0,0.0,4000.0,0,0.0,0,0.0,0.0,12,True,1300.0,142.73,2011-03-14,,,,,,,,,,,,,,,,,,,,,,,,,,Repaid,False,,180+,,,,,83.08,9.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,0.0,0.0,
959,2021-10-20,33B3F669-D0E3-4474-8045-9BB70128D064,43,2009-02-21 18:00:40,2009-02-21 18:00:40,0,0.0,322.75,OUPEN,True,2009-02-21 18:00:40,2009-02-28,2010-03-10,2009-04-10,2010-03-10,2010-03-10,18,7,2.0,1,38,,0.0,EE,639.12,322.75,25.0,12,,,,6,4.0,4.0,4.0,3.0,UpTo4Years,,15To25Years,,,9000.0,0.0,0.0,0.0,0.0,0.0,0.0,9000.0,0,0.0,0,0.0,0.0,10,True,5050.0,697.28,2010-03-10,,,,,,,,,,,,,,,,,,,,,,,,,,Repaid,False,,151-180,,,,,322.75,44.56,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,0.0,0.0,
960,2021-10-20,7074D9E8-E8F5-403B-8614-9BB701338AD4,44,2009-02-21 18:39:43,2009-02-21 18:39:43,0,0.0,252.45,Aime,True,2009-02-21 18:39:43,2009-03-03,2011-04-04,2009-05-01,2011-04-01,2011-04-01,18,7,2.0,1,55,,1.0,EE,958.67,252.45,10.0,24,,,,2,3.0,2.0,3.0,3.0,MoreThan5Years,,MoreThan25Years,,,6000.0,0.0,0.0,0.0,0.0,0.0,0.0,6000.0,0,0.0,0,0.0,0.0,1,True,3950.0,454.3,2011-04-04,,,,,,,,,,,,,,,,,,,,,,,,,,Repaid,False,,121-150,,,,,252.45,29.08,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,0.0,0.0,
961,2021-10-20,39F2A312-CD6C-4F60-A7F6-9BB7014CE0A9,45,2009-02-21 20:11:58,2009-02-21 20:11:58,0,0.0,63.91,element,True,2009-02-21 20:11:58,2009-03-07,2010-06-10,2009-05-06,2010-04-06,2010-04-06,20,7,2.0,1,46,,1.0,EE,766.94,63.91,20.0,12,,,,8,5.0,1.0,5.0,3.0,MoreThan5Years,,MoreThan25Years,,,9000.0,0.0,0.0,0.0,0.0,0.0,1700.0,10700.0,0,0.0,0,0.0,0.0,6,True,1000.0,127.58,2010-06-10,,,,,,,,,2010-05-04,,0.0,0.0,3.02,-4.92,3.02,0.71,,,,,,,,,,Repaid,False,,61-90,,,,,63.91,8.65,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,0.0,0.0,
962,2021-10-20,01594183-1C98-454E-BB45-9BB801594F50,53,2009-02-22 20:57:14,2009-02-22 20:57:14,0,0.0,127.82,koort681,True,2009-02-22 20:57:14,2009-03-08,2009-05-15,2009-04-15,2009-04-15,2009-04-15,20,1,2.0,1,48,,1.0,EE,127.82,127.82,10.0,1,,,,6,5.0,4.0,0.0,3.0,,,MoreThan25Years,,,9340.0,0.0,0.0,0.0,0.0,0.0,2000.0,11340.0,0,0.0,0,0.0,0.0,15,True,2000.0,18.84,2009-04-14,,,,,,,,,,,,,,,,,,,,,,,,,,Repaid,False,,,,,,,127.82,1.16,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,0.0,0.0,
963,2021-10-20,74B28519-DCA6-4C35-8AA0-9BBC00C7F4A6,61,2009-02-26 12:08:00,2009-02-26 12:08:00,0,0.0,70.3,eiks,True,2009-02-26 12:08:00,2009-03-01,2009-09-11,2009-04-13,2009-09-11,2009-09-11,12,5,2.0,1,30,,0.0,EE,447.38,70.3,12.0,6,,,,2,4.0,3.0,0.0,3.0,,,10To15Years,,,12000.0,0.0,0.0,0.0,0.0,0.0,0.0,12000.0,0,0.0,0,0.0,0.0,11,True,1100.0,42.37,2009-09-11,,,,,,,,,,,,,,,,,,,,,,,,,,Repaid,False,,31-60,,,,,70.3,2.73,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,0.0,0.0,
964,2021-10-20,4400B605-6CB0-475A-8494-9BBC00D582A8,64,2009-02-26 12:57:22,2009-02-26 12:57:22,0,0.0,185.34,Optimist,True,2009-02-26 12:57:22,2009-03-12,2009-09-21,2009-04-20,2009-09-21,2009-09-21,12,5,2.0,1,43,,1.0,EE,319.56,185.34,25.0,6,,,,7,4.0,1.0,5.0,5.0,,,10To15Years,,,3000.0,0.0,0.0,0.0,0.0,0.0,3000.0,6000.0,0,0.0,0,0.0,0.0,20,True,2900.0,230.98,2009-09-21,,,,,,,,,,,,,,,,,,,,,,,,,,Repaid,False,,31-60,,,,,185.34,14.76,0.0,0.0,0.0,0.0,1.0,281.19,0.0,0.0,0.0,,,,,,,0.0,0.0,
965,2021-10-20,C90D976B-B5C9-454A-BD13-9BC0008E12CD,81,2009-03-02 08:37:16,2009-03-02 08:37:16,0,0.0,172.56,mona35,True,2009-03-02 08:37:16,2009-03-16,2011-04-11,2009-05-11,2011-04-11,2011-04-11,8,2,2.0,1,39,,1.0,EE,319.56,172.58,32.0,24,,,,8,4.0,1.0,1.0,3.0,UpTo1Year,,15To25Years,,,7600.0,0.0,0.0,0.0,0.0,0.0,2000.0,9600.0,0,0.0,0,0.0,0.0,11,True,2700.0,1035.78,2011-04-11,,,,,,,,,,,,,,,,,,,,,,,,,,Repaid,False,,121-150,,,,,172.58,67.37,0.0,0.0,0.0,0.0,1.0,185.34,0.0,0.0,0.0,,,,,,,0.0,0.0,
966,2021-10-20,DC59F40B-A669-4255-8BE8-9BC000C0C30A,83,2009-03-02 11:41:49,2009-03-02 11:41:49,0,0.0,127.82,enep,True,2009-03-02 11:41:49,2009-03-03,2009-05-25,2009-04-27,2009-04-27,2009-04-27,11,2,2.0,1,34,,1.0,EE,127.82,127.82,10.0,1,,,,4,2.0,2.0,3.0,3.0,UpTo5Years,,2To5Years,,,4000.0,0.0,0.0,0.0,0.0,0.0,900.0,4900.0,0,0.0,0,0.0,0.0,25,True,2000.0,28.73,2009-04-21,,,,,,,,,,,,,,,,,,,,,,,,,,Repaid,False,,,,,,,127.82,1.72,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,0.0,0.0,


In [102]:
data.tail(10)

Unnamed: 0,ReportAsOfEOD,LoanId,LoanNumber,ListedOnUTC,BiddingStartedOn,BidsPortfolioManager,BidsApi,BidsManual,UserName,NewCreditCustomer,LoanApplicationStartedDate,LoanDate,ContractEndDate,FirstPaymentDate,MaturityDate_Original,MaturityDate_Last,ApplicationSignedHour,ApplicationSignedWeekday,VerificationType,LanguageCode,Age,DateOfBirth,Gender,Country,AppliedAmount,Amount,Interest,LoanDuration,MonthlyPayment,County,City,UseOfLoan,Education,MaritalStatus,NrOfDependants,EmploymentStatus,EmploymentDurationCurrentEmployer,EmploymentPosition,WorkExperience,OccupationArea,HomeOwnershipType,IncomeFromPrincipalEmployer,IncomeFromPension,IncomeFromFamilyAllowance,IncomeFromSocialWelfare,IncomeFromLeavePay,IncomeFromChildSupport,IncomeOther,IncomeTotal,ExistingLiabilities,LiabilitiesTotal,RefinanceLiabilities,DebtToIncome,FreeCash,MonthlyPaymentDay,ActiveScheduleFirstPaymentReached,PlannedPrincipalTillDate,PlannedInterestTillDate,LastPaymentOn,CurrentDebtDaysPrimary,DebtOccuredOn,CurrentDebtDaysSecondary,DebtOccuredOnForSecondary,ExpectedLoss,LossGivenDefault,ExpectedReturn,ProbabilityOfDefault,DefaultDate,PrincipalOverdueBySchedule,PlannedPrincipalPostDefault,PlannedInterestPostDefault,EAD1,EAD2,PrincipalRecovery,InterestRecovery,RecoveryStage,StageActiveSince,ModelVersion,Rating,EL_V0,Rating_V0,EL_V1,Rating_V1,Rating_V2,Status,Restructured,ActiveLateCategory,WorseLateCategory,CreditScoreEsMicroL,CreditScoreEsEquifaxRisk,CreditScoreFiAsiakasTietoRiskGrade,CreditScoreEeMini,PrincipalPaymentsMade,InterestAndPenaltyPaymentsMade,PrincipalWriteOffs,InterestAndPenaltyWriteOffs,PrincipalBalance,InterestAndPenaltyBalance,NoOfPreviousLoansBeforeLoan,AmountOfPreviousLoansBeforeLoan,PreviousRepaymentsBeforeLoan,PreviousEarlyRepaymentsBefoleLoan,PreviousEarlyRepaymentsCountBeforeLoan,GracePeriodStart,GracePeriodEnd,NextPaymentDate,NextPaymentNr,NrOfScheduledPayments,ReScheduledOn,PrincipalDebtServicingCost,InterestAndPenaltyDebtServicingCost,ActiveLateLastPaymentCategory
191981,2021-10-20,814566B1-0023-44B3-8F72-ADC50185CD50,2405917,2021-10-19 18:59:27,2021-10-19 21:59:27,274,0.0,4.0,BO7451973A,False,2021-10-18 23:39:14,2021-10-19,,2021-11-15,2026-10-15,2026-10-15,23,2,4.0,4,38,,1.0,FI,3943.0,3943.0,18.72,60,98.58,,,-1,3.0,-1.0,,-1.0,UpTo5Years,,,-1.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1800.0,2,138.75,0,0.0,0.0,15,False,,,,,,,,0.09,0.79,0.07,0.12,,0.0,,,,,,,,,6.0,D,,,,,,Current,False,,,M,,5.0,,0.0,0.0,,,3943.0,,1.0,4150.0,507.22,,0.0,,,2021-11-15,1.0,60.0,,,,
192241,2021-10-20,48CC95D8-C950-4331-A336-ADC6014DCE7A,2407633,2021-10-19 19:04:03,2021-10-19 22:04:03,34,3.0,114.0,BO96429K3A,False,2021-10-19 20:15:21,2021-10-19,,2021-11-15,2026-10-13,2026-10-13,20,3,1.0,1,24,,0.0,EE,2126.0,2126.0,43.21,60,84.6,,,-1,4.0,-1.0,,-1.0,UpTo1Year,,,-1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1100.0,5,495.57,0,0.0,0.0,13,False,,,,,,,,0.19,0.6,0.09,0.31,,0.0,,,,,,,,,6.0,F,,,,,,Current,False,,,M,,,700.0,0.0,0.0,,,2126.0,,2.0,7442.0,1072.06,,0.0,,,2021-11-15,1.0,60.0,,,,
192258,2021-10-20,3EE5ED96-E7F6-4D21-AF20-ADC6016BA019,2407758,2021-10-19 19:06:34,2021-10-19 22:06:34,2377,176.0,1677.0,BO92K97K2,False,2021-10-19 22:03:55,2021-10-19,,2021-11-10,2026-10-12,2026-10-12,22,3,4.0,1,32,,0.0,EE,5316.0,5316.0,28.99,60,176.13,,,-1,3.0,-1.0,,-1.0,UpTo5Years,,,-1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1600.0,5,1006.33,0,0.0,0.0,10,False,,,,,,,,0.09,0.59,0.12,0.16,,0.0,,,,,,,,,6.0,D,,,,,,Current,False,,,M,,,900.0,0.0,0.0,,,5316.0,,3.0,15203.0,4550.89,,0.0,,,2021-11-10,1.0,60.0,,,,
192242,2021-10-20,92F130ED-F8A1-40C4-B933-ADC6014E71BA,2407640,2021-10-19 19:07:48,2021-10-19 22:07:48,283,1.0,11.0,BOA2519KAA,True,2021-10-19 20:17:40,2021-10-19,,2021-11-18,2026-10-19,2026-10-19,22,3,4.0,4,34,,1.0,FI,4150.0,4150.0,18.7,60,103.75,,,-1,3.0,-1.0,,-1.0,UpTo5Years,,,-1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2100.0,0,0.0,0,0.0,0.0,18,False,,,,,,,,0.09,0.73,0.07,0.12,,0.0,,,,,,,,,6.0,D,,,,,,Current,False,,,M,,4.0,,0.0,0.0,,,4150.0,,0.0,0.0,,,0.0,,,2021-11-18,1.0,60.0,,,,
192259,2021-10-20,D674C17C-65B4-41F7-9BF1-ADC6016FD158,2407766,2021-10-19 19:26:57,2021-10-19 22:26:57,2,0.0,52.0,BO2957557,False,2021-10-19 22:19:11,2021-10-19,,2021-11-08,2022-10-10,2022-10-10,22,3,4.0,1,39,,1.0,EE,744.0,744.0,13.03,12,68.7,,,-1,4.0,-1.0,,-1.0,MoreThan5Years,,,-1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1300.0,3,289.91,0,0.0,0.0,8,False,,,,,,,,0.03,0.61,0.09,0.04,,0.0,,,,,,,,,6.0,A,,,,,,Current,False,,,M,,,1000.0,0.0,0.0,,,744.0,,3.0,5314.0,3614.62,,0.0,,,2021-11-08,1.0,12.0,,,,
192260,2021-10-20,C262C3F8-448D-4AA6-9FFB-ADC601748CDC,2407784,2021-10-19 19:44:29,2021-10-19 22:44:29,69,7.0,75.0,BO1456553A,True,2021-10-19 22:36:25,2021-10-19,,2021-10-26,2026-09-28,2026-09-28,22,3,4.0,3,41,,0.0,EE,2126.0,2126.0,35.29,60,76.71,,,-1,4.0,-1.0,,-1.0,UpTo1Year,,,-1.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1850.0,1,350.0,0,0.0,0.0,26,False,,,,,,,,0.11,0.55,0.13,0.21,,0.0,,,,,,,,,6.0,D,,,,,,Current,False,,,M,,,1000.0,0.0,0.0,,,2126.0,,0.0,0.0,,,0.0,,,2021-10-26,1.0,60.0,,,,
192261,2021-10-20,40497E17-CEEF-4546-8B28-ADC6017DA044,2407807,2021-10-19 20:27:12,2021-10-19 23:27:12,158,3.0,118.0,BO3K52741,False,2021-10-19 23:09:27,2021-10-19,,2021-11-17,2026-10-19,2026-10-19,23,3,4.0,1,38,,0.0,EE,3934.0,3934.0,12.87,60,100.97,,,-1,4.0,-1.0,,-1.0,UpTo1Year,,,-1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1200.0,3,148.76,0,0.0,0.0,17,False,,,,,,,,0.03,0.58,0.09,0.04,,0.0,,,,,,,,,6.0,A,,,,,,Current,False,,,M,,,800.0,0.0,0.0,,,3934.0,,1.0,2020.0,170.07,,0.0,,,2021-11-17,1.0,60.0,,,,
192256,2021-10-20,1CA21C9E-608C-42DD-B822-ADC60166FB54,2407738,2021-10-19 20:56:23,2021-10-19 23:56:23,2,0.0,106.0,BO7A4A69AA,False,2021-10-19 21:47:01,2021-10-20,,2021-11-15,2023-04-13,2023-04-13,21,3,4.0,2,32,,0.0,EE,1488.0,1488.0,33.04,18,107.84,,,-1,5.0,-1.0,,-1.0,UpTo1Year,,,-1.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1000.0,1,14.89,0,0.0,0.0,13,False,,,,,,,,0.1,0.61,0.14,0.16,,0.0,,,,,,,,,6.0,D,,,,,,Current,False,,,M,,,1000.0,0.0,0.0,,,1488.0,,1.0,531.0,118.62,,0.0,,,2021-11-15,1.0,18.0,,,,
192237,2021-10-20,D0F13371-4180-4B55-A08C-ADC6014317FF,2407570,2021-10-19 20:58:02,2021-10-19 23:58:02,311,2.0,63.0,BO4943343A,True,2021-10-19 19:36:21,2021-10-20,,2021-11-15,2026-10-13,2026-10-13,19,3,1.0,1,21,,0.0,EE,5316.0,5316.0,43.21,60,211.52,,,-1,3.0,-1.0,,-1.0,Other,,,-1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,900.0,0,0.0,0,0.0,0.0,13,False,,,,,,,,0.18,0.57,0.1,0.32,,0.0,,,,,,,,,6.0,F,,,,,,Current,False,,,M,,,1000.0,0.0,0.0,,,5316.0,,0.0,0.0,,,0.0,,,2021-11-15,1.0,60.0,,,,
192055,2021-10-20,44C564D1-0E2F-43A8-8032-ADC600AE879F,2406346,2021-10-19 21:01:56,2021-10-20 00:01:56,288,0.0,7.0,BOK9A7193,True,2021-10-19 10:35:27,2021-10-20,,2021-11-18,2026-10-19,2026-10-19,10,3,4.0,4,42,,1.0,FI,4150.0,4150.0,18.7,60,103.75,,,-1,3.0,-1.0,,-1.0,MoreThan5Years,,,-1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2400.0,4,1063.75,0,0.0,0.0,18,False,,,,,,,,0.09,0.79,0.07,0.12,,0.0,,,,,,,,,6.0,D,,,,,,Current,False,,,M,,3.0,,0.0,0.0,,,4150.0,,1.0,4150.0,207.5,,0.0,,,2021-11-18,1.0,60.0,,,,


The data contains various numerical and categorical columns. To fully understand the data and proceed with cleaning the dataset, one must work through most of the columns. Visual inspection is important when working with this dataset as some columns have become obsolete halfway through the dataset. It is also helpful to statistically describe different columns as it helps to get a better understanding of the data. Below are some examples on how some of the numerical and categorical columns were analyzed.

In [103]:
# Describing a numerical column
data.Amount.describe()

count   192771.00
mean      2564.41
std       2198.90
min          6.39
25%        740.00
50%       2125.00
75%       4000.00
max      10632.00
Name: Amount, dtype: float64

In [104]:
# Numerical column value counts
data.Amount.value_counts(dropna=False)

530.00     20337
531.00     13669
4150.00    10337
2125.00     9046
2126.00     6827
           ...  
923.00         1
63.81          1
2708.00        1
671.09         1
1281.89        1
Name: Amount, Length: 6543, dtype: int64

In [105]:
# Describing a categorical column
data.Rating.describe()

count     190038
unique         8
top            D
freq       43741
Name: Rating, dtype: object

In [106]:
# Categorical column value counts
data.Rating.value_counts(dropna=False)

D      43741
E      34137
C      31077
F      26000
B      24232
HR     14832
A       8565
AA      7454
NaN     2733
Name: Rating, dtype: int64

## Cleaning and filtering the data

The first step is to drop all the unnecessary columns. Many of these columns are obsolete (see Bondora API docs) or serve no purpose for us. However some of the columns are not marked as obsolete in the API docs and half-way through the dataset they are filled only with null values. The reason behind this is most likely EU's data protection law (GDPR). These columns were detected by visually going through the dataset and since the last half of the data is missing in these columns, there is really no point in keeping them.

In [107]:
# Drop unnecessary columns.
df = data.drop(['BiddingStartedOn', 'LoanApplicationStartedDate', 'ApplicationSignedHour',
                'ApplicationSignedWeekday', 'DateOfBirth', 'County', 'City', 'UseOfLoan',
                'MaritalStatus', 'NrOfDependants', 'EmploymentStatus', 'EmploymentPosition',
                'WorkExperience', 'OccupationArea', 'IncomeFromPrincipalEmployer', 'IncomeFromPension',
                'IncomeFromFamilyAllowance', 'IncomeFromSocialWelfare', 'IncomeFromLeavePay',
                'IncomeFromChildSupport', 'IncomeOther', 'RefinanceLiabilities', 'DebtToIncome', 'FreeCash', 'MonthlyPaymentDay',
                'EL_V0', 'Rating_V0', 'EL_V1', 'Rating_V1', 'Rating_V2', 'PrincipalWriteOffs', 'InterestAndPenaltyWriteOffs', 'PlannedPrincipalTillDate', 'CreditScoreEsEquifaxRisk', 'CreditScoreEsMicroL', 'BidsPortfolioManager', 'BidsApi', 'BidsManual', 'PrincipalDebtServicingCost', 'InterestAndPenaltyDebtServicingCost'], axis=1)

In order to further reduce the size of the dataset we apply some filters. Some columns are parsed into date format, so it makes it easier to use them in the filters.

We also do not want currently active loans, as our goal is to classify if the loans are worth to invest in or not, and we do not know the outcome of those loans yet. However if the loan is active but has been restructured, then we keep it and use it in the target variable as a "bad" loan since we know for sure that the loan has had issues. The reason for not including loans that are active and have not had any issues is that we have no way of knowing if issues would or would not arise in the future.

We also apply some filters to numerical and categorical columns in order to deal with anomalies. And we also deal with null values.

In [108]:
# Parse date columns into date format.
date_columns = ['ListedOnUTC', 'LoanDate', 'ContractEndDate', 'MaturityDate_Original', 'MaturityDate_Last', 'ReportAsOfEOD']
df[date_columns] = df[date_columns].apply(pd.to_datetime, format='%Y-%m-%d', errors='coerce')

# Do not include loans that are currently active, and have not been restructured (the original maturity date of the loan has not increased by at least 60 days).
df = df.loc[~((df.Status == 'Current') & (df.Restructured == 0))]

# Some loans have been repaid just days after signing a contract so they should be filtered out.
df = df.loc[~(((df.MaturityDate_Last - df.LoanDate) < pd.to_timedelta("180days")) & (df.Status == 'Repaid'))]

# Remove the most recent loans because most of them seem to be with the status "Late".
df = df.loc[(df.LoanDate < (df.ReportAsOfEOD - pd.to_timedelta("60days")))]

In [109]:
# Filters regarding numerical values

# Age between 18 and 70.
df = df.loc[(df.Age >= 18) & (df.Age <= 70)]

# Monthly payment is over 0.
df = df.loc[(df.MonthlyPayment > 0)]

In [110]:
# Filters regarding categorical values

# Select allowed loan durations. This dataset contains various contract lengths, but we are interested in the most common ones. Values represent loan duration in months.
df.LoanDuration = df.LoanDuration.astype('category')
df = df.loc[(df.LoanDuration.isin([60, 48, 36, 30, 24, 18, 12, 9, 6]))]

# We only want loans from EE.
df = df.loc[(df.Country.isin(['EE']))]

# The main verification types.
df.VerificationType = df.VerificationType.astype('category')
df = df.loc[(df.VerificationType.isin([1, 2, 3, 4]))]

# Gender is either male of female.
df.Gender = df.Gender.astype('category')
df = df.loc[(df.Gender.isin([0, 1]))]

# Education
df.Education = df.Education.astype('category')
df = df.loc[(df.Education.isin([1, 2, 3, 4, 5]))]

# Employment duration
df = df.loc[(df.EmploymentDurationCurrentEmployer.isin(['MoreThan5Years', 'UpTo5Years', 'UpTo4Years', 'UpTo3Years', 'UpTo2Years', 'UpTo1Year', 'TrialPeriod', 'Retiree', 'Other']))]

# Home ownership
df.HomeOwnershipType = df.HomeOwnershipType.astype('category')
df = df.loc[(df.HomeOwnershipType.isin([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]))]

# Rating
df = df.loc[(df.Rating.isin(['AA', 'A', 'B', 'C', 'D', 'E', 'F', 'HR']))]

# Credit score
df.CreditScoreEeMini = df.CreditScoreEeMini.astype('category')
df = df.loc[(df.CreditScoreEeMini.isin([1000, 900, 800, 700, 600, 500]))]

In [111]:
# Dealing with null values

df.PreviousRepaymentsBeforeLoan = df.PreviousRepaymentsBeforeLoan.fillna(0)

df.PreviousEarlyRepaymentsBefoleLoan = df.PreviousEarlyRepaymentsBefoleLoan.fillna(0)

In [112]:
print(f"DataFrame shape after cleaning: {df.shape}")

DataFrame shape after cleaning: (80918, 72)


## Feature engineering

TODO

## Constructing the target variable

We construct a target variable called "PreferLoan" which is a boolean value. As stated before, loans with previous issues are not to be preferred and are assigned the value "0". Loans without issues are assigned the value "1". However there is a small exception. If the the worst late category is 16-30 days or less and the loan has not had any other issues then the loan is also preferred.

In [None]:
# Constructing the target value.

# Loan status must be 'Repaid'.
# WorseLateCategory must not be higher than 16-30 (can be null).
# Loan must be repaid before or on the original maturity date.
# Loan must not be restructured.
# Loan must not be defaulted.

# Set the default value for all loans to be 0.
df["PreferLoan"] = 0

# Select preferred loans and set their value to 1.
df.loc[(
               (df.Status == 'Repaid') &
               (df.WorseLateCategory.isin([np.nan, '1-7', '8-15', '16-30'])) &
               (df.MaturityDate_Last <= df.MaturityDate_Original) &
               (df.Restructured != 1) &
               (df.DefaultDate.isnull())
       ), 'PreferLoan'] = 1

## Data selection for modeling

We used many columns for filtering that have no use for our next step - creating a classification model. Below we select the columns that can be used for the modeling step. We also check for null values as some models can not deal with them. And we also have to one-hot encode the categorical values.

In [114]:
# We do not include the "Country" column, because we predict only loans from EE.
modeling_cols = ['NewCreditCustomer', 'VerificationType', 'LanguageCode', 'Age', 'Gender',
        'Amount', 'Interest', 'LoanDuration', 'MonthlyPayment',
        'Education', 'EmploymentDurationCurrentEmployer', 'HomeOwnershipType', 'IncomeTotal',
        'ExistingLiabilities', 'LiabilitiesTotal', 'Rating',
        'CreditScoreEeMini', 'NoOfPreviousLoansBeforeLoan', 'AmountOfPreviousLoansBeforeLoan',
        'PreviousRepaymentsBeforeLoan', 'PreviousEarlyRepaymentsBefoleLoan', 'PreferLoan']

df = df[modeling_cols]

In [115]:
# Check for null values
df.isna().sum()

NewCreditCustomer                    0
VerificationType                     0
LanguageCode                         0
Age                                  0
Gender                               0
Amount                               0
Interest                             0
LoanDuration                         0
MonthlyPayment                       0
Education                            0
EmploymentDurationCurrentEmployer    0
HomeOwnershipType                    0
IncomeTotal                          0
ExistingLiabilities                  0
LiabilitiesTotal                     0
Rating                               0
CreditScoreEeMini                    0
NoOfPreviousLoansBeforeLoan          0
AmountOfPreviousLoansBeforeLoan      0
PreviousRepaymentsBeforeLoan         0
PreviousEarlyRepaymentsBefoleLoan    0
PreferLoan                           0
dtype: int64

In [116]:
# One-hot encoding.
df = pd.get_dummies(df)
df

Unnamed: 0,NewCreditCustomer,LanguageCode,Age,Amount,Interest,MonthlyPayment,IncomeTotal,ExistingLiabilities,LiabilitiesTotal,NoOfPreviousLoansBeforeLoan,AmountOfPreviousLoansBeforeLoan,PreviousRepaymentsBeforeLoan,PreviousEarlyRepaymentsBefoleLoan,PreferLoan,VerificationType_1.0,VerificationType_3.0,VerificationType_4.0,Gender_0.0,Gender_1.0,Gender_2.0,LoanDuration_3,LoanDuration_6,LoanDuration_9,LoanDuration_12,LoanDuration_18,LoanDuration_24,LoanDuration_30,LoanDuration_36,LoanDuration_42,LoanDuration_48,LoanDuration_54,LoanDuration_60,Education_-1.0,Education_1.0,Education_2.0,Education_3.0,Education_4.0,Education_5.0,EmploymentDurationCurrentEmployer_MoreThan5Years,EmploymentDurationCurrentEmployer_Other,EmploymentDurationCurrentEmployer_Retiree,EmploymentDurationCurrentEmployer_TrialPeriod,EmploymentDurationCurrentEmployer_UpTo1Year,EmploymentDurationCurrentEmployer_UpTo2Years,EmploymentDurationCurrentEmployer_UpTo3Years,EmploymentDurationCurrentEmployer_UpTo4Years,EmploymentDurationCurrentEmployer_UpTo5Years,HomeOwnershipType_1.0,HomeOwnershipType_2.0,HomeOwnershipType_3.0,HomeOwnershipType_4.0,HomeOwnershipType_5.0,HomeOwnershipType_6.0,HomeOwnershipType_7.0,HomeOwnershipType_8.0,HomeOwnershipType_9.0,HomeOwnershipType_10.0,Rating_A,Rating_AA,Rating_B,Rating_C,Rating_D,Rating_E,Rating_F,Rating_HR,CreditScoreEeMini_500.0,CreditScoreEeMini_600.0,CreditScoreEeMini_700.0,CreditScoreEeMini_800.0,CreditScoreEeMini_900.0,CreditScoreEeMini_1000.0
11326,True,1,52,700.00,26.00,88.48,590.00,4,626.00,0.00,0.00,0.00,0.00,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1
11294,False,1,44,500.00,30.00,17.64,1195.00,9,721.50,1.00,1000.00,123.30,0.00,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0
11346,True,3,43,4150.00,31.00,190.57,1386.00,3,831.00,0.00,0.00,0.00,0.00,1,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1
11342,False,1,35,3500.00,32.00,133.62,1298.00,5,1133.82,1.00,3000.00,556.53,0.00,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0
11351,True,1,39,1000.00,34.00,74.60,1033.00,2,670.00,0.00,0.00,0.00,0.00,1,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
183394,True,3,25,2658.00,11.39,66.45,1600.00,2,363.57,1.00,531.00,180.79,0.00,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0
183567,True,1,20,106.00,36.57,3.88,500.00,1,19.33,1.00,531.00,34.44,0.00,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1
183568,False,1,33,956.00,22.13,28.62,2200.00,6,721.35,6.00,21050.00,4533.74,0.00,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1
183574,True,1,34,6911.00,15.28,184.98,2100.00,2,206.00,0.00,0.00,0.00,0.00,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0


In [118]:
df.to_csv('data/processed.csv', index=False)