# 1. Data preprocessing

The goal of this worksheet is to fully process the dataset so the data can be used for modeling.

This worksheet is structured as following:
1. Obtaining the data
2. Inspecting the data
3. Cleaning and filtering the data
4. Feature engineering
5. Creating the target variable
6. Data selection for modeling


In [54]:
# Imports and display settings.
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

## Obtaining the data

The dataset is quite large so it should be only downloaded once. However if you plan to work on it continuously then consider re-downloading it once per month as Bondora adds more loans and some of the old loans are updated. After downloading we extract the zip file and load the extracted .csv file into a DataFrame. The data is also sorted in chronological order.

In [55]:
def download():

    import zipfile
    from requests import get
    url = "https://bondora.com/marketing/media/LoanData.zip"

    with open("/data/Data.zip", "wb") as file:
        response = get(url)
        file.write(response.content)

    with zipfile.ZipFile("/data/Data.zip","r") as zip_ref:
        zip_ref.extractall("data/")

#download()

In [56]:
data = pd.read_csv("data/LoanData.csv")
data = data.sort_values(by=['ListedOnUTC'])

  exec(code_obj, self.user_global_ns, self.user_ns)


## Inspecting the data

We can see that the dataset contains ~200000 loans and has 112 different columns. The earliest loans are from 2009, that means we have over 10 years of loan data.

In [57]:
print(f"DataFrame shape: {data.shape}")

DataFrame shape: (197625, 112)


In [58]:
data.head(5)

Unnamed: 0,ReportAsOfEOD,LoanId,LoanNumber,ListedOnUTC,BiddingStartedOn,BidsPortfolioManager,BidsApi,BidsManual,PartyId,NewCreditCustomer,LoanApplicationStartedDate,LoanDate,ContractEndDate,FirstPaymentDate,MaturityDate_Original,MaturityDate_Last,ApplicationSignedHour,ApplicationSignedWeekday,VerificationType,LanguageCode,Age,DateOfBirth,Gender,Country,AppliedAmount,Amount,Interest,LoanDuration,MonthlyPayment,County,City,UseOfLoan,Education,MaritalStatus,NrOfDependants,EmploymentStatus,EmploymentDurationCurrentEmployer,EmploymentPosition,WorkExperience,OccupationArea,HomeOwnershipType,IncomeFromPrincipalEmployer,IncomeFromPension,IncomeFromFamilyAllowance,IncomeFromSocialWelfare,IncomeFromLeavePay,IncomeFromChildSupport,IncomeOther,IncomeTotal,ExistingLiabilities,LiabilitiesTotal,RefinanceLiabilities,DebtToIncome,FreeCash,MonthlyPaymentDay,ActiveScheduleFirstPaymentReached,PlannedPrincipalTillDate,PlannedInterestTillDate,LastPaymentOn,CurrentDebtDaysPrimary,DebtOccuredOn,CurrentDebtDaysSecondary,DebtOccuredOnForSecondary,ExpectedLoss,LossGivenDefault,ExpectedReturn,ProbabilityOfDefault,DefaultDate,PrincipalOverdueBySchedule,PlannedPrincipalPostDefault,PlannedInterestPostDefault,EAD1,EAD2,PrincipalRecovery,InterestRecovery,RecoveryStage,StageActiveSince,ModelVersion,Rating,EL_V0,Rating_V0,EL_V1,Rating_V1,Rating_V2,Status,Restructured,ActiveLateCategory,WorseLateCategory,CreditScoreEsMicroL,CreditScoreEsEquifaxRisk,CreditScoreFiAsiakasTietoRiskGrade,CreditScoreEeMini,PrincipalPaymentsMade,InterestAndPenaltyPaymentsMade,PrincipalWriteOffs,InterestAndPenaltyWriteOffs,PrincipalBalance,InterestAndPenaltyBalance,NoOfPreviousLoansBeforeLoan,AmountOfPreviousLoansBeforeLoan,PreviousRepaymentsBeforeLoan,PreviousEarlyRepaymentsBefoleLoan,PreviousEarlyRepaymentsCountBeforeLoan,GracePeriodStart,GracePeriodEnd,NextPaymentDate,NextPaymentNr,NrOfScheduledPayments,ReScheduledOn,PrincipalDebtServicingCost,InterestAndPenaltyDebtServicingCost,ActiveLateLastPaymentCategory
957,2021-11-10,FA160D69-2682-4A60-8D8E-9BB700EA30CE,37,2009-02-21 14:12:39,2009-02-21 14:12:39,0,0.0,63.91,{544DFBAC-374F-4039-AE45-9BB700E44853},True,2009-02-21 14:12:39,2009-03-07,2009-09-10,2009-04-10,2009-09-10,2009-09-10,14,7,2.0,1,21,,0.0,EE,63.91,63.91,20.0,6,,,,3,5.0,4.0,0.0,5.0,UpTo2Years,,5To10Years,,,15500.0,0.0,0.0,0.0,0.0,0.0,0.0,15500.0,0,0.0,0,0.0,0.0,10,True,1000.0,34.14,2009-09-10,,,,,,,,,,,,,,,,,,,,,,,,,,Repaid,False,,,,,,,63.91,2.18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,0.0,0.0,
958,2021-11-10,8E929B92-7C99-421D-8499-9BB70125F390,42,2009-02-21 17:50:14,2009-02-21 17:50:14,0,0.0,83.09,{B7FDCB11-4CE9-4CDE-993F-9BB70103B180},True,2009-02-21 17:50:14,2009-03-03,2011-03-14,2009-04-13,2011-03-14,2011-03-14,17,7,2.0,1,19,,1.0,EE,958.67,83.08,10.0,24,,,,6,2.0,3.0,2.0,3.0,UpTo1Year,,LessThan2Years,,,4000.0,0.0,0.0,0.0,0.0,0.0,0.0,4000.0,0,0.0,0,0.0,0.0,12,True,1300.0,142.73,2011-03-14,,,,,,,,,,,,,,,,,,,,,,,,,,Repaid,False,,180+,,,,,83.08,9.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,0.0,0.0,
959,2021-11-10,33B3F669-D0E3-4474-8045-9BB70128D064,43,2009-02-21 18:00:40,2009-02-21 18:00:40,0,0.0,322.75,{E58803E6-77B6-40EB-83C4-9BB70118C245},True,2009-02-21 18:00:40,2009-02-28,2010-03-10,2009-04-10,2010-03-10,2010-03-10,18,7,2.0,1,38,,0.0,EE,639.12,322.75,25.0,12,,,,6,4.0,4.0,4.0,3.0,UpTo4Years,,15To25Years,,,9000.0,0.0,0.0,0.0,0.0,0.0,0.0,9000.0,0,0.0,0,0.0,0.0,10,True,5050.0,697.28,2010-03-10,,,,,,,,,,,,,,,,,,,,,,,,,,Repaid,False,,151-180,,,,,322.75,44.56,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,0.0,0.0,
960,2021-11-10,7074D9E8-E8F5-403B-8614-9BB701338AD4,44,2009-02-21 18:39:43,2009-02-21 18:39:43,0,0.0,252.45,{DE67C0EB-7534-47F2-87BD-9BB7011E112A},True,2009-02-21 18:39:43,2009-03-03,2011-04-04,2009-05-01,2011-04-01,2011-04-01,18,7,2.0,1,55,,1.0,EE,958.67,252.45,10.0,24,,,,2,3.0,2.0,3.0,3.0,MoreThan5Years,,MoreThan25Years,,,6000.0,0.0,0.0,0.0,0.0,0.0,0.0,6000.0,0,0.0,0,0.0,0.0,1,True,3950.0,454.3,2011-04-04,,,,,,,,,,,,,,,,,,,,,,,,,,Repaid,False,,121-150,,,,,252.45,29.08,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,0.0,0.0,
961,2021-11-10,39F2A312-CD6C-4F60-A7F6-9BB7014CE0A9,45,2009-02-21 20:11:58,2009-02-21 20:11:58,0,0.0,63.91,{C41C37A5-B2D7-4B5A-B2D2-9BB70149B419},True,2009-02-21 20:11:58,2009-03-07,2010-06-10,2009-05-06,2010-04-06,2010-04-06,20,7,2.0,1,46,,1.0,EE,766.94,63.91,20.0,12,,,,8,5.0,1.0,5.0,3.0,MoreThan5Years,,MoreThan25Years,,,9000.0,0.0,0.0,0.0,0.0,0.0,1700.0,10700.0,0,0.0,0,0.0,0.0,6,True,1000.0,127.58,2010-06-10,,,,,,,,,2010-05-04,,0.0,0.0,3.02,-4.92,3.02,0.71,,,,,,,,,,Repaid,False,,61-90,,,,,63.91,8.65,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,0.0,0.0,


In [59]:
data.tail(5)

Unnamed: 0,ReportAsOfEOD,LoanId,LoanNumber,ListedOnUTC,BiddingStartedOn,BidsPortfolioManager,BidsApi,BidsManual,PartyId,NewCreditCustomer,LoanApplicationStartedDate,LoanDate,ContractEndDate,FirstPaymentDate,MaturityDate_Original,MaturityDate_Last,ApplicationSignedHour,ApplicationSignedWeekday,VerificationType,LanguageCode,Age,DateOfBirth,Gender,Country,AppliedAmount,Amount,Interest,LoanDuration,MonthlyPayment,County,City,UseOfLoan,Education,MaritalStatus,NrOfDependants,EmploymentStatus,EmploymentDurationCurrentEmployer,EmploymentPosition,WorkExperience,OccupationArea,HomeOwnershipType,IncomeFromPrincipalEmployer,IncomeFromPension,IncomeFromFamilyAllowance,IncomeFromSocialWelfare,IncomeFromLeavePay,IncomeFromChildSupport,IncomeOther,IncomeTotal,ExistingLiabilities,LiabilitiesTotal,RefinanceLiabilities,DebtToIncome,FreeCash,MonthlyPaymentDay,ActiveScheduleFirstPaymentReached,PlannedPrincipalTillDate,PlannedInterestTillDate,LastPaymentOn,CurrentDebtDaysPrimary,DebtOccuredOn,CurrentDebtDaysSecondary,DebtOccuredOnForSecondary,ExpectedLoss,LossGivenDefault,ExpectedReturn,ProbabilityOfDefault,DefaultDate,PrincipalOverdueBySchedule,PlannedPrincipalPostDefault,PlannedInterestPostDefault,EAD1,EAD2,PrincipalRecovery,InterestRecovery,RecoveryStage,StageActiveSince,ModelVersion,Rating,EL_V0,Rating_V0,EL_V1,Rating_V1,Rating_V2,Status,Restructured,ActiveLateCategory,WorseLateCategory,CreditScoreEsMicroL,CreditScoreEsEquifaxRisk,CreditScoreFiAsiakasTietoRiskGrade,CreditScoreEeMini,PrincipalPaymentsMade,InterestAndPenaltyPaymentsMade,PrincipalWriteOffs,InterestAndPenaltyWriteOffs,PrincipalBalance,InterestAndPenaltyBalance,NoOfPreviousLoansBeforeLoan,AmountOfPreviousLoansBeforeLoan,PreviousRepaymentsBeforeLoan,PreviousEarlyRepaymentsBefoleLoan,PreviousEarlyRepaymentsCountBeforeLoan,GracePeriodStart,GracePeriodEnd,NextPaymentDate,NextPaymentNr,NrOfScheduledPayments,ReScheduledOn,PrincipalDebtServicingCost,InterestAndPenaltyDebtServicingCost,ActiveLateLastPaymentCategory
197112,2021-11-10,9443551E-9656-4C9C-B036-ADDB017E027D,2438363,2021-11-09 21:14:48,2021-11-09 23:14:48,28,3.0,16.0,{7B3BFFB4-9CA2-4AD8-A38F-AD3300342FCB},True,2021-11-09 23:10:51,2021-11-09,,2021-12-08,2026-11-09,2026-11-09,23,3,4.0,1,31,,0.0,EE,637.0,637.0,43.26,60,25.34,,,-1,1.0,-1.0,,-1.0,UpTo1Year,,,-1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1400.0,0,0.0,0,0.0,0.0,8,False,,,,,,,,0.2,0.55,0.06,0.36,,0.0,,,,,,,,,6.0,F,,,,,,Current,False,,,M,,,900.0,0.0,0.0,,,637.0,,0.0,0.0,,,0.0,,,2021-12-08,1.0,60.0,,,,
197111,2021-11-10,96A41405-759B-4936-BA22-ADDB017B8C18,2438353,2021-11-09 21:23:44,2021-11-09 23:23:44,127,0.0,91.0,{9E336188-56C7-4FDB-B370-ADDB017B8C46},True,2021-11-09 23:01:54,2021-11-09,,2021-11-15,2026-10-13,2026-10-13,23,3,4.0,4,22,,0.0,FI,3113.0,3113.0,18.98,60,77.82,,,-1,3.0,-1.0,,-1.0,UpTo5Years,,,-1.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1600.0,0,0.0,0,0.0,0.0,13,False,,,,,,,,0.09,0.73,0.07,0.12,,0.0,,,,,,,,,6.0,D,,,,,,Current,False,,,M,,4.0,,0.0,0.0,,,3113.0,,0.0,0.0,,,0.0,,,2021-11-15,1.0,60.0,,,,
197113,2021-11-10,337C80A1-882F-4E27-A199-ADDB01861DD6,2438387,2021-11-09 21:46:45,2021-11-09 23:46:45,15,0.0,56.0,{16C016AA-FEE1-4BA1-8F8A-ACED00B658DF},True,2021-11-09 23:40:22,2021-11-09,,2021-12-08,2024-11-08,2024-11-08,23,3,4.0,1,64,,1.0,EE,956.0,956.0,34.05,36,43.59,,,-1,5.0,-1.0,,-1.0,Retiree,,,-1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,656.0,3,248.27,0,0.0,0.0,8,False,,,,,,,,0.11,0.6,0.13,0.18,,0.0,,,,,,,,,6.0,D,,,,,,Current,False,,,M,,,1000.0,0.0,0.0,,,956.0,,1.0,2126.0,140.0,,0.0,,,2021-12-08,1.0,36.0,,,,
197114,2021-11-10,B0100B25-49CA-4DA6-B8F3-ADDB018A8159,2438395,2021-11-09 22:02:55,2021-11-10 00:02:55,79,9.0,63.0,{D9FFA1F8-0B6B-4DC1-9C3B-ADDB018A8159},True,2021-11-09 23:56:21,2021-11-10,,2021-11-18,2026-10-19,2026-10-19,23,3,4.0,3,43,,0.0,EE,2126.0,2126.0,19.72,60,61.26,,,-1,4.0,-1.0,,-1.0,UpTo5Years,,,-1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1250.0,0,0.0,0,0.0,0.0,18,False,,,,,,,,0.05,0.55,0.1,0.1,,0.0,,,,,,,,,6.0,B,,,,,,Current,False,,,M,,,1000.0,0.0,0.0,,,2126.0,,0.0,0.0,,,0.0,,,2021-11-18,1.0,60.0,,,,
197115,2021-11-10,F254CC92-13ED-4565-839F-ADDC0001D790,2438402,2021-11-09 22:09:40,2021-11-10 00:09:40,217,0.0,1.0,{9666D351-A66E-43A5-8BD5-A3E10147B94E},False,2021-11-10 00:06:42,2021-11-10,,2021-12-02,2026-11-02,2026-11-02,0,4,4.0,4,41,,0.0,FI,3113.0,3113.0,18.79,60,77.82,,,-1,5.0,-1.0,,-1.0,MoreThan5Years,,,-1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3650.0,3,232.46,0,0.0,0.0,2,False,,,,,,,,0.09,0.79,0.07,0.12,,0.0,,,,,,,,,6.0,D,,,,,,Current,False,,,M,,3.0,,0.0,0.0,,,3113.0,,4.0,12300.0,408.08,,0.0,,,2021-12-02,1.0,60.0,,,,


The data contains various numerical and categorical columns. To fully understand the data and proceed with cleaning the dataset, one must work through most of the columns. Visual inspection is important when working with this dataset as some columns have become obsolete halfway through the dataset. It is also helpful to statistically describe different columns as it helps to get a better understanding of the data. Below are some examples on how some of the numerical and categorical columns were analyzed.

Another helpful resource is the overview of the codes and terms that are used in the dataset which can be found at the [public-reports](https://www.bondora.com/en/public-reports) section on Bondora's website.

In [60]:
# Describing a numerical column
data.Amount.describe()

count   197625.00
mean      2572.08
std       2199.15
min          6.39
25%        740.00
50%       2125.00
75%       4100.00
max      10632.00
Name: Amount, dtype: float64

In [61]:
# Numerical column value counts
data.Amount.value_counts(dropna=False)

530.00     20337
531.00     13920
4150.00    11813
2125.00     9046
2126.00     7108
           ...  
6576.00        1
4623.00        1
5263.00        1
4174.00        1
1178.90        1
Name: Amount, Length: 6606, dtype: int64

In [62]:
# Describing a categorical column
data.Rating.describe()

count     194892
unique         8
top            D
freq       46991
Name: Rating, dtype: object

In [63]:
# Categorical column value counts
data.Rating.value_counts(dropna=False)

D      46991
E      34391
C      31424
F      26159
B      24633
HR     14832
A       8798
AA      7664
NaN     2733
Name: Rating, dtype: int64

## Cleaning and filtering the data

The first step is to drop all the unnecessary columns. Many of these columns are obsolete (see Bondora API docs) or serve no purpose for us. However some of the columns are not marked as obsolete in the API docs and half-way through the dataset they are filled only with null values. The reason behind this is most likely EU's data protection law (GDPR). These columns were detected by visually going through the dataset and since the last half of the data is missing in these columns, there is really no point in keeping them.

In [64]:
# Drop unnecessary columns.
df = data.drop(['BiddingStartedOn', 'LoanApplicationStartedDate', 'ApplicationSignedHour',
                'ApplicationSignedWeekday', 'DateOfBirth', 'County', 'City', 'UseOfLoan',
                'MaritalStatus', 'NrOfDependants', 'EmploymentStatus', 'EmploymentPosition',
                'WorkExperience', 'OccupationArea', 'IncomeFromPrincipalEmployer', 'IncomeFromPension',
                'IncomeFromFamilyAllowance', 'IncomeFromSocialWelfare', 'IncomeFromLeavePay',
                'IncomeFromChildSupport', 'IncomeOther', 'RefinanceLiabilities', 'DebtToIncome', 'FreeCash', 'MonthlyPaymentDay',
                'EL_V0', 'Rating_V0', 'EL_V1', 'Rating_V1', 'Rating_V2', 'PrincipalWriteOffs', 'InterestAndPenaltyWriteOffs', 'PlannedPrincipalTillDate', 'CreditScoreEsEquifaxRisk', 'CreditScoreEsMicroL', 'BidsPortfolioManager', 'BidsApi', 'BidsManual', 'PrincipalDebtServicingCost', 'InterestAndPenaltyDebtServicingCost',
                'ContractEndDate', 'LoanNumber', 'FirstPaymentDate', 'PlannedInterestTillDate', 'LastPaymentOn', 'CurrentDebtDaysPrimary',
                'DebtOccuredOn', 'CurrentDebtDaysSecondary', 'DebtOccuredOnForSecondary', 'ExpectedLoss', 'LossGivenDefault', 'ExpectedReturn',
                'ProbabilityOfDefault', 'ActiveScheduleFirstPaymentReached', 'PlannedPrincipalPostDefault', 'PlannedInterestPostDefault', 'EAD1', 'EAD2',
                'PrincipalRecovery', 'InterestRecovery', 'RecoveryStage', 'StageActiveSince', 'ModelVersion', 'NextPaymentNr', 'ReScheduledOn'], axis=1)

# Set "LoanId" as the index
df = df.set_index('LoanId')

In order to further reduce the size of the dataset we apply some filters. We parse some of the columns into date format, so it makes it easier to use them in the filters.

Since we do not know the outcomes of active loans we do not include them. However if the loan is active but has been restructured, then we keep it and use it in the target variable as a "bad" loan since we know for sure that the loan has had issues. The reason for not including loans that are active and have not had any issues is that we have no way of knowing if issues would or would not arise in the future.

We also apply some filters to numerical and categorical columns in order to deal with anomalies. And we also deal with null values.

In [65]:
data.ActiveScheduleFirstPaymentReached.value_counts(dropna=False)

True     184208
False     13417
Name: ActiveScheduleFirstPaymentReached, dtype: int64

In [66]:
df.head(5)

Unnamed: 0_level_0,ReportAsOfEOD,ListedOnUTC,PartyId,NewCreditCustomer,LoanDate,MaturityDate_Original,MaturityDate_Last,VerificationType,LanguageCode,Age,Gender,Country,AppliedAmount,Amount,Interest,LoanDuration,MonthlyPayment,Education,EmploymentDurationCurrentEmployer,HomeOwnershipType,IncomeTotal,ExistingLiabilities,LiabilitiesTotal,DefaultDate,PrincipalOverdueBySchedule,Rating,Status,Restructured,ActiveLateCategory,WorseLateCategory,CreditScoreFiAsiakasTietoRiskGrade,CreditScoreEeMini,PrincipalPaymentsMade,InterestAndPenaltyPaymentsMade,PrincipalBalance,InterestAndPenaltyBalance,NoOfPreviousLoansBeforeLoan,AmountOfPreviousLoansBeforeLoan,PreviousRepaymentsBeforeLoan,PreviousEarlyRepaymentsBefoleLoan,PreviousEarlyRepaymentsCountBeforeLoan,GracePeriodStart,GracePeriodEnd,NextPaymentDate,NrOfScheduledPayments,ActiveLateLastPaymentCategory
LoanId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1
FA160D69-2682-4A60-8D8E-9BB700EA30CE,2021-11-10,2009-02-21 14:12:39,{544DFBAC-374F-4039-AE45-9BB700E44853},True,2009-03-07,2009-09-10,2009-09-10,2.0,1,21,0.0,EE,63.91,63.91,20.0,6,,5.0,UpTo2Years,,15500.0,0,0.0,,,,Repaid,False,,,,,63.91,2.18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,
8E929B92-7C99-421D-8499-9BB70125F390,2021-11-10,2009-02-21 17:50:14,{B7FDCB11-4CE9-4CDE-993F-9BB70103B180},True,2009-03-03,2011-03-14,2011-03-14,2.0,1,19,1.0,EE,958.67,83.08,10.0,24,,2.0,UpTo1Year,,4000.0,0,0.0,,,,Repaid,False,,180+,,,83.08,9.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,
33B3F669-D0E3-4474-8045-9BB70128D064,2021-11-10,2009-02-21 18:00:40,{E58803E6-77B6-40EB-83C4-9BB70118C245},True,2009-02-28,2010-03-10,2010-03-10,2.0,1,38,0.0,EE,639.12,322.75,25.0,12,,4.0,UpTo4Years,,9000.0,0,0.0,,,,Repaid,False,,151-180,,,322.75,44.56,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,
7074D9E8-E8F5-403B-8614-9BB701338AD4,2021-11-10,2009-02-21 18:39:43,{DE67C0EB-7534-47F2-87BD-9BB7011E112A},True,2009-03-03,2011-04-01,2011-04-01,2.0,1,55,1.0,EE,958.67,252.45,10.0,24,,3.0,MoreThan5Years,,6000.0,0,0.0,,,,Repaid,False,,121-150,,,252.45,29.08,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,
39F2A312-CD6C-4F60-A7F6-9BB7014CE0A9,2021-11-10,2009-02-21 20:11:58,{C41C37A5-B2D7-4B5A-B2D2-9BB70149B419},True,2009-03-07,2010-04-06,2010-04-06,2.0,1,46,1.0,EE,766.94,63.91,20.0,12,,5.0,MoreThan5Years,,10700.0,0,0.0,2010-05-04,,,Repaid,False,,61-90,,,63.91,8.65,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,


In [67]:
# Parse date columns into date format.
date_columns = ['LoanDate', 'MaturityDate_Original', 'MaturityDate_Last', 'ReportAsOfEOD']
df[date_columns] = df[date_columns].apply(pd.to_datetime, format='%Y-%m-%d', errors='coerce')

# Do not include loans that are currently active, and have not been restructured (the original maturity date of the loan has not increased by at least 60 days).
df = df.loc[~((df.Status == 'Current') & (df.Restructured == 0))]

# Some loans have been repaid just days after signing a contract so they should be filtered out.
df = df.loc[~(((df.MaturityDate_Last - df.LoanDate) < pd.to_timedelta("180days")) & (df.Status == 'Repaid'))]

# Remove the most recent loans because most of them seem to be with the status "Late".
df = df.loc[(df.LoanDate < (df.ReportAsOfEOD - pd.to_timedelta("60days")))]

In [68]:
# Filters regarding numerical values.

# Age must be between 18 and 70.
df = df.loc[(df.Age >= 18) & (df.Age <= 70)]

# Monthly payment must be over 0.
df = df.loc[(df.MonthlyPayment > 0)]

In [69]:
# Filters regarding categorical values.

# Select allowed loan durations. This dataset contains various contract lengths, but we are interested in the most common ones. Values represent loan duration in months.
df.LoanDuration = df.LoanDuration.astype('category')
df = df.loc[(df.LoanDuration.isin([60, 48, 36, 30, 24, 18, 12, 9, 6]))]

# We only want loans from EE.
df = df.loc[(df.Country.isin(['EE']))]

# The main verification types.
df.VerificationType = df.VerificationType.astype('category')
df = df.loc[(df.VerificationType.isin([1, 2, 3, 4]))]

# Gender is either male of female.
df.Gender = df.Gender.astype('category')
df = df.loc[(df.Gender.isin([0, 1]))]

# Select allowed education levels.
df.Education = df.Education.astype('category')
df = df.loc[(df.Education.isin([1, 2, 3, 4, 5]))]

# Select allowed employment durations.
df = df.loc[(df.EmploymentDurationCurrentEmployer.isin(['MoreThan5Years', 'UpTo5Years', 'UpTo4Years', 'UpTo3Years', 'UpTo2Years', 'UpTo1Year', 'TrialPeriod', 'Retiree', 'Other']))]

# Select allowed home ownership types.
df.HomeOwnershipType = df.HomeOwnershipType.astype('category')
df = df.loc[(df.HomeOwnershipType.isin([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]))]

# Select allowed ratings.
df = df.loc[(df.Rating.isin(['AA', 'A', 'B', 'C', 'D', 'E', 'F', 'HR']))]

# Select allowed credit score levels.
df.CreditScoreEeMini = df.CreditScoreEeMini.astype('category')
df = df.loc[(df.CreditScoreEeMini.isin([1000, 900, 800, 700, 600, 500]))]

In [70]:
# Dealing with null values

df.PreviousRepaymentsBeforeLoan = df.PreviousRepaymentsBeforeLoan.fillna(0)

df.PreviousEarlyRepaymentsBefoleLoan = df.PreviousEarlyRepaymentsBefoleLoan.fillna(0)

In [71]:
print(f"DataFrame shape after cleaning: {df.shape}")

DataFrame shape after cleaning: (82291, 46)


We can see that the dataset is contains alot less rows and columns after processing it.

## Feature engineering

TODO

## Constructing the target variable

We construct a target variable called "PreferLoan" which is a boolean value. As stated before, loans with previous issues are not to be preferred and are assigned the value "0". Loans without issues are assigned the value "1". However there is a small exception. If the the worst late category is 16-30 days or less and the loan has not had any other issues then the loan is also preferred.

In [72]:
# Constructing the target value.

# Loan status must be 'Repaid'.
# WorseLateCategory must not be higher than 16-30 (can be null).
# Loan must be repaid before or on the original maturity date.
# Loan must not be restructured.
# Loan must not be defaulted.

# Set the default value for all loans to be 0.
df["PreferLoan"] = 0

# Select preferred loans and set their value to 1.
df.loc[(
               (df.Status == 'Repaid') &
               (df.WorseLateCategory.isin([np.nan, '1-7', '8-15', '16-30'])) &
               (df.MaturityDate_Last <= df.MaturityDate_Original) &
               (df.Restructured != 1) &
               (df.DefaultDate.isnull())
       ), 'PreferLoan'] = 1

## Next steps
We have successfully cleaned and processed the dataset.It can be used in the next steps which are modelling and validating the model.

In [76]:
df.to_csv('data/processed.csv', index=True)