# **PROJECT OUTLINE**
- Business Understanding & Problem Framework

- The Task

- Exploratory Data Analysis
  * Understanding the Data
  * Cleaning the Data
  * Relationship Analysis between the variables

- Feature Engineering & Modelling
  * Model Training
  * Model Evaluation
  * Shortlisting Promising Models
  * Predictions

- Findings & Recommendations



# **UNDERSTANDING THE PROBLEM**

A local digital lending company named SuperLender is dedicated to using credit risk models to offer profitable and impactful loan alternatives. Their assessment approach is based on two key factors that predict loan default: 
(a) the customer's willingness; and 
(b) ability to pay. 



However, since not all customers repay their loans, the company is ready to invest in experienced data scientists to develop robust models for predicting the odds of repayment. 



To make informed decisions about loan approvals, credit grantors need to evaluate these two key factors (the customer's willingness; and the ability to pay) at the point of each application. This is important to determine repayment likelihood and whether an applicant is eligible for a loan, and if so, what the specific terms of the offer will be (like the loan size, price, and tenure).



There are two categories of risk models: (a) new business risk model; and (b) repeat or behavior risk. The former assesses the risk associated with the first loan application, while the latter takes into account the customer's repayment history if they are applying for a repeat loan. Incorporating the customer's prior loan performance can enhance the accuracy of the repeat risk model.



This for me is a great opportunity to use my skills to help solve this challenging problem for SuperLender, a company that values data-driven decision-making.

# **EXPLORATORY DATA ANALYSIS**

## **Understanding the Data**

In [1]:
# Importing Necessary Libraries

import pandas as pd     # useful for data manipulation
import numpy as np

In [2]:
# Importing Datasets

cust_demo = pd.read_csv("https://raw.githubusercontent.com/VICTORIA-OKESIPE/KaggleX-BIPOC-Mentorship-Program/main/Data/customerdemographics.csv", sep=",")
cust_perf = pd.read_csv("https://raw.githubusercontent.com/VICTORIA-OKESIPE/KaggleX-BIPOC-Mentorship-Program/main/Data/customerperf.csv", sep=",")
cust_prev_loans = pd.read_csv("https://raw.githubusercontent.com/VICTORIA-OKESIPE/KaggleX-BIPOC-Mentorship-Program/main/Data/customerprevloans.csv", sep=",")

In [None]:
# Inspecting Datasets

# ASSIGNMENTS
#head
#tail
#shapes (size)
#columns
#datatypes ----> having trouble changing data type (date related ones)
#describe
#nunique
#unique  ----> is giving error too
#merge data  ----> has issue merging datasets too
#correlating the variables  ---->

In [4]:
cust_demo.head()      # checking the first five rows of the demographic data

Unnamed: 0,customerid,birthdate,bank_account_type,longitude_gps,latitude_gps,bank_name_clients,bank_branch_clients,employment_status_clients,level_of_education_clients
0,8a858e135cb22031015cbafc76964ebd,00:00.0,Savings,3.319219,6.528604,GT Bank,,,
1,8a858e275c7ea5ec015c82482d7c3996,00:00.0,Savings,3.325598,7.119403,Sterling Bank,,Permanent,
2,8a858e5b5bd99460015bdc95cd485634,00:00.0,Savings,5.7461,5.563174,Fidelity Bank,,,
3,8a858efd5ca70688015cabd1f1e94b55,00:00.0,Savings,3.36285,6.642485,GT Bank,,Permanent,
4,8a858e785acd3412015acd48f4920d04,00:00.0,Savings,8.455332,11.97141,GT Bank,,Permanent,


In [None]:
cust_demo.tail()      # checking the last five rows of the data set

Unnamed: 0,customerid,birthdate,bank_account_type,longitude_gps,latitude_gps,bank_name_clients,bank_branch_clients,employment_status_clients,level_of_education_clients
4341,8a858f155554552501555588ca2b3b40,00:00.0,Other,3.236753,7.030168,Stanbic IBTC,,Permanent,Graduate
4342,8a858fc65cf978f4015cf97cee3a02ce,00:00.0,Savings,7.01375,4.875662,GT Bank,,,
4343,8a858f4f5b66de3a015b66fc83c61902,00:00.0,Savings,6.29553,7.092508,GT Bank,,Permanent,
4344,8aaae7a74400b28201441c8b62514150,00:00.0,Savings,3.354206,6.53907,GT Bank,HEAD OFFICE,Permanent,Primary
4345,8a85896653e2e18b0153e69c1b90265c,00:00.0,Savings,6.661014,7.4727,UBA,,Permanent,


In [None]:
cust_perf.head()      # checking the first five rows of the data set 

Unnamed: 0,customerid,systemloanid,loannumber,approveddate,creationdate,loanamount,totaldue,termdays,referredby,good_bad_flag
0,8a2a81a74ce8c05d014cfb32a0da1049,301994762,12,22:56.0,22:47.0,30000,34500.0,30,,Good
1,8a85886e54beabf90154c0a29ae757c0,301965204,2,04:41.0,04:18.0,15000,17250.0,30,,Good
2,8a8588f35438fe12015444567666018e,301966580,7,52:57.0,52:51.0,20000,22250.0,15,,Good
3,8a85890754145ace015429211b513e16,301999343,3,00:41.0,00:35.0,10000,11500.0,15,,Good
4,8a858970548359cc0154883481981866,301962360,9,42:45.0,42:39.0,40000,44000.0,30,,Good


In [None]:
cust_perf.tail()      # checking the last five rows of the data set 

Unnamed: 0,customerid,systemloanid,loannumber,approveddate,creationdate,loanamount,totaldue,termdays,referredby,good_bad_flag
4363,8a858e6d58b0cc520158beeb14b22a5a,302003163,2,19:42.0,18:30.0,10000,13000.0,30,,Bad
4364,8a858ee85cf400f5015cf44ab1c42d5c,301998967,2,35:47.0,35:40.0,10000,13000.0,30,,Bad
4365,8a858f365b2547f3015b284597147c94,301995576,3,25:57.0,24:47.0,10000,11500.0,15,,Bad
4366,8a858f935ca09667015ca0ee3bc63f51,301977679,2,50:27.0,50:21.0,10000,13000.0,30,8a858eda5c8863ff015c9dead65807bb,Bad
4367,8a858fd458639fcc015868eb14b542ad,301967124,8,01:06.0,01:01.0,30000,34500.0,30,,Bad


In [None]:
cust_prev_loans.head()      # checking the first five rows of the data set 

Unnamed: 0,customerid,systemloanid,loannumber,approveddate,creationdate,loanamount,totaldue,termdays,closeddate,referredby,firstduedate,firstrepaiddate
0,8a2a81a74ce8c05d014cfb32a0da1049,301682320,2,22:40.0,22:32.0,10000,13000.0,30,06:48.0,,00:00.0,51:43.0
1,8a2a81a74ce8c05d014cfb32a0da1049,301883808,9,39:07.0,38:53.0,10000,13000.0,30,44:49.0,,00:00.0,00:00.0
2,8a2a81a74ce8c05d014cfb32a0da1049,301831714,8,56:25.0,56:19.0,20000,23800.0,30,18:56.0,,00:00.0,03:47.0
3,8a8588f35438fe12015444567666018e,301861541,5,25:55.0,25:42.0,10000,11500.0,15,35:52.0,,00:00.0,48:43.0
4,8a85890754145ace015429211b513e16,301941754,2,29:57.0,29:50.0,10000,11500.0,15,18:43.0,,00:00.0,08:35.0


In [None]:
cust_prev_loans.tail()      # checking the last five rows of the data set 

Unnamed: 0,customerid,systemloanid,loannumber,approveddate,creationdate,loanamount,totaldue,termdays,closeddate,referredby,firstduedate,firstrepaiddate
18178,8a858899538ddb8e0153a2b555421fc5,301611754,2,36:34.0,36:28.0,10000,13000.0,30,04:52.0,,00:00.0,05:07.0
18179,8a858899538ddb8e0153a2b555421fc5,301761267,9,26:07.0,25:51.0,30000,34400.0,30,08:57.0,,00:00.0,53:48.0
18180,8a858899538ddb8e0153a2b555421fc5,301631653,4,30:56.0,30:50.0,10000,13000.0,30,39:00.0,,00:00.0,23:56.0
18181,8a858f0656b7820c0156c92ca3ba436f,301697691,1,03:45.0,03:34.0,10000,13000.0,30,17:54.0,,00:00.0,02:45.0
18182,8a858faf5679a838015688de3028143d,301715255,2,42:14.0,42:05.0,10000,13000.0,30,51:04.0,,00:00.0,35:55.0


In [None]:
print("The Customer Demographics Data:", cust_demo.shape)     # getting to know the size of the datasets
print("The Customer Performance Data:", cust_perf.shape)      # getting to know the size of the datasets
print("The Customer Previous Loan Data:", cust_prev_loans.shape)  # getting to know the size of the datasets

The Customer Demographics Data: (4346, 9)
The Customer Performance Data: (4368, 10)
The Customer Previous Loan Data: (18183, 12)


In [None]:
cust_demo.columns

Index(['customerid', 'birthdate', 'bank_account_type', 'longitude_gps',
       'latitude_gps', 'bank_name_clients', 'bank_branch_clients',
       'employment_status_clients', 'level_of_education_clients'],
      dtype='object')

In [None]:
cust_perf.columns

Index(['customerid', 'systemloanid', 'loannumber', 'approveddate',
       'creationdate', 'loanamount', 'totaldue', 'termdays', 'referredby',
       'good_bad_flag'],
      dtype='object')

In [None]:
cust_prev_loans.columns

Index(['customerid', 'systemloanid', 'loannumber', 'approveddate',
       'creationdate', 'loanamount', 'totaldue', 'termdays', 'closeddate',
       'referredby', 'firstduedate', 'firstrepaiddate'],
      dtype='object')

In [None]:
cust_demo.dtypes      # checking the datatypes

customerid                     object
birthdate                      object
bank_account_type              object
longitude_gps                 float64
latitude_gps                  float64
bank_name_clients              object
bank_branch_clients            object
employment_status_clients      object
level_of_education_clients     object
dtype: object

In [None]:
cust_demo["birthdate"] = pd.to_datetime(cust_demo["birthdate"]); cust_demo.dtypes       # changing data type to "datetime"

customerid                            object
birthdate                     datetime64[ns]
bank_account_type                     object
longitude_gps                        float64
latitude_gps                         float64
bank_name_clients                     object
bank_branch_clients                   object
employment_status_clients             object
level_of_education_clients            object
dtype: object

In [None]:
cust_perf.dtypes

customerid        object
systemloanid       int64
loannumber         int64
approveddate      object
creationdate      object
loanamount         int64
totaldue         float64
termdays           int64
referredby        object
good_bad_flag     object
dtype: object

In [None]:
# cust_perf["approveddate"] = pd.to_datetime(cust_perf["approveddate"]); 
# cust_perf["creationdate"] = pd.to_datetime(cust_perf["creationdate"]);

# df2.dtypes

In [None]:
cust_prev_loans.dtypes

customerid          object
systemloanid         int64
loannumber           int64
approveddate        object
creationdate        object
loanamount           int64
totaldue           float64
termdays             int64
closeddate          object
referredby          object
firstduedate        object
firstrepaiddate     object
dtype: object

In [None]:
# df3["approveddate"] = pd.to_datetime(df3["approveddate"]); 
# df3["creationdate"] = pd.to_datetime(df2["creationdate"]);
# df3["closeddate"] = pd.to_datetime(df3["closeddate"]); 
# df3["firstduedate"] = pd.to_datetime(df2["firstduedate"]);
# df3["firstrepaiddate"] = pd.to_datetime(df2["firstrepaiddate"]);

# df3.dtypes

In [None]:
df1.describe()      # summarising the numerical variables in the datasets

Unnamed: 0,longitude_gps,latitude_gps
count,4346.0,4346.0
mean,4.626189,7.251356
std,7.184832,3.055052
min,-118.247009,-33.868818
25%,3.354953,6.47061
50%,3.593302,6.621888
75%,6.54522,7.425052
max,151.20929,71.228069


In [None]:
df2.describe()  # I am NOT sure if system loan id should be taken as numeric

Unnamed: 0,systemloanid,loannumber,loanamount,totaldue,termdays
count,4368.0,4368.0,4368.0,4368.0,4368.0
mean,301981000.0,5.17239,17809.065934,21257.377679,29.261676
std,13431.15,3.653569,10749.694571,11943.510416,11.512519
min,301958500.0,2.0,10000.0,10000.0,15.0
25%,301969100.0,2.0,10000.0,13000.0,30.0
50%,301980100.0,4.0,10000.0,13000.0,30.0
75%,301993500.0,7.0,20000.0,24500.0,30.0
max,302004000.0,27.0,60000.0,68100.0,90.0


In [None]:
df3.describe()

Unnamed: 0,systemloanid,loannumber,loanamount,totaldue,termdays
count,18183.0,18183.0,18183.0,18183.0,18183.0
mean,301839500.0,4.189353,16501.23742,19573.202931,26.69279
std,93677.67,3.24949,9320.547516,10454.245277,10.946556
min,301600100.0,1.0,3000.0,3450.0,15.0
25%,301776600.0,2.0,10000.0,11500.0,15.0
50%,301855000.0,3.0,10000.0,13000.0,30.0
75%,301919700.0,6.0,20000.0,24500.0,30.0
max,302000300.0,26.0,60000.0,68100.0,90.0


In [None]:
df1.nunique()    # checking the number of unique entries

customerid                    4334
birthdate                        1
bank_account_type                3
longitude_gps                 4103
latitude_gps                  4313
bank_name_clients               18
bank_branch_clients             45
employment_status_clients        6
level_of_education_clients       4
dtype: int64

In [None]:
df2.nunique()     # checking the number of unique entries

customerid       4368
systemloanid     4368
loannumber         23
approveddate     2505
creationdate     2537
loanamount         10
totaldue           47
termdays            4
referredby        521
good_bad_flag       2
dtype: int64

In [None]:
df3.nunique()      # checking the number of unique entries

customerid          4359
systemloanid       18183
loannumber            26
approveddate        3585
creationdate        3573
loanamount            16
totaldue              97
termdays               4
closeddate          3570
referredby           521
firstduedate           1
firstrepaiddate     3572
dtype: int64

In [None]:
df1["level_of_education"].unique()

## **Cleaning the Data**

## **Relationship Analysis between the Variables**

In [None]:
# Merge Datasets
df = pd.merge(df1, df2, df3)

ValueError: ignored

In [None]:
customerid=[]
birthdate=[]
loannumber=[]

for i in df1['customerid']:
  count=0
  for j in df2['customerid']:

    if i==j:
      customerid.append(df2.iloc[count,0])
      birthdate.append(df1.iloc[count,1])
      loannumber.append(df2.iloc[count,2])
    count+=1

print(customerid)
print(birthdate)
print(loannumber)

IndexError: ignored

In [None]:
final_table=pd.DataFrame()
final_table["Customer_id"]=customerid
final_table["Birth_Date"]=birthdate
final_table['Loan Number']=loannumber

final_table.head(10)

In [None]:
import pandas as pd
import os

# load dataframe
df = pd.read_csv("data.csv")

# loop through each row of the dataframe
for index, row in df.iterrows():
    # get the image from the column "image"
    image = row["image"]

    # get first name and last name from column "name"
    name = row["name"].split(" ")
    first_name = name[0]
    last_name = name[1]

    # create file name
    file_name = first_name + "_" + last_name + ".jpg"

    # save the image to the current working directory
    with open(file_name, "wb") as f:
        f.write(image)
# In this example, the loop uses the iterrows() method to iterate through each row of the dataframe. For each row, it extracts the image and the name from the corresponding columns, then creates a file name using the first name and last name. Finally, it saves the image to the current working directory with the file name.

# You can change the directory where you want to save the images by changing the file_name variable with the path you want to save the images to.

# You should also note that this example assumes that the "image" column contains binary image data and the "name" column contains the first and last name separated by a space. If the columns are in different format you need to make necessary changes accordingly.

# **PREPROCESSING THE DATA**