# Loan Flagging Dataset

We have a dataset containing information on thousands of loans, including features like loan info, age, past billings, credit score, etc. The target variable indicates whether the loan was flagged or not. 

In [1]:
import pandas as pd

In [2]:
# importing the dataset
data = pd.read_csv("test_task.csv")

In [3]:
# we have 4157 rows and 22 columns
data.shape

(4157, 22)

In [6]:
# prints the first 10 rows of the data.
data.head(10)

Unnamed: 0,loanKey,rep_loan_date,first_loan,dpd_5_cnt,dpd_15_cnt,dpd_30_cnt,first_overdue_date,close_loans_cnt,federal_district_nm,TraderKey,...,payment_type_2,payment_type_3,payment_type_4,payment_type_5,past_billings_cnt,score_1,score_2,age,gender,bad_flag
0,708382,2016-10-06,2015-11-13,,,,,3.0,region_6,6,...,10,0,0,0,10.0,,,21.0,False,0
1,406305,2016-03-26,2015-09-28,1.0,0.0,0.0,2016-01-30,0.0,region_6,6,...,6,0,0,0,5.0,,,20.0,False,0
2,779736,2016-10-30,2015-12-21,,,,,2.0,region_1,6,...,0,5,0,0,5.0,,,19.0,False,0
3,556376,2016-06-29,2015-06-30,,,,,1.0,region_6,14,...,4,0,0,0,6.0,,,21.0,False,0
4,266968,2015-12-01,2015-08-03,,,,,0.0,region_5,22,...,0,0,0,0,3.0,,,33.0,False,0
5,697186,2016-10-01,2015-08-30,,,,,2.0,region_3,38,...,6,0,0,0,5.0,,,34.0,False,0
6,347907,2016-02-18,2015-06-07,1.0,0.0,0.0,2015-11-06,2.0,region_3,6,...,9,0,0,0,8.0,,,32.0,False,0
7,256097,2015-11-23,2015-06-04,1.0,1.0,0.0,2015-11-06,0.0,region_3,6,...,5,0,0,0,5.0,,,23.0,False,1
8,670540,2016-09-19,2015-12-03,3.0,1.0,0.0,2016-01-15,1.0,region_2,6,...,4,0,0,0,6.0,,,33.0,False,0
9,254453,2015-11-22,2015-06-04,1.0,1.0,0.0,2015-11-06,0.0,region_3,6,...,5,0,0,0,5.0,,,23.0,False,1


In [7]:
# prints number of non-null values and type of each column
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4157 entries, 0 to 4156
Data columns (total 22 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   loanKey              4157 non-null   int64  
 1   rep_loan_date        4157 non-null   object 
 2   first_loan           4157 non-null   object 
 3   dpd_5_cnt            675 non-null    float64
 4   dpd_15_cnt           675 non-null    float64
 5   dpd_30_cnt           675 non-null    float64
 6   first_overdue_date   675 non-null    object 
 7   close_loans_cnt      4142 non-null   float64
 8   federal_district_nm  4146 non-null   object 
 9   TraderKey            4157 non-null   int64  
 10  payment_type_0       4157 non-null   int64  
 11  payment_type_1       4157 non-null   int64  
 12  payment_type_2       4157 non-null   int64  
 13  payment_type_3       4157 non-null   int64  
 14  payment_type_4       4157 non-null   int64  
 15  payment_type_5       4157 non-null   i

In [8]:
# we manually select numerical columns to describe,
# but we intentionally ignore the target variable (bag_flag) and 'key' columns
data[[
    "dpd_5_cnt",
    "dpd_15_cnt",
    "dpd_30_cnt",
    "close_loans_cnt",
    "payment_type_0",
    "payment_type_1",
    "payment_type_2",
    "payment_type_3",
    "payment_type_4",
    "payment_type_5",
    "past_billings_cnt",
    "score_1",
    "score_2",
    "age"
]].describe()

Unnamed: 0,dpd_5_cnt,dpd_15_cnt,dpd_30_cnt,close_loans_cnt,payment_type_0,payment_type_1,payment_type_2,payment_type_3,payment_type_4,payment_type_5,past_billings_cnt,score_1,score_2,age
count,675.0,675.0,675.0,4142.0,4157.0,4157.0,4157.0,4157.0,4157.0,4157.0,3909.0,3507.0,239.0,4157.0
mean,1.444444,0.733333,0.28,1.184693,0.018523,0.596103,3.755834,0.758239,0.019485,0.0,4.979023,578.911345,552.54661,34.561222
std,0.900599,0.764572,0.502339,1.723715,0.330359,2.564887,3.810703,2.212487,0.24596,0.0,3.491556,48.989869,21.49284,10.834143
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,384.220628,485.874267,18.0
25%,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0,547.429791,535.545724,26.0
50%,1.0,1.0,0.0,1.0,0.0,0.0,3.0,0.0,0.0,0.0,4.0,588.531315,556.757944,32.0
75%,2.0,1.0,1.0,2.0,0.0,0.0,6.0,0.0,0.0,0.0,6.0,612.32309,567.608057,41.0
max,7.0,5.0,3.0,31.0,15.0,81.0,33.0,35.0,7.0,0.0,21.0,691.52842,603.311653,74.0


In [9]:
# next, we 'describe' the categorical columns
# note that the output differs between the two calls to `.describe()`
data[[
    "federal_district_nm",
    "gender"
]].describe()

Unnamed: 0,federal_district_nm,gender
count,4146,4157
unique,8,2
top,region_3,False
freq,1595,3570


# Question One 

### Exploratory Data Analysis

Walk me through how you would explore this dataset and perform some initial EDA. What are some key things you would look at to start understanding the data, relationships between features, and what might be predictive of our target variable? Feel free to speak generally about your approach first, then get more specific in terms of what you might examine with this particular dataset. 

# Question Two

### Feature Engineering

Walk me through how you would explore this data and identify opportunities for feature engineering. What types of new features might you extract or derive from the existing data that could help a model better predict loan risk?

# Question Three

### Algorithm Selection

As you already know, we are trying to predict whether a customer is 'bad' or not based on their attributes and past behavior. What type of machine learning problem is this, and how would you approach selecting an appropriate algorithm?

# Question Four

### Hyperparameter Optimisation

Earlier we discussed selecting a classification algorithm for predicting 'bad' customers. Now, assuming we've chosen an appropriate model, like a random forest classifier, how would you go about tuning hyperparameters to optimize its performance?