# Retail Analysis

## Summary of the business

HELOC

The HELOC dataset from FICO. Each entry in the dataset is a line of credit, typically offered by a bank as a percentage of home equity (the difference between the current market value of a home and its purchase price). The customers in this dataset have requested a credit line in the range of $5,000 - $150,000. The fundamental task is to use the information about the applicant in their credit report to predict whether they will repay their HELOC account within 2 years.

In [1]:
import pandas as pd

# Data Exploration

There is only one csv file called: heloc.csv

In [2]:
heloc_df = pd.read_csv('data/heloc.csv')

In [3]:
print(f'There are {len(heloc_df.columns)} columns: ')
print()
for idx, x in enumerate(heloc_df.columns):
    print(x)

There are 24 columns: 

RiskPerformance
ExternalRiskEstimate
MSinceOldestTradeOpen
MSinceMostRecentTradeOpen
AverageMInFile
NumSatisfactoryTrades
NumTrades60Ever2DerogPubRec
NumTrades90Ever2DerogPubRec
PercentTradesNeverDelq
MSinceMostRecentDelq
MaxDelq2PublicRecLast12M
MaxDelqEver
NumTotalTrades
NumTradesOpeninLast12M
PercentInstallTrades
MSinceMostRecentInqexcl7days
NumInqLast6M
NumInqLast6Mexcl7days
NetFractionRevolvingBurden
NetFractionInstallBurden
NumRevolvingTradesWBalance
NumInstallTradesWBalance
NumBank2NatlTradesWHighUtilization
PercentTradesWBalance


In [16]:
# Lets check if there are no null values:
heloc_df.isnull().sum()[1] == 0

True

We don't have any values that are empty which helps us greatly however now we have to see the quality of the data that are in the dataframe.

In [12]:
heloc_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10459 entries, 0 to 10458
Data columns (total 24 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   RiskPerformance                     10459 non-null  object
 1   ExternalRiskEstimate                10459 non-null  int64 
 2   MSinceOldestTradeOpen               10459 non-null  int64 
 3   MSinceMostRecentTradeOpen           10459 non-null  int64 
 4   AverageMInFile                      10459 non-null  int64 
 5   NumSatisfactoryTrades               10459 non-null  int64 
 6   NumTrades60Ever2DerogPubRec         10459 non-null  int64 
 7   NumTrades90Ever2DerogPubRec         10459 non-null  int64 
 8   PercentTradesNeverDelq              10459 non-null  int64 
 9   MSinceMostRecentDelq                10459 non-null  int64 
 10  MaxDelq2PublicRecLast12M            10459 non-null  int64 
 11  MaxDelqEver                         10459 non-null  in

In [18]:
heloc_df.describe(include=[object])

Unnamed: 0,RiskPerformance
count,10459
unique,2
top,Bad
freq,5459


In [19]:
heloc_df['RiskPerformance'].unique()

array(['Bad', 'Good'], dtype=object)

For the categorial values we have RiskPerformance, which is a nominal binary value. 

In [17]:
heloc_df.describe()

Unnamed: 0,ExternalRiskEstimate,MSinceOldestTradeOpen,MSinceMostRecentTradeOpen,AverageMInFile,NumSatisfactoryTrades,NumTrades60Ever2DerogPubRec,NumTrades90Ever2DerogPubRec,PercentTradesNeverDelq,MSinceMostRecentDelq,MaxDelq2PublicRecLast12M,...,PercentInstallTrades,MSinceMostRecentInqexcl7days,NumInqLast6M,NumInqLast6Mexcl7days,NetFractionRevolvingBurden,NetFractionInstallBurden,NumRevolvingTradesWBalance,NumInstallTradesWBalance,NumBank2NatlTradesWHighUtilization,PercentTradesWBalance
count,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,...,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0
mean,67.425758,184.205373,8.543455,73.843293,19.428052,0.042738,-0.142843,86.661536,6.762406,4.928291,...,32.16646,-0.325366,0.868152,0.812602,31.629888,39.158906,3.185008,0.976097,0.018071,62.079166
std,21.121621,109.683816,13.301745,38.782803,13.004327,2.51391,2.367397,25.999584,20.50125,3.756275,...,20.128634,6.067556,3.179304,3.143698,30.06014,42.101601,4.413173,4.060995,3.358135,27.711565
min,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,...,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0
25%,63.0,118.0,3.0,52.0,12.0,0.0,0.0,87.0,-7.0,4.0,...,20.0,-7.0,0.0,0.0,5.0,-8.0,2.0,1.0,0.0,47.0
50%,71.0,178.0,5.0,74.0,19.0,0.0,0.0,96.0,-7.0,6.0,...,31.0,0.0,1.0,1.0,25.0,47.0,3.0,2.0,0.0,67.0
75%,79.0,249.5,11.0,95.0,27.0,1.0,0.0,100.0,14.0,7.0,...,44.0,1.0,2.0,2.0,54.0,79.0,5.0,3.0,1.0,82.0
max,94.0,803.0,383.0,383.0,79.0,19.0,19.0,100.0,83.0,9.0,...,100.0,24.0,66.0,66.0,232.0,471.0,32.0,23.0,18.0,100.0


## Searching for Correlation between RiskPerformance

In [None]:
heloc_df.sample()

Unnamed: 0,RiskPerformance,ExternalRiskEstimate,MSinceOldestTradeOpen,MSinceMostRecentTradeOpen,AverageMInFile,NumSatisfactoryTrades,NumTrades60Ever2DerogPubRec,NumTrades90Ever2DerogPubRec,PercentTradesNeverDelq,MSinceMostRecentDelq,...,PercentInstallTrades,MSinceMostRecentInqexcl7days,NumInqLast6M,NumInqLast6Mexcl7days,NetFractionRevolvingBurden,NetFractionInstallBurden,NumRevolvingTradesWBalance,NumInstallTradesWBalance,NumBank2NatlTradesWHighUtilization,PercentTradesWBalance
112,Good,-9,-9,-9,-9,-9,-9,-9,-9,-9,...,-9,-9,-9,-9,-9,-9,-9,-9,-9,-9


In [None]:
heloc_df['RiskPerformance'] = heloc_df.apply(lambda x: 1 if x['RiskPerformance'] == 'Good' else 0, axis=1)

In [None]:
heloc_df.corr()[['RiskPerformance']]

Unnamed: 0,RiskPerformance
RiskPerformance,1.0
ExternalRiskEstimate,0.21677
MSinceOldestTradeOpen,0.185155
MSinceMostRecentTradeOpen,0.046937
AverageMInFile,0.209168
NumSatisfactoryTrades,0.12308
NumTrades60Ever2DerogPubRec,-0.067211
NumTrades90Ever2DerogPubRec,-0.043402
PercentTradesNeverDelq,0.12201
MSinceMostRecentDelq,-0.057067


In [13]:
heloc_df.sample(3)

Unnamed: 0,RiskPerformance,ExternalRiskEstimate,MSinceOldestTradeOpen,MSinceMostRecentTradeOpen,AverageMInFile,NumSatisfactoryTrades,NumTrades60Ever2DerogPubRec,NumTrades90Ever2DerogPubRec,PercentTradesNeverDelq,MSinceMostRecentDelq,...,PercentInstallTrades,MSinceMostRecentInqexcl7days,NumInqLast6M,NumInqLast6Mexcl7days,NetFractionRevolvingBurden,NetFractionInstallBurden,NumRevolvingTradesWBalance,NumInstallTradesWBalance,NumBank2NatlTradesWHighUtilization,PercentTradesWBalance
9899,Good,83,140,4,75,27,0,0,100,-7,...,52,0,3,2,12,100,2,3,0,71
6036,Good,84,155,5,65,28,0,0,100,-7,...,25,16,0,0,4,67,3,2,0,56
5197,Good,82,151,4,66,10,1,0,91,51,...,23,10,0,0,0,-8,0,1,-8,33


TODO: 
 [ ] Remove some not correlated columns?
    [ ] Design experiments for the columns in MLFlow

# Machine Learning Risk Performance Predictor

In [50]:
import mlflow

from sklearn.model_selection import train_test_split

Get values from the dataframe

In [47]:
X = heloc_df.iloc[:, 1:].values
y = heloc_df.iloc[:, 0].values


Split into train and test

In [48]:
X_train, X_test, y_train, y_test = train_test_split(X, y)