# Retail Analysis

## Summary of the business

HELOC

The HELOC dataset from FICO. Each entry in the dataset is a line of credit, typically offered by a bank as a percentage of home equity (the difference between the current market value of a home and its purchase price). The customers in this dataset have requested a credit line in the range of $5,000 - $150,000. The fundamental task is to use the information about the applicant in their credit report to predict whether they will repay their HELOC account within 2 years.

In [1]:
import pandas as pd
import numpy as np

# Data Exploration

There is only one csv file called: heloc.csv

In [2]:
heloc_df = pd.read_csv('data/heloc.csv')

In [3]:
print(f'There are {len(heloc_df.columns)} columns: ')
print()
for idx, x in enumerate(heloc_df.columns):
    print(x)

There are 24 columns: 

RiskPerformance
ExternalRiskEstimate
MSinceOldestTradeOpen
MSinceMostRecentTradeOpen
AverageMInFile
NumSatisfactoryTrades
NumTrades60Ever2DerogPubRec
NumTrades90Ever2DerogPubRec
PercentTradesNeverDelq
MSinceMostRecentDelq
MaxDelq2PublicRecLast12M
MaxDelqEver
NumTotalTrades
NumTradesOpeninLast12M
PercentInstallTrades
MSinceMostRecentInqexcl7days
NumInqLast6M
NumInqLast6Mexcl7days
NetFractionRevolvingBurden
NetFractionInstallBurden
NumRevolvingTradesWBalance
NumInstallTradesWBalance
NumBank2NatlTradesWHighUtilization
PercentTradesWBalance


In [4]:
# Lets check if there are no null values:
heloc_df.isnull().sum()[1] == 0

True

We don't have any values that are empty which helps us greatly however now we have to see the quality of the data that are in the dataframe.

In [5]:
heloc_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10459 entries, 0 to 10458
Data columns (total 24 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   RiskPerformance                     10459 non-null  object
 1   ExternalRiskEstimate                10459 non-null  int64 
 2   MSinceOldestTradeOpen               10459 non-null  int64 
 3   MSinceMostRecentTradeOpen           10459 non-null  int64 
 4   AverageMInFile                      10459 non-null  int64 
 5   NumSatisfactoryTrades               10459 non-null  int64 
 6   NumTrades60Ever2DerogPubRec         10459 non-null  int64 
 7   NumTrades90Ever2DerogPubRec         10459 non-null  int64 
 8   PercentTradesNeverDelq              10459 non-null  int64 
 9   MSinceMostRecentDelq                10459 non-null  int64 
 10  MaxDelq2PublicRecLast12M            10459 non-null  int64 
 11  MaxDelqEver                         10459 non-null  in

In [6]:
heloc_df.describe(include=[object])

Unnamed: 0,RiskPerformance
count,10459
unique,2
top,Bad
freq,5459


In [7]:
heloc_df['RiskPerformance'].unique()

array(['Bad', 'Good'], dtype=object)

For the categorial values we have RiskPerformance, which is a nominal binary value. 

In [8]:
heloc_df.describe()

Unnamed: 0,ExternalRiskEstimate,MSinceOldestTradeOpen,MSinceMostRecentTradeOpen,AverageMInFile,NumSatisfactoryTrades,NumTrades60Ever2DerogPubRec,NumTrades90Ever2DerogPubRec,PercentTradesNeverDelq,MSinceMostRecentDelq,MaxDelq2PublicRecLast12M,...,PercentInstallTrades,MSinceMostRecentInqexcl7days,NumInqLast6M,NumInqLast6Mexcl7days,NetFractionRevolvingBurden,NetFractionInstallBurden,NumRevolvingTradesWBalance,NumInstallTradesWBalance,NumBank2NatlTradesWHighUtilization,PercentTradesWBalance
count,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,...,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0
mean,67.425758,184.205373,8.543455,73.843293,19.428052,0.042738,-0.142843,86.661536,6.762406,4.928291,...,32.16646,-0.325366,0.868152,0.812602,31.629888,39.158906,3.185008,0.976097,0.018071,62.079166
std,21.121621,109.683816,13.301745,38.782803,13.004327,2.51391,2.367397,25.999584,20.50125,3.756275,...,20.128634,6.067556,3.179304,3.143698,30.06014,42.101601,4.413173,4.060995,3.358135,27.711565
min,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,...,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0
25%,63.0,118.0,3.0,52.0,12.0,0.0,0.0,87.0,-7.0,4.0,...,20.0,-7.0,0.0,0.0,5.0,-8.0,2.0,1.0,0.0,47.0
50%,71.0,178.0,5.0,74.0,19.0,0.0,0.0,96.0,-7.0,6.0,...,31.0,0.0,1.0,1.0,25.0,47.0,3.0,2.0,0.0,67.0
75%,79.0,249.5,11.0,95.0,27.0,1.0,0.0,100.0,14.0,7.0,...,44.0,1.0,2.0,2.0,54.0,79.0,5.0,3.0,1.0,82.0
max,94.0,803.0,383.0,383.0,79.0,19.0,19.0,100.0,83.0,9.0,...,100.0,24.0,66.0,66.0,232.0,471.0,32.0,23.0,18.0,100.0


The Risk Performance seems to be the indication whether or not the client is eligeblie for the loan. There fore this is our y value to predict with ML

## Searching for Correlation between RiskPerformance

In [9]:
heloc_df.sample()

Unnamed: 0,RiskPerformance,ExternalRiskEstimate,MSinceOldestTradeOpen,MSinceMostRecentTradeOpen,AverageMInFile,NumSatisfactoryTrades,NumTrades60Ever2DerogPubRec,NumTrades90Ever2DerogPubRec,PercentTradesNeverDelq,MSinceMostRecentDelq,...,PercentInstallTrades,MSinceMostRecentInqexcl7days,NumInqLast6M,NumInqLast6Mexcl7days,NetFractionRevolvingBurden,NetFractionInstallBurden,NumRevolvingTradesWBalance,NumInstallTradesWBalance,NumBank2NatlTradesWHighUtilization,PercentTradesWBalance
5491,Good,79,366,3,94,53,0,0,100,-7,...,23,16,0,0,16,64,5,4,0,39


In [10]:
heloc_df['RiskPerformance'] = heloc_df.apply(lambda x: 1 if x['RiskPerformance'] == 'Good' else 0, axis=1)

In [11]:
heloc_df.corr()[['RiskPerformance']]

Unnamed: 0,RiskPerformance
RiskPerformance,1.0
ExternalRiskEstimate,0.21677
MSinceOldestTradeOpen,0.185155
MSinceMostRecentTradeOpen,0.046937
AverageMInFile,0.209168
NumSatisfactoryTrades,0.12308
NumTrades60Ever2DerogPubRec,-0.067211
NumTrades90Ever2DerogPubRec,-0.043402
PercentTradesNeverDelq,0.12201
MSinceMostRecentDelq,-0.057067


In [12]:
heloc_df.sample(3)

Unnamed: 0,RiskPerformance,ExternalRiskEstimate,MSinceOldestTradeOpen,MSinceMostRecentTradeOpen,AverageMInFile,NumSatisfactoryTrades,NumTrades60Ever2DerogPubRec,NumTrades90Ever2DerogPubRec,PercentTradesNeverDelq,MSinceMostRecentDelq,...,PercentInstallTrades,MSinceMostRecentInqexcl7days,NumInqLast6M,NumInqLast6Mexcl7days,NetFractionRevolvingBurden,NetFractionInstallBurden,NumRevolvingTradesWBalance,NumInstallTradesWBalance,NumBank2NatlTradesWHighUtilization,PercentTradesWBalance
1818,0,69,57,4,37,14,0,0,100,-7,...,50,0,4,4,52,152,6,7,2,100
8741,0,71,149,3,63,27,0,0,96,-8,...,36,0,1,1,41,97,7,1,2,73
2153,1,58,421,1,142,24,0,0,96,12,...,20,0,3,3,78,101,9,4,7,87


TODO:   
 [ ] Remove some not correlated columns?  
    [ ] Design experiments for the columns in MLFlow

# Machine Learning Risk Performance Predictor

In [13]:
import mlflow

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

Get values from the dataframe

In [14]:
X = heloc_df.iloc[:, 1:].values
y = heloc_df.iloc[:, 0].values


Split into train and test

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

## Feature Scaling

In [16]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [17]:
X_train[1]

array([-0.30351006, -0.07187352, -0.04015581, -0.22571106,  0.36504154,
       -0.01850728,  0.05938707, -0.25451283,  1.23659479,  0.28532569,
        0.12133222,  0.29630724,  0.57289325,  0.58451674,  0.05375399,
        0.04329737, -0.25502427,  0.74380605,  0.78269115,  0.41398057,
        0.49912962,  0.59329716,  0.97204882])

## Training the Logistic Regression model on the Training set

In [18]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

## Predicting the Test set results

In [19]:
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))


[[0 0]
 [1 0]
 [0 0]
 ...
 [1 0]
 [0 1]
 [1 1]]


## Making the Confusion Matrix

In [20]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[1035  336]
 [ 377  867]]


0.7273422562141492

In [21]:
# Make note, to predict a value remember to scale the input
# like so:

# print(classifier.predict( sc.transform( input_array ) ) )

# MLFlow

In [22]:
def evaluate(y: list, pred: list) -> float:
    rmse = np.sqrt(mean_squared_error(y, pred))
    return rmse

In [23]:
mlflow.set_tracking_uri("http://localhost:5000")
# mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("Heloc Experiment")


<Experiment: artifact_location='mlflow-artifacts:/1', creation_time=1690128332051, experiment_id='1', last_update_time=1690128332051, lifecycle_stage='active', name='Heloc Experiment', tags={}>

In [24]:
from hyperopt import hp, tpe, fmin, Trials
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

In [30]:
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

# specify the search space for hyperparameters
space = {
    'penalty': hp.choice('penalty', [{'penalty': 'l2'}, {'penalty': 'none'}]),
    'C': hp.uniform('C', 0.0, 10.0),
    'fit_intercept': True,
    'solver': "newton-cholesky",
    'multi_class': 'auto',
}

mapper = {
    0: 'l2'
}


# define objective function to minimize
def objective(params):
    params = {**params, **params['penalty']}

    classifier = LogisticRegression(**params)
    accuracy = cross_val_score(classifier, X_train, y_train, cv=5).mean()
    return {'loss': -accuracy, 'status': STATUS_OK}

# running the optimizer
trials = Trials()
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=50,
            trials=trials)

# log the best parameters and model to mlflow
with mlflow.start_run():
    classifier = LogisticRegression(**best)
    classifier.fit(X_train, y_train)

    y_pred = classifier.predict(X_test)
    score = evaluate(y_pred, y_test)

    mlflow.log_params(best)
    mlflow.log_metric("score", score)
    mlflow.sklearn.log_model(classifier, "model")

  4%|▍         | 2/50 [00:00<00:04, 11.24trial/s, best loss: -0.7122650589872659]













 10%|█         | 5/50 [00:00<00:05,  7.70trial/s, best loss: -0.7122650589872659]













 16%|█▌        | 8/50 [00:01<00:05,  7.23trial/s, best loss: -0.7122650589872659]













 18%|█▊        | 9/50 [00:01<00:06,  6.39trial/s, best loss: -0.7122650589872659]







 20%|██        | 10/50 [00:01<00:08,  4.72trial/s, best loss: -0.7122650589872659]









 24%|██▍       | 12/50 [00:01<00:06,  6.06trial/s, best loss: -0.7122650589872659]





















 28%|██▊       | 14/50 [00:02<00:05,  6.49trial/s, best loss: -0.7122650589872659]

















 30%|███       | 15/50 [00:02<00:04,  7.11trial/s, best loss: -0.7122650589872659]











 42%|████▏     | 21/50 [00:03<00:03,  7.45trial/s, best loss: -0.7122650589872659]

























 44%|████▍     | 22/50 [00:03<00:03,  7.61trial/s, best loss: -0.7122650589872659]





















 48%|████▊     | 24/50 [00:03<00:03,  7.45trial/s, best loss: -0.7122650589872659]

















 50%|█████     | 25/50 [00:03<00:03,  6.59trial/s, best loss: -0.7122650589872659]



















 52%|█████▏    | 26/50 [00:04<00:05,  4.58trial/s, best loss: -0.7122650589872659]









 54%|█████▍    | 27/50 [00:04<00:06,  3.81trial/s, best loss: -0.7122650589872659]









 56%|█████▌    | 28/50 [00:04<00:05,  3.89trial/s, best loss: -0.7122650589872659]















 58%|█████▊    | 29/50 [00:04<00:04,  4.34trial/s, best loss: -0.7122650589872659]

















 60%|██████    | 30/50 [00:05<00:04,  4.66trial/s, best loss: -0.7122650589872659]











 64%|██████▍   | 32/50 [00:05<00:03,  5.19trial/s, best loss: -0.7122650589872659]



















 66%|██████▌   | 33/50 [00:05<00:03,  5.29trial/s, best loss: -0.7122650589872659]

















 68%|██████▊   | 34/50 [00:05<00:02,  5.59trial/s, best loss: -0.7122650589872659]

















 72%|███████▏  | 36/50 [00:06<00:02,  5.95trial/s, best loss: -0.7122650589872659]

















 76%|███████▌  | 38/50 [00:06<00:01,  7.15trial/s, best loss: -0.7122650589872659]



















 80%|████████  | 40/50 [00:06<00:01,  8.37trial/s, best loss: -0.7122650589872659]













 86%|████████▌ | 43/50 [00:06<00:00,  8.84trial/s, best loss: -0.7122650589872659]



























 88%|████████▊ | 44/50 [00:06<00:00,  8.63trial/s, best loss: -0.7122650589872659]









 92%|█████████▏| 46/50 [00:07<00:00,  7.04trial/s, best loss: -0.7122650589872659]













 96%|█████████▌| 48/50 [00:07<00:00,  6.78trial/s, best loss: -0.7122650589872659]

















100%|██████████| 50/50 [00:07<00:00,  6.34trial/s, best loss: -0.7122650589872659]












InvalidParameterError: The 'penalty' parameter of LogisticRegression must be a str among {'l1', 'elasticnet', 'none' (deprecated), 'l2'} or None. Got 1 instead.