# Retail Analysis

## Summary of the business

HELOC

The HELOC dataset from FICO. Each entry in the dataset is a line of credit, typically offered by a bank as a percentage of home equity (the difference between the current market value of a home and its purchase price). The customers in this dataset have requested a credit line in the range of $5,000 - $150,000. The fundamental task is to use the information about the applicant in their credit report to predict whether they will repay their HELOC account within 2 years.

In [1]:
import pandas as pd
import numpy as np

# Data Exploration

There is only one csv file called: heloc.csv

In [2]:
heloc_df = pd.read_csv('data/heloc.csv')

In [3]:
print(f'There are {len(heloc_df.columns)} columns: ')
print()
for idx, x in enumerate(heloc_df.columns):
    print(x)

There are 24 columns: 

RiskPerformance
ExternalRiskEstimate
MSinceOldestTradeOpen
MSinceMostRecentTradeOpen
AverageMInFile
NumSatisfactoryTrades
NumTrades60Ever2DerogPubRec
NumTrades90Ever2DerogPubRec
PercentTradesNeverDelq
MSinceMostRecentDelq
MaxDelq2PublicRecLast12M
MaxDelqEver
NumTotalTrades
NumTradesOpeninLast12M
PercentInstallTrades
MSinceMostRecentInqexcl7days
NumInqLast6M
NumInqLast6Mexcl7days
NetFractionRevolvingBurden
NetFractionInstallBurden
NumRevolvingTradesWBalance
NumInstallTradesWBalance
NumBank2NatlTradesWHighUtilization
PercentTradesWBalance


In [4]:
# Lets check if there are no null values:
heloc_df.isnull().sum()[1] == 0

True

We don't have any values that are empty which helps us greatly however now we have to see the quality of the data that are in the dataframe.

In [5]:
heloc_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10459 entries, 0 to 10458
Data columns (total 24 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   RiskPerformance                     10459 non-null  object
 1   ExternalRiskEstimate                10459 non-null  int64 
 2   MSinceOldestTradeOpen               10459 non-null  int64 
 3   MSinceMostRecentTradeOpen           10459 non-null  int64 
 4   AverageMInFile                      10459 non-null  int64 
 5   NumSatisfactoryTrades               10459 non-null  int64 
 6   NumTrades60Ever2DerogPubRec         10459 non-null  int64 
 7   NumTrades90Ever2DerogPubRec         10459 non-null  int64 
 8   PercentTradesNeverDelq              10459 non-null  int64 
 9   MSinceMostRecentDelq                10459 non-null  int64 
 10  MaxDelq2PublicRecLast12M            10459 non-null  int64 
 11  MaxDelqEver                         10459 non-null  in

|   #    | Column                              | Non-Null Count | Dtype |
|--------|-------------------------------------|----------------|-------|
|   0    | RiskPerformance                     | 10459 non-null | object|
|   1    | ExternalRiskEstimate                | 10459 non-null | int64 |
|   2    | MSinceOldestTradeOpen               | 10459 non-null | int64 |
|   3    | MSinceMostRecentTradeOpen           | 10459 non-null | int64 |
|   4    | AverageMInFile                      | 10459 non-null | int64 |
|   5    | NumSatisfactoryTrades               | 10459 non-null | int64 |
|   6    | NumTrades60Ever2DerogPubRec         | 10459 non-null | int64 |
|   7    | NumTrades90Ever2DerogPubRec         | 10459 non-null | int64 |
|   8    | PercentTradesNeverDelq              | 10459 non-null | int64 |
|   9    | MSinceMostRecentDelq                | 10459 non-null | int64 |
|   10   | MaxDelq2PublicRecLast12M            | 10459 non-null | int64 |
|   11   | MaxDelqEver                         | 10459 non-null | int64 |
|   12   | NumTotalTrades                      | 10459 non-null | int64 |
|   13   | NumTradesOpeninLast12M              | 10459 non-null | int64 |
|   14   | PercentInstallTrades                | 10459 non-null | int64 |
|   15   | MSinceMostRecentInqexcl7days        | 10459 non-null | int64 |
|   16   | NumInqLast6M                        | 10459 non-null | int64 |
|   17   | NumInqLast6Mexcl7days               | 10459 non-null | int64 |
|   18   | NetFractionRevolvingBurden          | 10459 non-null | int64 |
|   19   | NetFractionInstallBurden            | 10459 non-null | int64 |
|   20   | NumRevolvingTradesWBalance          | 10459 non-null | int64 |
|   21   | NumInstallTradesWBalance            | 10459 non-null | int64 |
|   22   | NumBank2NatlTradesWHighUtilization  | 10459 non-null | int64 |
|   23   | PercentTradesWBalance               | 10459 non-null | int64 |

In [6]:
heloc_df.describe(include=[object])

Unnamed: 0,RiskPerformance
count,10459
unique,2
top,Bad
freq,5459


In [7]:
heloc_df['RiskPerformance'].unique()

array(['Bad', 'Good'], dtype=object)

For the categorial values we have RiskPerformance, which is a nominal binary value. 

In [8]:
heloc_df.describe()

Unnamed: 0,ExternalRiskEstimate,MSinceOldestTradeOpen,MSinceMostRecentTradeOpen,AverageMInFile,NumSatisfactoryTrades,NumTrades60Ever2DerogPubRec,NumTrades90Ever2DerogPubRec,PercentTradesNeverDelq,MSinceMostRecentDelq,MaxDelq2PublicRecLast12M,...,PercentInstallTrades,MSinceMostRecentInqexcl7days,NumInqLast6M,NumInqLast6Mexcl7days,NetFractionRevolvingBurden,NetFractionInstallBurden,NumRevolvingTradesWBalance,NumInstallTradesWBalance,NumBank2NatlTradesWHighUtilization,PercentTradesWBalance
count,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,...,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0
mean,67.425758,184.205373,8.543455,73.843293,19.428052,0.042738,-0.142843,86.661536,6.762406,4.928291,...,32.16646,-0.325366,0.868152,0.812602,31.629888,39.158906,3.185008,0.976097,0.018071,62.079166
std,21.121621,109.683816,13.301745,38.782803,13.004327,2.51391,2.367397,25.999584,20.50125,3.756275,...,20.128634,6.067556,3.179304,3.143698,30.06014,42.101601,4.413173,4.060995,3.358135,27.711565
min,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,...,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0
25%,63.0,118.0,3.0,52.0,12.0,0.0,0.0,87.0,-7.0,4.0,...,20.0,-7.0,0.0,0.0,5.0,-8.0,2.0,1.0,0.0,47.0
50%,71.0,178.0,5.0,74.0,19.0,0.0,0.0,96.0,-7.0,6.0,...,31.0,0.0,1.0,1.0,25.0,47.0,3.0,2.0,0.0,67.0
75%,79.0,249.5,11.0,95.0,27.0,1.0,0.0,100.0,14.0,7.0,...,44.0,1.0,2.0,2.0,54.0,79.0,5.0,3.0,1.0,82.0
max,94.0,803.0,383.0,383.0,79.0,19.0,19.0,100.0,83.0,9.0,...,100.0,24.0,66.0,66.0,232.0,471.0,32.0,23.0,18.0,100.0


The Risk Performance seems to be the indication whether or not the client is eligeblie for the loan. There fore this is our y value to predict with ML

## Searching for Correlation between RiskPerformance

In [9]:
heloc_df.sample()

Unnamed: 0,RiskPerformance,ExternalRiskEstimate,MSinceOldestTradeOpen,MSinceMostRecentTradeOpen,AverageMInFile,NumSatisfactoryTrades,NumTrades60Ever2DerogPubRec,NumTrades90Ever2DerogPubRec,PercentTradesNeverDelq,MSinceMostRecentDelq,...,PercentInstallTrades,MSinceMostRecentInqexcl7days,NumInqLast6M,NumInqLast6Mexcl7days,NetFractionRevolvingBurden,NetFractionInstallBurden,NumRevolvingTradesWBalance,NumInstallTradesWBalance,NumBank2NatlTradesWHighUtilization,PercentTradesWBalance
630,Good,90,230,9,84,26,0,0,100,-7,...,57,9,0,0,1,85,1,3,0,44


In [10]:
heloc_df['RiskPerformance'] = heloc_df.apply(lambda x: 1 if x['RiskPerformance'] == 'Good' else 0, axis=1)

In [11]:
heloc_df.corr()[['RiskPerformance']]

Unnamed: 0,RiskPerformance
RiskPerformance,1.0
ExternalRiskEstimate,0.21677
MSinceOldestTradeOpen,0.185155
MSinceMostRecentTradeOpen,0.046937
AverageMInFile,0.209168
NumSatisfactoryTrades,0.12308
NumTrades60Ever2DerogPubRec,-0.067211
NumTrades90Ever2DerogPubRec,-0.043402
PercentTradesNeverDelq,0.12201
MSinceMostRecentDelq,-0.057067


In [12]:
heloc_df.sample(3)

Unnamed: 0,RiskPerformance,ExternalRiskEstimate,MSinceOldestTradeOpen,MSinceMostRecentTradeOpen,AverageMInFile,NumSatisfactoryTrades,NumTrades60Ever2DerogPubRec,NumTrades90Ever2DerogPubRec,PercentTradesNeverDelq,MSinceMostRecentDelq,...,PercentInstallTrades,MSinceMostRecentInqexcl7days,NumInqLast6M,NumInqLast6Mexcl7days,NetFractionRevolvingBurden,NetFractionInstallBurden,NumRevolvingTradesWBalance,NumInstallTradesWBalance,NumBank2NatlTradesWHighUtilization,PercentTradesWBalance
2428,0,69,129,13,63,24,1,0,80,20,...,24,-7,0,0,51,71,4,3,2,53
4556,1,77,464,31,138,16,1,1,94,50,...,31,-7,0,0,11,-8,3,2,0,71
4520,1,61,183,26,111,3,1,1,60,7,...,20,-7,1,1,94,-8,1,1,-8,100


# Machine Learning Risk Performance Predictor

In [13]:
import mlflow

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

Get values from the dataframe

In [14]:
X = heloc_df.iloc[:, 1:].values
y = heloc_df.iloc[:, 0].values


Split into train and test

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

## Feature Scaling

In [16]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [17]:
X_train[1]

array([-3.65250537, -1.77117284, -1.31199757, -2.13873074, -2.18994908,
       -3.62333165, -3.76959543, -3.70355811, -0.76658089, -3.7454938 ,
       -3.68597658, -2.0574265 , -3.36021475, -2.04921054, -1.44239851,
       -3.15344583, -3.17291901, -1.35377495, -1.13291145, -2.76028026,
       -2.44477193, -2.69474731, -2.56630471])

## Training the Logistic Regression model on the Training set

In [18]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

## Predicting the Test set results

In [19]:
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))


[[1 1]
 [1 1]
 [0 0]
 ...
 [1 0]
 [0 0]
 [0 0]]


## Making the Confusion Matrix

In [20]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[1051  332]
 [ 409  823]]


0.7166347992351817

In [21]:
# Make note, to predict a value remember to scale the input
# like so:

# print(classifier.predict( sc.transform( input_array ) ) )

Save the model

In [26]:
import pickle

In [28]:
filename = 'api/model/final_model.sav'
pickle.dump(classifier, open(filename, 'wb'))

In [30]:
with open('api/model/scaler.pkl','wb') as f:
    pickle.dump(sc, f)


# MLFlow

In [22]:
def evaluate(y: list, pred: list) -> float:
    rmse = np.sqrt(mean_squared_error(y, pred))
    return rmse

In [23]:
mlflow.set_tracking_uri("http://localhost:5000")
# mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("Heloc Experiment")


<Experiment: artifact_location='mlflow-artifacts:/1', creation_time=1690128332051, experiment_id='1', last_update_time=1690128332051, lifecycle_stage='active', name='Heloc Experiment', tags={}>

In [24]:
from hyperopt import hp, tpe, fmin, Trials
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

In [25]:
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

# specify the search space for hyperparameters
space = {
    'penalty': hp.choice('penalty', [{'penalty': 'l2'}, {'penalty': 'none'}]),
    'C': hp.uniform('C', 0.0, 10.0),
    'fit_intercept': True,
    'solver': "newton-cholesky",
    'multi_class': 'auto',
}

mapper = {
    0: 'l2'
}


# define objective function to minimize
def objective(params):
    params = {**params, **params['penalty']}

    classifier = LogisticRegression(**params)
    accuracy = cross_val_score(classifier, X_train, y_train, cv=5).mean()
    return {'loss': -accuracy, 'status': STATUS_OK}

# running the optimizer
trials = Trials()
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=50,
            trials=trials)

# log the best parameters and model to mlflow
with mlflow.start_run():
    classifier = LogisticRegression(**best)
    classifier.fit(X_train, y_train)

    y_pred = classifier.predict(X_test)
    score = evaluate(y_pred, y_test)

    mlflow.log_params(best)
    mlflow.log_metric("score", score)
    mlflow.sklearn.log_model(classifier, "model")

  6%|▌         | 3/50 [00:00<00:02, 21.99trial/s, best loss: -0.7162175960250258]



















 12%|█▏        | 6/50 [00:00<00:02, 19.21trial/s, best loss: -0.7163453096343699]

















 28%|██▊       | 14/50 [00:00<00:01, 20.88trial/s, best loss: -0.7163453096343699]





































 34%|███▍      | 17/50 [00:00<00:01, 22.34trial/s, best loss: -0.7166001677917821]



















 74%|███████▍  | 37/50 [00:01<00:00, 22.11trial/s, best loss: -0.7166001677917821]





















 80%|████████  | 40/50 [00:01<00:00, 19.35trial/s, best loss: -0.7166001677917821]











 92%|█████████▏| 46/50 [00:02<00:00, 19.01trial/s, best loss: -0.7166001677917821]

















100%|██████████| 50/50 [00:02<00:00, 20.03trial/s, best loss: -0.7166001677917821]














InvalidParameterError: The 'penalty' parameter of LogisticRegression must be a str among {'none' (deprecated), 'l1', 'elasticnet', 'l2'} or None. Got 0 instead.