# Retail Analysis

## Summary of the business

HELOC

The HELOC dataset from FICO. Each entry in the dataset is a line of credit, typically offered by a bank as a percentage of home equity (the difference between the current market value of a home and its purchase price). The customers in this dataset have requested a credit line in the range of $5,000 - $150,000. The fundamental task is to use the information about the applicant in their credit report to predict whether they will repay their HELOC account within 2 years.

In [2]:
import pandas as pd
import numpy as np

# Data Exploration

There is only one csv file called: heloc.csv

In [3]:
heloc_df = pd.read_csv('data/heloc.csv')

In [4]:
print(f'There are {len(heloc_df.columns)} columns: ')
print()
for idx, x in enumerate(heloc_df.columns):
    print(x)

There are 24 columns: 

RiskPerformance
ExternalRiskEstimate
MSinceOldestTradeOpen
MSinceMostRecentTradeOpen
AverageMInFile
NumSatisfactoryTrades
NumTrades60Ever2DerogPubRec
NumTrades90Ever2DerogPubRec
PercentTradesNeverDelq
MSinceMostRecentDelq
MaxDelq2PublicRecLast12M
MaxDelqEver
NumTotalTrades
NumTradesOpeninLast12M
PercentInstallTrades
MSinceMostRecentInqexcl7days
NumInqLast6M
NumInqLast6Mexcl7days
NetFractionRevolvingBurden
NetFractionInstallBurden
NumRevolvingTradesWBalance
NumInstallTradesWBalance
NumBank2NatlTradesWHighUtilization
PercentTradesWBalance


In [5]:
# Lets check if there are no null values:
heloc_df.isnull().sum()[1] == 0

True

We don't have any values that are empty which helps us greatly however now we have to see the quality of the data that are in the dataframe.

In [6]:
heloc_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10459 entries, 0 to 10458
Data columns (total 24 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   RiskPerformance                     10459 non-null  object
 1   ExternalRiskEstimate                10459 non-null  int64 
 2   MSinceOldestTradeOpen               10459 non-null  int64 
 3   MSinceMostRecentTradeOpen           10459 non-null  int64 
 4   AverageMInFile                      10459 non-null  int64 
 5   NumSatisfactoryTrades               10459 non-null  int64 
 6   NumTrades60Ever2DerogPubRec         10459 non-null  int64 
 7   NumTrades90Ever2DerogPubRec         10459 non-null  int64 
 8   PercentTradesNeverDelq              10459 non-null  int64 
 9   MSinceMostRecentDelq                10459 non-null  int64 
 10  MaxDelq2PublicRecLast12M            10459 non-null  int64 
 11  MaxDelqEver                         10459 non-null  in

In [7]:
heloc_df.describe(include=[object])

Unnamed: 0,RiskPerformance
count,10459
unique,2
top,Bad
freq,5459


In [8]:
heloc_df['RiskPerformance'].unique()

array(['Bad', 'Good'], dtype=object)

For the categorial values we have RiskPerformance, which is a nominal binary value. 

In [9]:
heloc_df.describe()

Unnamed: 0,ExternalRiskEstimate,MSinceOldestTradeOpen,MSinceMostRecentTradeOpen,AverageMInFile,NumSatisfactoryTrades,NumTrades60Ever2DerogPubRec,NumTrades90Ever2DerogPubRec,PercentTradesNeverDelq,MSinceMostRecentDelq,MaxDelq2PublicRecLast12M,...,PercentInstallTrades,MSinceMostRecentInqexcl7days,NumInqLast6M,NumInqLast6Mexcl7days,NetFractionRevolvingBurden,NetFractionInstallBurden,NumRevolvingTradesWBalance,NumInstallTradesWBalance,NumBank2NatlTradesWHighUtilization,PercentTradesWBalance
count,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,...,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0,10459.0
mean,67.425758,184.205373,8.543455,73.843293,19.428052,0.042738,-0.142843,86.661536,6.762406,4.928291,...,32.16646,-0.325366,0.868152,0.812602,31.629888,39.158906,3.185008,0.976097,0.018071,62.079166
std,21.121621,109.683816,13.301745,38.782803,13.004327,2.51391,2.367397,25.999584,20.50125,3.756275,...,20.128634,6.067556,3.179304,3.143698,30.06014,42.101601,4.413173,4.060995,3.358135,27.711565
min,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,...,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0
25%,63.0,118.0,3.0,52.0,12.0,0.0,0.0,87.0,-7.0,4.0,...,20.0,-7.0,0.0,0.0,5.0,-8.0,2.0,1.0,0.0,47.0
50%,71.0,178.0,5.0,74.0,19.0,0.0,0.0,96.0,-7.0,6.0,...,31.0,0.0,1.0,1.0,25.0,47.0,3.0,2.0,0.0,67.0
75%,79.0,249.5,11.0,95.0,27.0,1.0,0.0,100.0,14.0,7.0,...,44.0,1.0,2.0,2.0,54.0,79.0,5.0,3.0,1.0,82.0
max,94.0,803.0,383.0,383.0,79.0,19.0,19.0,100.0,83.0,9.0,...,100.0,24.0,66.0,66.0,232.0,471.0,32.0,23.0,18.0,100.0


The Risk Performance seems to be the indication whether or not the client is eligeblie for the loan. There fore this is our y value to predict with ML

## Searching for Correlation between RiskPerformance

In [10]:
heloc_df.sample()

Unnamed: 0,RiskPerformance,ExternalRiskEstimate,MSinceOldestTradeOpen,MSinceMostRecentTradeOpen,AverageMInFile,NumSatisfactoryTrades,NumTrades60Ever2DerogPubRec,NumTrades90Ever2DerogPubRec,PercentTradesNeverDelq,MSinceMostRecentDelq,...,PercentInstallTrades,MSinceMostRecentInqexcl7days,NumInqLast6M,NumInqLast6Mexcl7days,NetFractionRevolvingBurden,NetFractionInstallBurden,NumRevolvingTradesWBalance,NumInstallTradesWBalance,NumBank2NatlTradesWHighUtilization,PercentTradesWBalance
7221,Good,62,161,1,60,16,1,0,79,4,...,16,0,1,1,20,25,4,2,0,47


In [11]:
heloc_df['RiskPerformance'] = heloc_df.apply(lambda x: 1 if x['RiskPerformance'] == 'Good' else 0, axis=1)

In [12]:
heloc_df.corr()[['RiskPerformance']]

Unnamed: 0,RiskPerformance
RiskPerformance,1.0
ExternalRiskEstimate,0.21677
MSinceOldestTradeOpen,0.185155
MSinceMostRecentTradeOpen,0.046937
AverageMInFile,0.209168
NumSatisfactoryTrades,0.12308
NumTrades60Ever2DerogPubRec,-0.067211
NumTrades90Ever2DerogPubRec,-0.043402
PercentTradesNeverDelq,0.12201
MSinceMostRecentDelq,-0.057067


In [13]:
heloc_df.sample(3)

Unnamed: 0,RiskPerformance,ExternalRiskEstimate,MSinceOldestTradeOpen,MSinceMostRecentTradeOpen,AverageMInFile,NumSatisfactoryTrades,NumTrades60Ever2DerogPubRec,NumTrades90Ever2DerogPubRec,PercentTradesNeverDelq,MSinceMostRecentDelq,...,PercentInstallTrades,MSinceMostRecentInqexcl7days,NumInqLast6M,NumInqLast6Mexcl7days,NetFractionRevolvingBurden,NetFractionInstallBurden,NumRevolvingTradesWBalance,NumInstallTradesWBalance,NumBank2NatlTradesWHighUtilization,PercentTradesWBalance
3106,0,66,23,3,36,21,0,0,100,-7,...,19,0,1,1,45,72,11,3,5,74
2323,0,63,84,2,65,25,0,0,96,0,...,50,-7,2,2,46,98,3,1,2,50
4537,0,79,86,9,36,13,0,0,100,-7,...,8,-7,0,0,19,-8,6,1,0,70


TODO:   
 [ ] Remove some not correlated columns?  
    [ ] Design experiments for the columns in MLFlow

# Machine Learning Risk Performance Predictor

In [14]:
import mlflow

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

Get values from the dataframe

In [15]:
X = heloc_df.iloc[:, 1:].values
y = heloc_df.iloc[:, 0].values


Split into train and test

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

## Feature Scaling

In [17]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [18]:
X_train[1]

array([-0.40323754, -0.62172153, -0.41908247,  0.61973185, -0.42289862,
        0.37937265,  0.48336977,  0.00985558, -0.1787039 , -0.25361311,
        0.12112622, -0.40675474,  0.24137398,  1.03600317, -1.09429611,
        0.03723953,  0.05508805, -0.9172518 ,  1.36386899, -0.26931694,
        0.24843962, -0.00427352,  1.36996771])

## Training the Logistic Regression model on the Training set

In [19]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

## Predicting the Test set results

In [20]:
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))


[[0 0]
 [0 1]
 [0 0]
 ...
 [0 0]
 [0 1]
 [1 1]]


## Making the Confusion Matrix

In [21]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[1038  311]
 [ 431  835]]


0.7162523900573614

In [22]:
# Make note, to predict a value remember to scale the input
# like so:

# print(classifier.predict( sc.transform( input_array ) ) )

# MLFlow

In [23]:
def evaluate(y: list, pred: list) -> float:
    rmse = np.sqrt(mean_squared_error(y, pred))
    return rmse

In [24]:
mlflow.set_tracking_uri("http://localhost:5000")
# mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("Heloc Experiment")


<Experiment: artifact_location='mlflow-artifacts:/1', creation_time=1690128332051, experiment_id='1', last_update_time=1690128332051, lifecycle_stage='active', name='Heloc Experiment', tags={}>

In [25]:
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
import mlflow
import mlflow.sklearn

# specify solver choices
solver_choice_list = [{'solver':'newton-cg', 'penalty': 'l2'},
                      {'solver':'lbfgs', 'penalty': 'l2'},
                      {'solver': 'liblinear', 'penalty': hp.choice('p_lib',['l1','l2'])}, 
                      {'solver':'saga', 'penalty':'elasticnet', 'l1_ratio':hp.uniform('l1_ratio', 0, 1)}]

# specify search space
space= {'C': hp.uniform('C', 0.0, 10.0),
       'fit_intercept': hp.choice('fit_intercept', [True, False]),
       'multi_class': hp.choice('multi_class', ['auto', 'ovr']),
       'solver':  hp.choice('solver', solver_choice_list)}

# define objective function to minimize
def objective(space):
    solver_dict = space['solver']
    solver = solver_dict['solver']
    penalty = solver_dict['penalty']
    l1_ratio = solver_dict.get('l1_ratio', None)  
    C = space['C']
    fit_intercept = space['fit_intercept']
    multi_class = space['multi_class']
    classifier = LogisticRegression(solver=solver, penalty=penalty, C=C, fit_intercept=fit_intercept, multi_class=multi_class)
    if penalty == 'elasticnet':
        classifier.set_params(l1_ratio=l1_ratio)
    accuracy = cross_val_score(classifier, X_train, y_train, cv=5).mean()
    return {'loss': -accuracy, 'status': STATUS_OK}

# running the optimizer
trials = Trials()
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=50,
            trials=trials)

# log the best parameters and model to mlflow
with mlflow.start_run():
    solver_dict = solver_choice_list[best['solver']]
    solver = solver_dict['solver']
    penalty = solver_dict.get('penalty', 'none')  
    best.pop('solver')
    l1_ratio = best.pop('l1_ratio', None)
    best.pop('p_lib', None)
    multi_class_map = {0: 'auto', 1: 'ovr'}
    best['multi_class'] = multi_class_map[best['multi_class']]
    classifier = LogisticRegression(solver=solver, penalty=penalty, **best)
    if penalty == 'elasticnet':
        classifier.set_params(l1_ratio=l1_ratio)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    score = accuracy_score(y_test, y_pred)
    mlflow.log_params(best)
    mlflow.log_metric("score", score)
    mlflow.sklearn.log_model(classifier, "model")



  0%|          | 0/50 [00:00<?, ?trial/s, best loss=?]







  2%|▏         | 1/50 [00:00<00:41,  1.19trial/s, best loss: -0.7121357194885602]




 16%|█▌        | 8/50 [00:03<00:20,  2.01trial/s, best loss: -0.7162145068352388]







 20%|██        | 10/50 [00:04<00:18,  2.16trial/s, best loss: -0.7162145068352388]




 30%|███       | 15/50 [00:05<00:11,  3.17trial/s, best loss: -0.7162145068352388]







 32%|███▏      | 16/50 [00:06<00:16,  2.12trial/s, best loss: -0.7162145068352388]







 34%|███▍      | 17/50 [00:07<00:19,  1.72trial/s, best loss: -0.7162145068352388]





 64%|██████▍   | 32/50 [00:21<00:12,  1.46trial/s, best loss: -0.716342057855647] 







 66%|██████▌   | 33/50 [00:22<00:12,  1.38trial/s, best loss: -0.716342057855647]




 74%|███████▍  | 37/50 [00:25<00:07,  1.75trial/s, best loss: -0.716342057855647]







 76%|███████▌  | 38/50 [00:26<00:07,  1.54trial/s, best loss: -0.716342057855647]




 84%|████████▍ | 42/50 [00:29<00:06,  1.27trial/s, best loss: -0.716342057855647]







 86%|████████▌ | 43/50 [00:30<00:05,  1.25trial/s, best loss: -0.7163422204445832]







 88%|████████▊ | 44/50 [00:31<00:04,  1.22trial/s, best loss: -0.7163422204445832]









 90%|█████████ | 45/50 [00:31<00:04,  1.23trial/s, best loss: -0.7163422204445832]







 92%|█████████▏| 46/50 [00:32<00:03,  1.22trial/s, best loss: -0.7163422204445832]









 94%|█████████▍| 47/50 [00:33<00:02,  1.22trial/s, best loss: -0.7163422204445832]







 96%|█████████▌| 48/50 [00:34<00:01,  1.22trial/s, best loss: -0.7163422204445832]









 98%|█████████▊| 49/50 [00:35<00:00,  1.21trial/s, best loss: -0.7163422204445832]







100%|██████████| 50/50 [00:36<00:00,  1.39trial/s, best loss: -0.7163422204445832]










In [26]:
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import make_scorer

# Preprocess the data
scaler = StandardScaler()
X_train_scaled = X_train
X_test_scaled = X_test

# Specify the search space for RandomForest
space_rf = {
    'n_estimators': hp.choice('n_estimators', range(100, 500)),
    'max_depth': hp.choice('max_depth', range(1, 20)),
    'min_samples_split': hp.choice('min_samples_split', range(2, 20)),
    'min_samples_leaf': hp.choice('min_samples_leaf', range(1, 20)),
}

# Define the objective function
def objective_rf(space):
    classifier = RandomForestClassifier(n_estimators=space['n_estimators'],
                                        max_depth=space['max_depth'],
                                        min_samples_split=space['min_samples_split'],
                                        min_samples_leaf=space['min_samples_leaf'])
    accuracy = cross_val_score(classifier, X_train_scaled, y_train, cv=5).mean()
    return {'loss': -accuracy, 'status': STATUS_OK}

# Run the optimizer
trials = Trials()
best = fmin(fn=objective_rf, space=space_rf, algo=tpe.suggest, max_evals=50, trials=trials)

# Extract best hyperparameters
n_estimators = best['n_estimators']
max_depth = best['max_depth']
min_samples_split = best['min_samples_split']
min_samples_leaf = best['min_samples_leaf']

# Train the model using best hyperparameters
classifier = RandomForestClassifier(n_estimators=n_estimators,
                                    max_depth=max_depth,
                                    min_samples_split=min_samples_split,
                                    min_samples_leaf=min_samples_leaf)
classifier.fit(X_train_scaled, y_train)

# Feature selection using RandomForest feature importance
selector = SelectFromModel(classifier, prefit=True)
X_train_reduced = selector.transform(X_train_scaled)
X_test_reduced = selector.transform(X_test_scaled)

# Train a new model on reduced dataset
classifier.fit(X_train_reduced, y_train)

# Predict and calculate score
y_pred = classifier.predict(X_test_reduced)
score = accuracy_score(y_test, y_pred)

# Log parameters, score and model to mlflow
with mlflow.start_run():
    mlflow.log_params(best)
    mlflow.log_metric("score", score)
    mlflow.sklearn.log_model(classifier, "random_forest")


100%|██████████| 50/50 [08:09<00:00,  9.80s/trial, best loss: -0.7283267322225258]


