# Case Study: Modelling: AutoML

## 01 Project Description

Lending Bank wants to attract term deposits to fund its lending business. In a term deposit, a client agrees to deposit funds and allow the bank to use them for a fixed length of time. In return, the bank will pay interest on the deposit.

The bank’s sales manager wants to market the product to their existing clients. They have historical information from a previous marketing campaign that includes client demographics, prior call experience, market conditions and the interest rate offered.

Using machine learning, help the bank predict which clients are likely to subscribe to a new term deposit. Explain how different features affect the decision.

**Problem**

Perform an analysis of the given data to determine how different features are related to credit card eligibility. Build a machine learning model that can predict the subs_deposit.

## 02 Load Training and Test Data

In [82]:
import pandas as pd
data_filepath = '../data/raw/train.csv'
df_preview = pd.read_csv(data_filepath, nrows=10)
display(df_preview.head())

Unnamed: 0,client_id,age_bracket,job,marital,education,has_housing_loan,has_personal_loan,prev_call_duration,days_since_last_call,num_contacts_prev,poutcome,contact_date,cpi,subs_deposit
0,41020,41-60,white-collar,divorced,bachelors,yes,no,283,3,1,success,07/09/18,92.379,1
1,23720,60+,other,divorced,secondary,no,yes,169,6,2,success,05/07/18,94.215,1
2,29378,41-60,white-collar,married,bachelors,no,no,552,999,0,nonexistent,01/08/18,93.444,1
3,36636,25-40,technician,single,senior_secondary,yes,yes,206,999,0,nonexistent,02/11/18,93.2,0
4,38229,18-24,white-collar,single,bachelors,no,no,341,999,0,nonexistent,04/04/18,93.075,1


In [83]:
sample_type = {
    'client_id': 'str',
 'age_bracket': 'str',
 'job': 'str',
 'marital': 'str',
 'education': 'str',
 'has_housing_loan': 'str',
 'has_personal_loan': 'str',
 'prev_call_duration': 'int',  
 'days_since_last_call': 'int',  
 'num_contacts_prev':'int', 
 'poutcome': 'str',
 'contact_date': 'str',
 'cpi': 'float',
 'subs_deposit':'int', 
}
df_raw_train = pd.read_csv(
    data_filepath,
    index_col=['client_id'],
    parse_dates=['contact_date'],
    dtype=sample_type,
)
display(df_raw_train.head())

Unnamed: 0_level_0,age_bracket,job,marital,education,has_housing_loan,has_personal_loan,prev_call_duration,days_since_last_call,num_contacts_prev,poutcome,contact_date,cpi,subs_deposit
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
41020,41-60,white-collar,divorced,bachelors,yes,no,283,3,1,success,2018-07-09,92.379,1
23720,60+,other,divorced,secondary,no,yes,169,6,2,success,2018-05-07,94.215,1
29378,41-60,white-collar,married,bachelors,no,no,552,999,0,nonexistent,2018-01-08,93.444,1
36636,25-40,technician,single,senior_secondary,yes,yes,206,999,0,nonexistent,2018-02-11,93.2,0
38229,18-24,white-collar,single,bachelors,no,no,341,999,0,nonexistent,2018-04-04,93.075,1


In [84]:
def data_factory(df_in):
    df_out = df_in.copy()
    # set client_id as the index
    # df_out.set_index('client_id', inplace=True)
    # normalize the prev_call duration.
    # reset outliters
    df_out.prev_call_duration = df_out.prev_call_duration.apply(lambda x: x/100 if x >120000 else x)
    # set upperbound to 4200
    df_out.prev_call_duration = df_out.prev_call_duration.apply(lambda x: x if x < 4200 else 4200)
    # normalize the days_since_last_call, replace 999 to -1
    df_out.days_since_last_call = df_out.days_since_last_call.apply(lambda x: -1 if x == 999 else x)
    # normalize the cpi
    # reset outliers
    df_out.cpi = df_out.cpi.apply(lambda x: x/10 if x >100 else x)
    # set upperbound to 100
    df_out.cpi = df_out.cpi.apply(lambda x: 100 if x > 100 else x)
    df_out.dropna(inplace=True)

    # replace subs_deposit with Yes to 1 and No to 0
    df_out.subs_deposit = df_out.subs_deposit.apply(lambda x: 'yes' if x == 1 else 'no')
    return df_out

df_train = data_factory(df_raw_train)

In [85]:
sample_type_test = {
    'client_id': 'str',
 'age_bracket': 'str',
 'job': 'str',
 'marital': 'str',
 'education': 'str',
 'has_housing_loan': 'str',
 'has_personal_loan': 'str',
 'prev_call_duration': 'int',  
 'days_since_last_call': 'int',  
 'num_contacts_prev':'int', 
 'poutcome': 'str',
 'contact_date': 'str',
 'cpi': 'float',
}

data_filepath_test = '../data/raw/test.csv'

df_raw_test = pd.read_csv(
    data_filepath,
    index_col=['client_id'],
    parse_dates=['contact_date'],
    dtype=sample_type,
)
display(df_raw_test.head())

 

Unnamed: 0_level_0,age_bracket,job,marital,education,has_housing_loan,has_personal_loan,prev_call_duration,days_since_last_call,num_contacts_prev,poutcome,contact_date,cpi,subs_deposit
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
41020,41-60,white-collar,divorced,bachelors,yes,no,283,3,1,success,2018-07-09,92.379,1
23720,60+,other,divorced,secondary,no,yes,169,6,2,success,2018-05-07,94.215,1
29378,41-60,white-collar,married,bachelors,no,no,552,999,0,nonexistent,2018-01-08,93.444,1
36636,25-40,technician,single,senior_secondary,yes,yes,206,999,0,nonexistent,2018-02-11,93.2,0
38229,18-24,white-collar,single,bachelors,no,no,341,999,0,nonexistent,2018-04-04,93.075,1


In [86]:
df_test = data_factory(df_raw_test)
display(df_test)

Unnamed: 0_level_0,age_bracket,job,marital,education,has_housing_loan,has_personal_loan,prev_call_duration,days_since_last_call,num_contacts_prev,poutcome,contact_date,cpi,subs_deposit
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
41020,41-60,white-collar,divorced,bachelors,yes,no,283.0,3,1,success,2018-07-09,92.379,yes
23720,60+,other,divorced,secondary,no,yes,169.0,6,2,success,2018-05-07,94.215,yes
29378,41-60,white-collar,married,bachelors,no,no,552.0,-1,0,nonexistent,2018-01-08,93.444,yes
36636,25-40,technician,single,senior_secondary,yes,yes,206.0,-1,0,nonexistent,2018-02-11,93.200,no
38229,18-24,white-collar,single,bachelors,no,no,341.0,-1,0,nonexistent,2018-04-04,93.075,yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7519,41-60,entrepreneur,single,secondary,yes,no,396.0,-1,0,nonexistent,2018-02-05,92.893,yes
29822,41-60,white-collar,married,bachelors,yes,no,115.0,-1,0,nonexistent,2018-01-08,93.444,no
24462,25-40,white-collar,married,senior_secondary,yes,no,214.0,-1,0,nonexistent,2018-06-07,93.918,no
26089,25-40,pink-collar,married,secondary,yes,no,76.0,-1,0,nonexistent,2018-02-07,93.918,no


## 03 Model Development

In [87]:
from autogluon.tabular import TabularPredictor

label = 'subs_deposit'
save_path = '../models/agModels-CleanRawAcc'  # specifies folder to store trained models

# here we used all the feature.
predictor = TabularPredictor(label=label, path=save_path, eval_metric='balanced_accuracy').fit(df_train)

Beginning AutoGluon training ...
AutoGluon will save models to "../models/agModels-CleanRawAcc/"
AutoGluon Version:  0.4.0
Python Version:     3.8.10
Operating System:   Darwin
Train Data Rows:    4000
Train Data Columns: 12
Label Column: subs_deposit
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  ['yes', 'no']
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = yes, class 0 = no
	Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive (yes) vs negative (no) class.
	To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Using Feature Generators to preprocess the data ...
Fitting Aut

In [88]:
save_report_leaderboard_csv = '../reports/df_clean_raw_acc_leaderboard.csv'
df_report = predictor.leaderboard(df_train, silent=True, extra_metrics=[predictor.eval_metric])
df_report.to_csv(save_report_leaderboard_csv )
display(df_report)

Unnamed: 0,model,score_test,balanced_accuracy,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,RandomForestEntr,0.980293,0.980293,0.842376,0.123698,0.047108,0.503647,0.123698,0.047108,0.503647,1,True,6
1,RandomForestGini,0.979872,0.979872,0.839012,0.102994,0.044437,0.460479,0.102994,0.044437,0.460479,1,True,5
2,ExtraTreesEntr,0.975369,0.975369,0.803027,0.146977,0.04524,0.4541,0.146977,0.04524,0.4541,1,True,9
3,ExtraTreesGini,0.974947,0.974947,0.799663,0.157998,0.046098,0.437306,0.157998,0.046098,0.437306,1,True,8
4,KNeighborsDist,0.969339,0.969339,0.76806,0.048678,0.008131,0.00748,0.048678,0.008131,0.00748,1,True,2
5,WeightedEnsemble_L2,0.961864,0.961864,0.87913,0.395431,0.097837,11.537287,0.003996,0.001163,0.508546,2,True,14
6,XGBoost,0.932587,0.932587,0.863287,0.026145,0.006806,0.545246,0.026145,0.006806,0.545246,1,True,11
7,LightGBMLarge,0.924265,0.924265,0.851533,0.013458,0.006187,1.101521,0.013458,0.006187,1.101521,1,True,13
8,CatBoost,0.921526,0.921526,0.860732,0.016372,0.007974,3.671559,0.016372,0.007974,3.671559,1,True,7
9,LightGBMXT,0.909729,0.909729,0.844847,0.029335,0.007528,1.422638,0.029335,0.007528,1.422638,1,True,3


In [89]:
best_model_name = df_report[df_report.balanced_accuracy == df_report.balanced_accuracy.max()].model[0]
predictor.set_model_best(best_model_name)
predictor.save()
print('Best model is saved as {}'.format(best_model_name))

TabularPredictor saved. To load, use: predictor = TabularPredictor.load("../models/agModels-CleanRawAcc/")


Best model is saved as RandomForestEntr


### Training Performacne

In [90]:
data = df_train.drop(columns='subs_deposit')
target = df_train.subs_deposit 

y_pred = predictor.predict(data)
# print("Predictions:  \n", y_pred)
perf = predictor.evaluate_predictions(y_true=target,
                                         y_pred=y_pred,
                                         auxiliary_metrics=True)

Evaluation: balanced_accuracy on test data: 0.9802930661029776
Evaluations on test data:
{
    "balanced_accuracy": 0.9802930661029776,
    "accuracy": 0.98025,
    "mcc": 0.9588796642456014,
    "f1": 0.975289333750391,
    "precision": 0.970130678282514,
    "recall": 0.980503144654088
}


## Test Prediction

In [91]:
y_pred_test = predictor.predict(df_test)

Create a submission file for the test data.

In [92]:
df_submission = y_pred_test.to_frame('subs_deposit')
df_submission.subs_deposit = df_submission.subs_deposit.apply(lambda x: 1 if x == 'yes' else 0)
df_submission.to_csv('../reports/submission_acc.csv')

In [93]:
df_submission.head()

Unnamed: 0_level_0,subs_deposit
client_id,Unnamed: 1_level_1
41020,1
23720,1
29378,1
36636,0
38229,1


## 04 Try Model with F1 evaluation metric

In [94]:
save_path = '../models/agModels-CleanRawF1'  # specifies folder to store trained models

# here we used all the feature.
predictor = TabularPredictor(label=label, path=save_path, eval_metric='f1').fit(df_train, hyperparameters={'GBM':{}, 'XGB':{},'XT':{}, 'RF':{}, 'LR':{},'KNN':{},'FASTAI':{}})

Beginning AutoGluon training ...
AutoGluon will save models to "../models/agModels-CleanRawF1/"
AutoGluon Version:  0.4.0
Python Version:     3.8.10
Operating System:   Darwin
Train Data Rows:    4000
Train Data Columns: 12
Label Column: subs_deposit
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  ['yes', 'no']
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = yes, class 0 = no
	Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive (yes) vs negative (no) class.
	To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Using Feature Generators to preprocess the data ...
Fitting Auto

In [95]:
save_report_leaderboard_csv = '../reports/df_clean_raw_f1_leaderboard.csv'
df_report = predictor.leaderboard(df_train, silent=True, extra_metrics=[predictor.eval_metric])
df_report.to_csv(save_report_leaderboard_csv )
display(df_report)

Unnamed: 0,model,score_test,f1,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,RandomForest,0.974953,0.974953,0.805825,0.08764,0.048268,0.442106,0.08764,0.048268,0.442106,1,True,3
1,ExtraTrees,0.969811,0.969811,0.758794,0.117578,0.045478,0.500015,0.117578,0.045478,0.500015,1,True,4
2,XGBoost,0.91514,0.91514,0.833333,0.041911,0.007735,0.885735,0.041911,0.007735,0.885735,1,True,6
3,WeightedEnsemble_L2,0.91514,0.91514,0.833333,0.043708,0.009382,1.539423,0.001797,0.001647,0.653688,2,True,8
4,NeuralNetFastAI,0.858094,0.858094,0.809524,0.067836,0.014951,3.805497,0.067836,0.014951,3.805497,1,True,5
5,LightGBM,0.857852,0.857852,0.821256,0.010096,0.006438,1.135036,0.010096,0.006438,1.135036,1,True,2
6,KNeighbors,0.795003,0.795003,0.744526,0.029963,0.006463,0.006851,0.029963,0.006463,0.006851,1,True,1
7,LinearModel,0.774693,0.774693,0.762626,0.013357,0.013102,2.025372,0.013357,0.013102,2.025372,1,True,7


In [96]:
best_model_name = df_report[df_report.f1 == df_report.f1.max()].model[0]
predictor.set_model_best(best_model_name)
predictor.save()
print('Best model is saved as {}'.format(best_model_name))

TabularPredictor saved. To load, use: predictor = TabularPredictor.load("../models/agModels-CleanRawF1/")


Best model is saved as RandomForest


In [97]:
# Training performance
data = df_train.drop(columns='subs_deposit')
target = df_train.subs_deposit 

y_pred = predictor.predict(data)
# print("Predictions:  \n", y_pred)
perf = predictor.evaluate_predictions(y_true=target,
                                         y_pred=y_pred,
                                         auxiliary_metrics=True)

Evaluation: f1 on test data: 0.9749530369442705
Evaluations on test data:
{
    "f1": 0.9749530369442705,
    "accuracy": 0.98,
    "balanced_accuracy": 0.9798716041650356,
    "mcc": 0.9583332181532717,
    "precision": 0.9706982543640897,
    "recall": 0.9792452830188679
}


In [98]:
y_pred_test = predictor.predict(df_test)
df_submission = y_pred_test.to_frame('subs_deposit')
df_submission.subs_deposit = df_submission.subs_deposit.apply(lambda x: 1 if x == 'yes' else 0)
df_submission.to_csv('../reports/submission_f1.csv')

In [99]:
df_submission.head()

Unnamed: 0_level_0,subs_deposit
client_id,Unnamed: 1_level_1
41020,1
23720,1
29378,1
36636,0
38229,1


## 05 Feature Importances.

In [100]:
df_fi= predictor.feature_importance(df_train)
df_fi

Computing feature importance via permutation shuffling for 12 features using 1000 rows with 3 shuffle sets...
	2.98s	= Expected runtime (0.99s per shuffle set)
	0.81s	= Actual runtime (Completed 3 of 3 shuffle sets)


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
prev_call_duration,0.363587,0.019175,0.000463,3,0.473464,0.25371
contact_date,0.1514,0.001646,2e-05,3,0.160832,0.141968
cpi,0.114571,0.005748,0.000419,3,0.14751,0.081633
job,0.030849,0.003014,0.001583,3,0.048119,0.01358
marital,0.024129,0.008665,0.020199,3,0.073778,-0.025521
education,0.02391,0.00994,0.026535,3,0.08087,-0.03305
age_bracket,0.017644,0.001202,0.000771,3,0.02453,0.010759
has_housing_loan,0.012618,0.004196,0.017468,3,0.036659,-0.011424
days_since_last_call,0.011674,0.003862,0.017302,3,0.033805,-0.010457
num_contacts_prev,0.009759,0.002117,0.007664,3,0.021891,-0.002373


In [101]:
# df_fi.importance.sum()
df_fi.to_csv('../reports/feature_importance.csv')

## 06 Web Service REST API

In [102]:
import pandas as pd
from autogluon.tabular import TabularPredictor


In [104]:
model = TabularPredictor.load(save_path)

In [108]:
model.fit_summary()

*** Summary of fit() ***
Estimated performance of each model:
                 model  score_val  pred_time_val  fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0              XGBoost   0.833333       0.007735  0.885735                0.007735           0.885735            1       True          6
1  WeightedEnsemble_L2   0.833333       0.009382  1.539423                0.001647           0.653688            2       True          8
2             LightGBM   0.821256       0.006438  1.135036                0.006438           1.135036            1       True          2
3      NeuralNetFastAI   0.809524       0.014951  3.805497                0.014951           3.805497            1       True          5
4         RandomForest   0.805825       0.048268  0.442106                0.048268           0.442106            1       True          3
5          LinearModel   0.762626       0.013102  2.025372                0.013102           2.025372            1  



{'model_types': {'KNeighbors': 'KNNModel',
  'LightGBM': 'LGBModel',
  'RandomForest': 'RFModel',
  'ExtraTrees': 'XTModel',
  'NeuralNetFastAI': 'NNFastAiTabularModel',
  'XGBoost': 'XGBoostModel',
  'LinearModel': 'LinearModel',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel'},
 'model_performance': {'KNeighbors': 0.7445255474452555,
  'LightGBM': 0.8212560386473431,
  'RandomForest': 0.8058252427184466,
  'ExtraTrees': 0.7587939698492462,
  'NeuralNetFastAI': 0.8095238095238095,
  'XGBoost': 0.8333333333333334,
  'LinearModel': 0.7626262626262627,
  'WeightedEnsemble_L2': 0.8333333333333334},
 'model_best': 'RandomForest',
 'model_paths': {'KNeighbors': '../models/agModels-CleanRawF1/models/KNeighbors/',
  'LightGBM': '../models/agModels-CleanRawF1/models/LightGBM/',
  'RandomForest': '../models/agModels-CleanRawF1/models/RandomForest/',
  'ExtraTrees': '../models/agModels-CleanRawF1/models/ExtraTrees/',
  'NeuralNetFastAI': '../models/agModels-CleanRawF1/models/NeuralNetFastAI/',


In [113]:
sample = {'client_id': '123456', 'age_bracket': '18-24', 'job': 'white-collar', 'marital': 'married', 'education': 'bachelors', 'has_housing_loan': 'yes', 'has_personal_loan': 'yes', 'prev_call_duration': '237', 'days_since_last_call': '3', 'num_contacts_prev': '1', 'poutcome': 'success', 'contact_date': '2018-02-05', 'cpi': '92.81'}
print(sample)

{'client_id': '123456', 'age_bracket': '18-24', 'job': 'white-collar', 'marital': 'married', 'education': 'bachelors', 'has_housing_loan': 'yes', 'has_personal_loan': 'yes', 'prev_call_duration': '237', 'days_since_last_call': '3', 'num_contacts_prev': '1', 'poutcome': 'success', 'contact_date': '2018-02-05', 'cpi': '92.81'}


In [121]:
df_xin = pd.DataFrame([sample])
df_xin.set_index('client_id', inplace=True)
df_xin

Unnamed: 0_level_0,age_bracket,job,marital,education,has_housing_loan,has_personal_loan,prev_call_duration,days_since_last_call,num_contacts_prev,poutcome,contact_date,cpi
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
123456,18-24,white-collar,married,bachelors,yes,yes,237,3,1,success,2018-02-05,92.81


In [122]:
y_pred = model.predict(df_xin)
y_proba = model.predict_proba(df_xin)
print(y_pred, y_proba)

client_id
123456    yes
Name: subs_deposit, dtype: object                  no       yes
client_id                    
123456     0.176667  0.823333


## END  OF NOTEBOOK