# Case Study: Using Pre-trained Models to Classify Sample

## 01 Project Description

Lending Bank wants to attract term deposits to fund its lending business. In a term deposit, a client agrees to deposit funds and allow the bank to use them for a fixed length of time. In return, the bank will pay interest on the deposit.

The bank’s sales manager wants to market the product to their existing clients. They have historical information from a previous marketing campaign that includes client demographics, prior call experience, market conditions and the interest rate offered.

Using machine learning, help the bank predict which clients are likely to subscribe to a new term deposit. Explain how different features affect the decision.

**Problem**

Perform an analysis of the given data to determine how different features are related to credit card eligibility. Build a machine learning model that can predict the subs_deposit.

## 02 Data Preparation

In [55]:
%load_ext autoreload
%autoreload 2
import numpy as np
import pandas as pd
import sys 
sys.path.append("..") 
from src.data.sample_format import convert_sample

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [56]:
data_filepath = '../data/raw/train.csv'
# Read just the fist 10 rows to understand the sample values.
df_test = pd.read_csv(data_filepath, nrows=10)
display(df_test.head())

Unnamed: 0,client_id,age_bracket,job,marital,education,has_housing_loan,has_personal_loan,prev_call_duration,days_since_last_call,num_contacts_prev,poutcome,contact_date,cpi,subs_deposit
0,41020,41-60,white-collar,divorced,bachelors,yes,no,283,3,1,success,07/09/18,92.379,1
1,23720,60+,other,divorced,secondary,no,yes,169,6,2,success,05/07/18,94.215,1
2,29378,41-60,white-collar,married,bachelors,no,no,552,999,0,nonexistent,01/08/18,93.444,1
3,36636,25-40,technician,single,senior_secondary,yes,yes,206,999,0,nonexistent,02/11/18,93.2,0
4,38229,18-24,white-collar,single,bachelors,no,no,341,999,0,nonexistent,04/04/18,93.075,1


### Features Factory for one sample

In [101]:
from pprint import pprint

data = df_test.drop(columns='subs_deposit')
target = df_test[['client_id', 'subs_deposit']]
 

In [102]:
display(data.head())
display(target.head())

Unnamed: 0,client_id,age_bracket,job,marital,education,has_housing_loan,has_personal_loan,prev_call_duration,days_since_last_call,num_contacts_prev,poutcome,contact_date,cpi
0,41020,41-60,white-collar,divorced,bachelors,yes,no,283,3,1,success,07/09/18,92.379
1,23720,60+,other,divorced,secondary,no,yes,169,6,2,success,05/07/18,94.215
2,29378,41-60,white-collar,married,bachelors,no,no,552,999,0,nonexistent,01/08/18,93.444
3,36636,25-40,technician,single,senior_secondary,yes,yes,206,999,0,nonexistent,02/11/18,93.2
4,38229,18-24,white-collar,single,bachelors,no,no,341,999,0,nonexistent,04/04/18,93.075


Unnamed: 0,client_id,subs_deposit
0,41020,1
1,23720,1
2,29378,1
3,36636,0
4,38229,1


### Create an Sample in Dict for Testing

In [115]:
# assume the input sample is from the REST API in JSON format.
sample = data.iloc[0, :].to_dict()
y_true = target['subs_deposit'][0]
pprint(sample)
# pprint(sample.to_json())


{'age_bracket': '41-60',
 'client_id': 41020,
 'contact_date': '07/09/18',
 'cpi': 92.379,
 'days_since_last_call': 3,
 'education': 'bachelors',
 'has_housing_loan': 'yes',
 'has_personal_loan': 'no',
 'job': 'white-collar',
 'marital': 'divorced',
 'num_contacts_prev': 1,
 'poutcome': 'success',
 'prev_call_duration': 283}


In [124]:
# covert the sample to the format required by the model.
new_sample = convert_sample(sample)
pprint(new_sample)
 
# Check the contact_data in different foramt
sample['contact_date'] = '2018-09-07'
new_sample = convert_sample(sample)
assert new_sample['contact_weekday'] ==  5, 'contact_weekday is not 5'

{'age_bracket': 3,
 'client_id': 41020,
 'contact_weekday': 5,
 'cpi': 92.379,
 'days_since_last_call': 3,
 'education': 1,
 'has_housing_loan': 1,
 'has_personal_loan': 0,
 'job': 1,
 'marital': 3,
 'num_contacts_prev': 1,
 'poutcome': 1,
 'prev_call_duration': 283}


#### Convert one Sample to DataFrame

In [126]:
df_sample = pd.DataFrame(new_sample, index=[0])
df_sample.set_index('client_id', inplace=True)
df_sample

Unnamed: 0_level_0,age_bracket,job,marital,education,has_housing_loan,has_personal_loan,prev_call_duration,days_since_last_call,num_contacts_prev,poutcome,contact_weekday,cpi
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
41020,3,1,3,1,1,0,283,3,1,1,5,92.379


In [127]:
from autogluon.tabular import TabularPredictor

save_path = '../models/agModels-predictClass'
model_autoML = TabularPredictor.load(save_path)


In [128]:
y_pred = model_autoML.predict(df_sample)
y_proba = model_autoML.predict_proba(df_sample)
print(f'target label: {y_true}') 
print(f'predicted label: {y_pred.values[0]}')
display(y_proba)

target label: 1
predicted label: 1


Unnamed: 0_level_0,0,1
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1
41020,0.030215,0.969785


In [129]:
from joblib import  load
model = load('../models/model_DT.joblib')   
x = df_sample.values   

print(f'input sample {x.size}: {x}')

y_pred = model.predict(x)
y_porb = model.predict_proba(x)
y_pred, y_porb

input sample 12: [[  3.      1.      3.      1.      1.      0.    283.      3.      1.
    1.      5.     92.379]]


(array([1]), array([[0.06185567, 0.93814433]]))

In [130]:
x

array([[  3.   ,   1.   ,   3.   ,   1.   ,   1.   ,   0.   , 283.   ,
          3.   ,   1.   ,   1.   ,   5.   ,  92.379]])

In [131]:
from joblib import  load
model = load('../models/model_RF.joblib')   
x = df_sample.values   
y_pred = model.predict(x)
y_porb = model.predict_proba(x)
y_pred, y_porb

(array([1]), array([[0.12238699, 0.87761301]]))

### Features Factory for patch sample

In [132]:
data 

Unnamed: 0,client_id,age_bracket,job,marital,education,has_housing_loan,has_personal_loan,prev_call_duration,days_since_last_call,num_contacts_prev,poutcome,contact_date,cpi
0,41020,41-60,white-collar,divorced,bachelors,yes,no,283,3,1,success,07/09/18,92.379
1,23720,60+,other,divorced,secondary,no,yes,169,6,2,success,05/07/18,94.215
2,29378,41-60,white-collar,married,bachelors,no,no,552,999,0,nonexistent,01/08/18,93.444
3,36636,25-40,technician,single,senior_secondary,yes,yes,206,999,0,nonexistent,02/11/18,93.2
4,38229,18-24,white-collar,single,bachelors,no,no,341,999,0,nonexistent,04/04/18,93.075
5,27202,25-40,self-employed,married,secondary,no,no,81,999,0,nonexistent,06/08/18,93.444
6,1409,60+,white-collar,married,bachelors,no,no,1076,6,1,success,07/05/18,92.893
7,24379,41-60,other,married,senior_secondary,no,no,133,999,0,nonexistent,06/07/18,93.918
8,10036,25-40,blue-collar,married,secondary,no,no,253,999,1,failure,03/05/18,92.893
9,18115,41-60,self-employed,married,bachelors,no,no,467,999,0,nonexistent,01/06/18,94.465


In [133]:
target

Unnamed: 0,client_id,subs_deposit
0,41020,1
1,23720,1
2,29378,1
3,36636,0
4,38229,1
5,27202,0
6,1409,1
7,24379,0
8,10036,0
9,18115,1


#### Convert a Batch of Samples to DataFrame

In [139]:
# data.set_index('client_id', inplace=True)
# target.set_index('client_id', inplace=True)
df_batch = pd.DataFrame(data.apply(lambda x: convert_sample(x), axis=1).to_list())
df_batch.set_index('client_id', inplace=True)
df_batch 

target.set_index('client_id', inplace=True)

In [140]:
y_pred = model_autoML.predict(df_batch)
y_proba = model_autoML.predict_proba(df_batch)

perf = model_autoML.evaluate_predictions(y_true=target.subs_deposit,
                                         y_pred=y_pred,
                                         auxiliary_metrics=True)


In [141]:
perf

{'accuracy': 0.8,
 'balanced_accuracy': 0.75,
 'mcc': 0.6123724356957946,
 'f1': 0.8571428571428571,
 'precision': 0.75,
 'recall': 1.0}

In [142]:
# print(f'target label: {target.subs_deposit.values}')
# print(f'predicted label: {y_pred.values}')
df_results = target.copy()
df_results['prediction'] = y_pred
df_results = pd.concat([df_results, y_proba], axis=1)
df_results.rename(columns={0: 'No', 1: 'Yes'}, inplace=True)
df_results.sort_values('Yes',ascending= False, inplace=True)
display(df_results)

Unnamed: 0_level_0,subs_deposit,prediction,No,Yes
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
41020,1,1,0.030215,0.969785
1409,1,1,0.057885,0.942115
18115,1,1,0.107665,0.892335
23720,1,1,0.109889,0.890111
38229,1,1,0.129909,0.870091
10036,0,1,0.141878,0.858122
29378,1,1,0.145968,0.854032
36636,0,1,0.423301,0.576699
24379,0,0,0.956612,0.043388
27202,0,0,0.970214,0.029786
