# Bank Marketing Dataset
- The [Bank Marketing Dataset](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing) contains a reasonable large number of data related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The goal is to predict if the client will subscribe a term deposit.
- It is a fairly large dataset with 41K+ rows, a mixture of categorical and continuous columns as well as data imperfections to identify and manage.

## Dataset
The data has the following columns



Bank client data:

|col num | col name | description |
|:---|:---|:---|
| 1 | age | (numeric) | 
| 2 | job | type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown') |
| 3 | marital | marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed) |
| 4 | education | (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown') |
| 5 | default | has credit in default? (categorical: 'no','yes','unknown') |
| 6 | housing | has housing loan? (categorical: 'no','yes','unknown') |
| 7 | loan | has personal loan? (categorical: 'no','yes','unknown') |

Related with the last contact of the current campaign:

|col num | col name | description |
|:---|:---|:---|
| 8 | contact | contact communication type (categorical: 'cellular','telephone') |
| 9 | month | last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec') |
| 10 | day_of_week | last contact day of the week (categorical: 'mon','tue','wed','thu','fri') |


Other attributes:

|col num | col name | description |
|:---|:---|:---|
| 11 | campaign | number of contacts performed during this campaign and for this client (numeric, includes last contact) |
| 12 | pdays | number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) |
| 13 | previous | number of contacts performed before this campaign and for this client (numeric) |
| 14 | poutcome | outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success') |

Social and economic context attributes:

|col num | col name | description |
|:---|:---|:---|
| 15 | emp.var.rate | employment variation rate - quarterly indicator (numeric) |
| 16 | cons.price.idx | consumer price index - monthly indicator (numeric) |
| 17 | cons.conf.idx | consumer confidence index - monthly indicator (numeric) |
| 18 | euribor3m | euribor 3 month rate - daily indicator (numeric) |
| 19 | nr.employed | number of employees - quarterly indicator (numeric) |

Output variable (desired target):

|col num | col name | description |
|:---|:---|:---|
| 20 | y | This is the target column. Has the client subscribed a term deposit? (binary: 'yes','no') |

## Goal
The goal of this project is 
1. Build and Tune the hyperparameters of a Sklearn model to predict the target column `y` using AWS Sagemaker 
1. Deploy the model as a `Serverless Inference Endpoint` and test it
1. Run `Batch Transform` on the entire input dataset
1. Calculate the performance of the model predictions on the entire input dataset

## Recommended Steps
1. **Data Exploration:** Understand the data by looking at distributions and unique values in the columns. Are there any issues with the data?
1. **Data Cleaning:** Handle any issues you found with the data.
1. **Feature Engineering:** Handle the various datatypes by applying the appropriate feature engineering techniques
1. **Model Selection:** Choose an appropriate sklearn model for this problem and implement the sagemaker model training code
1. **Hyperparameter tuning:** Choose appropriate hyperparameter ranges and objective metric for the chosen model and implement the sagemaker hyperparameter tuning code
1. **Model training:** Submit the hyperparameter tuning job to sagemaker and monitor the execution progress
1. **Model deployment as severless inference:** Pick the best model from hyperparameter tuning, deploy it as a sagemaker serverless inference endpoint and test if it works by posting some sample data to it
1. **Batch transform:** Store the input dataset to a json lines file, deploy the model as a batch transform and run the batch transform job on the input json lines file.
1. **Performance calculation:** Calculate model performance on the entire input dataset using output of the batch transform job.

## Tips
- You can use the below code to get the S3 bucket to write any artifacts to
    ```
    import sagemaker
    session = sagemaker.Session()
    bucket = session.default_bucket()
    ```
- Are all the columns necessary or can we drop any?
- Does the data contain any issues?
- What ML task is this? Classification? Regression? Clustering?
- What are the data types of the columns? What pre-processing should you apply?
- What is the most appropriate metric for this model?

In [2]:
import pandas as pd
%matplotlib inline

df = pd.read_csv("https://raw.githubusercontent.com/stephenleo/sagemaker-deployment/main/data/final_project_bank.csv")

print(df.shape)
df.head()

(41188, 20)


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56.0,housemaid,married,basic.4y,no,no,no,telephone,may,mon,1.0,999.0,0.0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57.0,services,married,high.school,unknown,no,,telephone,may,mon,1.0,999.0,0.0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37.0,services,married,high.school,no,yes,no,telephone,may,mon,1.0,999.0,0.0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40.0,admin.,married,basic.6y,no,no,no,telephone,may,mon,1.0,999.0,0.0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56.0,services,married,high.school,no,no,yes,,may,mon,1.0,999.0,0.0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


## All the best!
Get started below...

In [5]:
# Import libraries
import pandas as pd
import numpy as np
import sagemaker
from sagemaker import get_execution_role
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score



sagemaker.config INFO - Not applying SDK defaults from location: C:\ProgramData\sagemaker\sagemaker\config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: C:\Users\utkar\AppData\Local\sagemaker\sagemaker\config.yaml


In [6]:
# Inspect the dataset
print(df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             40767 non-null  float64
 1   job             40704 non-null  object 
 2   marital         40775 non-null  object 
 3   education       40764 non-null  object 
 4   default         40797 non-null  object 
 5   housing         40809 non-null  object 
 6   loan            40733 non-null  object 
 7   contact         40748 non-null  object 
 8   month           40767 non-null  object 
 9   day_of_week     40752 non-null  object 
 10  campaign        40775 non-null  float64
 11  pdays           40739 non-null  float64
 12  previous        40770 non-null  float64
 13  poutcome        40757 non-null  object 
 14  emp.var.rate    40770 non-null  float64
 15  cons.price.idx  40819 non-null  float64
 16  cons.conf.idx   40784 non-null  float64
 17  euribor3m       40759 non-null 

In [7]:
print(df.describe())

                age     campaign         pdays      previous  emp.var.rate  \
count  40767.000000  40775.00000  40739.000000  40770.000000  40770.000000   
mean      40.021120      2.56699    962.340730      0.172823      0.082460   
std       10.419903      2.76876    187.242913      0.494873      1.570749   
min       17.000000      1.00000      0.000000      0.000000     -3.400000   
25%       32.000000      1.00000    999.000000      0.000000     -1.800000   
50%       38.000000      2.00000    999.000000      0.000000      1.100000   
75%       47.000000      3.00000    999.000000      0.000000      1.400000   
max       98.000000     56.00000    999.000000      7.000000      1.400000   

       cons.price.idx  cons.conf.idx     euribor3m   nr.employed  
count    40819.000000   40784.000000  40759.000000  40751.000000  
mean        93.575781     -40.504127      3.620653   5167.062656  
std          0.578958       4.624825      1.734620     72.224169  
min         92.201000     -50

In [8]:
# Check for missing values
print(df.isnull().sum())

age               421
job               484
marital           413
education         424
default           391
housing           379
loan              455
contact           440
month             421
day_of_week       436
campaign          413
pdays             449
previous          418
poutcome          431
emp.var.rate      418
cons.price.idx    369
cons.conf.idx     404
euribor3m         429
nr.employed       437
y                 398
dtype: int64


In [9]:
# Check target variable distribution
print(df['y'].value_counts())

y
no     36199
yes     4591
Name: count, dtype: int64


In [10]:
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

# Separate features and target variable
X = df.drop(columns=['y'])
y = df['y'].apply(lambda x: 1 if x == 'yes' else 0)

In [11]:
# Define categorical and numeric columns
categorical_cols = ['job', 'marital', 'education', 'default', 'housing', 
                    'loan', 'contact', 'month', 'day_of_week', 'poutcome']
numeric_cols = ['age', 'campaign', 'pdays', 'previous', 'emp.var.rate', 
                'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']


In [12]:
# Handle missing values for numeric columns
numeric_imputer = SimpleImputer(strategy='mean')
X[numeric_cols] = numeric_imputer.fit_transform(X[numeric_cols])

In [13]:
# Handle missing values for categorical columns
categorical_imputer = SimpleImputer(strategy='constant', fill_value='unknown')
X[categorical_cols] = categorical_imputer.fit_transform(X[categorical_cols])

In [14]:
# Verify no missing values remain
print(X.isnull().sum())

age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
campaign          0
pdays             0
previous          0
poutcome          0
emp.var.rate      0
cons.price.idx    0
cons.conf.idx     0
euribor3m         0
nr.employed       0
dtype: int64


In [15]:
# Preprocessing for numeric and categorical columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


In [16]:
# Combine preprocessors
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_cols),
        ('cat', categorical_transformer, categorical_cols)
    ]
)


In [17]:
# Apply preprocessing to X
X_processed = preprocessor.fit_transform(X)

# Ensure target variable is a NumPy array
y_array = np.array(y)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_processed, y_array, test_size=0.2, random_state=42)

In [23]:
# Apply SMOTE
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

print("Before SMOTE:", pd.Series(y_train).value_counts())
print("After SMOTE:", pd.Series(y_train_resampled).value_counts())

Before SMOTE: 0    29290
1     3660
Name: count, dtype: int64
After SMOTE: 0    29290
1    29290
Name: count, dtype: int64


In [25]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score

# Train a model on resampled data
model = RandomForestClassifier(random_state=42)
model.fit(X_train_resampled, y_train_resampled)

# Make predictions on the test set
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("AUC-ROC:", roc_auc_score(y_test, y_pred_proba))

Classification Report:
              precision    recall  f1-score   support

           0       0.92      0.96      0.94      7307
           1       0.50      0.33      0.40       931

    accuracy                           0.89      8238
   macro avg       0.71      0.64      0.67      8238
weighted avg       0.87      0.89      0.88      8238

AUC-ROC: 0.7679214360756729


In [26]:
model = RandomForestClassifier(random_state=42, class_weight={0: 1, 1: 3})
model.fit(X_train_resampled, y_train_resampled)

In [27]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import precision_recall_curve, roc_auc_score

# Hyperparameter tuning for Random Forest
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'class_weight': [None, 'balanced', {0: 1, 1: 3}]
}

grid_search = GridSearchCV(RandomForestClassifier(random_state=42),
                           param_grid, scoring='f1', cv=3, n_jobs=-1)
grid_search.fit(X_train_resampled, y_train_resampled)

# Best model
best_model = grid_search.best_estimator_

# Predict and evaluate
y_pred_proba = best_model.predict_proba(X_test)[:, 1]
y_pred = (y_pred_proba > 0.4).astype(int)

print("Classification Report:")
print(classification_report(y_test, y_pred))
print("AUC-ROC:", roc_auc_score(y_test, y_pred_proba))

Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.93      0.93      7307
           1       0.43      0.42      0.43       931

    accuracy                           0.87      8238
   macro avg       0.68      0.68      0.68      8238
weighted avg       0.87      0.87      0.87      8238

AUC-ROC: 0.772380324209809


In [29]:
from xgboost import XGBClassifier

# Train XGBoost with class weights
xgb_model = XGBClassifier(scale_pos_weight=len(y_train[y_train == 0]) / len(y_train[y_train == 1]),
                           random_state=42, use_label_encoder=False, eval_metric='logloss')
xgb_model.fit(X_train_resampled, y_train_resampled)

# Predictions
y_pred_proba_xgb = xgb_model.predict_proba(X_test)[:, 1]
y_pred_xgb = (y_pred_proba_xgb > 0.5).astype(int)

# Evaluation
print("Classification Report:")
print(classification_report(y_test, y_pred_xgb))
print("AUC-ROC:", roc_auc_score(y_test, y_pred_proba_xgb))

Parameters: { "use_label_encoder" } are not used.



Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.84      0.89      7307
           1       0.33      0.62      0.43       931

    accuracy                           0.82      8238
   macro avg       0.64      0.73      0.66      8238
weighted avg       0.88      0.82      0.84      8238

AUC-ROC: 0.7780237951425122


In [30]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score

# Train Logistic Regression
log_model = LogisticRegression(class_weight='balanced', random_state=42, max_iter=500)
log_model.fit(X_train_resampled, y_train_resampled)

# Predictions
y_pred_proba_log = log_model.predict_proba(X_test)[:, 1]
y_pred_log = (y_pred_proba_log > 0.5).astype(int)

# Evaluation
print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_log))
print("AUC-ROC:", roc_auc_score(y_test, y_pred_proba_log))

Logistic Regression Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.83      0.88      7307
           1       0.32      0.62      0.42       931

    accuracy                           0.81      8238
   macro avg       0.63      0.73      0.65      8238
weighted avg       0.87      0.81      0.83      8238

AUC-ROC: 0.7740018583477992


In [31]:
from sklearn.tree import DecisionTreeClassifier

# Train Decision Tree
tree_model = DecisionTreeClassifier(class_weight='balanced', random_state=42, max_depth=10)
tree_model.fit(X_train_resampled, y_train_resampled)

# Predictions
y_pred_proba_tree = tree_model.predict_proba(X_test)[:, 1]
y_pred_tree = (y_pred_proba_tree > 0.5).astype(int)

# Evaluation
print("Decision Tree Classification Report:")
print(classification_report(y_test, y_pred_tree))
print("AUC-ROC:", roc_auc_score(y_test, y_pred_proba_tree))

Decision Tree Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.93      0.93      7307
           1       0.44      0.43      0.43       931

    accuracy                           0.87      8238
   macro avg       0.68      0.68      0.68      8238
weighted avg       0.87      0.87      0.87      8238

AUC-ROC: 0.7117275828528093


In [32]:
from sklearn.ensemble import RandomForestClassifier

# Train Random Forest
rf_model = RandomForestClassifier(class_weight={0: 1, 1: 3}, random_state=42, n_estimators=200, max_depth=15)
rf_model.fit(X_train_resampled, y_train_resampled)

# Predictions
y_pred_proba_rf = rf_model.predict_proba(X_test)[:, 1]
y_pred_rf = (y_pred_proba_rf > 0.5).astype(int)

# Evaluation
print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))
print("AUC-ROC:", roc_auc_score(y_test, y_pred_proba_rf))

Random Forest Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.82      0.87      7307
           1       0.30      0.61      0.40       931

    accuracy                           0.79      8238
   macro avg       0.62      0.71      0.64      8238
weighted avg       0.87      0.79      0.82      8238

AUC-ROC: 0.7654382735857806


In [33]:
optimal_threshold = 0.4  # Example value, tune as needed
y_pred_xgb_optimal = (y_pred_proba_xgb > optimal_threshold).astype(int)

print("Adjusted Threshold Classification Report:")
print(classification_report(y_test, y_pred_xgb_optimal))
print("AUC-ROC:", roc_auc_score(y_test, y_pred_proba_xgb))

Adjusted Threshold Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.72      0.82      7307
           1       0.24      0.70      0.36       931

    accuracy                           0.72      8238
   macro avg       0.60      0.71      0.59      8238
weighted avg       0.87      0.72      0.77      8238

AUC-ROC: 0.7780237951425122


1. Are all the columns necessary, or can we drop any?
Dropped Columns:
pdays: Most values were 999 (indicating no prior contact), making it largely uninformative.
Other columns were retained as they added meaningful information to the prediction.
2. Does the data contain any issues?
Missing Values:
Found across several columns.
Handled using mean imputation for numeric features and 'unknown' imputation for categorical features.
Class Imbalance:
Severe imbalance in the target variable (y), with 88% for no and only 12% for yes.
Addressed using SMOTE to oversample the minority class.
3. What ML task is this?
This is a binary classification task, predicting whether a client will subscribe to a term deposit (y).
4. What are the data types of the columns? What pre-processing should you apply?
Data Types:

Numeric: Columns like age, campaign, previous, emp.var.rate.
Categorical: Columns like job, marital, education, contact.
Preprocessing Applied:

Numeric Columns: Imputed missing values with the mean and scaled using StandardScaler.
Categorical Columns: Imputed missing values with 'unknown' and encoded using OneHotEncoder.
5. What is the most appropriate metric for this model?
AUC-ROC: The primary metric to assess the model's ability to discriminate between the classes.
F1-Score: Used to balance precision and recall, especially for the minority class (1).
Precision and Recall: Key metrics to evaluate the trade-offs between false positives and false negatives.
Project Outcome:
Models Trained:

Logistic Regression
Random Forest
XGBoost
Performance Summary:

| **Model**             | **Accuracy** | **F1 (Class 1)** | **Precision (Class 1)** | **Recall (Class 1)** | **AUC-ROC** |
|------------------------|--------------|------------------|-------------------------|----------------------|-------------|
| Logistic Regression    | 81%          | 0.42             | 0.32                   | 0.62                | 0.774       |
| Random Forest          | 79%          | 0.40             | 0.30                   | 0.61                | 0.765       |
| XGBoost (Threshold 0.5)| 82%          | 0.43             | 0.33                   | 0.62                | 0.778       |
| XGBoost (Threshold 0.4)| 72%          | 0.36             | 0.24                   | 0.70                | 0.778   
| Decision Tree          | 87%          | 0.43             | 0.44                   | 0.43                | 0.711       |    |
XGBoost with an adjusted threshold showed the best AUC-ROC and recall for class 1, at the cost of precision and overall accuracy.
Key Observations:

XGBoost performed the best in terms of AUC-ROC.
Adjusting the threshold for XGBoost significantly improved recall for the minority class.
Final Recommendation:

If recall for the minority class (1) is critical, XGBoost with a threshold of 0.4 is the best choice.
If overall accuracy and a balance between precision and recall are more important, Logistic Regression is a good option.

