**Background**

We are a small startup focusing mainly on providing machine learning solutions in the European banking market. We work on a variety of problems including fraud detection, sentiment classification and customer intention prediction and classification.

We are interested in developing a robust machine learning system that leverages information coming from call center data.

Ultimately, we are looking for ways to improve the success rate for calls made to customers for any product that our clients offer. Towards this goal we are working on designing an ever evolving machine learning product that offers high success outcomes while offering interpretability for our clients to make informed decisions.

**Data Description:**

The data comes from direct marketing efforts of a European banking institution. The marketing campaign involves making a phone call to a customer, often multiple times to ensure a product subscription, in this case a term deposit. Term deposits are usually short-term deposits with maturities ranging from one month to a few years. The customer must understand when buying a term deposit that they can withdraw their funds only after the term ends. All customer information that might reveal personal information is removed due to privacy concerns.

**Attributes:**

    age : age of customer (numeric)

    job : type of job (categorical)

    marital : marital status (categorical)

    education (categorical)

    default: has credit in default? (binary)

    balance: average yearly balance, in euros (numeric)

    housing: has a housing loan? (binary)

    loan: has personal loan? (binary)

    contact: contact communication type (categorical)

    day: last contact day of the month (numeric)

    month: last contact month of year (categorical)

    duration: last contact duration, in seconds (numeric)

    campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

**Output (desired target):**

    y - has the client subscribed to a term deposit? (binary)

**Goal(s):**

    Predict if the customer will subscribe (yes/no) to a term deposit (variable y)

**Success Metric(s):**

    Hit %81 or above accuracy

In [4]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

Classification function

In [1]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score, roc_auc_score

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold


from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy import stats
from sklearn.metrics import accuracy_score





label_encode function

In [2]:
def label_encode(df,feature):
    from sklearn.preprocessing import LabelEncoder
    lencoder = LabelEncoder()
    df[feature] = lencoder.fit_transform(df[feature]).reshape((-1,1))

### Import data

In [5]:
X_train_dep = pd.read_csv(r'C:\Users\dgarb\OneDrive\Documents\APZIVA\Project 2\data\X_train_dep.csv',index_col=0)

X_test_dep = pd.read_csv(r'C:\Users\dgarb\OneDrive\Documents\APZIVA\Project 2\data\X_test_dep.csv',index_col=0)

y_train_dep_df = pd.read_csv(r'C:\Users\dgarb\OneDrive\Documents\APZIVA\Project 2\data\y_train_dep_df.csv',index_col=0)

y_test_dep_df  = pd.read_csv(r'C:\Users\dgarb\OneDrive\Documents\APZIVA\Project 2\data\y_test_dep_df.csv',index_col=0)

In [6]:
y_test_dep_df.head()

Unnamed: 0,y_label
13130,False
11127,True
17899,True
39652,False
25890,False


In [7]:
char_cols_lst = list(X_train_dep.select_dtypes(exclude=['int64','float64','bool']).columns)
char_cols_lst

['job',
 'marital',
 'education',
 'default',
 'housing',
 'loan',
 'contact',
 'month']

In [8]:
X_train_dep

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,...,month_aug,month_dec,month_feb,month_jan,month_jul,month_jun,month_mar,month_may,month_nov,month_oct
39336,30,admin,single,secondary,no,2,yes,no,telephone,18,...,0,0,0,0,0,0,0,1,0,0
25184,36,services,single,secondary,no,1482,yes,no,cellular,18,...,0,0,0,0,0,0,0,0,1,0
34484,35,blue-collar,married,primary,no,147,yes,no,cellular,5,...,0,0,0,0,0,0,0,1,0,0
25237,33,management,married,secondary,no,480,yes,no,cellular,18,...,0,0,0,0,0,0,0,0,1,0
16170,46,blue-collar,married,secondary,no,209,yes,no,cellular,22,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2072,31,blue-collar,married,primary,no,0,yes,yes,unknown,12,...,0,0,0,0,0,0,0,1,0,0
27318,39,management,married,tertiary,no,3216,no,no,cellular,21,...,0,0,0,0,0,0,0,0,1,0
32879,33,blue-collar,single,primary,no,557,yes,no,cellular,17,...,0,0,0,0,0,0,0,0,0,0
2577,31,admin,single,secondary,no,513,yes,no,unknown,13,...,0,0,0,0,0,0,0,1,0,0


In [12]:
X_train_dep.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,...,month_aug,month_dec,month_feb,month_jan,month_jul,month_jun,month_mar,month_may,month_nov,month_oct
39336,30,admin,single,secondary,no,2,yes,no,telephone,18,...,0,0,0,0,0,0,0,1,0,0
25184,36,services,single,secondary,no,1482,yes,no,cellular,18,...,0,0,0,0,0,0,0,0,1,0
34484,35,blue-collar,married,primary,no,147,yes,no,cellular,5,...,0,0,0,0,0,0,0,1,0,0
25237,33,management,married,secondary,no,480,yes,no,cellular,18,...,0,0,0,0,0,0,0,0,1,0
16170,46,blue-collar,married,secondary,no,209,yes,no,cellular,22,...,0,0,0,0,1,0,0,0,0,0


In [13]:
X_train_dep.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,...,month_aug,month_dec,month_feb,month_jan,month_jul,month_jun,month_mar,month_may,month_nov,month_oct
39336,30,admin,single,secondary,no,2,yes,no,telephone,18,...,0,0,0,0,0,0,0,1,0,0
25184,36,services,single,secondary,no,1482,yes,no,cellular,18,...,0,0,0,0,0,0,0,0,1,0
34484,35,blue-collar,married,primary,no,147,yes,no,cellular,5,...,0,0,0,0,0,0,0,1,0,0
25237,33,management,married,secondary,no,480,yes,no,cellular,18,...,0,0,0,0,0,0,0,0,1,0
16170,46,blue-collar,married,secondary,no,209,yes,no,cellular,22,...,0,0,0,0,1,0,0,0,0,0


### Label Encode

In [14]:

char_cols_lst

In [15]:
for categ_feat in char_cols_lst:
    label_encode(X_train_dep,categ_feat)

In [16]:
X_train_dep[char_cols_lst].head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,...,month_aug,month_dec,month_feb,month_jan,month_jul,month_jun,month_mar,month_may,month_nov,month_oct
39336,11,0,2,1,0,806,1,0,1,17,...,0,0,0,0,0,0,0,1,0,0
25184,17,7,2,1,0,2279,1,0,0,17,...,0,0,0,0,0,0,0,0,1,0
34484,16,1,1,0,0,951,1,0,0,4,...,0,0,0,0,0,0,0,1,0,0
25237,14,4,1,1,0,1284,1,0,0,17,...,0,0,0,0,0,0,0,0,1,0
16170,27,1,1,1,0,1013,1,0,0,21,...,0,0,0,0,1,0,0,0,0,0


In [None]:
for categ_feat in char_cols_lst:
    label_encode(X_test_dep,categ_feat)

In [None]:
X_test_dep[char_cols_lst].head()

In [None]:
X_train_dep.columns

### drop dummies 

In [None]:
dummies_lst = ['job_blue-collar',
 'job_entrepreneur',
 'job_housemaid',
 'job_management',
 'job_retired',
 'job_self-employed',
 'job_services',
 'job_student',
 'job_technician',
 'job_unemployed',
 'job_unknown',
 'marital_married',
 'marital_single',
 'education_secondary',
 'education_tertiary',
 'education_unknown',
 'default_yes',
 'housing_yes',
 'loan_yes',
 'contact_telephone',
 'contact_unknown',
 'month_aug',
 'month_dec',
 'month_feb',
 'month_jan',
 'month_jul',
 'month_jun',
 'month_mar',
 'month_may',
 'month_nov',
 'month_oct']

In [None]:
# X_train_dep['Unnamed: 0'] 

In [None]:
X_test_dep_dd = X_test_dep.drop(columns = dummies_lst)
X_test_dep_dd.shape

In [None]:
X_test_dep_dd.info()

In [None]:
X_test_dep_dd.head()


In [None]:
stop here

In [None]:
# y_test_dep_df = y_test_dep_df.drop(columns= ['Unnamed: 0'])

In [None]:
features= ['age', 'job', 'marital', 'education', 'default',
       'balance', 'housing', 'loan', 'contact', 'day', 'month', 'duration',
       'campaign']

In [None]:
features = list(X_test_dep_dd.columns)

ss = StandardScaler()
x_train = pd.DataFrame(ss.fit_transform(X_test_dep_dd), columns = features)
x_test = pd.DataFrame(ss.fit_transform(X_test_dep_dd), columns = features)




In [None]:
# build a model
model = Sequential()
model.add(Dense(16, input_shape=(X.shape[1],), activation='relu')) # Add an input shape! (features,)
model.add(Dense(16, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary() 

# compile the model
model.compile(optimizer='Adam', 
              loss='binary_crossentropy',
              metrics=['accuracy'])

# early stopping callback
# This callback will stop the training when there is no improvement in  
# the validation loss for 10 consecutive epochs.  
es = EarlyStopping(monitor='val_accuracy', 
                                   mode='max', # don't minimize the accuracy!
                                   patience=10,
                                   restore_best_weights=True)

# now we just update our model fit call
history = model.fit(X,
                    Y,
                    callbacks=[es],
                    epochs=80, # you can set this to a big number!
                    batch_size=10,
                    validation_split=0.2,
                    shuffle=True,
                    verbose=1)

Evaluate the Model

    Learning curves (Loss)
    Learning curves (Accuracy)
    Confusion matrix

In [None]:
history_dict = history.history
# Learning curve(Loss)
# let's see the training and validation loss by epoch

# loss
loss_values = history_dict['loss'] # you can change this
val_loss_values = history_dict['val_loss'] # you can also change this

# range of X (no. of epochs)
epochs = range(1, len(loss_values) + 1) 

# plot
plt.plot(epochs, loss_values, 'bo', label='Training loss')
plt.plot(epochs, val_loss_values, 'orange', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

In [None]:
# Learning curve(accuracy)
# let's see the training and validation accuracy by epoch

# accuracy
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

# range of X (no. of epochs)
epochs = range(1, len(acc) + 1)

# plot
# "bo" is for "blue dot"
plt.plot(epochs, acc, 'bo', label='Training accuracy')
# orange is for "orange"
plt.plot(epochs, val_acc, 'orange', label='Validation accuracy')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

# this is the max value - should correspond to
# the HIGHEST train accuracy
np.max(val_acc)

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# see how these are numbers between 0 and 1? 
model.predict(X) # prob of successes (survival)
np.round(model.predict(X),0) # 1 and 0 (survival or not)
Y # 1 and 0 (survival or not)

# so we need to round to a whole number (0 or 1),
# or the confusion matrix won't work!
preds = np.round(model.predict(X),0)

# confusion matrix
print(confusion_matrix(Y, preds)) # order matters! (actual, predicted)

## array([[490,  59],   ([[TN, FP],
##       [105, 235]])     [Fn, TP]])

print(classification_report(Y, preds))

https://medium.com/luca-chuangs-bapm-notes/build-a-neural-network-in-python-binary-classification-49596d7dcabf

https://www.kaggle.com/code/mirichoi0218/ann-making-model-for-binary-classification

https://www.deeplearning.ai/courses/

https://www.youtube.com/watch?v=QDX-1M5Nj7s&list=PLtBw6njQRU-rwp5__7C0oIVt26ZgjG9NI&index=1

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

# Generate example data (replace this with your dataset)
X_train = ...
y_train = ...
X_test = ...

# Build the MLP model
model = Sequential()
model.add(Dense(units=64, activation='relu', input_dim=X_train.shape[1]))
model.add(Dense(units=32, activation='relu'))
model.add(Dense(units=1, activation='sigmoid'))

# Compile the model
model.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

# Predict probabilities on test data
predicted_probabilities = model.predict(X_test)