 Input variables:
## bank client data:  
- age (numeric)    
- job : type of job (categorical: "admin.","blue-collar","entrepreneur","housemaid","management","retired","self-employed","services","student","technician","unemployed","unknown")
- marital : marital status (categorical: "divorced","married","single","unknown"; note: "divorced" means divorced or widowed)
- education (categorical: "basic.4y","basic.6y","basic.9y","high.school","illiterate","professional.course","university.degree","unknown")
- default: has credit in default? (categorical: "no","yes","unknown")
- housing: has housing loan? (categorical: "no","yes","unknown")
- loan: has personal loan? (categorical: "no","yes","unknown")

## related with the last contact of the current campaign:
- contact: contact communication type (categorical: "cellular","telephone") 
- month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
- day_of_week: last contact day of the week (categorical: "mon","tue","wed","thu","fri")
- duration: last contact duration, in seconds (numeric). *Important note:  this attribute highly affects the output target (e.g., if duration=0 then y="no"). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.*

## other attributes:
- campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
- pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
- previous: number of contacts performed before this campaign and for this client (numeric)
- poutcome: outcome of the previous marketing campaign (categorical: "failure","nonexistent","success")

## social and economic context attributes
- emp.var.rate: employment variation rate - quarterly indicator (numeric)
- cons.price.idx: consumer price index - monthly indicator (numeric)     
- cons.conf.idx: consumer confidence index - monthly indicator (numeric)     
- euribor3m: euribor 3 month rate - daily indicator (numeric)
- nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
- y \- has the client subscribed a term deposit? (binary: "yes","no")


In [1]:
import pandas as pd
from matplotlib import pyplot as plt

In [2]:
df = pd.read_csv('bank-additional/bank-additional/bank-additional-full.csv', sep=';')

In [44]:
categorical_vars = ['y','job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome']
df2 = pd.get_dummies(data=df, columns=categorical_vars, drop_first='y')

In [45]:
mean_pdays = df2[df2.pdays != 999].pdays.mean()
df2['pdays999'] = 0
df2.loc[df['pdays'] == 999, 'pdays999'] = 1

df2.pdays = df2['pdays'].replace(999, mean_pdays)

In [46]:
df2.pdays.plot.hist()
plt.show()

In [47]:
from sklearn.preprocessing import MaxAbsScaler
from sklearn.model_selection import train_test_split

X = df2.drop(['duration', 'y_yes'], axis=1)
y = df2.y_yes

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33, random_state=0)

min_max_scaler = MaxAbsScaler()
X_train_scaled = min_max_scaler.fit_transform(X_train)
X_test_scaled  = min_max_scaler.transform(X_test)

In [48]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

clf = RandomForestClassifier(n_estimators=10)

scores = cross_val_score(clf, X_train_scaled, y_train)

scores.mean()

0.88972631931664115

In [50]:
from sklearn.metrics import roc_curve, auc

clf.fit(X_train_scaled, y_train)
y_score = clf.predict_proba(X_test_scaled)[:,1]
fpr, tpr, thresh = roc_curve(y_score=y_score, y_true=y_test.astype(int))

roc_auc = auc(fpr, tpr)

plt.figure()
lw = 2
plt.plot(fpr, tpr, color='darkorange',lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()