In each section, there will be code lines for training set and testing set separately. To enable please remove '#' at the begin of code lines. '##' lines are instructions

### Data Loading and Processing

In [3]:
import pandas as pd
import numpy as np

In [4]:
## Loading the training dataset
df_train=pd.read_csv('lixan23.csv')

## Loading the testing dataset - PLEASE INPUT THE TEST SET NAME BELOW
#df_test=pd.read_csv('INPUT.csv')

In [5]:
## Create transformation function
def data_transform(df):
    ## Separate categorical and numerical data
    df_cate = df.select_dtypes(include=['object']).copy()
    df_numeric=df.select_dtypes(include=['int64']).copy()
    ## Transform binary data
    df_binary=df_cate[['default','housing','loan','y']]
    for i in df_binary.columns:
        df_binary.loc[df_cate[i]=='no',i]=0
        df_binary.loc[df_cate[i]=='yes',i]=1
    ## Transform multinominal data
    from sklearn.preprocessing import LabelBinarizer
    lb = LabelBinarizer()
    def binary_convert(a):
        lb_results = lb.fit_transform(df_cate[a])
        df_onehot = pd.DataFrame(lb_results, columns=lb.classes_)
        for col in list(df_onehot.columns):
            df_onehot.rename(columns={col:(a+'_'+col)},inplace = True)
        return df_onehot
    df_job_onehot=binary_convert('job')
    df_mar_onehot=binary_convert('marital')
    df_edu_onehot=binary_convert('education')
    df_contact_onehot=binary_convert('contact')
    df_pout_onehot=binary_convert('poutcome')
    ## Merging
    df_convert = pd.concat([df_numeric,df_binary, df_job_onehot,df_mar_onehot, 
                        df_edu_onehot, df_contact_onehot, df_pout_onehot], 
                       axis=1, sort=False)
    df_cv=df_convert.astype('int64')
    return df_cv

In [6]:
## Apply function to train & test set
df_train_cv = data_transform(df_train)
#df_test_cv = data_transform(df_test)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.


### Model fitting and prediction

In [7]:
## Library import
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import CategoricalNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
cnb=CategoricalNB()
gnb=GaussianNB()

Project stage 1: OPTIMISING PRECISION

In [8]:
## Splitting input and output features
X_train_1=df_train_cv[['poutcome_success','housing']]
Y_train_1=df_train_cv[['y']]
#X_test_1=df_test_cv[['poutcome_success','housing']]
#Y_test_1=df_test_cv[['y']]

In [9]:
## Fitting model to training set
cnb.fit(X_train_1,Y_train_1)

  y = column_or_1d(y, warn=True)


CategoricalNB(alpha=1.0, class_prior=None, fit_prior=True)

In [None]:
## Predict test set and checking model efficiency
#Y_predict_1=bnb.predict(X_test_1)
#print (confusion_matrix(Y_test_1,Y_predict_1))
#print (precision_score(Y_test_1,Y_predict_1))
#print (recall_score(Y_test_1,Y_predict_1))

Project stage 2: PROFIT OPTIMISATION
In case of database limitation, and that the opportunity cost is much larger than cold-call (fall positive cost), below analysis can help to define the number of input should be chosen


In [10]:
## Calculate the trade-off cost with the increase of variables
## 1. Decision Tree factor importance ranking (for precision)
from sklearn import tree
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
folds = KFold(n_splits=5, shuffle=True, random_state=42)
ct=tree.DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=4, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=26, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=42, splitter='best')
X_train_no_du = df_train_cv.drop(['y', 'duration'], axis=1)
Y_train = df_train_cv[['y']]
ct.fit(X_train_no_du,Y_train)
feat_importance = ct.tree_.compute_feature_importances(normalize=False)
feat_imp_dict = dict(zip(X_train_no_du.columns, ct.feature_importances_))
feat_imp = pd.DataFrame.from_dict(feat_imp_dict, orient='index')
feat_imp.rename(columns = {0:'FeatureImportance'}, inplace = True)
feat_imp_top=feat_imp.sort_values(by=['FeatureImportance'], ascending=False)

## 2. Trade-off table: 
##Potential reach: number of potential customer reached in 1000 potential customers
##Total call: Number of calls to be made for above potential reach
gnb=GaussianNB()
name=list(range(1,36))
nb_pre=[]
nb_re=[]
reach_1000=[]
call_1000=[]

for i in feat_imp_top.index:
    if i=='poutcome_success':
        X_temp=X_train_no_du[[i]]
        cnb_scores1 = cross_val_score(cnb, X_temp, Y_train, scoring='precision', cv=folds)
        nb_pre.append(np.mean(cnb_scores1))
        cnb_scores2 = cross_val_score(cnb, X_temp, Y_train, scoring='recall', cv=folds)
        nb_re.append(np.mean(cnb_scores2))
        reach_1000.append(1000*np.mean(cnb_scores2))
        call_1000.append(1000*np.mean(cnb_scores2)/np.mean(cnb_scores1))
        
    elif i=='contact_unknown' or i=='housing':
        X_temp[i]=X_train_no_du[i]
        cnb_scores1 = cross_val_score(cnb, X_temp, Y_train, scoring='precision', cv=folds)
        nb_pre.append(np.mean(cnb_scores1))
        cnb_scores2 = cross_val_score(cnb, X_temp, Y_train, scoring='recall', cv=folds)
        nb_re.append(np.mean(cnb_scores2))
        reach_1000.append(1000*np.mean(cnb_scores2))
        call_1000.append(1000*np.mean(cnb_scores2)/np.mean(cnb_scores1))
    else:
        X_temp[i]=X_train_no_du[i]
        gnb_scores1 = cross_val_score(gnb, X_temp, Y_train, scoring='precision', cv=folds)
        nb_pre.append(np.mean(gnb_scores1))
        gnb_scores2 = cross_val_score(gnb, X_temp, Y_train, scoring='recall', cv=folds)
        nb_re.append(np.mean(gnb_scores2))   
        reach_1000.append(1000*np.mean(gnb_scores2))
        call_1000.append(1000*np.mean(gnb_scores2)/np.mean(gnb_scores1))
table = {'added_features': feat_imp_top.index,'precision': nb_pre, 'recall': nb_re,'potential_reach':reach_1000,'total_call':call_1000}
trade_off_table = pd.DataFrame(data=table)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using 

In [11]:
## To select the best input, optimizing: 
#(potential_reach(TP among 1000P) * benefit per subcriber) - (total_call(Predicted P) * call cost)
trade_off_table

Unnamed: 0,added_features,precision,recall,potential_reach,total_call
0,poutcome_success,0.596248,0.182733,182.732928,306.471306
1,contact_unknown,0.599877,0.182733,182.732928,304.61727
2,housing,0.629877,0.163602,163.602493,259.737162
3,pdays,0.538877,0.196356,196.355604,364.378927
4,job_blue-collar,0.527834,0.198483,198.483263,376.033179
5,balance,0.436829,0.217436,217.435659,497.759426
6,campaign,0.380631,0.249349,249.349014,655.093113
7,education_tertiary,0.383002,0.250587,250.586502,654.269661
8,marital_married,0.37659,0.252326,252.325632,670.027768
9,marital_single,0.372923,0.254134,254.133661,681.463715


Selected feature can be re input in Project stage 1 for prediction
Note that when including non-binary variables in input, Gaussian NB should be applied instead of Cateorical NB