# Introduction

> In this project, we will look at GAIN & LIFT chart - two measures that are used for measuring the benifits of using the logistic regression model and are used in business context such as target marketing, customer churn rate, customer conversions rate.

> In target marketing or marketing campaigns, customers' responses to campaign are usually very low, somtimes less than 1%. Another example oof such low conversions if response to advertisement on the internet such as google adwords and mobile advertisements.

> The organization incurs cost for each customer contact and hence would like to minimize the cost of marketing campaign and at the same time achieve the desired response level from the customers.

> Formula for Gain and Lift is as below:

>    GAIN = (Cumulative number of positive observations upto decile i)/(Total number of positive                  observations in the data)

>    LIFT = (Cumulative number of positive observations upto decile i using Logistic                              model)/(Cumulative number of positive observations upto decile i based on random model)

> Apart from this we will also create decision tree graph using pydotplus and graphVIZ API


> Dataset is taken from UCI's ML repository and it is a part of the book Machine learning using python by U Dinesh kumar

# Feature description

1. **age** - Age of the client
2. **job** - type of job
3. **marital** - marital status
4. **education** - education qualification
5. **default** - customer has credit in default? (yes/no)
6. **balance** - average yearly balance, in euros
7. **housing-loan** - has housing loan?(yes/no)
8. **personal-loan** - has personal loan? (yes/no)
9. **current camoaign** - number of contact performed during this campaign
10. **previos campaign** - number of contact performed before this campaign
11. **subscribed(Target)** - subscribed or not for term deposit? (yes/no)

In [18]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import export_graphviz
from IPython.display import Image
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, auc, confusion_matrix, accuracy_score
from sklearn.utils import resample, shuffle


# matplotlib defaults
plt.style.use("seaborn-darkgrid")
plt.rc("figure", autolayout=True)
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=14,
    titlepad=10,
)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [19]:
# download and install pydotplus
!pip install pydotplus

In [20]:
import pydotplus as pdot

In [21]:
df = pd.read_csv("/kaggle/input/bank-dataset/bank.csv")
df.head()

# Basic data analysis

In [22]:
df.info()
print("*******************************************************")
print(df.shape)

In [23]:
# check class balance.
df.subscribed.value_counts().plot(kind='bar', xlabel='subscribed',
                                 ylabel='Count of users',
                                 title='Count of subscribers for term deposit')

> Classes are highly imbalanced as it is and case of targeted marketing. in such cases conversions are always low. in some marketing campaigns customers conversions are below 1%.

In [24]:
num_cols = [col for col in df.columns if df[col].dtype == 'int64']
print(num_cols)
print("**********************************************************")
cat_cols = [col for col in df.columns if df[col].dtype == 'object']
print(cat_cols)

In [25]:
for col in cat_cols:
    vc=df[col].value_counts()
    print("for cat feature {}, values are: \n{}".format(col,vc))
    print("******************************************")

> more than 50% of customers are having housing loans

In [26]:
# see the distributions of num features
for idx, col in enumerate(num_cols):
    plt.figure(idx, figsize=(6,6))
    sns.displot(x=df[col], kde=False, hue=df['subscribed'])
    plt.show()

> all age groups are imporatant as showing in graph, we will keep all the features 

> current- previous camplaign contacts shoiwng that approx 3-4 times contacted customers are being converted to subscribe for term deposite.

> balance feature has some outliers 

# Logistic regression model

In [27]:
# pre-process the data and build logistic regresison model
X = df.drop('subscribed', axis=1)
y = df.subscribed

# get dummy variables
X = pd.get_dummies(X, 
                   columns=['job','marital','education','default','housing-loan','personal-loan'],
                   drop_first=True)

# encode target columns
y = y.map({'yes': 1,
           'no': 0}
         )

# logit model
X = sm.add_constant(X)

lm = sm.Logit(y,X).fit()
lm.summary2()

In [28]:
# see the predicted probability
pred_df = pd.DataFrame({'actual': y,
                         'pred_prob': lm.predict(X)
                       })
sorted_df = pred_df.sort_values('pred_prob', ascending=False)
sorted_df.head()

> Having unbalanced classes we will not assign predicted 0 or 1. instead try to find prediction accuracy through GAIN-LIFT chart.

# GAIN-LIFT CHART

In [29]:
#calculate counts per decile
num_per_decile = int(len(sorted_df['pred_prob'])/10)
print("Number of observation per decile: ", num_per_decile)

In [30]:
li = [num for num in range(0,4520,452)]
li

In [31]:
# assign deciles to each observations
sorted_df['decile'] = 1
sorted_df.iloc[0:452, sorted_df.columns.get_loc('decile')] = 1
sorted_df.iloc[452:904, sorted_df.columns.get_loc('decile')] = 2
sorted_df.iloc[904:1356, sorted_df.columns.get_loc('decile')] = 3
sorted_df.iloc[1356:1808, sorted_df.columns.get_loc('decile')] = 4
sorted_df.iloc[1808:2260, sorted_df.columns.get_loc('decile')] = 5
sorted_df.iloc[2260:2712, sorted_df.columns.get_loc('decile')] = 6
sorted_df.iloc[2712:3164, sorted_df.columns.get_loc('decile')] = 7
sorted_df.iloc[3164:3616, sorted_df.columns.get_loc('decile')] = 8
sorted_df.iloc[3616:4068, sorted_df.columns.get_loc('decile')] = 9
sorted_df.iloc[4068:, sorted_df.columns.get_loc('decile')] = 10
sorted_df

In [32]:
deciles_df = sorted_df.copy()

gain_decile_df = pd.DataFrame(deciles_df.groupby('decile')['actual'].sum().reset_index())
gain_decile_df.columns = ['decile','gain']
gain_decile_df

In [33]:
gain_decile_df['gain_percentage'] = (100*gain_decile_df['gain'].cumsum()/gain_decile_df['gain'].sum())
gain_decile_df

In [34]:
# plot the gain against the decile
plt.figure(figsize=(12,6))

plt.subplot(1,2,1)
plt.plot(gain_decile_df['decile'], gain_decile_df['gain'])
plt.xlabel('Decile')
plt.ylabel('Gain')

plt.subplot(1,2,2)
plt.plot(gain_decile_df['decile'], gain_decile_df['gain_percentage'])
plt.xlabel('Decile')
plt.ylabel('Gain percentage')

plt.suptitle('GAIN chart of logistic regression model', weight='bold')
plt.show()

> Inferences can be made as, first 10% of the customers gives almost 120 customers who are going to subscribe for term deposit

> from second, we can notice that by contacting first 60% of the customers we will get 80% of the subscribers

> For business cases where you have imbalanced classes, rather than predicting actual outcome, we can use logistic regression model that predicts probabilistic outcome and than we can measure what percentage of customers are more likely to be converted.

In [35]:
gain_decile_df['Lift'] = (gain_decile_df.gain_percentage/(gain_decile_df.decile*10))
gain_decile_df

In [36]:
# plot the lift chart
plt.figure(figsize=(6,6))
plt.plot(gain_decile_df['decile'], gain_decile_df['Lift'])
plt.xlabel('Decile')
plt.ylabel('Lift')
plt.title("LIFT Chart")
plt.show()

> Upto 2nd deciles customers are most likely to be subscribers from prediction model.

# Decision tree model with unbalanced classes

**> Experiment no. 1**

> We are builing decision tree classifer without balancing data using gini impurity method

In [37]:
X_n = X.drop('const', axis=1)
y

#devide traina and test data
X_train, X_test, y_train, y_test = train_test_split(X_n,y, test_size=0.20, random_state=42)

#decision tree classifier
tree1 = DecisionTreeClassifier(criterion = 'gini', max_depth=10)
tree1.fit(X_train,y_train)
score = tree1.score(X_test, y_test)
pred_y = tree1.predict(X_test)
accuracy = accuracy_score(y_test,pred_y)
print("Score: ", score)
print("Accuray of model: ", accuracy)

In [38]:
# confusion matrix
def confusion_matri(actual, predicted):
    cm = confusion_matrix(actual, predicted, labels=[1,0])
    tp, fn, fp, tn = cm.ravel()
    #plot confusion matrix
    plt.figure(figsize=(6,6))
    sns.heatmap(cm, annot=True, fmt='.2f', 
                xticklabels =['subscribed','not subscribed'],
                yticklabels =['subscribed','not subscribed'])
    plt.ylabel('True Label')
    plt.xlabel('Predicted lable')
    plt.title("Confusion Matrix")
    plt.show()
    return tn,fp,fn,tp

In [39]:
confusion_matri(y_test, pred_y)

In [40]:
# creating graph of the tree
# export the tree into odt file
export_graphviz(tree1,
               out_file= "tree1_file.odt",
               feature_names=X_n.columns,
               filled=True,
               proportion=False)

# read the create the image file
tree_graph = pdot.graphviz.graph_from_dot_file("tree1_file.odt")
tree_graph.write_jpg("tree1_file.png")
#Render the png file
Image(filename="tree1_file.png")

**> Experiment no. 2**

> We are builing decision tree classifer without balancing data using entropy criteria

In [41]:
#decision tree classifier 2 
tree2 = DecisionTreeClassifier(criterion = 'entropy', max_depth=10)
tree2.fit(X_train,y_train)
score2 = tree2.score(X_test, y_test)
pred_y2 = tree2.predict(X_test)
accuracy2 = accuracy_score(y_test,pred_y)
print("Score 2 : ", score2)
print("Accuray of model 2: ", accuracy2)

> Scores are almost the same with gini and entropy criteria

**> Experiment no. 3**

> Using sklearn utils to oversample minority class

In [42]:
# creat full data frame with total observations
d_n = pd.concat([X_n,y], axis=1)

#seperate the yes-subscribed and no-subscribed
df_no = d_n[d_n['subscribed'] == 0]
df_yes = d_n[d_n['subscribed'] == 1]

# upsample teh minority dataframe
df_upsampled = resample(df_yes,
                       replace=True,
                       n_samples=2600)

# concat to create final dataset
final_df = pd.concat([df_no,df_upsampled])

# shuffle the data set
final_df = shuffle(final_df)

In [47]:
# Build model using Decision tree clssifier
Xn = final_df.drop('subscribed', axis=1)
yn = final_df.subscribed

# train and test dataset
X_trn, X_tst, y_trn, y_tst = train_test_split(Xn,yn, test_size=0.20, random_state=42)

# Decision tree model
tree3 = DecisionTreeClassifier(criterion='gini', max_depth=14, random_state=42)
tree3.fit(X_trn, y_trn)
pred = tree3.predict(X_tst)
accuracy3 = accuracy_score(y_tst, pred)
print("Accuracy score of the model: ", accuracy3)

In [48]:
report = classification_report(y_tst, pred)
print(report)

> Here, we can notice that model is not performing well with max_depth below 10, while in previous models we could archive desired accuracy at max_depth of 10.

> we tune the max_depth hyperparamer to 14 to archvie 84% of the accuracy with rebalanced data set using sklearn utils

# Conclusions:

> In this project we applied logistic regression model with unbalanced classes to predict probabilistic outcome with GAIN & LIFT charts.

> We also saw decision tree models in which created decision tree with split and entropy rules using graphviz api of pydotplus.

> Finally, we also applied Decision tree model with class balancing using sklearn utils.

> We found that irrespective of unbalanced or balance data Decision tree can perform well with test data while logistic regression is good to produce probabilistc outcome for unbalanced data.