# Introduction 

Term deposits are a major source of income for a bank. A term deposit is a cash investment held at a financial institution. Your money is invested for an agreed rate of interest over a fixed amount of time, or term. The bank has various outreach plans to sell term deposits to their customers such as email marketing, advertisements, telephonic marketing, and digital marketing.

Telephonic marketing campaigns still remain one of the most effective way to reach out to people. However, they require huge investment as large call centers are hired to actually execute these campaigns. Hence, it is crucial to identify the customers most likely to convert beforehand so that they can be specifically targeted via call. 

In this notebook I will create a profile for the users more likely to perform a term deposit.

## Columns

1. age (numeric)
2. job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
"blue-collar","self-employed","retired","technician","services")
3. marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)
4. education (categorical: "unknown","secondary","primary","tertiary")
5. default: has credit in default? (binary: "yes","no")
6. balance: average yearly balance, in euros (numeric)
7. housing: has housing loan? (binary: "yes","no")
8. loan: has personal loan? (binary: "yes","no")
9. contact: contact communication type (categorical: "unknown","telephone","cellular")
10. day: last contact day of the month (numeric)
11. month: last contact month of year (categorical: "jan", "feb", "mar", …, "nov", "dec")
12. duration: last contact duration, in seconds (numeric)
13. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
14. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client wasnot previously contacted)
15. previous: number of contacts performed before this campaign and for this client (numeric)
16. poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")
17. y - has the client subscribed a term deposit? (binary: "yes","no")

In [4]:
import pandas as pd 
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from functools import reduce

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelBinarizer, OneHotEncoder, RobustScaler
from sklearn.compose import make_column_transformer

In [6]:
df = pd.read_csv("data/data.csv", sep=";")
df["y_numeric"] = df["y"].map({"no": 0, "yes": 1})
assert not df.isna().any().any() # Assert that there are no missing values

categorical_features = ["job", "marital", "education", "default", "housing", "loan", "contact", "poutcome"]
cols_to_drop = ["y", "month"]

# Transform variables into categorical type
df[categorical_features] = df[categorical_features].astype('category')

# Encode the months using sin transform
df["month_encoded"] = pd.to_datetime(df["month"], format="%b").apply(lambda x: np.sin((x.month/12)*2*np.pi))


df.shape

(45211, 19)

In [8]:
df.dtypes

age                 int64
job              category
marital          category
education        category
default          category
balance             int64
housing          category
loan             category
contact          category
day                 int64
month              object
duration            int64
campaign            int64
pdays               int64
previous            int64
poutcome         category
y                  object
y_numeric           int64
month_encoded     float64
dtype: object

In [10]:
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y,y_numeric,month_encoded
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no,0,0.5
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no,0,0.5
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no,0,0.5
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no,0,0.5
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no,0,0.5


In [12]:
def label_counts(df: pd.DataFrame):
    """ Plots the label counts """
    fig = px.bar(df["y"].value_counts(), title=f"Success Rate {int(round(5289/39922, 2)*100)}%") 
    fig.layout.yaxis.title = "# Customers"
    fig.layout.xaxis.title = "Successfull contact"
    fig.layout.showlegend = False
    return fig
label_counts(df)

In [14]:
def plot_monthly_success(df: pd.DataFrame):
    """ Plots the monthly success rate and total number of calls """
    ordered_months = ["jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"]
    plot_df = df[["month", "y_numeric"]].groupby("month").agg(
        n_calls=pd.NamedAgg(column="y_numeric", aggfunc=lambda x: x.shape[0]),
        success_rate=pd.NamedAgg(column="y_numeric", aggfunc=lambda x: (x.sum()/x.shape[0])*100)
        ).loc[ordered_months, :]

    # Create figure with secondary y-axis
    fig = make_subplots(specs=[[{"secondary_y": True}]])

    # Add traces
    fig.add_trace(
        go.Scatter(x=plot_df.index, y=plot_df["n_calls"], name="# Calls"),
        secondary_y=False,
    )

    fig.add_trace(
        go.Scatter(x=plot_df.index, y=plot_df["success_rate"], name="Success Rate"),
        secondary_y=True,
    )

    # Add figure title
    fig.update_layout(
        title_text="# Number of calls and Success Rate"
    )

    # Set x-axis title
    fig.update_xaxes(title_text="Month")

    # Set y-axes titles
    fig.update_yaxes(title_text="# Calls", secondary_y=False)
    fig.update_yaxes(title_text="Success Rate [%]", secondary_y=True)

    fig.show()
plot_monthly_success(df)

In [None]:
def box_plot_cont(df: pd.DataFrame):
    """ Plots the box plot of the continuous variables """
    fig = px.box(df[["y", "balance", "age", "campaign"]].melt(id_vars="y"), x="y", y="value", facet_col="variable")
    fig.update_yaxes(matches=None)
    return fig
box_plot_cont(df)

In [62]:
def bar_plot_disc_variables(df: pd.DataFrame, feature: str):
    """ Plots the Bar plots for the discrete variables"""
    fig = px.bar(df[[feature, "y_numeric"]].groupby("job").agg(
        TotalCount = pd.NamedAgg(column="y_numeric", aggfunc="count"), 
        SuccessCount = pd.NamedAgg(column="y_numeric", aggfunc=sum)
        ).reset_index().melt(id_vars="job").rename(columns={"value": "Count Value", "variable": "Count Type"}), x="job", y="Count Value", color="Count Type")

    fig.update_xaxes(matches=None)
    fig.update_layout(barmode='group')
    fig.layout.title.text = f"Total and Success Counts of variable {feature}"
    return fig
bar_plot_disc_variables(df, "job")

# Model Fitting
## Fit Linear Model
I will now use Logistic Regression see the most important features


In [17]:
def transform_data(df: pd.DataFrame) -> pd.DataFrame:
    """ Transforms the Data frame """
    train_data = df.drop(cols_to_drop, axis=1).copy()
    oh = OneHotEncoder(drop="first")
    oh.fit(train_data[categorical_features])
    cat_enc_df = pd.DataFrame(oh.transform(train_data[categorical_features]).toarray(), columns=oh.get_feature_names(categorical_features))
    return train_data.select_dtypes(exclude="category").join(cat_enc_df)
train_data = transform_data(df)   
train_data.columns

Index(['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous',
       'y_numeric', 'month_encoded', 'job_blue-collar', 'job_entrepreneur',
       'job_housemaid', 'job_management', 'job_retired', 'job_self-employed',
       'job_services', 'job_student', 'job_technician', 'job_unemployed',
       'job_unknown', 'marital_married', 'marital_single',
       'education_secondary', 'education_tertiary', 'education_unknown',
       'default_yes', 'housing_yes', 'loan_yes', 'contact_telephone',
       'contact_unknown', 'poutcome_other', 'poutcome_success',
       'poutcome_unknown'],
      dtype='object')

In [18]:
label = "y_numeric"
X = train_data.drop(label, axis=1)
y = df[label]

mean_cross_val_score = cross_val_score(LogisticRegression(), X, y, cv=StratifiedKFold(5), scoring="f1").mean()
clf = LogisticRegression()
clf.fit(X, y)
fig = px.bar(pd.Series(clf.coef_.flatten(), index=X.columns).sort_values(), title=f"Logistic regression feature importance. 5 Fold Cross Validation F1 score: {round(mean_cross_val_score, 2)}")
fig.layout.showlegend = False
fig.show()