# Marketing campaigns Subscription Predictor
Shiying Wang,  Karlos Muradyan, Gaurav Sinha

# Introduction

Businesses provide numerous services to their clients. It becomes important for them to know whether their clients need those services. Banking is one sector which provides numerous services to their clients. If a client subscribes to their service, it increases revenue for the bank. One such service offered by banks is *Term Deposit*. A term deposit is a form of deposit in which money is held for a fixed duration of time with the financial institution. A client will subscribe to a term deposit or not is dependent on a large number of features of a client. Banks generally have this information of clients which can help to predict whether a client will subscribe to a term deposit or not. This is an interesting problem which can be solved by analysing the data and building model to predict such clients behaviour. 

In this project, we will analyze a *Bank Marketing* data of a Portuguese banking institution, and predict whether the a client will subscribe for a term deposit from the marketing campaign. 


# Methods

## Data

The dataset we chose is Bank Marketing data of a Portugese banking instituion collected from [UCI Machine learning Repository](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing). The dataset is publicly available for research, the details of which are described in [Moro et al., 2014](#Reference). Detail of the source is provided in the reference session.

The dataset has 4521 observations, 16 features and a response variable `y` which indicates if the client subscribed a term deposit or not (binary: 'yes','no'). The classes are imbalance: there are 4000 no and 521 yes. There are 0 observations with missing values in the data set. 

## Exploratory Data Analysis

We split the data as 80% training dataset and 20% test data. We used the training dataset to perform the analysis. Figure 1 is the proportion of the response variable y. 

![](../results/proportion_of_class.png)

Figure 1. Proportion of response variable: whether the client subscribed.

From the plot, we can see that the classes are highly imbalance: around 90% of people did not subscribe after the marketing campaign, and only around 10% subscribed after the marketing campaign. The highly imbalance dataset would cause the problem that the model only predicted the majority classes. We will discuss how to solve it in the [Model](#Model) section.

This dataset has combination of numerical and categorical features. We explored them respectively. Figure 2 Figure 3 shows the relationship between the numerical features. 

<img src="../results/kendall_corr_matrix.png" width="400" height="400"> <img src="../results/pearson_corr_matrix.png" width="400" height="400">

Figure 2: kendall and pearson correlation matrix

From the kendall and pearson correlation matrix, most features does not have correlation with others. The only big correlation between numerical features is between pdays and previous. Kendall's correlation coefficient gives value greater than 0.9

![](../results/pairplot_numeric.png)
Figure 3: pairplot of numerical features.

From Figure 3, we found some interested patterns: for people who did not subscribed (blue), the density plot for `duration` is more concentrated on the left. This indicates that this class tend to has lower duration of last contact. Most of the people who subscribed after the campaign (red) haven't had previous contacts before. Those indicates that `duration`, `pdays` and `previous` might be important features.

For categorical features, we looked at number of counts for each class to see some common characteristics of clients in our dataset.

![](../results/count_of_cat_features.png)
Figure 4: counts of categorical features.

From the figure 4, we observed some common characteristics:
- In this dataset:
    - Most clients have job type management or blue-collar.
    - Most clients are married.
    - Most clients have secondary education level.
    - Almost all clients does not have credit in default.
    - A big portion of clients have housing loan.
    - Most clients does not have personal loan.
    - Clients are mostly contacted by cellular.
    - Most clients do not have previous campaign before. 

- For those clients who subscribe the service after campaign:
    - Most of them are management or technician.
    - Most of them are married, and many of them are single.
    - Most of them have secondary education level, and many of them have tertiary education level.
    - Most of them does not have credit in default, housing loan or personal loan.
    - They are mostly contacted by cellular.
    - Most clients do not have previous campaign before.
    

## Model

In data preprocessing, we standardize the numerical features and use one-hot-encoding to convert categorical features into dummy variables. Since this is a classification problem, we used some traditional classification model such as logistic regression, support vector machine and random forest from `sklearn` package. To deal with the data imbalance, we tuned the hyper-parameter `class_weight` in built-in sklearn model parameter. Furthermore, we used models that can handle class imbalance data by their nature such as gradient boosting trees. 

We used F1 score as our main evaluation metric. Given the formula $F1_{score} = 2 * \frac{(precision * recall)}{(precision + recall)}$, it will try to find a balance between the precision and recall, which is very useful in our imbalance case.

# Results & Discussion

## Results

## Limitation

## Further directions

## Credit

# Reference

Reitz,Kenneth.2019._Requests: HTTP for Humans_.https://pypi.org/project/requests/.


[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

UCI Machine Learning Repository. University of California, Irvine, School of Information; Computer Sciences.https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

