# Exploratory data analysis of the UCI Bank Marketing data set
This is the exploratory data analysis for our data analysis [proposal](https://github.com/UBC-MDS/DSCI_522_Group_10).

In [1]:
import warnings
warnings.filterwarnings('ignore')
from IPython.display import HTML, display

import numpy as np
import pandas as pd

bank_df = pd.read_csv("../data/raw/bank-additional/bank-additional-full.csv", sep=";")
train_df = pd.read_csv("../data/processed/bank-additional-full_train.csv")
test_df = pd.read_csv("../data/processed/bank-additional-full_test.csv")

## Summary of the data set

The data we are using for this project, [bank-additional-full.csv](https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip), was from a marketing campaign of a Portuguese bank. It was sourced from the UCI Learning Repository and can be found on this [website](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing).

Each row of the data is related to the direct telemarketing campaigns. They were using telemarketing to attempt to get customer to sign up for the bank's term deposit product.  The target in this dataset is yes or no to subscribing to the term deposit product.

There are also some values of 'unknown' in some categorical features like education. We are considering imputation but will re-assess this while preprocessing the features.


In [2]:
total_examples = len(bank_df)
total_features = len(bank_df.columns) - 1
count_NA = bank_df.isna().sum().sum()
print(f"There is a total of {total_examples} examples, {total_features} features and {str(count_NA)} observations with missing values in the dataset")

There is a total of 41188 examples, 20 features and 0 observations with missing values in the dataset


In [3]:
class_distribution = bank_df.y.value_counts().to_frame().T.rename(columns={"no":"Not Subscribed", "yes":"Subscribed"}, index={"y":"Class Distribution"})
class_distribution

Unnamed: 0,Not Subscribed,Subscribed
Class Distribution,36548,4640


Table 1. Counts of observation for each class.

The data used in this analysis is very simliar to the data used in [Moro et al., 2014].

## Partition the data set into training and test sets

Before proceeding further, we will split the data such that 80% of observations are in the training and 20% of observations are in the test set. Below we list the counts of observations for each class:

In [4]:
train_class_distribution = train_df.target.value_counts().to_frame().T.rename(columns={0:"Not Subscribed", 1:"Subscribed"}, index={"target":"Training Set Class Distribution"})
test_class_distribution = test_df.target.value_counts().to_frame().T.rename(columns={0:"Not Subscribed", 1:"Subscribed"}, index={"target":"Test Set Class Distribution"})
pd.concat([train_class_distribution, test_class_distribution])

Unnamed: 0,Not Subscribed,Subscribed
Training Set Class Distribution,29250,3700
Test Set Class Distribution,7298,940


Table 2. Counts of observation for each class for each data partition.

As shown above, there is class imbalance. Our positive for this analysis is for our customers to subscribe to the term deposit. We will try to spot this class so we make sure we capture as many customers as we can with this term deposit product.  We care a bit more about recall than precision because we want to capture as many customers as we can. If we can tune our prediction model to minimize false negatives, the more customers we hope we can get signed up this term deposit product.

We will first attempt to capture the right metrics to find and tune the best model. We note the class imbalance as shown in the table above.  So in addition to our initial tuning, we are prepared to change the training procedures (ex. class weight) and maybe even changing the data (over/under sampling) as we continue our analysis. This will also be dependent on our initial tuning to see if any metrics identify any other problems.




# Exploratory analysis on the training set

To gain an understanding on which features could be helpful in predicting the positive class, we plotted histograms of numeric features (didn't subscibe: blue and subscibed: orange) and percent subscribed bar graphs for each of the categorical features for all observations in the training data set. Although the histograms distributions for all of the numeric features overlap to a certain degree, they do show a difference in their centres and spreads, for example, `age` histogram. For the categorical features, some features are similar in the proportion subscribed, while others seem to be promising in predicting the positive class. The `poutcome` (previous outcome) feature seem to be the best as previous success is highly associated with the positive class. In addition, the features values (`contact`: cellphone, `education`:illitrate, `age_category`:older adults then young adults, and `job`s:retired and student) seem to be associated with the positive class. 


In [5]:

display(HTML("<table>" + 
             "<tr><td><img src='../results/age.png'></td><td><img src='../results/last_contact_duration.png'></td></tr>" +
             "<tr><td><img src='../results/contacts_during_campaign.png'></td><td><img src='../results/days_after_previous_contact.png'></td></tr>" +
             "<tr><td><img src='../results/previous_contacts.png'></td><td><img src='../results/employment_variation_rate.png'></td></tr>" +
             "<tr><td><img src='../results/consumer_price_index.png'></td><td><img src='../results/consumer_confidence_index.png'></td></tr>" +
             "<tr><td><img src='../results/euribor_3_month_rate.png'></td><td><img src='../results/number_of_employees.png'></td></tr>" +
            "</table>"))


Figure 1. Distribution of numeric features in the training set for subscribers and non-subscribers to the bank's term deposit product.

In [7]:

display(HTML("<table>" + 
             "<tr><td><img src='../results/job.png'></td><td><img src='../results/month.png'></td></tr>" +
             "<tr><td><img src='../results/education.png'></td><td><img src='../results/day_of_week.png'></td></tr>" +
             "<tr><td><img src='../results/loan.png'></td><td><img src='../results/previous_outcome.png'></td></tr>" +
             "<tr><td><img src='../results/marital_status.png'></td><td><img src='../results/housing.png'></td></tr>" +
             "<tr><td><img src='../results/contact.png'></td><td><img src='../results/default.png'></td></tr>" +
            "</table>"))

Figure 2. Distribution of categorical features in the training set for subscribers to the bank's term deposit product.

# References

[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014. https://archive.ics.uci.edu/ml/datasets/Bank+Marketing#.
