In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import scipy.stats as stats

I decided to look at some data about outbound sales calls by a bank for a new product of theirs. 

[TODO: more details about the bank here]

My goal will be to build a Machine Learning model to predict what types of customers are most likely to buy the product because of a telemarketing call. Such a model could help the bank spend its sales and marketing budget more effectively. It could also be used to predict the success of using telemarketing to promote new products.

I obtained this data from the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing#

This data was used in the following research papter:

```[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014```

Let's import, clean up, and analyze the data to determine which Machine Learning algorithms are suitable for classifying this data.

In [None]:
## file obtained from https://archive.ics.uci.edu/ml/machine-learning-databases/00222/

bank_data = pd.read_csv("bank-additional-full.csv", delimiter=";")

bank_data.sample()

In [None]:
for col in bank_data.columns:
    if bank_data[col].isnull().any():
        print(f"Nulls in {bank_data[col]}")

Based on this, the data looks pretty clean. We have some unknown values, but no `na` values to contend with. Let's start looking at individual columns. For starters, `age` seems pretty reasonable. The youngest age is 17 and the max is 98.

In [None]:
print(bank_data.age.describe())

bank_data.age.hist()

Next, let's check the categorical variables, such as `job`, `marital`, `education`, `default`, `housing`, `loan`, and `contact`. The simplest way to visualize these is just using value_counts() on the columns that do not have a numeric type.

In [None]:
for col in bank_data.columns:
    if bank_data[col].dtype == 'object':
        print(bank_data[col].value_counts())
        print("\n")

Many of the categorical variables have 'unknown' values: `job`, `marital`, `education`, `default`, `housing`, `loan`. (for reference, `default` is whether the customer is in default on any loans. `housing` is whether the customer has a housing loan. `loan` is whether the customer has a non-housing loan.)

While unknown values are not ideal, I don't believe it makes sense to throw out these values. If we find that customers with `unknown` values are less likely to buy the product, that tells us something too: that the business should stop trying to market to customers that they don't have enough demographic information about. In other words, `unknown` conveys less information than the other categorical values, but not zero information.

The rest of the categorical variables look good: `month`, `day_of_week`, and `poutcome`. (`month` and `day_of_week` indicate when the customer was contacted. `poutcome` is whether previous marketing campaigns to this person were successful.)

`y` is whether the sales attempt was successful or not. About 90% of sales calls are unsuccessful. That's a pretty big class imbalance, which means we need to be careful with how we select and score our model, because a model that always predicts `no`, regardless of the feature data, would be accurate 89% of the time:

In [None]:
print(100 * sum(bank_data.y == "no") / len(bank_data.y))

For instance, when using `sklearn.svm.SVC` to classify this data, we will want to use the argument `class_weight='balanced'`.

Now, let's look at the rest of the non-categorical values. There is one 'magic' value, in the `pdays` column. `pdays` is the number of days since last contact, and if there hasn't been a previous contact, the value 999 is assigned. It looks like this is the most common value in the column, indicating most people haven't been contacted before.

The column `previous` indicates the number of contacts previously with the customer. This column should always be 0 if `pdays` is 999. Let's make sure that's true.

In [None]:
possibly_bad_pdays = bank_data[(bank_data.pdays == 999) & (bank_data.previous > 0) ]

print("possibly bad 'pday' values: %.2f %%" % (100 * len(possibly_bad_pdays.previous) / len(bank_data)))

In other words, almost 10% of our rows have a nonsensical value for this column. For now I am just going to drop it. I will also drop the `duration` column, as this is not a realistic column to use for a predicative model. According to the documentation:

> this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

In [None]:
bank_data = bank_data.drop(['pdays', 'duration'], axis=1)

`campaign` is a bit of an odd field. According to the data dictionary, it is the "number of contacts performed during this campaign and for this client (numeric, includes last contact)". It has a very unusual distribution:

In [None]:
bank_data.campaign.hist()

print(bank_data.campaign.describe())

The largest value is 56 times. That seems like a huge number of contacts for one person. Does the distribution of this data make sense?

Intuitively, it seems like `campaign` should follow an exponential distribution. We can use a qq plot to verify this. A qq plot plots the actual distribution versus the best fit theoretical distribution (in this case, exponential):

In [None]:
sm.qqplot(bank_data.campaign, fit=True, line='r', dist=stats.expon)

Ideally, we'd like to see a flat line there. So it doesn't (based on the eye test, at least) appear to be an exponential distribution, but it's close.

I found a better fit using the `gamma` distribution, which is just a more general version of the exponential with 2 parameters instead of one:

In [None]:
sm.qqplot(bank_data.campaign, fit=True, line='r',dist=stats.gamma)

This looks like a pretty good fit, so for now I'm going to assume that the data in the `campaign` field is organic, rather than being due to some type of error. However, the percentile values may end up being a better predictor than the raw values due to the long tail, so let's add that in:

In [None]:
bank_data['campaign_quartile'] = pd.qcut(bank_data['campaign'], 4, labels=False, duplicates='drop')

The remaining variables are all macroeconomic statistics (what their values were at time of contacting the customer): `emp.var.rate`, `cons.price.idx`, `cons.conf.idx`, `euribor3m` and `nr.employed`.

It's always interesting to look at which correlations are in the dataset. If we have 20 columns, are we really getting 20 dimensions worth of data, or are there some we could ignore because they are highly correlated with others? Let's check that for the economic factors.

In [None]:
macro_predictors = ['emp.var.rate', 'cons.price.idx', 'cons.conf.idx',
                    'euribor3m', 'nr.employed']

bank_data.loc[:, macro_predictors].corr()

We can see `emp.var.rate`, `euribor3m` and `nr.employed` are very highly correlated. This is a bit surprising since `euribor3m` is the European inter-bank lending rate, `emp.var.rate` is the employment rate, and `nr.employed` is the number of employees. The strong correlation between lending rate and employment rate is curious, but may be explained by the time that this data came from (May 2008 to November 2010), which is in the aftermath of the global economic crisis at the end of 2008. [TODO: fix this sentence]

Since the correlations are so high, I may end up only using one of the 3 metrics in the final model in order to reduce the curse of dimensionality.

Let's start with a simple linear regression model to see how well we can do with the simplest possible model. We're going to need to normalize all the categorical data to be able to apply most learning algorithms. 

Most of these categorical variables don't have a natural order to them, so we will just encode them as dummy variables. For example, for the `education` field, instead of arbitrarily assigning integer values to `{single, married, divorced, unknown}`, we're adding variables like `{is_single, is_married, is_divorced, is_unknown}` that will be 1 or 0.

we can convert day of week and month back to integers rather than treating them as categories.

In [None]:
### TODO: a basic 

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

y = bank_data['y']
x = bank_data.copy().drop(['y'], axis=1)
## we now need to convert all the categorical variables in x.
categorical_fields = ['job', 'marital', 'education', 'default', 'housing', 
                        'loan', 'contact', 'poutcome']

for field in categorical_fields:
    # I am dropping the most frequent category for each of these. I believe this will 
    # make the model's output more understandable. If not, I will use the built in pandas
    # feature to drop the first one instead.
    # dummies = pd.get_dummies(x[field], prefix=field, drop_first=True)
    dummies = pd.get_dummies(x[field], prefix=field)
    toss_cat_var_name = f"field_{x[field].value_counts().idxmax()}"

    x =   pd.concat([x, dummies], axis='columns') \
              .drop([field, toss_cat_var_name], axis='columns')

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)

# linear_model = LinearRegression(normalize=True)

# linear_model.fit(x_train, y_train)



In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(x_train, y_train)