## Problem statement

The goal for this project is to predict how likely will a customer respond to an offer, based on demographics and offer sent data. The target variable is a boolean value that indicates if a customer is going to buy it (1) or not (0). 

It is worth noticing that the plan is to have an introduction to this matter, and such problem can be really complicated in the real world, as it can be seen in the results. This is a continuous improvement solution, and I'm aiming to provide a minimum viable product for that.

## Packages used

The following packages are used:
<ol>
    <li> <strong>pandas</strong>: python package for data analysis.
    <li> <strong>numpy</strong>: used for numerical processing.
    <li> <strong>scitkit-learn</strong>: mainstream machine learning distribution.
</ol>

## Algorithms and tools used


Target variable is supposed to be binary (`offer_completed`). From [sklearn's documentation](https://scikit-learn.org/stable/modules/naive_bayes.html#gaussian-naive-bayes):
>BernoulliNB implements the naive Bayes training and classification algorithms for data that is distributed according to multivariate Bernoulli distributions; i.e., there may be multiple features but each one is assumed to be a binary-valued (Bernoulli, boolean) variable.

All of our features are binary, so this makes BernoulliNB an obvious first choice. <br>

This algorithm has a smoothing parameter, called alpha, which can be varied for best fits. The main goal is to better predict low probabilities (ours is 0.2). I'll try values really close to zero for this. Wikipedia has a [really interesting story](https://en.wikipedia.org/wiki/Sunrise_problem) for the origin of the technique. <br>

Second algorithm envolves introducing a certain level of randomness to the system, then chosing based on that. That would be a [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier). From the website:

>A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

Here we're controlling the number of estimators, i.e the number of trees in the forest. The larger the better, but harder to compute. Also, [there is a ceiling for this](https://scikit-learn.org/stable/modules/ensemble.html#forest), where the model performance starts to get closer to a plateau, as described:

> In addition, note that results will stop getting significantly better beyond a critical number of trees.

I'll start with 100, then additive increasing to check on performance.

GridSearchCV is a really powerful tool that runs through a grid of parameters trying to predict the target variable. I'll use that for model tuning.

## Metrics

The following metrics are going to be used to measure the models performance:
<ol>
    <li> <strong>Confusion matrix</strong>: the predicted variable, which is predicting if an user will respond or not to an offer, is simply binary. This way, the confusion matrix is 2x2, hence providing a clear view of how the model is going. We're trying to minimize both errors (types I and II), but prioritizing to minimize false negatives (type II). The reason for that is that is better to send an offer to a customer that won't respond than keeping one possible buying customer from actually buying. Therefore, the higher the numbers in the main diagonal, the better.
    <li> <strong>Balanced accuracy score</strong>: this is a ratio that divides the number of correct predictions versus the total predictions, considering non-normal distributions. The main target variable we're trying to predict is binary, with dataset mean on 0.2. That means it is highly unbalanced, and taking this into account would get more precision. We're aiming to get is as close as possible to 1. 
</ol>

## Possible improvements: 
<ol>
    <li>Check how long do the customers take to respond to an offer (this might be interesting to better calibrate the expirations).
    <li>Compare amount spent versus number of offers (maybe better offers for high conversion customers).
</ol>

## Loading data

Data was cleaned in `data_modeling.ipynb`.

In [22]:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score, confusion_matrix

Like mentioned in `data_analysis.ipynb`, we can only use customers who have been sent an offer. 

In [2]:
events = pd.read_csv('data\\transcript_clean.csv')
customers = pd.read_csv('data\profile_clean.csv')
portfolio = pd.read_csv('data\portfolio_clean.csv')
df_merge = pd.merge(customers, 
                    events[events['offer_id'].notna()],
                    how='left',
                    on='customer_id')
df = pd.merge(df_merge,
            portfolio,
            how='outer',
            on='offer_id')
df = df[df['offer_completed'].notna()] # check text below
df.head()

Unnamed: 0,customer_id,became_member_on,gender_F,gender_M,gender_O,age_range_age_0_to_18,age_range_age_18_to_25,age_range_age_25_to_30,age_range_age_30_to_35,age_range_age_35_to_40,...,channel_0_email,channel_0_web,channel_1_email,channel_1_mobile,channel_2_mobile,channel_2_social,channel_3_social,offer_type_bogo,offer_type_discount,offer_type_informational
0,0610b486422d4921ae7d2bf64640c50b,2017-07-15,1,0,0,0,0,0,0,0,...,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
1,0610b486422d4921ae7d2bf64640c50b,2017-07-15,1,0,0,0,0,0,0,0,...,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2,78afa995795e4d85b5d9ceeca43f5fef,2017-05-09,1,0,0,0,0,0,0,0,...,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
3,78afa995795e4d85b5d9ceeca43f5fef,2017-05-09,1,0,0,0,0,0,0,0,...,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
4,78afa995795e4d85b5d9ceeca43f5fef,2017-05-09,1,0,0,0,0,0,0,0,...,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


We had to remove rows that had `offer_completed` as NaN because sklearn's algorithms can't handle this. We only had 5 entries, so this might be ok.

### Split data into training and test

- As a matter of controlling and making fair model comparisons, all test data is going to be 20% of the data.

In [3]:
def split_data(df, features, target):
    test_size = 0.20
    x_train, x_test, y_train, y_test = train_test_split(df[features],
                                                        df[target],
                                                        test_size=test_size,
                                                        random_state=42)
    return x_train, x_test, y_train, y_test

In [None]:
features = ['gender_F', 
            'gender_M',
            'gender_O',
            'age_range_age_0_to_18', 
            'age_range_age_18_to_25', 
            'age_range_age_25_to_30',
            'age_range_age_30_to_35', 
            'age_range_age_35_to_40', 
            'age_range_age_40_to_45',
            'age_range_age_45_to_50', 
            'age_range_age_50_to_55', 
            'age_range_age_55_to_60',
            'age_range_age_60_to_65', 
            'age_range_age_65_to_101',
            'income_range_income_10000.0_to_30000.0', 
            'income_range_income_30000.0_to_50000.0', 
            'income_range_income_50000.0_to_70000.0',
            'income_range_income_70000.0_to_90000.0', 
            'income_range_income_90000.0_to_110000.0', 
            'income_range_income_110000.0_to_120000.0',
            'offer_type_bogo',
            'offer_type_discount',
            'offer_type_informational']
target = 'offer_completed'
x_train, x_test, y_train, y_test = split_data(df, features, target)

Unnamed: 0,gender_F,gender_M,gender_O,age_range_age_0_to_18,age_range_age_18_to_25,age_range_age_25_to_30,age_range_age_30_to_35,age_range_age_35_to_40,age_range_age_40_to_45,age_range_age_45_to_50,...,age_range_age_65_to_101,income_range_income_10000.0_to_30000.0,income_range_income_30000.0_to_50000.0,income_range_income_50000.0_to_70000.0,income_range_income_70000.0_to_90000.0,income_range_income_90000.0_to_110000.0,income_range_income_110000.0_to_120000.0,offer_type_bogo,offer_type_discount,offer_type_informational
2211,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,1,0,0,1.0,0.0,0.0
139542,0,1,0,0,0,0,0,0,1,0,...,0,0,0,0,1,0,0,1.0,0.0,0.0
130552,1,0,0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0.0,1.0,0.0
100404,0,1,0,0,0,0,0,0,0,1,...,0,0,0,0,1,0,0,0.0,1.0,0.0
109941,0,1,0,0,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119879,0,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0.0,1.0,0.0
103694,0,1,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0.0,1.0,0.0
131932,0,1,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0.0,1.0,0.0
146867,0,1,0,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,1.0,0.0,0.0


For a fair comparison between models, I'm using the same test-train datasets.

In [40]:
model = RandomForestClassifier()
param_grid = {'n_estimators': [100+x for x in range(100, 1000, 100)]}
clf_forest = GridSearchCV(model, param_grid, scoring='balanced_accuracy')
clf_forest.fit(x_train, y_train)
y_pred = clf_forest.predict(x_test)
print(balanced_accuracy_score(y_test, y_pred))
confusion_matrix(y_test, y_pred)

In [39]:
clf_forest.cv_results_

{'mean_fit_time': array([0.0717999 , 0.06113319, 0.05787168, 0.05412049, 0.06391301,
        0.05911183, 0.05838885, 0.06119337, 0.06188035]),
 'std_fit_time': array([0.01408196, 0.0035162 , 0.00522388, 0.00372874, 0.0056876 ,
        0.00727348, 0.00299065, 0.00600461, 0.00733104]),
 'mean_score_time': array([0.01253395, 0.01071825, 0.00871606, 0.01304812, 0.0114779 ,
        0.01084881, 0.01016064, 0.01248031, 0.00981469]),
 'std_score_time': array([0.0036812 , 0.00330097, 0.00087009, 0.00412799, 0.00306839,
        0.00459823, 0.0033111 , 0.00377423, 0.00305934]),
 'param_alpha': masked_array(data=[0.1, 0.01, 0.001, 0.0001, 1e-05, 1e-06, 1e-07, 1e-08,
                    1e-09],
              mask=[False, False, False, False, False, False, False, False,
                    False],
        fill_value='?',
             dtype=object),
 'params': [{'alpha': 0.1},
  {'alpha': 0.01},
  {'alpha': 0.001},
  {'alpha': 0.0001},
  {'alpha': 1e-05},
  {'alpha': 1e-06},
  {'alpha': 1e-07},
  {'a

In [32]:
model = BernoulliNB()
param_grid = {'alpha': [10**(-x) for x in range(1, 10, 1)]}
clf_bernoulli = GridSearchCV(model, param_grid)
clf_bernoulli.fit(x_train, y_train)
y_pred = clf_bernoulli.predict(x_test)
print(balanced_accuracy_score(y_test, y_pred))
confusion_matrix(y_test, y_pred)

0.5


array([[23332,     0],
       [ 6429,     0]], dtype=int64)

In [None]:
clf_bernoulli.cv_results_

## Checking if the offer type matters

Now, we run the same tests considering the dataset without the offer types.

In [None]:
features = ['gender_F', 
            'gender_M',
            'gender_O',
            'age_range_age_0_to_18', 
            'age_range_age_18_to_25', 
            'age_range_age_25_to_30',
            'age_range_age_30_to_35', 
            'age_range_age_35_to_40', 
            'age_range_age_40_to_45',
            'age_range_age_45_to_50', 
            'age_range_age_50_to_55', 
            'age_range_age_55_to_60',
            'age_range_age_60_to_65', 
            'age_range_age_65_to_101',
            'income_range_income_10000.0_to_30000.0', 
            'income_range_income_30000.0_to_50000.0', 
            'income_range_income_50000.0_to_70000.0',
            'income_range_income_70000.0_to_90000.0', 
            'income_range_income_90000.0_to_110000.0', 
            'income_range_income_110000.0_to_120000.0']
target = 'offer_completed'
x_train, x_test, y_train, y_test = split_data(df, features, target)