## Problem statement

The goal for this project is to predict how likely will a customer respond to an offer, based on demographics and behavioural data. 

## Packages used

The following packages are used:
<ol>
    <li> <strong>pandas</strong>: python package for data analysis
    <li> <strong>numpy</strong>: 
    <li> <strong>matplotlib</strong>: 
    <li> <strong>seaborn</strong>: 
    <li> <strong>scitkit-learn</strong>: 
    <li> <strong>scipy.stats</strong>: 
    <li> <strong></strong>: 
</ol>

## Metrics

The following metrics are going to be used to measure the models performance:

## Loading data

Data was cleaned in `data_modeling.ipynb`.

In [111]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, BernoulliNB

In [102]:
events = pd.read_csv('data\\transcript_clean.csv')
customers = pd.read_csv('data\profile_clean.csv')
portfolio = pd.read_csv('data\portfolio_clean.csv')
df_merge = pd.merge(customers, 
                    events,
                    how='outer',
                    on='customer_id')
df = pd.merge(df_merge,
            portfolio,
            how='outer',
            on='offer_id')
df.head()

Unnamed: 0,customer_id,became_member_on,gender_F,gender_M,gender_O,age_range_age_0_to_18,age_range_age_18_to_25,age_range_age_25_to_30,age_range_age_30_to_35,age_range_age_35_to_40,...,channel_0_email,channel_0_web,channel_1_email,channel_1_mobile,channel_2_mobile,channel_2_social,channel_3_social,offer_type_bogo,offer_type_discount,offer_type_informational
0,0610b486422d4921ae7d2bf64640c50b,2017-07-15,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
1,0610b486422d4921ae7d2bf64640c50b,2017-07-15,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
2,0610b486422d4921ae7d2bf64640c50b,2017-07-15,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
3,78afa995795e4d85b5d9ceeca43f5fef,2017-05-09,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
4,78afa995795e4d85b5d9ceeca43f5fef,2017-05-09,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,


### Split data into training and test

- As a matter of controlling and making fair model comparisons, all test data is going to be 20% of the data.

In [107]:
def split_data(df, features, target):
    test_size = 0.20
    x_train, x_test, y_train, y_test = train_test_split(df[features],
                                                        df[target],
                                                        test_size=test_size,
                                                        random_state=42)
    return x_train, x_test, y_train, y_test

In [110]:
features = ['gender_F', 
            'gender_M',
            'gender_O',
            'age_range_age_0_to_18', 
            'age_range_age_18_to_25', 
            'age_range_age_25_to_30',
            'age_range_age_30_to_35', 
            'age_range_age_35_to_40', 
            'age_range_age_40_to_45',
            'age_range_age_45_to_50', 
            'age_range_age_50_to_55', 
            'age_range_age_55_to_60',
            'age_range_age_60_to_65', 
            'age_range_age_65_to_101',
            'income_range_income_10000.0_to_30000.0', 
            'income_range_income_30000.0_to_50000.0', 
            'income_range_income_50000.0_to_70000.0',
            'income_range_income_70000.0_to_90000.0', 
            'income_range_income_90000.0_to_110000.0', 
            'income_range_income_110000.0_to_120000.0',
            'offer_type_bogo',
            'offer_type_discount',
            'offer_type_informational']
target = 'offer_completed'
x_train, x_test, y_train, y_test = split_data(df, features, target)
x_train

Unnamed: 0,gender_F,gender_M,gender_O,age_range_age_0_to_18,age_range_age_18_to_25,age_range_age_25_to_30,age_range_age_30_to_35,age_range_age_35_to_40,age_range_age_40_to_45,age_range_age_45_to_50,...,age_range_age_65_to_101,income_range_income_10000.0_to_30000.0,income_range_income_30000.0_to_50000.0,income_range_income_50000.0_to_70000.0,income_range_income_70000.0_to_90000.0,income_range_income_90000.0_to_110000.0,income_range_income_110000.0_to_120000.0,offer_type_bogo,offer_type_discount,offer_type_informational
72012,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,,,
161242,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
294034,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
210966,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
180646,,,,,,,,,,,...,,,,,,,,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119879,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,,,
259178,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
131932,,,,,,,,,,,...,,,,,,,,,,
146867,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


In [113]:
clf = GaussianNB()
clf.fit(x_train, y_train)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').