In [1]:
import pandas as pd
import numpy as np
import math
import json
from collections import deque
% matplotlib inline

# read in the json files
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)

- offers were made on days 0,7,14,17,21,24
- 75% of the population received offers each time offers were made
    - `received.time.value_counts()/profile.shape[0]`
- 75% of all the offers given were viewed
    - `viewed.shape[0]/received.shape[0]`
- at a given time a person received at most one offer
    - `received.groupby('person')['time'].value_counts().value_counts()`
- an person might receive the same offer multiple times over the course of 6 weeks
    - `received.groupby('person')['id'].value_counts().value_counts()`
- There are only 6 people who have not received a single offer

If these set of people recived an offer:
- would they view it?
    - if yes would it encourage/discourage them to complete the offer
- would they have completed the offer 
- or to rephrase the question does viewing the offer have an effect on completing the offer
- depends on the person and the type of offer (bogo, discount)
- maybe someone will complete the offer only 

- (view and complete) or (not view and complete) -> 1
- does not view or does not complete -> 0
- for each (offer, person) we can train a decision tree that tells us whether the person will view and complete the offer

## Business understanding (CRISP-DM Step 1)
- In this project our goal will be to try and predict if we should show a given offer to a given customer
- From a business perspective we want to show offers to those people who satisfy the following conditions
    - who are likely to complete the offer if they viewed the offer
    - who are not likely to complete the offer if they did not view the offer
- The second condition is important because if a customer would complete the offer even without viewing it would mean we need to not have given them the offer in the first place, and starbucks could have saved revenue here.
- If we can identify a subset of customers that satisfies both these conditions then by showing these individuals specific offers we can capture revenue that we would have otherwise missed out on.

### Problem Statement:
Our goal will be identify customers which show a strong positive relationship between viewing an offer and completing it. We will call such customer **responsive** customers.
We will formulate this as a classification problem where a user will be given a positive label (called responsive) when either
1. the user has completed the offer and viewed the offer OR
2. the user has not completed the offer and not viewed the offer

With such a labeling in place our attempt will be to come up with a classifier that predicts whether a customer is responsive based on the features from the customer and the offer being shown to them.

### Strategy
- We will begin with 3 different types of classifiers and conduct a preliminary analysis on their ability to perform the required classification
- After evaluating the preliminary modeling stage we will pick one out of the 3 and perform a grid search to further optimize the solution

### Metrics
- In order to evaluate the performance of the classification model we will employ a F-beta score that gives more importance to precision as compared to recall.
- This is because there is a cost associated with sending offers to unresponsive customers.
    - An unresponsive customer might complete the offer without even seeing it, which means we could have saved money by not showing the customer this offer
    - An unresposive cusomter might not complete the offer if they see it, which means the offer is negatively impacting their spending.
- With this in mind we will want to minimize the False Positives and hence give more importance to precision rather than recall.

## Data understanding (CRISP-DM Step 2)
- we have been given data regarding each user which we can make use of as features
- starbucks has also provided is with transcripts regarding if/when a user viewed/completed their shown offers.
- we can use this to figure out which users actually viewed their offers before completing their offers.
    - we have to be careful here though, an offer that was viewed very early and an offer that was viewed late might have different effects on the behavior of the users
    - For instance assume a user has completed \$8 out of \$10 required to complete an offer before seeing the offer. And if the offer has a reward of \$4, then then the user might be compelled to go out and spend \$2 to get a bigger reward. This should not let us bias our answer to say that the we should give such offers to those people.
    - We should also be aware that we do not have control over when the user is going to view the offer.
- looking at the type of offers that were completed we see that informational offers cannot be completed, since there is no difficulty/reward associated with them. 
    - So in order to evaluate whether an information offer was effective we will have to see if the individual performed some transaction in the duration corresponding to the informational offer
- for bogo/discount orders difficulty/rewards are specified which we can use as features as well

## Add some plots

## Data Preparation  (CRISP-DM Step 3)
- are there any missing values?
    - the gender and income columns in the profile seem to be have missing values
    - if a row is missing gender then it is missing income and vice versa
    - we choose to drop these users out of our analysis
    - another thing we observe is that there are users who have not been shown a single offer
- are there any duplicate values?
    - There are users who might receive the same offer twice, and both of them can be marked complete using a common set of transactions.
- are there any categorical variables?
    - the offers in the portfolios have channels and offer_type which we will convert into dummy variables
- are there variables that need cleaning?
    - each transaction type has a different type of value stored as a dict, we wil need to extract the data into separate columns
    - since a user might be shown the same offer more than once, we need to make sure that we use the time columns to identify which offer receivals are related to which offer views and completions

In [2]:
def attach_dict_cols(df, event, col='value'):
    """
    A helper function to clean the transactions data.
    
    Args:
        df: A dataframe that contains all the transactions
        event: the type of event we want to filter and clean. This should be one of the values in the 'event' column in df
        col: The column which contains the dict describing the various attributes of this event
    Returns:
        A dataframe of the specific event with the dict in the value col transformed to separate columns.
    """
    df = df[df['event'] == event].copy()
    attributes = pd.DataFrame(list(df[col]), index=df.index)
    attributes.columns = [col.replace(' ', '_') for col in attributes.columns]
    df = pd.concat([df.drop('value', axis=1), attributes], axis=1, sort=False).reset_index(drop=True)
    return df

transcript['transcript_id'] = np.arange(transcript.shape[0])
received = attach_dict_cols(transcript, 'offer received')
viewed = attach_dict_cols(transcript, 'offer viewed')
completed = attach_dict_cols(transcript, 'offer completed')
transaction = attach_dict_cols(transcript, 'transaction')

#received.groupby('offer_id')['time'].value_counts().unstack()

In [3]:
portfolio['duration_hours'] = portfolio.duration * 24
portfolio['offer_id'] = portfolio.id

received['receive_id'] = transcript['transcript_id']
received['time_receive'] = received['time']
received = pd.merge(received, portfolio[['offer_id', 'duration_hours', 'offer_type', 'difficulty', 'channels']],
                    on='offer_id') 
received['end_time'] = received['time'] + received.duration_hours

viewed = pd.merge_asof(viewed, received[['person', 'time', 'offer_id', 'time_receive', 'receive_id']].sort_values('time'),
                    by=['person', 'offer_id'], on='time')
received = received.drop(['receive_id', 'time'], axis=1)

In [4]:
rc = pd.concat([received, completed], sort=False).sort_values('time')

rec_dict = {person:{offer:deque() for offer in portfolio.id} for person in profile.id}

# loop through all the receives and completes
# each time you see a complete find the first receive of the same offer that is still open 
#rec_dict = person -> offer -> [(transcript_id, end_time)]
complete_list = []
for tup in (received.itertuples(index=False)):
    rec_dict[tup.person][tup.offer_id].append((tup.transcript_id,tup.end_time))

for tup in (completed.itertuples(index=False)):
    stash = rec_dict[tup.person][tup.offer_id]
    found = False
    while stash and (not found):
        rec = stash.popleft()
        if rec[1] >= tup.time:
            complete_list.append(rec[0])
            found = True
    if not found:
        raise

assert completed.shape[0] == len(complete_list)

In [5]:
completed['receive_id'] = complete_list

received = received.merge(viewed[['receive_id', 'time']], left_on='transcript_id', right_on='receive_id', 
                          how='left', suffixes=('', '_view'))

received = received.merge(completed[['receive_id', 'time']], left_on='transcript_id', right_on='receive_id',
                         how='left', suffixes=('_view', '_complete'))

In [6]:
df = received.copy()

df['viewed'] = (~df.time_view.isnull())
df['completed'] = (~df.time_complete.isnull())

channels = set(x for list_ in portfolio.channels.values for x in list_)
for channel in channels:
    df[channel] = df.channels.apply(lambda _: (channel in _))

df['offer_type_dummy'] = df['offer_type']
df = pd.get_dummies(df, columns=['offer_type'])

In [7]:
df.groupby(['offer_type_dummy', 'viewed'])['completed'].value_counts().unstack()

Unnamed: 0_level_0,completed,False,True
offer_type_dummy,viewed,Unnamed: 2_level_1,Unnamed: 3_level_1
bogo,False,10614.0,11748.0
bogo,True,4216.0,3921.0
discount,False,9740.0,13440.0
discount,True,2893.0,4470.0
informational,False,11449.0,
informational,True,3786.0,


In [8]:
clean_profile = profile.dropna().copy()
clean_profile['person'] = clean_profile['id']

In [9]:
clean_profile = pd.get_dummies(clean_profile, columns=['gender'], prefix='gender')
profile_cols = ['age', 'became_member_on', 'gender_M', 'gender_O', 'gender_F', 'income', 'person']
#profile_cols = ['age', 'became_member_on', 'gender', 'income', 'person']

df = pd.merge(df, clean_profile[profile_cols], on='person')

In [10]:
df.groupby('viewed')['completed'].value_counts().unstack()

completed,False,True
viewed,Unnamed: 1_level_1,Unnamed: 2_level_1
False,25451,24319
True,8606,8125


In [11]:
df['label'] = (df['viewed'] == df['completed']).astype(int)

In [12]:
all_cols = ['duration_hours', 'difficulty', 'mobile', 'social', 'email', 'web', 
            'offer_type_bogo', 'offer_type_discount', 'offer_type_informational', 
            'age', 'became_member_on', 'gender_M', 'gender_F', 'gender_O', 'income', 'label']

In [13]:
df = df[all_cols].copy().astype(int)

In [14]:
df = df[df.offer_type_informational == 0].copy()

## Modeling  (CRISP-DM Step 4)
- In this section we will try and predict the labels for customers that have been shown an offer
- the label we have assigned is positive when the customer completes and offer if and only if they have been seen the offer
- We will split the data into training and test sets 
- we will then train a AdaBoostClassifier and evaluate its performance on unseen data, the test set

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, fbeta_score

In [16]:
features = df.drop('label', axis=1)
labels = df['label']
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.25, random_state=42)

In [18]:
def train_predict(learner, sample_size, X_train, y_train, X_test, y_test): 
    results = {}
    learner.fit(X_train[:sample_size], y_train[:sample_size])
    
    training_pred = learner.predict(X_train[:1000])
    results['train_acc'] = accuracy_score(y_train[:1000], training_pred)
    results['train_f'] = fbeta_score(y_train[:1000], training_pred, 0.5)
    results['train_prec'] = precision_score(y_train[:1000], training_pred)
    results['train_rec'] = recall_score(y_train[:1000], training_pred)
    
    y_pred = learner.predict(X_test)
    results['test_acc'] = accuracy_score(y_test, y_pred)
    results['test_f'] = fbeta_score(y_test, y_pred, 0.5)
    results['test_prec'] = precision_score(y_test, y_pred)
    results['test_rec'] = recall_score(y_test, y_pred)
    
    return results

In [79]:
clf_1 = DecisionTreeClassifier()
clf_2 = AdaBoostClassifier()
clf_3 = GradientBoostingClassifier()
clf_4 = RandomForestClassifier()

sizes = [int(X_train.shape[0] * 0.02), int(X_train.shape[0]*0.2), X_train.shape[0]]
results = {}
for clf in [clf_2, clf_3, clf_4, clf_1]:
    clf_name = clf.__class__.__name__
    results[clf_name] = {}
    for i in sizes:
        results[clf_name][i] = train_predict(clf, i, X_train, y_train, X_test, y_test)

In [80]:
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)
from plotly.tools import make_subplots

In [82]:
fig = make_subplots(rows=2, cols=4, subplot_titles=('Training Accuracy', 'Training F-Score', 'Training Precision', 'Training Recall',
                                                    'Testing Accuracy', 'Testing F-Score', 'Testing Precision', 'Testing Recall'))
trace_dict = {}
title_dict = {}
for i, metric in enumerate(['acc', 'f', 'prec', 'rec']):
    for j, tt in enumerate(['train', 'test']):
        trace_dict[(j+1, i+1)] = []
        for (clf_name, results_by_size), color in zip(results.items(), ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']):
            trace = go.Bar(name=clf_name, x=['2%', '20%', '100%'], y=[results_by_size[s][f'{tt}_{metric}'] for s in sizes],
                          legendgroup=clf_name, showlegend=(i==0 and j==0), marker={'color':color})
            trace_dict[(j+1, i+1)].append(trace)
        title_dict[(j+1, i+1)] = f'{tt}_{metric}'
for (r, c), traces in trace_dict.items():
    for trace in traces:
        fig.append_trace(trace, row=r, col=c)
    #title = title_dict[(r,c)]
    if r==1 and c==1:
        fig.update({'layout':{'xaxis':{'type':'category'}}})
    else:
        fig.update({'layout':{f'xaxis{(r-1)*4 + c}':{'type':'category'}}})
iplot(fig)

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]  [ (1,3) x3,y3 ]  [ (1,4) x4,y4 ]
[ (2,1) x5,y5 ]  [ (2,2) x6,y6 ]  [ (2,3) x7,y7 ]  [ (2,4) x8,y8 ]

