In [1]:
import pandas as pd
import numpy as np
import math
import json
from collections import deque
% matplotlib inline

# read in the json files
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)

- offers were made on days 0,7,14,17,21,24
- 75% of the population received offers each time offers were made
    - `received.time.value_counts()/profile.shape[0]`
- 75% of all the offers given were viewed
    - `viewed.shape[0]/received.shape[0]`
- at a given time a person received at most one offer
    - `received.groupby('person')['time'].value_counts().value_counts()`
- an person might receive the same offer multiple times over the course of 6 weeks
    - `received.groupby('person')['id'].value_counts().value_counts()`
- There are only 6 people who have not received a single offer

If these set of people recived an offer:
- would they view it?
    - if yes would it encourage/discourage them to complete the offer
- would they have completed the offer 
- or to rephrase the question does viewing the offer have an effect on completing the offer
- depends on the person and the type of offer (bogo, discount)
- maybe someone will complete the offer only 

- (view and complete) or (not view and complete) -> 1
- does not view or does not complete -> 0
- for each (offer, person) we can train a decision tree that tells us whether the person will view and complete the offer

## Business understanding (CRISP-DM Step 1)
- In this project our goal will be to try and predict if we should show a given offer to a given customer
- From a business perspective we want to show offers to those people who satisfy the following conditions
    - who would complete the offer if they viewed the offer
    - who would not complete the offer if they did not view the offer
- The second condition is important because if a customer would complete the offer even without viewing it would mean we need to not have given them the offer in the first place, and starbucks could have saved revenue here.
- If we can identify a subset of customers that satisfies both these conditions then by showing these individuals specific offers we can capture revenue that we would have otherwise missed out on.

## Data understanding (CRISP-DM Step 2)
- we have been given data regarding each user which we can make use of as features
- starbucks has also provided is with transcripts regarding if/when a user viewed/completed their shown offers.
- we can use this to figure out which users actually viewed their offers before completing their offers.
    - we have to be careful here though, an offer that was viewed very early and an offer that was viewed late might have different effects on the behavior of the users
    - For instance assume a user has completed \$8 out of \$10 required to complete an offer before seeing the offer. And if the offer has a reward of \$4, then then the user might be compelled to go out and spend \$2 to get a bigger reward. This should not let us bias our answer to say that the we should give such offers to those people.
    - We should also be aware that we do not have control over when the user is going to view the offer.
- looking at the type of offers that were completed we see that informational offers cannot be completed, since there is no difficulty/reward associated with them. 
    - So in order to evaluate whether an information offer was effective we will have to see if the individual performed some transaction in the duration corresponding to the informational offer
- for bogo/discount orders difficulty/rewards are specified which we can use as features as well

## Data Preparation  (CRISP-DM Step 3)
- are there any missing values?
    - the gender and income columns in the profile seem to be have missing values
    - if a row is missing gender then it is missing income and vice versa
    - we choose to drop these users out of our analysis
    - another thing we observe is that there are users who have not been shown a single offer
- are there any duplicate values?
    - There are users who might receive the same offer twice, and both of them can be marked complete using a common set of transactions.
- are there any categorical variables?
    - the offers in the portfolios have channels and offer_type which we will convert into dummy variables
- are there variables that need cleaning?
    - each transaction type has a different type of value stored as a dict, we wil need to extract the data into separate columns
    - since a user might be shown the same offer more than once, we need to make sure that we use the time columns to identify which offer receivals are related to which offer views and completions

In [15]:
def attach_dict_cols(df, event, col='value'):
    """
    A helper function to clean the transactions data.
    
    Args:
        df: A dataframe that contains all the transactions
        event: the type of event we want to filter and clean. This should be one of the values in the 'event' column in df
        col: The column which contains the dict describing the various attributes of this event
    Returns:
        A dataframe of the specific event with the dict in the value col transformed to separate columns.
    """
    df = df[df['event'] == event].copy()
    attributes = pd.DataFrame(list(df[col]), index=df.index)
    attributes.columns = [col.replace(' ', '_') for col in attributes.columns]
    df = pd.concat([df.drop('value', axis=1), attributes], axis=1, sort=False).reset_index(drop=True)
    return df

transcript['transcript_id'] = np.arange(transcript.shape[0])
received = attach_dict_cols(transcript, 'offer received')
viewed = attach_dict_cols(transcript, 'offer viewed')
completed = attach_dict_cols(transcript, 'offer completed')
transaction = attach_dict_cols(transcript, 'transaction')

#received.groupby('offer_id')['time'].value_counts().unstack()

In [16]:
portfolio['duration_hours'] = portfolio.duration * 24
portfolio['offer_id'] = portfolio.id

received['receive_id'] = transcript['transcript_id']
received['time_receive'] = received['time']
received = pd.merge(received, portfolio[['offer_id', 'duration_hours']], on='offer_id') 
received['end_time'] = received['time'] + received.duration_hours

viewed = pd.merge_asof(viewed, received[['person', 'time', 'offer_id', 'time_receive', 'receive_id']].sort_values('time'),
                    by=['person', 'offer_id'], on='time')
received = received.drop(['receive_id', 'time'], axis=1)

In [17]:
rc = pd.concat([received, completed], sort=False).sort_values('time')

rec_dict = {person:{offer:deque() for offer in portfolio.id} for person in profile.id}

# loop through all the receives and completes
# each time you see a complete find the first receive of the same offer that is still open 
#rec_dict = person -> offer -> [(transcript_id, end_time)]
complete_list = []
for tup in (received.itertuples(index=False)):
    rec_dict[tup.person][tup.offer_id].append((tup.transcript_id,tup.end_time))

for tup in (completed.itertuples(index=False)):
    stash = rec_dict[tup.person][tup.offer_id]
    found = False
    while stash and (not found):
        rec = stash.popleft()
        if rec[1] >= tup.time:
            complete_list.append(rec[0])
            found = True
    if not found:
        raise

assert completed.shape[0] == len(complete_list)

In [18]:
completed['receive_id'] = complete_list

received = received.merge(viewed[['receive_id', 'time']], left_on='transcript_id', right_on='receive_id', 
                          how='left', suffixes=('', '_view'))

received = received.merge(completed[['receive_id', 'time']], left_on='transcript_id', right_on='receive_id',
                         how='left', suffixes=('_view', '_complete'))

In [20]:
df = received.copy()

In [23]:
df['viewed'] = (~df.time_view.isnull())
df['completed'] = (~df.time_complete.isnull())

In [25]:
df.groupby('viewed')['completed'].value_counts().unstack()

completed,False,True
viewed,Unnamed: 1_level_1,Unnamed: 2_level_1
False,31803,25188
True,10895,8391


Unnamed: 0,person,event,transcript_id,offer_id,time_receive,duration_hours,end_time,receive_id_view,time_view,receive_id_complete,time_complete
0,78afa995795e4d85b5d9ceeca43f5fef,offer received,0,9b98b8c7a33c4b65b9aebfe6a799e6d9,0,168,168,0.0,6.0,0.0,132.0
1,ebe7ef46ea6f4963a7dd49f501b26779,offer received,18,9b98b8c7a33c4b65b9aebfe6a799e6d9,0,168,168,,,,
2,f082d80f0aac47a99173ba8ef8fc1909,offer received,21,9b98b8c7a33c4b65b9aebfe6a799e6d9,0,168,168,21.0,48.0,21.0,12.0
3,c0d210398dee4a0895b24444a5fcd1d2,offer received,28,9b98b8c7a33c4b65b9aebfe6a799e6d9,0,168,168,28.0,30.0,28.0,66.0
4,57dd18ec5ddc46828afb81ec5977bef2,offer received,30,9b98b8c7a33c4b65b9aebfe6a799e6d9,0,168,168,,,,
