# Build matrices notebooks
This notebook is used to develop the matrices that will be used for analysis. Based on learnings from the exploratory notebook "Starbucks_Capstone_notebook.ipynb". I have defined a set of parameteres that has to be derived from the combination of portfolio, profile and transcript data. 

## Interaction matrices
In princple I have decided to make two main matrices. One focusing on the users, and one focusing on the offers given. 

Documentation of the columns for each matrix is given in a separate description. XXINSERT_REF

### Offer based interactions
The offers dataframe will have one line per offer given to a user. Each line will consist of data related to the offer, taken from the portfolio data, data related to user identity, taken from the profile data, and data related to user interactions with the given offer, derived from the transcript data. This matrix will be the basis for investigating the user - offer interactions. 

### User based interactions
The profile_exp dataframe will be built with one row per user (in principle an expansion of the profile.json data. The expansion will provide features about the user, user details as provided in the original profile data, and aggregated features about the users offer and spending history. This matrix will be used for segmentation analysis of the users and their interactions. 

## Info
The functions created in this notebook will be moved to different python modules as seen fit. There are a lot of helper methods needed to build both matrices, and these can be investigated in detail below, or in the 




In [34]:
import pandas as pd
import numpy as np
import math
import json
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from collections import OrderedDict
%matplotlib inline

from utils.cleaning import clean_data

In [2]:
# read in the json files
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)

#for simplicity I will not keep the original dataframes
portfolio, profile, transcript = clean_data(portfolio, profile, transcript)

In [47]:
transcript['event'].unique()

array(['offer received', 'offer viewed', 'transaction', 'offer completed'],
      dtype=object)

In [52]:
profile.shape

(16911, 5)

In [51]:
transcript.loc[transcript['event']=='offer received', 'id'].unique().shape

(16905,)

## Offer based interactions

I need to separate each offer the user receives. Each offer will be treated as a unique offering, identified by the offer_id, the user_id. Based on the instructions on the data I am making a few assumptions and create some definitions to limit the solution space. 

First, I define a "valid windom" as the window from the offer is viewed to its complettion. Completion in this case will be defined as the first event of offer completed or offer expires. Parameters related to what happens inside this valid window is denoted with "...\_in\_window". 

The rest of the total window will then be defined as not valid, or "...\_out\_window". 

Anything happening outside of the window defined by the "offer received"-time and after the duration is passed will not be considered in this matrix. The user based interaction matrix will create parameters that transcends the specific offering such as spendings outside valid windows and offerings. 

One user can receive a specific offer_id several times. If these are overlapping, we will be in trouble with difficulty to differentiate the offers. At the moment we will close our eyes and hope that this is not happening to any users. It would seem counter productive to offer two bogo offers of the same kind at the same time to the same users. Also, the user doesn't need to have information twice about the same offering, one should hope at least. 

We are also not differentiating if the user has two or more offers at the same time. The aggregated data from transactions during the offer is performed per offer user receives. Thus, transactions that occurs inside window of more than one offer at the same time will be double booked. This is in principle no problem when focusing on offers by itself, as we cannot really differentiate which offer (at least not immediately) that influence the user the most. Hence, we make an assumption that each offer is independent from another regardless of time of occurence. 



In [56]:
def get_user_offer_ids(user_transcript):
    """
    Extracts offer ids presented to the user    
    """
    offer_ids = [(i, offer_id) for i, offer_id in
                 enumerate(user_transcript.loc[user_transcript['event'] == 'offer received', 'offer_id'])]
    return offer_ids

def get_user_offer_starts(user_transcript):
    """
    Extracts start times of offers presented to the user    
    """
    offers_start = np.array(user_transcript.loc[user_transcript['event'] == 'offer received', 'time'])
    return offers_start

def get_user_offer_types(portfolio, offer_ids):
    """
    Extracts offer types of offers presented to the user    
    """
    offers_type = np.array(
        [portfolio.loc[portfolio['id'] == offer_id, 'offer_type'].values.astype(str)[0] for offer_id in offer_ids])
    return offers_type
def get_user_offer_difficulties(portfolio, offer_ids):
    """
    Extracts difficulty of offers presented to the user    
    """
    offers_difficulty = np.array(
        [portfolio.loc[portfolio['id'] == offer_id, 'difficulty'].values.astype(str)[0] for offer_id in offer_ids])
    return offers_difficulty

def get_user_offer_rewards(portfolio, offer_ids):
    """
    Extracts difficulty of offers presented to the user    
    """
    offers_reward = np.array(
        [portfolio.loc[portfolio['id'] == offer_id, 'reward'].values.astype(str)[0] for offer_id in offer_ids])
    return offers_reward
    
def get_user_offer_durations(portfolio, offer_ids):
    """
    Extracts difficulty of offers presented to the user    
    """
    offers_duration = np.array(
        [portfolio.loc[portfolio['id'] == offer_id, 'duration'].values.astype(int)[0] * 24 for offer_id in offer_ids])
    return offers_duration
    
def get_user_offer_views(user_transcript):
    """
    Extracts difficulty of offers presented to the user    
    """
    offers_viewed = np.array(user_transcript.loc[user_transcript['event'] == 'offer viewed', ['time', 'offer_id']])
    return offers_viewed

def get_user_offer_completions(user_transcript):
    """
    Extracts difficulty of offers presented to the user    
    """
    offers_completed = np.array(
        user_transcript.loc[user_transcript['event'] == 'offer completed', ['time', 'offer_id']])
    return offers_completed

    
def build_offer_df(portfolio, profile, transcript):
    #iterate over users
    users = profile['id'].unique()
    
    offers = {}
    count = 0
    count_users_no_offer = 0
    for user in users[:3000]:
        # transcripts for specific user
        user_transcript = transcript.loc[transcript['id'] == user, :]
        user_transactions = user_transcript.loc[user_transcript['event'] == 'transaction', ['time', 'amount']]        
        offer_ids_tuples = get_user_offer_ids(user_transcript)
        if len(offer_ids_tuples)<1: #if there are no offers given to user, skip the rest. 
            count_users_no_offer += 1
            break
        offer_ids = list(list(zip(*offer_ids_tuples))[1])
        
        offers_start = get_user_offer_starts(user_transcript)
        offers_duration = get_user_offer_durations(portfolio, offer_ids)
        offers_difficulty = get_user_offer_difficulties(portfolio, offer_ids)
        offers_reward = get_user_offer_rewards(portfolio, offer_ids) 
        offers_type = get_user_offer_types(portfolio, offer_ids)
        offers_viewed = get_user_offer_views(user_transcript) 
        offers_completed = get_user_offer_completions(user_transcript)
        
        offers_end = offers_start + offers_duration
        
        #Test if results are as expected
        assert len(offer_ids) == len(offers_start) , "The number of offerings ({}) are not the same as the number of starting points ({})".format(len(offer_ids), len(offers_start))
        assert len(offer_ids) == len(offers_type) , "The number of offerings ({}) are not the same as the number of offer types ({})".format(len(offer_ids), len(offers_type))
        assert len(offer_ids) == len(offers_difficulty) , "The number of offerings ({}) are not the same as the number of offer difficulties ({})".format(len(offer_ids), len(offers_difficulty))
        assert len(offer_ids) == len(offers_reward) , "The number of offerings ({}) are not the same as the number of offer rewards ({})".format(len(offer_ids), len(offers_reward))
        assert len(offer_ids) == len(offers_duration) , "The number of offerings ({}) are not the same as the number of offer durations ({})".format(len(offer_ids), len(offers_duration))
        
        #iterate over offers and build dict to be used to fill a dataframe
        for i, offer_id in offer_ids_tuples:
            start = offers_start[i]
            duration = offers_duration[i]
            end = offers_end[i]
            kind = offers_type[i]
            reward = offers_reward[i]
            difficulty = offers_difficulty[i]
            
            # identify completion event within the offer
            completed_time = None
            completed = 0 #0 if no completion even, 1 if completion even
            for time, completion_offer_id in offers_completed:
                if completion_offer_id == offer_id and time >= start and time <= end:
                    completed_time = time
                    completed = 1
                    break
                    
            # identify view event within the offer, views after completion will be regarded as not viewed
            viewed_time = None
            viewed = 0 #0 if no completion even, 1 if completion even
            for time, viewed_offer_id in offers_viewed:
                if completed_time:
                    if time > completed_time: #do not accept if time of viewing is after time of completion
                        break
                if viewed_offer_id == offer_id and time >= start and time <= end:
                        viewed_time = time
                        viewed = 1
                        break        
            
            # calculate valid window related parameters
            time_in_window = 0
            amount_in_window = 0
            if viewed:
                # time from viewed to completion or end of offer window. 
                if completed_time:
                    time_in_window = completed_time - viewed_time
                else:
                    time_in_window = end - viewed_time
                # cumulative amount spent in valid window, if no valid window, no amount spent due to offer
                transactions_in_window = user_transactions.loc[(user_transactions['time'] >= viewed_time) &
                                                               (user_transactions['time'] <= viewed_time + time_in_window), :]
                
                amount_in_window = transactions_in_window['amount'].sum()
            
            
            
            
            offers.update({count: {'offer_id': offer_id,
                                   'user_id': user,
                                   'offer_type': kind,
                                   'difficulty': difficulty,
                                   'reward': reward,
                                   'start_time': start,
                                   'duration': duration,
                                   'end_time': end,
                                   'viewed': viewed,
                                   'view_time': viewed_time,
                                   'completed': completed,
                                   'complet_time': completed_time,
                                   'time_in_window': time_in_window,
                                   'amount_in_window': amount_in_window}})
            count+=1
    dtype = [str, str, str, float, float, float, float, float, int, float, int, float, float, float]
    offer_df = pd.DataFrame.from_dict(offers, orient='index')
    print("{} received no offer".format(count_users_no_offer))
    return offer_df

offer_df = build_offer_df(portfolio, profile, transcript)
pd.to_pickle()

1 received no offer


Unnamed: 0,offer_id,user_id,offer_type,difficulty,reward,start_time,duration,end_time,viewed,view_time,completed,complet_time,time_in_window,amount_in_window
0,2906b810c7d4411798c6938adc9daaa5,68be06ca386d4c31939f3a4f0e3dd783,discount,10,2,168,168,336,1,216.0,0,,120,0.00
1,0b1e1539f2cc45b7b9fa7c272da2e1d7,68be06ca386d4c31939f3a4f0e3dd783,discount,20,5,336,240,576,1,348.0,0,,228,10.52
2,fafdcd668e3743c1bb461111dcafc2a4,68be06ca386d4c31939f3a4f0e3dd783,discount,10,2,408,240,648,1,408.0,1,552.0,144,10.17
3,2298d6c36e964ae4a3e7e9706d1fb8c2,68be06ca386d4c31939f3a4f0e3dd783,discount,7,3,504,168,672,1,504.0,1,552.0,48,7.54
4,fafdcd668e3743c1bb461111dcafc2a4,68be06ca386d4c31939f3a4f0e3dd783,discount,10,2,576,240,816,1,582.0,0,,234,9.88
5,9b98b8c7a33c4b65b9aebfe6a799e6d9,0610b486422d4921ae7d2bf64640c50b,bogo,5,5,408,168,576,0,,1,528.0,0,0.00
6,3f207df678b143eea3cee63160fa8bed,0610b486422d4921ae7d2bf64640c50b,informational,0,0,504,96,600,0,,0,,0,0.00
7,9b98b8c7a33c4b65b9aebfe6a799e6d9,38fe809add3b4fcf9315a9694bb96ff5,bogo,5,5,168,168,336,1,168.0,0,,168,0.00
8,5a8bc65990b245e5a138643cd4eb9837,38fe809add3b4fcf9315a9694bb96ff5,informational,0,0,576,72,648,0,,0,,0,0.00
9,9b98b8c7a33c4b65b9aebfe6a799e6d9,78afa995795e4d85b5d9ceeca43f5fef,bogo,5,5,0,168,168,1,6.0,1,132.0,126,19.89


In [None]:
#pickle