# Initial Data Processing (Starbucks customer behavior)

This notebook has been created to gain an initial data understanding of the datasets provided by Starbucks on their customer behaviour in the context of their incentivised marketing campagins. It will specifically address what data points they have provided, any missing data or how they could be used in terms of predicting future marketing efforts.

The files looked at in this notebook are:
- portfolio.json: containing offer ids and meta data about each offer (duration, type, etc.)
- profile.json: demographic data for each customer
- transcript.json: records for transactions, offers received, offers viewed, and offers completed

### Imports

In [4]:
import pandas as pd
import numpy as np
import json

### Functions

In [34]:
# Space for any functions if needed

### Global Variables

The first thing I will do is load in each of the data files and look at which variables they each contain to gain some understanding of each of the datasets.

In [40]:
# read in each of the data files & look at the variables involved
portfolio_df = pd.read_json('data/portfolio.json', lines=True)
portfolio_df.head()

Unnamed: 0,channels,difficulty,duration,id,offer_type,reward
0,"[email, mobile, social]",10,7,ae264e3637204a6fb9bb56bc8210ddfd,bogo,10
1,"[web, email, mobile, social]",10,5,4d5c57ea9a6940dd891ad53e9dbe8da0,bogo,10
2,"[web, email, mobile]",0,4,3f207df678b143eea3cee63160fa8bed,informational,0
3,"[web, email, mobile]",5,7,9b98b8c7a33c4b65b9aebfe6a799e6d9,bogo,5
4,"[web, email]",20,10,0b1e1539f2cc45b7b9fa7c272da2e1d7,discount,5


In [30]:
profile_df = pd.read_json('data/profile.json', lines=True)
profile_df.head()

Unnamed: 0,age,became_member_on,gender,id,income
0,118,20170212,,68be06ca386d4c31939f3a4f0e3dd783,
1,55,20170715,F,0610b486422d4921ae7d2bf64640c50b,112000.0
2,118,20180712,,38fe809add3b4fcf9315a9694bb96ff5,
3,75,20170509,F,78afa995795e4d85b5d9ceeca43f5fef,100000.0
4,118,20170804,,a03223e636434f42ac4c3df47e8bac43,


In [41]:
transcript_df = pd.read_json('data/transcript.json', lines=True)
transcript_df.head()

Unnamed: 0,event,person,time,value
0,offer received,78afa995795e4d85b5d9ceeca43f5fef,0,{'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'}
1,offer received,a03223e636434f42ac4c3df47e8bac43,0,{'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'}
2,offer received,e2127556f4f64592b11af22de27a7932,0,{'offer id': '2906b810c7d4411798c6938adc9daaa5'}
3,offer received,8ec6ce2a7e7949b1bf142def7d0e0586,0,{'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'}
4,offer received,68617ca6246f4fbc85e91a2a49552598,0,{'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'}


### Initial Analysis

Next I will perform some initial analysis this will just be very basic to check missing data, values in each column etc and will provide the basis for the processing in the next step of the process.

#### Portfolio data

In [42]:
# First I'll look at the shape portfolio data
portfolio_df.shape

(10, 6)

In [46]:
# look into the types of offers available
portfolio_df['offer_type'].unique()

array(['bogo', 'informational', 'discount'], dtype=object)

In [49]:
# look into the spread of the numerical data difficulty, duration & reward
portfolio_df.describe()

Unnamed: 0,difficulty,duration,reward
count,10.0,10.0,10.0
mean,7.7,6.5,4.2
std,5.831905,2.321398,3.583915
min,0.0,3.0,0.0
25%,5.0,5.0,2.0
50%,8.5,7.0,4.0
75%,10.0,7.0,5.0
max,20.0,10.0,10.0


In [54]:
# find all the differenr types of channels that the campagins were displayed on
portfolio_df.channels

0         [email, mobile, social]
1    [web, email, mobile, social]
2            [web, email, mobile]
3            [web, email, mobile]
4                    [web, email]
5    [web, email, mobile, social]
6    [web, email, mobile, social]
7         [email, mobile, social]
8    [web, email, mobile, social]
9            [web, email, mobile]
Name: channels, dtype: object

From the above analysis I can see that their 10 campaigns where three different types of offers where deployed bogo (Buy One Get One Free), informational and discounts. They were distributed across the web, email, mobile and social media. They ranged in length between 10 and 3 days. Had an average difficulty (minimum spend) of 8.5 dollars and an average reward of 4 dollars.

#### Profile data

In [56]:
profile_df.head()

Unnamed: 0,age,became_member_on,gender,id,income
0,118,20170212,,68be06ca386d4c31939f3a4f0e3dd783,
1,55,20170715,F,0610b486422d4921ae7d2bf64640c50b,112000.0
2,118,20180712,,38fe809add3b4fcf9315a9694bb96ff5,
3,75,20170509,F,78afa995795e4d85b5d9ceeca43f5fef,100000.0
4,118,20170804,,a03223e636434f42ac4c3df47e8bac43,


In [55]:
# see how many consumers are in the profile dataset
profile_df.shape

(17000, 5)

In [60]:
# check the spread of ages and income in the data
profile_df.describe()

Unnamed: 0,age,income
count,17000.0,14825.0
mean,62.531412,65404.991568
std,26.73858,21598.29941
min,18.0,30000.0
25%,45.0,49000.0
50%,58.0,64000.0
75%,73.0,80000.0
max,118.0,120000.0


In [61]:
# check spread of data in the gender column
profile_df.groupby('gender').count()

Unnamed: 0_level_0,age,became_member_on,id,income
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
F,6129,6129,6129,6129
M,8484,8484,8484,8484
O,212,212,212,212


In [63]:
# check the spread of dates from when users became a member
pd.to_datetime(profile_df['became_member_on']).describe()

count                   17000
unique                   1716
top       2017-12-07 00:00:00
freq                       43
first     2013-07-29 00:00:00
last      2018-07-26 00:00:00
Name: became_member_on, dtype: object

In [66]:
# see how many missing values are in the income column
profile_df['income'].isna().sum()

2175

From the profile data I can see that their are 17,000 consumers ~6100 female, ~8400 male and ~200 unknown. The users joined between 29th July 2013 and 26th July 2018. We have data on around 15,000 of the users that have an income between 30,000-120,000 with the average of 64,000. The ages range from 18 to greater than 73 however it seems that no age has been specified as 118, this could be users who are more aware of sharing personal data.

#### Transcript data

In [67]:
transcript_df.head()

Unnamed: 0,event,person,time,value
0,offer received,78afa995795e4d85b5d9ceeca43f5fef,0,{'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'}
1,offer received,a03223e636434f42ac4c3df47e8bac43,0,{'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'}
2,offer received,e2127556f4f64592b11af22de27a7932,0,{'offer id': '2906b810c7d4411798c6938adc9daaa5'}
3,offer received,8ec6ce2a7e7949b1bf142def7d0e0586,0,{'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'}
4,offer received,68617ca6246f4fbc85e91a2a49552598,0,{'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'}


In [68]:
# how many interactions have been encountered
transcript_df.shape

(306534, 4)

In [69]:
# look into how the time column is formatted/the spread of the data
transcript_df.describe()

Unnamed: 0,time
count,306534.0
mean,366.38294
std,200.326314
min,0.0
25%,186.0
50%,408.0
75%,528.0
max,714.0


In [72]:
# see how many people received the offers and how many didn't
transcript_df.groupby('event').count()

Unnamed: 0_level_0,person,time,value
event,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
offer completed,33579,33579,33579
offer received,76277,76277,76277
offer viewed,57725,57725,57725
transaction,138953,138953,138953


In [74]:
# number of unique users in the dataset
len(transcript_df['person'].unique())

17000

In [73]:
# see if there is any missing data in the dataset
transcript_df.isna().sum()

event     0
person    0
time      0
value     0
dtype: int64

In [101]:
# look at the values column
transcript_df_copy = transcript_df
transcript_df_copy['type'] = [list(x.keys())[0] for x in transcript_df['value']]
transcript_df_copy['campaign'] = [list(x.values())[0] for x in transcript_df['value']]
transcript_df_copy.groupby('type').count()

Unnamed: 0_level_0,event,person,time,value,campaign
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
amount,138953,138953,138953,138953,138953
offer id,134002,134002,134002,134002,134002
offer_id,33579,33579,33579,33579,33579


Frome the transcript data we can see that 306,534 events have been logged by starbucks and the majority of these (138,953) are transactions, with ~76000 offers received, ~58,00 viewed and 33,579 completed. The time column ranges from zero to 714 minutes and shows the time after the related offer came online which is around 30 days. The dataset contains all of the 17,000 users that are in the profile data. The value column contains either the campagin ids or the amount of the transactions processed.

To gain further insights and to split the data into selected demographics needed for the modelling more processing needs to be performed.