# Starbucks Capstone Challenge
Notebook 1 of 4

### Introduction

This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks. 

Not all users receive the same offer, and that is the challenge to solve with this data set.

The task is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.

Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. You'll see in the data set that informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, you can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.

Transactional data shows user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer. 

Keep in mind as well that someone using the app might make a purchase through the app without having received an offer or seen an offer.

### Example

To give an example, a user could receive a discount offer of 'buy 10 dollars get 2 off' on Monday. The offer is valid for 10 days from receipt. If the customer accumulates at least 10 dollars in purchases during the validity period, the customer completes the offer.

However, there are a few things to watch out for in this data set. Customers do not opt into the offers that they receive; in other words, a user can receive an offer, never actually view the offer, and still complete the offer. For example, a user might receive the "buy 10 dollars get 2 dollars off offer", but the user never opens the offer during the 10 day validity period. The customer spends 15 dollars during those ten days. There will be an offer completion record in the data set; however, the customer was not influenced by the offer because the customer never viewed the offer.

### Cleaning

The data cleaning will be especially important and tricky due to the complexity of the trasactional data.

Additionally, some demographic groups will make purchases even if they don't receive an offer. From a business perspective, if a customer is going to make a 10 dollar purchase without an offer anyway, you wouldn't want to send a buy 10 dollars get 2 dollars off offer. You'll want to try to assess what a certain demographic group will buy when not receiving any offers.

### General Direction

Because this is a capstone project, analysis of the data is driven as the learner sees fit. Udacity provides a few suggestions stating you could:
- build a machine learning model that predicts how much someone will spend based on demographics and offer type.
- build a model that predicts whether or not someone will respond to an offer
- develop a set of heuristics that determine what offer you should send to each customer 

### Problem Statement
This is a classic project of evaluating marketing strategies. A simplification of the problem is a marketing organization that has applied promotions across a wide range of customers without knowledge of whether or not these promotions are successful and without targeting customers with specific promotions. The next logical step is to apply analytics to determine effectiveness and optimize the marketing strategy to positively impact the company’s bottom line. Various promotions occurred across different customer segments and analysis on the results can provide valuable insights to employ customer-targeting techniques.

Inputs to this problem are 10 different marketing promotions, basic demographic information of 17,000 customers, and 300,000 interactions of the customers with the app. After clustering the customers into segments, linear regression will determine whether each promotional strategy had a positive, negative, or neutral effect on customer spending within a given segment.

### Solution Statement
A proposed solution will be to determine whether each promotional strategy has a significant effect (positive or negative) on each customer segment. Based on the datasets, we will cluster the customer data into similar groups using k-means and then separately evaluate efficacy of promotional strategies for each group. Efficacy will be determined through linear regression to determine whether the promotion is statistically significant (rejecting the null hypothesis) compared to no promotion. If it is significant and the coefficient is positive, then the promotion will be considered good. If it is significant but with a negative coefficient then it will be considered bad. If it is not significant then the promotion will be considered neutral. A determination of good, neutral, and bad promotions will become evident for each segment, supporting a potential improvement in overall performance of the Starbucks marketing department.

### Project Design
This project contains 3 different datasets and a significant amount of data. The data must be thoroughly cleaned and explored in order to support our machine learning models. Success in this manner will gain additional insights and drive quality in our final product.

A preliminary workflow will follow these steps:

1.      Data cleaning and exploration - missing data in the profile.json file will need to be cleaned and the data in the transcript file will need to be better understood. Transaction.json will be transformed to determine whether a specific purchase was made during a promotional period. Transformation of the features in the data will be required to support regression.

2.      Principle Component Analysis (PCA) on profile.json to determine key relationships between variables and rank them in order of importance. Determination of how many components to include in final model.

3.      K-means clustering on the PCA dataset to define customer segments. An elbow graph will determine the appropriate number of clusters.

4.      Merge the transformed transcript matrix with the newly defined customer segments.

5.      Processing of the transaction and promotion data to flag transactions as to whether they were influenced by a specific promotion.

6.      Split the merged matrix into separate training and test datasets for each customer segment.

7.      Perform linear regression on this data to determine how each customer segment reacts to different promotional strategies vs. the baseline of no promotion.

8.      Repeat regression as needed and evaluate results of linear regression to determine if each promotion was good, bad, or neutral for each customer segment.

9.      Re-perform analysis by only following steps 1,5,6 and 7, i.e. develop benchmark model. Evaluate benchmark model vs. complete model to observe usefulness of customer segmentation.


# Starting Data Sets

The data is contained in three files:

* portfolio.json - containing offer ids and meta data about each offer (duration, type, etc.)
* profile.json - demographic data for each customer
* transcript.json - records for transactions, offers received, offers viewed, and offers completed

Here is the schema and explanation of each variable in the files:

**portfolio.json**
* id (string) - offer id
* offer_type (string) - type of offer ie BOGO, discount, informational
* difficulty (int) - minimum required spend to complete an offer
* reward (int) - reward given for completing an offer
* duration (int) - time for offer to be open, in days
* channels (list of strings)

**profile.json**
* age (int) - age of the customer 
* became_member_on (int) - date when customer created an app account
* gender (str) - gender of the customer (note some entries contain 'O' for other rather than M or F)
* id (str) - customer id
* income (float) - customer's income

**transcript.json**
* event (str) - record description (ie transaction, offer received, offer viewed, etc.)
* person (str) - customer id
* time (int) - time in hours since start of test. The data begins at time t=0
* value - (dict of strings) - either an offer id or transaction amount depending on the record

# Data importing and observation
Our first step will be to read in the data and do some simple ob

In [1]:
import pandas as pd
import numpy as np
import math
import json
from IPython.display import display, HTML
% matplotlib inline

prefix = 'raw_data'

# read in the json files
portfolio = pd.read_json(prefix + '/portfolio.json', orient='records', lines=True)
profile = pd.read_json(prefix + '/profile.json', orient='records', lines=True)
transcript = pd.read_json(prefix + '/transcript.json', orient='records', lines=True)

In [2]:
print("portfolio:   {} rows,      {} columns".format(portfolio.shape[0],portfolio.shape[1]))
print("profile:     {} rows,   {} columns".format(profile.shape[0],profile.shape[1]))
print("transcript:  {} rows,  {} columns".format(transcript.shape[0],transcript.shape[1]))

portfolio:   10 rows,      6 columns
profile:     17000 rows,   5 columns
transcript:  306534 rows,  4 columns


# Customer Data
First we will dive into our profile.json dataset which contains information on 17000 customers. 

In [3]:
display(HTML(profile.iloc[0:15].to_html()))

Unnamed: 0,age,became_member_on,gender,id,income
0,118,20170212,,68be06ca386d4c31939f3a4f0e3dd783,
1,55,20170715,F,0610b486422d4921ae7d2bf64640c50b,112000.0
2,118,20180712,,38fe809add3b4fcf9315a9694bb96ff5,
3,75,20170509,F,78afa995795e4d85b5d9ceeca43f5fef,100000.0
4,118,20170804,,a03223e636434f42ac4c3df47e8bac43,
5,68,20180426,M,e2127556f4f64592b11af22de27a7932,70000.0
6,118,20170925,,8ec6ce2a7e7949b1bf142def7d0e0586,
7,118,20171002,,68617ca6246f4fbc85e91a2a49552598,
8,65,20180209,M,389bc3fa690240e798340f5a15918d5c,53000.0
9,118,20161122,,8974fc5686fe429db53ddde067b88302,


It is immediately apparent that age, gender and income each have a significant amount of invalid user inputs. We will assume that an age of 118 is invalid, that gender should not be None, and income should not be NaN. It appears that many entries in this table have all 3 of the invalid inputs appearing. For the purposes of our analysis, we will consider those as a separate and not use this data since it does not seem reasonable that we could accurate impute all 3 pieces of data. 

In [4]:
profile['invalid'] = 0
profile.loc[(profile['age'] == 118) & (profile['gender'].isnull()) & (profile['income'].isna()), 'invalid'] = 1
display(HTML(profile.iloc[0:15].to_html()))

Unnamed: 0,age,became_member_on,gender,id,income,invalid
0,118,20170212,,68be06ca386d4c31939f3a4f0e3dd783,,1
1,55,20170715,F,0610b486422d4921ae7d2bf64640c50b,112000.0,0
2,118,20180712,,38fe809add3b4fcf9315a9694bb96ff5,,1
3,75,20170509,F,78afa995795e4d85b5d9ceeca43f5fef,100000.0,0
4,118,20170804,,a03223e636434f42ac4c3df47e8bac43,,1
5,68,20180426,M,e2127556f4f64592b11af22de27a7932,70000.0,0
6,118,20170925,,8ec6ce2a7e7949b1bf142def7d0e0586,,1
7,118,20171002,,68617ca6246f4fbc85e91a2a49552598,,1
8,65,20180209,M,389bc3fa690240e798340f5a15918d5c,53000.0,0
9,118,20161122,,8974fc5686fe429db53ddde067b88302,,1


For now we will add a customer segment column with i (for invalid) for invalid rows and o (for others) for all other rows). We'll then separate the valid and invalid rows into 2 separate datasets and use only the valid rows for our customer segmentation analysis.

In [5]:
profile['segment'] = 'o'
profile.loc[(profile['invalid']==1), 'segment'] = 'i'

In [6]:
valid_customers = profile.loc[(profile['segment']=='o')]
valid_customers = valid_customers.drop(columns=['invalid','segment'])
valid_customers = valid_customers.reset_index(drop = True)
display(HTML(valid_customers.iloc[0:15].to_html()))
valid_customers.shape
print("2,175 customers are labeled invalid. This means {}% of our customers have valid input.".format(round(100*valid_customers.shape[0]/profile.shape[0],1)))

Unnamed: 0,age,became_member_on,gender,id,income
0,55,20170715,F,0610b486422d4921ae7d2bf64640c50b,112000.0
1,75,20170509,F,78afa995795e4d85b5d9ceeca43f5fef,100000.0
2,68,20180426,M,e2127556f4f64592b11af22de27a7932,70000.0
3,65,20180209,M,389bc3fa690240e798340f5a15918d5c,53000.0
4,58,20171111,M,2eeac8d8feae4a8cad5a6af0499a211d,51000.0
5,61,20170911,F,aa4862eba776480b8bb9c68455b8c2e1,57000.0
6,26,20140213,M,e12aeaf2d47d42479ea1c4ac3d8286c6,46000.0
7,62,20160211,F,31dda685af34476cad5bc968bdb01c53,71000.0
8,49,20141113,M,62cf5e10845442329191fc246e7bcea3,52000.0
9,57,20171231,M,6445de3b47274c759400cd68131d91b4,42000.0


2,175 customers are labeled invalid. This means 87.2% of our customers have valid input.


# Spending Variables

We will create 2 new variables in this dataset which represents the number of purchases during the period and the total amount spent. In order to develop these variables, we must extract them from the transactions in transcript.json.

In [7]:
#display(HTML(transcript.iloc[0:15].to_html()))
print(transcript.event.unique())

['offer received' 'offer viewed' 'transaction' 'offer completed']


In [8]:
transactions = transcript[transcript.event == 'transaction']
transactions = transactions.reset_index(drop = True)
display(HTML(transactions.iloc[0:5].to_html()))

Unnamed: 0,event,person,time,value
0,transaction,02c083884c7d45b39cc68e1314fec56c,0,{'amount': 0.8300000000000001}
1,transaction,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,0,{'amount': 34.56}
2,transaction,54890f68699049c2a04d415abc25e717,0,{'amount': 13.23}
3,transaction,b2f1cd155b864803ad8334cdf13c4bd2,0,{'amount': 19.51}
4,transaction,fe97aa22dd3e48c8b143116a8403dd52,0,{'amount': 18.97}


In [9]:
#cleaning up
transcript = None

To extract our features and compare the 2 dataframes, we need to transform this dataframe into a new dataframe with index of person (equal to the id in the profile data) and the column data equal to the amount of the purchase.

In [10]:
#create amount column
transactions = pd.concat([transactions, pd.DataFrame((d for idx, d in transactions['value'].iteritems()))], axis=1)

In [11]:
transactions['id'] = transactions['person']
transactions = transactions.loc[:, ['id', 'amount']]

In [12]:
display(HTML(transactions.iloc[0:5].to_html()))
print(transactions.shape)

Unnamed: 0,id,amount
0,02c083884c7d45b39cc68e1314fec56c,0.83
1,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,34.56
2,54890f68699049c2a04d415abc25e717,13.23
3,b2f1cd155b864803ad8334cdf13c4bd2,19.51
4,fe97aa22dd3e48c8b143116a8403dd52,18.97


(138953, 2)


In [13]:
transaction_counts = transactions.groupby(['id']).count()
amount_purchased = transactions.groupby(['id']).sum()
transaction_counts = transaction_counts.rename(columns={'amount':'num purchases'})
amount_purchased = amount_purchased.rename(columns={'amount':'total amount purchased'})

In [14]:
spending_features = pd.concat([transaction_counts, amount_purchased], axis=1)

In [15]:
#cleaning up
transaction_counts = amount_purchased = None

In [16]:
display(HTML(spending_features.iloc[0:5].to_html()))

Unnamed: 0_level_0,num purchases,total amount purchased
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0009655768c64bdeb2e877511632db8f,8,127.6
00116118485d4dfda04fdbaba9a87b5c,3,4.09
0011e0d4e6b944f998e987f904e8c1e5,5,79.46
0020c2b971eb4e9188eac86d93036a77,8,196.86
0020ccbbb6d84e358d3414a3ff76cffd,12,154.05


In [17]:
spending_features.describe()

Unnamed: 0,num purchases,total amount purchased
count,16578.0,16578.0
mean,8.381771,107.096874
std,5.009822,126.393939
min,1.0,0.05
25%,5.0,23.6825
50%,7.0,72.41
75%,11.0,150.9375
max,36.0,1608.69


In [18]:
sorted_spending = spending_features.sort_values(by=['total amount purchased'], ascending=False)
display(HTML(sorted_spending.iloc[0:5].to_html()))
display(HTML(sorted_spending.iloc[-6:-1].to_html()))

Unnamed: 0_level_0,num purchases,total amount purchased
id,Unnamed: 1_level_1,Unnamed: 2_level_1
3c8d541112a74af99e88abbd0692f00e,8,1608.69
f1d65ae63f174b8f80fa063adcaa63b7,13,1365.66
ae6f43089b674728a50b8727252d3305,16,1327.74
626df8678e2a4953b9098246418c9cfa,13,1321.42
73afdeca19e349b98f09e928644610f8,10,1319.97


Unnamed: 0_level_0,num purchases,total amount purchased
id,Unnamed: 1_level_1,Unnamed: 2_level_1
c65086b345504ed398ffa2ed28e13d51,1,0.13
37ca07481c124d98ac1eaca5ee1f4146,1,0.1
11d87e606c2f4d649fe09a5e84d048c2,1,0.05
4828b93dd6dd44eb9ec8417f0564a9b9,1,0.05
fc3444ae44044a218e160522f7de8d8d,1,0.05


There seem to be very high and very low purchases, however there are similar entries so it should not be assumed that any of the amounts are erroneous. 

Since a higher count of purchases may be correlated with a total amount purchased, we will create a 3rd feature to replace total amount purchased, amount per purchase.

In [19]:
corr = spending_features.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,num purchases,total amount purchased
num purchases,1.0,0.331584
total amount purchased,0.331584,1.0


In [20]:
spending_features['amount per purchase'] = spending_features['total amount purchased']/spending_features['num purchases']
corr = spending_features.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,num purchases,total amount purchased,amount per purchase
num purchases,1.0,0.331584,-0.0940862
total amount purchased,0.331584,1.0,0.772835
amount per purchase,-0.0940862,0.772835,1.0


In [21]:
spending_features = spending_features.drop(columns = 'total amount purchased')

# Cleaning Customer Data

For the remaining data, we need to clean the features as follows: 

became_member_on: in order for this feature to be usable, we will convert it to an integer by comparing the date vs. the newest customer. We will then create a new variable called age of account where the newest customer has a value of 0 and all others have an integer value representing the age of the account in months versus the newest customer.

income: should be integer instead of float. Additionally, we will convert units to thousands to make it more readable.

age: no change

id: set index to id.

num purchases & total amount purhcased: import features from spending_features dataframe.

gender: examine frequency and separate datasets if necessary. PCA requires numerical inputs and gender cannot be transformed into a nmber that would generate useful results.

In [22]:
valid_customers['became_member_on'] = pd.to_datetime(valid_customers['became_member_on'], format='%Y%m%d', errors='ignore')

In [23]:
#Newest Customer
newest_customer = max(valid_customers['became_member_on'])

In [24]:
valid_customers['age of account'] = (newest_customer - valid_customers['became_member_on']).astype('timedelta64[M]')
#(df.fr-df.to).astype('timedelta64[h]')
valid_customers['age of account'] = valid_customers['age of account'].astype(int)
valid_customers = valid_customers.drop(columns = 'became_member_on')

In [25]:
valid_customers['income'] = valid_customers['income']/1000
valid_customers['income'] = valid_customers.income.astype(int)

In [26]:
valid_customers.index = valid_customers['id']
valid_customers = valid_customers.drop(columns = 'id')

In [27]:
display(HTML(valid_customers.iloc[0:10].to_html()))

Unnamed: 0_level_0,age,gender,income,age of account
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0610b486422d4921ae7d2bf64640c50b,55,F,112,12
78afa995795e4d85b5d9ceeca43f5fef,75,F,100,14
e2127556f4f64592b11af22de27a7932,68,M,70,2
389bc3fa690240e798340f5a15918d5c,65,M,53,5
2eeac8d8feae4a8cad5a6af0499a211d,58,M,51,8
aa4862eba776480b8bb9c68455b8c2e1,61,F,57,10
e12aeaf2d47d42479ea1c4ac3d8286c6,26,M,46,53
31dda685af34476cad5bc968bdb01c53,62,F,71,29
62cf5e10845442329191fc246e7bcea3,49,M,52,44
6445de3b47274c759400cd68131d91b4,57,M,42,6


In [28]:
#import features from spending_features

valid_customers = valid_customers.join(spending_features, how='left')
display(HTML(valid_customers.iloc[0:10].to_html()))

Unnamed: 0_level_0,age,gender,income,age of account,num purchases,amount per purchase
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0610b486422d4921ae7d2bf64640c50b,55,F,112,12,3.0,25.67
78afa995795e4d85b5d9ceeca43f5fef,75,F,100,14,7.0,22.752857
e2127556f4f64592b11af22de27a7932,68,M,70,2,3.0,19.243333
389bc3fa690240e798340f5a15918d5c,65,M,53,5,3.0,12.143333
2eeac8d8feae4a8cad5a6af0499a211d,58,M,51,8,4.0,3.905
aa4862eba776480b8bb9c68455b8c2e1,61,F,57,10,6.0,14.258333
e12aeaf2d47d42479ea1c4ac3d8286c6,26,M,46,53,11.0,5.110909
31dda685af34476cad5bc968bdb01c53,62,F,71,29,8.0,20.03125
62cf5e10845442329191fc246e7bcea3,49,M,52,44,9.0,16.012222
6445de3b47274c759400cd68131d91b4,57,M,42,6,6.0,3.183333


In [29]:
valid_customers.loc[(valid_customers['num purchases'].isnull())].shape

(333, 6)

333 customers have not made any purchases! Logically, we cannot determine the effect of a promotion on these customers since they are not making any purchases, regardless of the existence of any type of promotion. Perhaps these are customers who moved away, lost their phone, or were instructed by their doctor's to give up caffeine. In any case, we will remove these customers from our dataset.

In [30]:
valid_customers = valid_customers.loc[(valid_customers['num purchases'].notnull())]
valid_customers['num purchases'] = valid_customers['num purchases'].astype(int)

Let's check for outliers

In [31]:
valid_customers.describe()

Unnamed: 0,age,income,age of account,num purchases,amount per purchase
count,14492.0,14492.0,14492.0,14492.0,14492.0
mean,54.3061,65.21812,16.893458,8.553478,15.252287
std,17.434828,21.599247,13.803784,5.082696,16.543371
min,18.0,30.0,0.0,1.0,0.15
25%,42.0,49.0,6.0,5.0,3.890833
50%,55.0,63.0,11.0,7.0,14.311214
75%,66.0,79.0,26.0,11.0,21.467625
max,101.0,120.0,59.0,36.0,451.47


In [32]:
genders = valid_customers.gender.unique()
for gender in genders:
    print("{} occurrences of gender {}.".format(valid_customers[(valid_customers.gender == gender)].shape[0],gender))

5993 occurrences of gender F.
8295 occurrences of gender M.
204 occurrences of gender O.


From a purely practical standpoint, gender O only represents 204 of the 17,000 customers, which is 1.2% of the total. This is too small to support segmentation, as segmentation would generate clusters less than 1% of the total. For this reason, it will not be considered in this analysis. However, if in the future it were to become a larger portion of the overall data then it should obviously be included in a future analysis. 

In [33]:
204/17000

0.012

In [34]:
valid_customers = valid_customers[valid_customers.gender != 'O']

The remaining data will be split into different datasets for each gender. 

In [35]:
male_customers = valid_customers[valid_customers.gender == 'M']
female_customers = valid_customers[valid_customers.gender == 'F']
male_customers = male_customers.drop(columns = 'gender')
female_customers = female_customers.drop(columns = 'gender')
customers_dict = {'male':male_customers, "female":female_customers}

This concludes our preprocessing. Now we need to save the data to be used in the next step.

In [38]:
import os

def make_csv(data, filename, data_dir):
    '''Merges features and labels and converts them into one csv file with labels in the first column.
       :param x: Data features
       :param y: Data labels
       :param file_name: Name of csv file, ex. 'train.csv'
       :param data_dir: The directory where files will be saved
       '''
    # make data dir, if it does not exist
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)

    df.to_csv(os.path.join(data_dir, filename), header=True, index=True)  
    
    # nothing is returned, but a print statement indicates that the function has run
    print('Path created: '+str(data_dir)+'/'+str(filename))

In [39]:
data_dir = 'preprocessed_data'
for file_prefix, df in customers_dict.items():
    make_csv(df,file_prefix+'.csv',data_dir)

Path created: preprocessed_data/male.csv
Path created: preprocessed_data/female.csv


That concludes the preprocesing notebook. The next notebook is Customer Segmentation PCA and K-means.