# Starbucks Rewards: Predicting Consumer Responses

## Project Overview

This project seeks to determines how do we take data and discover what are the offers that excite people? We want to know what is the most valuable offer there is, not just for the customers as a whole but at an individual personal level.

Link to an academic paper where machine learning was applied to this type of problem: http://ceur-ws.org/Vol-3026/paper18.pdf

## Problem Statement

Predict if someone will reply to an offer. Transaction data and demographic information must be combined and gitHub
customers will also have access to the data. We will analyze the attributes of customers to create customer classifications. This is a multi-class classification problem so the key metric we will use is f1-score. The simulating dataset only has one product, while Starbucks offers dozens of products. Therefore, this data set is a simplified version of the real Starbucks app.

In [None]:
from collections import defaultdict
import pandas as pd
import sagemaker
from sklearn.model_selection import train_test_split
from sagemaker.xgboost.estimator import XGBoost
from sklearn import preprocessing
import matplotlib.pyplot as plt

## Data Exploration

### Dataset details

profile.json
* age (int) - age of the customer 
* became_member_on (int) - date when customer created an app account
* gender (str) - gender of the customer (note some entries contain 'O' for other rather than M or F)
* id (str) - customer id
* income (float) - customer's income

transcript.json
* event (str) - record description (ie transaction, offer received, offer viewed, etc.)
* person (str) - customer id
* time (int) - time in hours since the start of the test. The data begins at time t=0
* value - (dict of strings) - either an offer id or transaction amount depending on the record

portfolio.json
* id (string) - offer id
* offer_type (string) - a type of offer ie BOGO, discount, informational
* difficulty (int) - the minimum required to spend to complete an offer
* reward (int) - the reward is given for completing an offer
* duration (int) - time for the offer to be open, in days
* channels (list of strings)

In [None]:
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)

## Profile Analysis

In [None]:
print(profile.columns)
profile.hist()

In [None]:
profile.info()

In [None]:
profile.describe()

In [None]:
profile['gender'].value_counts().plot(kind='bar')

There are 17000 records where the average age is 63 years old and the average income is $65000 and there are 2175 missing data points from income and gender. We will need to consider what we do with these values later one. We can safely ignore the id as these don't mean anything. As for became_member_on we can confirm that most customers became a member in 2017 and 2018.

With M corresponding to male, we can see that males accounts for the majority of the customers. **This may be a problem for our model, but if our accuracy is too low, then we can always normalize the presence of each area in the dataset to get predictions that aren’t skewed towards male customers**.

## Transcript Analysis

In [None]:
transcript.info()

In [None]:
transcript.head()

In [None]:
transcript.describe()

In [None]:
transcript['event'].value_counts().plot(kind='bar')

Given the 300000+ records, it makes sense that offer completed has the smallest value count and transactions has the most since there can't be more offers completed compared to offer viewed and offer viewed. We also see the test lasted 714 hours but this doesn't seem useful for our analysis and modeling stage.

## Portfolio Analysis

In [None]:
portfolio.head(10)

In [None]:
portfolio['reward'].value_counts()

Note that there are 8 rows with rewards but 2 that are just informational. So even if an offer was completed there would be no reward and discounts gives rewards with values 2,3, and 5 and have multiple diffuculties.

## Missing Values

Previously we observed that the missing values come from gender and income. We will add a 4th category for gender as U for unknown and fill the income with the mean since income values are appear to have a normal distribution.

In [None]:
profile['gender'] = profile['gender'].fillna('U')
profile['income'] = profile['income'].fillna(profile['income'].mean)

In [None]:
profile.info()

## Outliers

In [None]:
len(profile[profile['age']>100])/len(profile)

In [None]:
print('percentage of centenarians {0:.0%}%'.format(len(profile[profile['age']>100])/len(profile)))
print('average age of customer', 63)

Note that so far the only possible outliers come form age as there are many Centenarian. Also out of a US population of approximately 300 million, there were approximatly 90000 centenarians (age 100+) or a prevalence of 0.3%. From above we see out of the customers make up 13% but the average age is high as well so we can ignore age for now.

## Data Preprocessing

Note: We will only consider customers who received offers. Customers who did not recieve an offer could not have viewed or responed to offfer so they offer no valid information for the puproses of predicting consumer response to offers.

In [None]:
new_transcript = transcript[transcript['event'].str.startswith('o')]
new_transcript.head()

In [None]:
# Split / Explode a column of dictionaries into separate columns 
new_transcript = pd.concat([new_transcript.drop(['value'], axis=1), new_transcript['value'].apply(pd.Series)], axis=1)
new_transcript.head()

In [None]:
new_transcript = new_transcript.dropna(subset = ["offer_id"])

temp1 = portfolio.rename(columns={"id": "offer_id"})
temp2 = profile.rename(columns={"person": "id"})

data = pd.merge(new_transcript, temp1, on="offer_id").rename(columns={"person": "id"})
data = pd.merge(data, temp2, on="id")

## Feature Engineering

In [None]:
# Creating additional features
data['became_member_on'] = pd.to_datetime(data['became_member_on'],format='%Y%m%d')

data["year"] = data.became_member_on.dt.year
data["month"] = data.became_member_on.dt.month
data["day"] = data.became_member_on.dt.day

data.head()

In [None]:
to_drop = ['offer id',
           'event',
           'id',
           'became_member_on',
           'reward_y',
           'reward_x',
           'time',
           'offer_type']
           
data.drop(to_drop, inplace=True, axis=1)
print(data)

In [None]:
target_vals = {offer_id for offer_id in data['offer_id']}
print(target_vals)

In [None]:
# Set target to be offer_id
data['target'] = data['offer_id']
data.drop('offer_id', inplace=True, axis=1)
data.head()

In [None]:
# Convert object type to avoid error
lbl = preprocessing.LabelEncoder()
data['gender'] = lbl.fit_transform(data['gender'].astype(str))
data['target'] = lbl.fit_transform(data['target'].astype(str))
data['channels'] = lbl.fit_transform(data['gender'].astype(str))
data.info()

In [None]:
data = data.drop_duplicates()

### Class distributions of offers completed

In [None]:
offer_count = defaultdict(int)
for offer in data['target']:
    if offer in offer_count:
        offer_count[offer]+=1
    else:
        offer_count[offer] = 1

class_count = pd.DataFrame.from_dict(offer_count, orient='index')
bar_plot = class_count.plot.bar(title="Number of occurrences")

In [None]:
data.to_csv('data.csv', index = False)

In [None]:
train, test = train_test_split(data, test_size=0.2, random_state=0)

In [None]:
train.to_csv('data/train/train.csv', index = False)
test.to_csv('data/test/test.csv', index = False)

## Upload to S3 bucket

In [None]:
sagemaker_session = sagemaker.Session()

bucket = "my-project-bucket-123"
region ="us-east-1" 
role = "arn:aws:iam::657240468511:role/service-role/AmazonSageMaker-ExecutionRole-20211225T133929"

print("Default Bucket: {}".format(bucket))
print("AWS Region: {}".format(region))
print("RoleArn: {}".format(role))

In [None]:
s3_path_to_data = sagemaker.Session().upload_data(bucket=bucket, 
                                                  path='data/train', 
                                                  key_prefix='train')

s3_path_to_data = sagemaker.Session().upload_data(bucket=bucket, 
                                                  path='data/test', 
                                                  key_prefix='test')


## Benchmark

We will be using multiclass logistic regression against which we can benchmark.

In [None]:
benchmark_data = pd.read_csv ('data.csv')
benchmark_train, benchmark_test = train_test_split(data, test_size=0.2, random_state=0)

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(multi_class='multinomial', solver='saga', max_iter=10000 )
X = benchmark_train.iloc[:, :-1]
y = benchmark_train.iloc[:, -1:]
model.fit(X, y.values.ravel())


In [None]:
from sklearn.metrics import f1_score

preds = model.predict(benchmark_test.iloc[:,:-1])
score=f1_score(benchmark_test["target"], preds, average='weighted')
print(f"f1-score: {score}")

## Model Training
**Note:** You will need to use the `train.py` script to train your model.

In [None]:
output_path = 's3://{}/output'.format(bucket)
input_data = 's3://{}/'.format(bucket)

metric_definitions = [{'Name': 'validation:f1', 'Regex': '.*\[[0-9]+\].*#011validation-f1:([-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?).*'}]

xgb_estimator = XGBoost(
    entry_point="train.py",
    role=role,
    instance_count=1,
    instance_type="ml.m5.2xlarge",
    framework_version="1.3-1",
    output_path=output_path,

)

In [None]:
train_input = input_data + "train"
test_input = input_data + "test"

In [None]:
xgb_estimator.fit({'train': train_input, 'validation': test_input})

## Standout suggestions

### Hyperparameter Tuning

In [None]:

from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

hyperparameter_ranges = {
    "max_depth": IntegerParameter(2, 8),
    "eta": ContinuousParameter(0.1, 0.5),
    "num_round" : CategoricalParameter([10, 50, 100]),
}

objective_metric_name = "validation:f1"


In [None]:
xgb_estimator = XGBoost(
    entry_point="train.py",
    role=role,
    instance_count=1,
    instance_type="ml.m5.2xlarge",
    framework_version="1.3-1",
    output_path=output_path,

)

tuner = HyperparameterTuner(
    xgb_estimator,
    objective_metric_name,
    hyperparameter_ranges, 
    max_jobs=2, 
    max_parallel_jobs=4 
)

In [None]:
tuner.fit({'train': train_input, "validation": test_input})

The final results compared to the benchmark result is higher by a large margin. 
Therefore the final model and solution is significant enough to have adequately solved the problem.

In [None]:
tuner.describe()['BestTrainingJob']['FinalHyperParameterTuningJobObjectiveMetric']

In [None]:
best_estimator = tuner.describe()['BestTrainingJob']['TunedHyperParameters']
print(best_estimator)

In [None]:
max_depth = int(best_estimator["max_depth"])
eta = float(best_estimator['eta'])
num_round = int(best_estimator['num_round'][1:-1])

print("max_depth: {}".format(max_depth))
print("eta: {}".format(eta))
print("num_round: {}".format(num_round))

In [None]:
hyperparameters = {
    "max_depth": max_depth,
    "eta": eta,
    "num_round": num_round
}

In [None]:
output_path = 's3://{}/output'.format(bucket)
input_data = 's3://{}/'.format(bucket)

xgb_estimator = XGBoost(
    entry_point="train.py",
    hyperparameters=hyperparameters,
    role=role,
    instance_count=1,
    instance_type="ml.m5.2xlarge",
    framework_version="1.3-1",
    output_path=output_path,

)

In [None]:
xgb_estimator.fit({'train': train_input, "validation": test_input})

In [None]:
pt_model_data = xgb_estimator.model_data
print("Model artifact saved at:\n", pt_model_data)

### Model Deploying and Querying

In [None]:
from time import gmtime, strftime

timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
endpoint_name = "inference-pipeline-ep-" + timestamp_prefix

In [None]:
predictor=xgb_estimator.deploy(instance_type="ml.m5.large", initial_instance_count=1, endpoint_name=endpoint_name) 

In [None]:
from sagemaker.predictor import Predictor

payload = '0,10,7,0,69,70000.0,2018,5,14'

predictor = Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker_session,
    serializer=sagemaker.serializers.CSVSerializer(),
    content_type="text/csv",
    accept="application/json")

print(predictor.predict(payload))

In [None]:
predictor.delete_endpoint()