# A/B Test: A New Menu Launch
## Project Overview
You're a business analyst for Round Roasters, a coffee restaurant in the United States of America. The executive team conducted a market test with a new menu and needs to figure whether the new menu can drive enough sales to offset the cost of marketing the new menu. Your job is to analyze the A/B test and write up a recommendation to whether the Round Roasters chain should launch this new menu.

In [None]:
# Load package
from datetime import datetime, timedelta
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind, ttest_rel
from sklearn.neighbors import KDTree
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

from statsmodels.tsa.seasonal import seasonal_decompose

import matplotlib.pyplot as plt
# plt.style.use('seaborn-whitegrid')
plt.rcParams['figure.figsize'] = [12, 12]

## Step 1: Plan Your Analysis
To perform the correct analysis, you will need to prepare a data set. Prior to rolling up your sleeves and preparing the data, it’s a good idea to have a plan of what you need to do in order to prepare the correct data set. A good plan will help you with your analysis. Here are a few questions to get you started:

-What is the performance metric you’ll use to evaluate the results of your test?  
-What is the test period?  
-At what level (day, week, month, etc.) should the data be aggregated?  

In [None]:
# load Stores data
stores_data = pd.read_csv('round-roaster-stores.csv')
stores_data.info()

In [None]:
stores_data.head(3)

In [None]:
# load Transactions data
# force Invoice Date column to datetime 
transactions_data = pd.read_csv('RoundRoastersTransactions.csv', parse_dates=['Invoice Date'])
transactions_data.info()

In [None]:
transactions_data.head()

In [None]:
# load Treatment Stores data
treatment_stores_data = pd.read_csv('treatment-stores.csv')
treatment_stores_data.info()

In [None]:
treatment_stores_data.head(3)

## Step 2: Clean Up Your Data
In this step, you should prepare the data for steps 3 and 4. You should aggregate the transaction data to the appropriate level and filter on the appropriate data ranges. You can assume that there is no missing, incomplete, duplicate, or dirty data. You’re ready to move on to the next step when you have weekly transaction data for all stores.

In [None]:
# Test cities: Denver and Chicago
# Treatment: 12 Weeks [2016-April-29 to 2016-July-21], start on Friday
# Control: 12 Weeks [(2015-April-29 to 2015-July-21], start on Wednesday
# Total weeks to identify trend and season: 52 Weeks + treatment weeks + control weeks = 52 + 12 + 12 = 76 Weeks

data_end_date = datetime(2016, 7, 21)
data_start_date = data_end_date - timedelta(weeks=76)
print(f'Data Start Date: {data_start_date} \nData End Date: {data_end_date}')

In [None]:
# Filter data for further process
filtered_transactions_data = transactions_data.query('`Invoice Date` > @data_start_date and `Invoice Date` <= @data_end_date')
# Make Invoice Date Index
filtered_transactions_data.set_index('Invoice Date', inplace=True)

# Aggregate the data to get the weekly gross margin and weekly traffic count unique invoices
weekly_gross_and_traffic = filtered_transactions_data.groupby([pd.Grouper(freq='W-FRI', closed='left'), 'StoreID'], as_index=True).agg({'Gross Margin': 'sum', 'Invoice Number': 'nunique'}).reset_index()
weekly_gross_and_traffic.rename(columns={'Invoice Number': 'Weekly Foot Traffic'}, inplace=True)

# Hack to start week on first date
weekly_gross_and_traffic['Invoice Date'] = weekly_gross_and_traffic['Invoice Date'] - pd.offsets.Week(1)

# Create Trend and Seasonal
result = seasonal_decompose(weekly_gross_and_traffic['Weekly Foot Traffic'], period=12, extrapolate_trend='freq')

# Add Trend and Seasonal weekly gross data
weekly_gross_and_traffic = weekly_gross_and_traffic.assign(Trend = result.trend, Seasonal = result.seasonal)
# Test the progress
weekly_gross_and_traffic.query('StoreID == 10018 and `Invoice Date` == @datetime(2015, 2, 6)')

In [None]:
# Filter Post and Pre test data
post_data = weekly_gross_and_traffic.query('`Invoice Date` >= @datetime(2016, 4, 29)')
pre_data = weekly_gross_and_traffic.query('`Invoice Date` >= @datetime(2015, 4, 29) and `Invoice Date` <= @datetime(2015, 7, 21)')

pre_data.head()

In [None]:
# week gross per store

# group pre data by store id
_temp_df = pre_data.groupby(['StoreID'], as_index=False).sum()

# Merge grouped pre gross and store data
stores_columns = ['StoreID', 'Sq_Ft', 'AvgMonthSales', 'Region']
weekly_gross_per_store = _temp_df.merge(stores_data[stores_columns], on='StoreID')

# Add Treatment and Control Group
_cond = weekly_gross_per_store['StoreID'].isin(treatment_stores_data['StoreID'])
weekly_gross_per_store = weekly_gross_per_store.assign(Group = np.where(_cond, 'Treatment', 'Control'))

weekly_gross_per_store.head()

## Step 3: Match Treatment and Control Units
In this step, you should create the trend and seasonality variables, and use them along with you other control variable(s) to match two control units to each treatment unit. Treatment stores should be matched to control stores in the same region. Note: Calculate the number of transactions per store per week and use 12 periods to calculate trend and seasonality.  

Apart from trend and seasonality...  

-What control variables should be considered? Note: Only consider variables in the RoundRoastersStore file.  
-What is the correlation between your each potential control variable and your performance metric? (Example of correlation matrix below)  
-What control variables will you use to match treatment and control stores?

In [None]:
# Correlation Gross Margin and Stores variables
weekly_gross_per_store[['Gross Margin', 'Sq_Ft', 'AvgMonthSales']].corr().round(2)


In [None]:
selected_variables = ['Trend', 'Seasonal', 'AvgMonthSales']
selected_variables

## Step 4: Analysis and Writeup
Conduct your A/B analysis and create a short report outlining your results and recommendations.  

In an AB Analysis we use the correlation matrix to find the most correlated variable to the performance metric to include in the AB controls tool to help find the best matches.

In [None]:
# Match Treatment and Control Stores
control_stores = weekly_gross_per_store.query('Group == "Control"')
treatment_stores = weekly_gross_per_store.query('Group == "Treatment"')
regions = ['Central', 'West']
transformer = ColumnTransformer([('scaler', StandardScaler(), selected_variables)], remainder='drop')
transformer.fit(control_stores)


In [None]:
# Matched Treatment and Stores

def matched_stores(control, treatment, regions_list):
    control_store_ids = []
    treatment_store_ids = []
    store_id_region = []

    for region in regions_list:
        control_region = control.query('Region == @region')
        treatment_region = treatment.query('Region == @region')
        
        kdtree_region = KDTree(transformer.transform(control_region), leaf_size=2)

        for i, store_id in enumerate(treatment_region['StoreID']):
            _data = treatment_region.iloc[i:i+1]
            _idx = kdtree_region.query(transformer.transform(_data), k=2, return_distance=False)

            treatment_store_ids += [store_id, store_id]
            control_store_ids += list(control_region.iloc[_idx[0]]['StoreID'])
            store_id_region += [region, region]

    _merge_list = list(zip(treatment_store_ids, control_store_ids, store_id_region))
    _columns_name = ['Treatment StoreID', 'Control StoreID', 'Region']
    return pd.DataFrame(_merge_list, columns=_columns_name)


In [None]:
result_df = matched_stores(control_stores, treatment_stores, regions)
result_df

In [None]:
# Match Pre and Post samples

# Post Treatment Samples
_treat_post = []
for i in result_df['Treatment StoreID']:
    _data_treat = post_data.query('StoreID == @i')
    _treat_post = [_data_treat]

post_treatment_sample = pd.concat(_treat_post)

# Post Control Samples
_cont_post = []
for i in result_df['Control StoreID']:
    _data_cont = post_data.query('StoreID == @i')
    _cont_post = [_data_cont]

post_control_sample = pd.concat(_cont_post)

# Pre Treatment Samples
_treat_pre = []
for i in result_df['Treatment StoreID']:
    _data_treat = pre_data.query('StoreID == @i')
    _treat_pre = [_data_treat]

pre_treatment_sample = pd.concat(_treat_pre)

# Post Control Samples
_cont_post = []
for i in result_df['Control StoreID']:
    _data_cont = pre_data.query('StoreID == @i')
    _cont_post = [_data_cont]

pre_control_sample = pd.concat(_cont_post)


In [None]:
post_treatment_sample

In [None]:
# is change not random? p < 0.05
s, p = ttest_ind(post_treatment_sample['Gross Margin'], pre_treatment_sample['Gross Margin'])
(1 - p) * 100