# A/B Test: A New Menu Launch
## Project Overview
You're a business analyst for Round Roasters, a coffee restaurant in the United States of America. The executive team conducted a market test with a new menu and needs to figure whether the new menu can drive enough sales to offset the cost of marketing the new menu. Your job is to analyze the A/B test and write up a recommendation to whether the Round Roasters chain should launch this new menu.

In [None]:
# Load package
from datetime import datetime, timedelta
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind
from sklearn.neighbors import KDTree
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

from statsmodels.tsa.seasonal import seasonal_decompose

import matplotlib.pyplot as plt
# plt.style.use('seaborn-whitegrid')
plt.rcParams['figure.figsize'] = [12, 12]

## Step 1: Plan Your Analysis
To perform the correct analysis, you will need to prepare a data set. Prior to rolling up your sleeves and preparing the data, it’s a good idea to have a plan of what you need to do in order to prepare the correct data set. A good plan will help you with your analysis. Here are a few questions to get you started:

-What is the performance metric you’ll use to evaluate the results of your test?  
-What is the test period?  
-At what level (day, week, month, etc.) should the data be aggregated?  

In [None]:
# load Stores data
stores_data = pd.read_csv('round-roaster-stores.csv')
stores_data.info()

In [None]:
stores_data.head(3)

In [None]:
# load Transactions data
# force Invoice Date column to datetime 
transactions_data = pd.read_csv('RoundRoastersTransactions.csv', parse_dates=['Invoice Date'])
transactions_data.info()

In [None]:
transactions_data.head()

In [None]:
# load Treatment Stores data
treatment_stores_data = pd.read_csv('treatment-stores.csv')
treatment_stores_data.info()

In [None]:
treatment_stores_data.head(3)

## Step 2: Clean Up Your Data
In this step, you should prepare the data for steps 3 and 4. You should aggregate the transaction data to the appropriate level and filter on the appropriate data ranges. You can assume that there is no missing, incomplete, duplicate, or dirty data. You’re ready to move on to the next step when you have weekly transaction data for all stores.

In [None]:
# Test cities: Denver and Chicago
# Treatment: 12 Weeks [2016-April-29 to 2016-July-21], start on Friday
# Control: 12 Weeks [(2015-April-29 to 2015-July-21], start on Wednesday
# Total weeks to identify trend and season: 52 Weeks + treatment weeks + control weeks = 52 + 12 + 12 = 76 Weeks

data_end_date = datetime(2016, 7, 21)
data_start_date = data_end_date - timedelta(weeks=76)
print(f'Data Start Date: {data_start_date} \nData End Date: {data_end_date}')

In [None]:
# Filter data for further process
filtered_transactions_data = transactions_data.query('`Invoice Date` > @data_start_date and `Invoice Date` <= @data_end_date')
# Make Invoice Date Index
filtered_transactions_data.set_index('Invoice Date', inplace=True)

filtered_transactions_data.head(10)

In [None]:
# Aggregate the data to get the weekly gross margin and weekly traffic count unique invoices
weekly_gross_and_traffic = filtered_transactions_data.groupby([pd.Grouper(freq='W-FRI', closed='left'), 'StoreID'], as_index=True).agg({'Gross Margin': 'sum', 'Invoice Number': 'nunique'}).reset_index()
weekly_gross_and_traffic.rename(columns={'Invoice Number': 'Weekly Foot Traffic'}, inplace=True)

# Hack to start week on first date
weekly_gross_and_traffic['Invoice Date'] = weekly_gross_and_traffic['Invoice Date'] - pd.offsets.Week(1)
weekly_gross_and_traffic.head()

In [None]:
# Find Trend and Seasonality
stores_ID_list = list(weekly_gross_and_traffic['StoreID'].unique())
df_list = []
for store_id in stores_ID_list:
    store_df = weekly_gross_and_traffic.query('StoreID == @store_id')
    result = seasonal_decompose(store_df['Weekly Foot Traffic'].values, period=12, extrapolate_trend='freq')
    store_df = store_df.assign(Trend = result.trend, Seasonal = result.seasonal)
    df_list.append(store_df)

cleaned_data = pd.concat(df_list)

In [None]:
# How many transactions did you get for Store 10018 in the week starting 2015-02-06?
cleaned_data.query('StoreID == 10018 and `Invoice Date` == @datetime(2015, 2, 6)')

In [None]:
# aggregate the data by store ID
cleaned_data = cleaned_data.groupby(['StoreID'], as_index=False).sum()

# merge cleaned data with stores data
stores_columns = ['StoreID', 'Sq_Ft', 'AvgMonthSales', 'Region']
transactions_stores_merged= cleaned_data.merge(stores_data[stores_columns], on='StoreID')
transactions_stores_merged['Group'] = np.where(transactions_stores_merged['StoreID'].isin(treatment_stores_data['StoreID'].values), 'Treatment', 'Control')

transactions_stores_merged.head()

## Step 3: Match Treatment and Control Units
In this step, you should create the trend and seasonality variables, and use them along with you other control variable(s) to match two control units to each treatment unit. Treatment stores should be matched to control stores in the same region. Note: Calculate the number of transactions per store per week and use 12 periods to calculate trend and seasonality.  

Apart from trend and seasonality...  

-What control variables should be considered? Note: Only consider variables in the RoundRoastersStore file.  
-What is the correlation between your each potential control variable and your performance metric? (Example of correlation matrix below)  
-What control variables will you use to match treatment and control stores?

In [None]:
# Correlation between control variable and perfromance metric
transactions_stores_merged.drop(columns='StoreID').corr().round(2)

In [None]:
# Correlated variables to be used in KDTree 
control_variables = ['Gross Margin', 'Weekly Foot Traffic', 'Trend', 'AvgMonthSales']

control_variables

In [None]:
# region data control
control_data = transactions_stores_merged.query('Group == "Control"').reset_index().drop(columns='index')
treatment_data = transactions_stores_merged.query('Group == "Treatment"').reset_index().drop(columns='index')

west_region_control = control_data.query('Region == "West"').reset_index().drop(columns='index')
central_region_control = control_data.query('Region == "Central"').reset_index().drop(columns='index')

# region data treatment
west_region_treatment = treatment_data.query('Region == "West"')
central_region_treatment = treatment_data.query('Region == "Central"')

# Scale the features
transformer = ColumnTransformer([('scaler', StandardScaler(), control_variables)], remainder='drop')
transformer.fit(control_data)

# KDtree for west
west_kdtree = KDTree(transformer.transform(west_region_control), leaf_size=2)
# KDtree for central
central_kdtree = KDTree(transformer.transform(central_region_control), leaf_size=2)

In [None]:
# Find the neighbors

def neighborStores(region_treatment_data, region_control_data, region_kdtree):
    control_indexes = []
    treatment_indexes = []
    treatment_store_id = []
    control_dist = []

    for store_id in region_treatment_data['StoreID'].values:
        treatment_store_id += [store_id, store_id]
        store_id_data = region_treatment_data.query('StoreID == @store_id')
        treatment_indexes.append(store_id_data.index.values[0])
        dis, idx = region_kdtree.query(transformer.transform(store_id_data), k=2)
        control_dist += list(dis[0])
        control_indexes += list(idx[0])
    
    control_store_id = list(region_control_data.iloc[control_indexes]['StoreID'].values)
    treat_cont_dist_df = pd.DataFrame(list(zip(treatment_store_id, control_store_id, control_dist)), columns=['Treatment', 'Control', 'Distance'])

    return treat_cont_dist_df, control_indexes, treatment_indexes


In [None]:
west_df, west_control_indexes, west_treatment_indexes = neighborStores(west_region_treatment, west_region_control, west_kdtree)
central_df, central_control_indexes, central_treatment_indexes = neighborStores(central_region_treatment, central_region_control, central_kdtree)

In [None]:
west_df

In [None]:
central_df

## Step 4: Analysis and Writeup
Conduct your A/B analysis and create a short report outlining your results and recommendations.  

In an AB Analysis we use the correlation matrix to find the most correlated variable to the performance metric to include in the AB controls tool to help find the best matches.