## Quantium - Module 2

We will be examining the performance in trial vs control stores to provide a recommendation for each location based on our insight.

Select control stores – explore the data and define metrics for control store selection – "What would make them a control store?" Visualize the drivers to see suitability.

Assessment of the trial – get insights of each of the stores. Compare each trial store with ontrol store to get its overall performance. We want to know if the trial stores were successful or not.

Collate findings – summarise findings for each store and provide recommendations to share with client outlining the impact on sales during trial period.

In [2]:
import pandas as pd
import numpy as np
import regex as re
from plotnine import *
import matplotlib.pyplot as plt
%matplotlib inline
qviData = pd.read_csv("QVI_data.csv")

In [3]:
qviData.head(10)

Unnamed: 0,LYLTY_CARD_NBR,DATE,STORE_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES,PACK_SIZE,BRAND,LIFESTAGE,PREMIUM_CUSTOMER
0,1000,2018-10-17,1,1,5,Natural Chip Compny SeaSalt175g,2,6.0,175,NATURAL,YOUNG SINGLES/COUPLES,Premium
1,1002,2018-09-16,1,2,58,Red Rock Deli Chikn&Garlic Aioli 150g,1,2.7,150,RRD,YOUNG SINGLES/COUPLES,Mainstream
2,1003,2019-03-07,1,3,52,Grain Waves Sour Cream&Chives 210G,1,3.6,210,GRNWVES,YOUNG FAMILIES,Budget
3,1003,2019-03-08,1,4,106,Natural ChipCo Hony Soy Chckn175g,1,3.0,175,NATURAL,YOUNG FAMILIES,Budget
4,1004,2018-11-02,1,5,96,WW Original Stacked Chips 160g,1,1.9,160,WOOLWORTHS,OLDER SINGLES/COUPLES,Mainstream
5,1005,2018-12-28,1,6,86,Cheetos Puffs 165g,1,2.8,165,CHEETOS,MIDAGE SINGLES/COUPLES,Mainstream
6,1007,2018-12-04,1,7,49,Infuzions SourCream&Herbs Veg Strws 110g,1,3.8,110,INFUZIONS,YOUNG SINGLES/COUPLES,Budget
7,1007,2018-12-05,1,8,10,RRD SR Slow Rst Pork Belly 150g,1,2.7,150,RRD,YOUNG SINGLES/COUPLES,Budget
8,1009,2018-11-20,1,9,20,Doritos Cheese Supreme 330g,1,5.7,330,DORITOS,NEW FAMILIES,Premium
9,1010,2018-09-09,1,10,51,Doritos Mexicana 170g,2,8.8,170,DORITOS,YOUNG SINGLES/COUPLES,Mainstream


## Select control stores
The client has selected store numbers 77, 86 and 88 as trial stores and want
control stores to be established stores that are operational for the entire
observation period.
We would want to match trial stores to control stores that are similar to the trial
store prior to the trial period of Feb 2019 in terms of :
- Monthly overall sales revenue
- Monthly number of customers
- Monthly number of transactions per customer

Let's first create the metrics of interest and filter to stores that are present
throughout the pre-trial period.

#### Calculate these measures over time for each store
#### Over to you! Add a new month ID column in the data with the format yyyymm.
data[, YEARMONTH := ]

In [4]:
qviData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264834 entries, 0 to 264833
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   LYLTY_CARD_NBR    264834 non-null  int64  
 1   DATE              264834 non-null  object 
 2   STORE_NBR         264834 non-null  int64  
 3   TXN_ID            264834 non-null  int64  
 4   PROD_NBR          264834 non-null  int64  
 5   PROD_NAME         264834 non-null  object 
 6   PROD_QTY          264834 non-null  int64  
 7   TOT_SALES         264834 non-null  float64
 8   PACK_SIZE         264834 non-null  int64  
 9   BRAND             264834 non-null  object 
 10  LIFESTAGE         264834 non-null  object 
 11  PREMIUM_CUSTOMER  264834 non-null  object 
dtypes: float64(1), int64(6), object(5)
memory usage: 24.2+ MB


In [37]:
qviData["DATE"] = pd.to_datetime(qviData["DATE"])
qviData["YEARMONTH"] = qviData["DATE"].dt.strftime("%Y%m").astype("int")

In [40]:
#set matrices
store_ym_group = qviData.groupby(["STORE_NBR", "YEARMONTH"])
total = store_ym_group["TOT_SALES"].sum()
num_cust = store_ym_group["LYLTY_CARD_NBR"].nunique()
avg_trans_per_cust = store_ym_group.size() / num_cust
avg_per_cust = store_ym_group["PROD_QTY"].sum() / num_cust
avg_price = total / store_ym_group["PROD_QTY"].sum()
combine = [total, num_cust, avg_trans_per_cust, avg_per_cust, avg_price]
qvi_metrics = pd.concat(combine, axis=1)
qvi_metrics.columns = ["TOT_SALES", "nCustomers", "nTxnPerCust", "nChipsPerTxn", "avgPricePerUnit"]

In [50]:
qvi_metrics = qvi_metrics.reset_index()

In [52]:
qvi_metrics.head(5)

Unnamed: 0,index,STORE_NBR,YEARMONTH,TOT_SALES,nCustomers,nTxnPerCust,nChipsPerTxn,avgPricePerUnit
0,0,1,201807,206.9,49,1.061224,1.265306,3.337097
1,1,1,201808,176.1,42,1.02381,1.285714,3.261111
2,2,1,201809,278.8,59,1.050847,1.271186,3.717333
3,3,1,201810,188.1,44,1.022727,1.318182,3.243103
4,4,1,201811,192.6,46,1.021739,1.23913,3.378947


In [51]:
qvi_metrics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3169 entries, 0 to 3168
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   index            3169 non-null   int64  
 1   STORE_NBR        3169 non-null   int64  
 2   YEARMONTH        3169 non-null   int64  
 3   TOT_SALES        3169 non-null   float64
 4   nCustomers       3169 non-null   int64  
 5   nTxnPerCust      3169 non-null   float64
 6   nChipsPerTxn     3169 non-null   float64
 7   avgPricePerUnit  3169 non-null   float64
dtypes: float64(4), int64(4)
memory usage: 198.2 KB


#### Next, we define the measure calculations to use during the analysis.


#### Filter to the pre-trial period and stores with full observation periods
storesWithFullObs <- unique(measureOverTime[, .N, STORE_NBR][N == 12, STORE_NBR])
preTrialMeasures <- measureOverTime[YEARMONTH < 201902 & STORE_NBR %in%
storesWithFullObs, ]

In [54]:
#pre trial observation
#filter only stores with full 12 months observation
observe_counts = qvi_metrics["STORE_NBR"].value_counts()
full_observe_index = observe_counts[observe_counts == 12].index
full_observe = qvi_metrics[qvi_metrics["STORE_NBR"].isin(full_observe_index)]
pretrial_full_observe = full_observe[full_observe["YEARMONTH"] < 201902]

pretrial_full_observe.head(5)

Unnamed: 0,index,STORE_NBR,YEARMONTH,TOT_SALES,nCustomers,nTxnPerCust,nChipsPerTxn,avgPricePerUnit
0,0,1,201807,206.9,49,1.061224,1.265306,3.337097
1,1,1,201808,176.1,42,1.02381,1.285714,3.261111
2,2,1,201809,278.8,59,1.050847,1.271186,3.717333
3,3,1,201810,188.1,44,1.022727,1.318182,3.243103
4,4,1,201811,192.6,46,1.021739,1.23913,3.378947


Now we need to work out a way of ranking how similar each potential control store
is to the trial store. We can calculate how correlated the performance of each
store is to the trial store.

Let's write a function for this so that we don't have to calculate this for each
trial store and control store pair.

#### Over to you! Create a function to calculate correlation for a measure, looping
through each control store.

#### Let's define inputTable as a metric table with potential comparison stores,
metricCol as the store metric used to calculate correlation on, and storeComparison
as the store number of the trial store.
calculateCorrelation <- function(inputTable, metricCol, storeComparison) {
calcCorrTable = data.table(Store1 = numeric(), Store2 = numeric(), corr_measure =
numeric())
storeNumbers <-
for (i in storeNumbers) {
calculatedMeasure = data.table("Store1" = ,
"Store2" = ,
"corr_measure" =
)
calcCorrTable <- rbind(calcCorrTable, calculatedMeasure)
}
return(calcCorrTable)

Apart from correlation, we can also calculate a standardised metric based on the
absolute difference between the trial store's performance and each control store's
performance.
Let's write a function for this.

#### Create a function to calculate a standardised magnitude distance for a measure,

#### looping through each control store
calculateMagnitudeDistance <- function(inputTable, metricCol, storeComparison) {
calcDistTable = data.table(Store1 = numeric(), Store2 = numeric(), YEARMONTH =
numeric(), measure = numeric())
storeNumbers <- unique(inputTable[, STORE_NBR])
for (i in storeNumbers) {
calculatedMeasure = data.table("Store1" = storeComparison
, "Store2" = i
, "YEARMONTH" = inputTable[STORE_NBR ==
storeComparison, YEARMONTH]
, "measure" = abs(inputTable[STORE_NBR ==
storeComparison, eval(metricCol)]
- inputTable[STORE_NBR == i,
eval(metricCol)])
)
calcDistTable <- rbind(calcDistTable, calculatedMeasure)
}

#### Standardise the magnitude distance so that the measure ranges from 0 to 1
minMaxDist <- calcDistTable[, .(minDist = min(measure), maxDist = max(measure)),
by = c("Store1", "YEARMONTH")]
distTable <- merge(calcDistTable, minMaxDist, by = c("Store1", "YEARMONTH"))
distTable[, magnitudeMeasure := 1 - (measure - minDist)/(maxDist - minDist)]
finalDistTable <- distTable[, .(mag_measure = mean(magnitudeMeasure)), by =
.(Store1, Store2)]
return(finalDistTable)
}

Now let's use the functions to find the control stores! We'll select control stores
based on how similar monthly total sales in dollar amounts and monthly number of
customers are to the trial stores. So we will need to use our functions to get four
scores, two for each of total sales and total customers.

#### Over to you! Use the function you created to calculate correlations against
store 77 using total sales and number of customers.

#### Hint: Refer back to the input names of the functions we created.
trial_store <-
corr_nSales <- calculateCorrelation(, quote(), )
corr_nCustomers <- calculateCorrelation(, quote(), )

#### Then, use the functions for calculating magnitude.
magnitude_nSales <- calculateMagnitudeDistance(preTrialMeasures, quote(totSales),
trial_store)
magnitude_nCustomers <- calculateMagnitudeDistance(preTrialMeasures,
quote(nCustomers), trial_store)

We'll need to combine the all the scores calculated using our function to create a
composite score to rank on.
Let's take a simple average of the correlation and magnitude scores for each
driver. Note that if we consider it more important for the trend of the drivers to
be similar, we can increase the weight of the correlation score (a simple average
gives a weight of 0.5 to the corr_weight) or if we consider the absolute size of
the drivers to be more important, we can lower the weight of the correlation score.

#### Over to you! Create a combined score composed of correlation and magnitude, by
first merging the correlations table with the magnitude table.
#### Hint: A simple average on the scores would be 0.5 * corr_measure + 0.5 *
mag_measure
corr_weight <- 0.5
score_nSales <- merge(, , by = )[, scoreNSales := ]
score_nCustomers <- merge(, , by = )[, scoreNCust := ]
```
Now we have a score for each of total number of sales and number of customers.
Let's combine the two via a simple average.

#### Over to you! Combine scores across the drivers by first merging our sales
scores and customer scores into a single table
score_Control <- merge(, , by = )
score_Control[, finalControlScore := scoreNSales * 0.5 + scoreNCust * 0.5]

The store with the highest score is then selected as the control store since it is
most similar to the trial store.

#### Select control stores based on the highest matching store (closest to 1 but
#### not the store itself, i.e. the second ranked highest store)
#### Over to you! Select the most appropriate control store for trial store 77 by
finding the store with the highest final score.
control_store <-
control_store

Now that we have found a control store, let's check visually if the drivers are
indeed similar in the period before the trial.
We'll look at total sales first.

measureOverTimeSales <- measureOverTime
pastSales <- measureOverTimeSales[, Store_type := ifelse(STORE_NBR == trial_store,
"Trial",
ifelse(STORE_NBR == control_store,
"Control", "Other stores"))
][, totSales := mean(totSales), by = c("YEARMONTH",
"Store_type")
][, TransactionMonth := as.Date(paste(YEARMONTH %/%
100, YEARMONTH %% 100, 1, sep = "-"), "%Y-%m-%d")
][YEARMONTH < 201903 , ]
ggplot(pastSales, aes(TransactionMonth, totSales, color = Store_type)) +
geom_line() +
labs(x = "Month of operation", y = "Total sales", title = "Total sales by month")

Next, number of customers.
#### Over to you! Conduct visual checks on customer count trends by comparing the
trial store to the control store and other stores.
#### Hint: Look at the previous plot.
measureOverTimeCusts <- measureOverTime
pastCustomers <- measureOverTimeCusts[,
][,
][,
][, ]
ggplot(, aes(, , color = )) +
geom_line() +
labs(x = , y = ", title = )

## Assessment of trial
The trial period goes from the start of February 2019 to April 2019. We now want to
see if there has been an uplift in overall chip sales.
We'll start with scaling the control store's sales to a level similar to control
for any differences between the two stores outside of the trial period.

#### Scale pre-trial control sales to match pre-trial trial store sales
scalingFactorForControlSales <- preTrialMeasures[STORE_NBR == trial_store &
YEARMONTH < 201902, sum(totSales)]/preTrialMeasures[STORE_NBR == control_store &
YEARMONTH < 201902, sum(totSales)]

#### Apply the scaling factor
measureOverTimeSales <- measureOverTime
scaledControlSales <- measureOverTimeSales[STORE_NBR == control_store, ][ ,
controlSales := totSales * scalingFactorForControlSales]

Now that we have comparable sales figures for the control store, we can calculate
the percentage difference between the scaled control sales and the trial store's
sales during the trial period.

#### Over to you! Calculate the percentage difference between scaled control sales
and trial sales
percentageDiff <- merge(,
,
by =
)[, percentageDiff := ]

Let's see if the difference is significant!

#### As our null hypothesis is that the trial period is the same as the pre-trial period, let's take the standard deviation based on the scaled percentage difference in the pre-trial period
stdDev <- sd(percentageDiff[YEARMONTH < 201902 , percentageDiff])

#### Note that there are 8 months in the pre-trial period
#### hence 8 - 1 = 7 degrees of freedom
degreesOfFreedom <- 7

#### We will test with a null hypothesis of there being 0 difference between trial
and control stores.
#### Over to you! Calculate the t-values for the trial months. After that, find the
95th percentile of the t distribution with the appropriate degrees of freedom 20200128_InsideSherpa_Task2_DraftSolutions - Template.Rmd
#### to check whether the hypothesis is statistically significant.
#### Hint: The test statistic here is (x - u)/standard deviation
percentageDiff[, tValue :=
][, TransactionMonth :=
][, .()]

We can observe that the t-value is much larger than the 95th percentile value of
the t-distribution for March and April - i.e. the increase in sales in the trial
store in March and April is statistically greater than in the control store.
Let's create a more visual version of this by plotting the sales of the control
store, the sales of the trial stores and the 95th percentile value of sales of the
control store.

measureOverTimeSales <- measureOverTime
#### Trial and control store total sales
#### Over to you! Create new variables Store_type, totSales and TransactionMonth in
the data table.
pastSales <- measureOverTimeSales[, Store_type :=
][, totSales :=
][, TransactionMonth :=
][Store_type %in% c("Trial", "Control"), ]
#### Control store 95th percentile
pastSales_Controls95 <- pastSales[Store_type == "Control",
][, totSales := totSales * (1 + stdDev * 2)
][, Store_type := "Control 95th % confidence
interval"]
#### Control store 5th percentile
pastSales_Controls5 <- pastSales[Store_type == "Control",
][, totSales := totSales * (1 - stdDev * 2)
][, Store_type := "Control 5th % confidence
interval"]
trialAssessment <- rbind(pastSales, pastSales_Controls95, pastSales_Controls5)
#### Plotting these in one nice graph
ggplot(trialAssessment, aes(TransactionMonth, totSales, color = Store_type)) +
geom_rect(data = trialAssessment[ YEARMONTH < 201905 & YEARMONTH > 201901 ,],
aes(xmin = min(TransactionMonth), xmax = max(TransactionMonth), ymin = 0 , ymax =
Inf, color = NULL), show.legend = FALSE) +
geom_line() +
labs(x = "Month of operation", y = "Total sales", title = "Total sales by month")

The results show that the trial in store 77 is significantly different to its
control store in the trial period as the trial store performance lies outside the
5% to 95% confidence interval of the control store in two of the three trial
months.
Let's have a look at assessing this for number of customers as well.
#### This would be a repeat of the steps before for total sales
#### Scale pre-trial control customers to match pre-trial trial store customers
#### Over to you! Compute a scaling factor to align control store customer counts
to our trial store.
#### Then, apply the scaling factor to control store customer counts.
#### Finally, calculate the percentage difference between scaled control store
customers and trial customers.
scalingFactorForControlCust <-
measureOverTimeCusts <- measureOverTime
scaledControlCustomers <- measureOverTimeCusts[,
][, controlCustomers :=
][, Store_type :=
]
percentageDiff <-

Let's again see if the difference is significant visually!

#### As our null hypothesis is that the trial period is the same as the pre-trial
period, let's take the standard deviation based on the scaled percentage difference
in the pre-trial period
stdDev <- sd(percentageDiff[YEARMONTH < 201902 , percentageDiff])
degreesOfFreedom <- 7

#### Trial and control store number of customers
pastCustomers <- measureOverTimeCusts[, nCusts := mean(nCustomers), by =
c("YEARMONTH", "Store_type")
][Store_type %in% c("Trial", "Control"), ]

#### Control store 95th percentile
pastCustomers_Controls95 <- pastCustomers[Store_type == "Control",
][, nCusts := nCusts * (1 + stdDev * 2)
][, Store_type := "Control 95th % confidence interval"]

#### Control store 5th percentile
pastCustomers_Controls5 <- pastCustomers[Store_type == "Control",
][, nCusts := nCusts * (1 - stdDev * 2)
][, Store_type := "Control 5th % confidence
interval"]
trialAssessment <- rbind(pastCustomers, pastCustomers_Controls95,
pastCustomers_Controls5)

#### Over to you! Plot everything into one nice graph.
#### Hint: geom_rect creates a rectangle in the plot. Use this to highlight the
trial period in our graph.
ggplot() +
geom_rect(data = , aes(xmin = , xmax = , ymin = , ymax = , color = ),
show.legend = FALSE) +
geom_line() +
labs()

Let's repeat finding the control store and assessing the impact of the trial for
each of the other two trial stores.
## Trial store 86

#### Over to you! Calculate the metrics below as we did for the first trial store.
measureOverTime <- data[, .(totSales = ,
nCustomers = ,
nTxnPerCust = ,
nChipsPerTxn = ,
avgPricePerUnit =
)
, by = ][order(, )]

#### Over to you! Use the functions we created earlier to calculate correlations
and magnitude for each potential control store
trial_store <- 86
corr_nSales <-
corr_nCustomers <-
magnitude_nSales <-
magnitude_nCustomers <-

#### Now, create a combined score composed of correlation and magnitude
corr_weight <- 0.5
score_nSales <-
score_nCustomers <-

#### Finally, combine scores across the drivers using a simple average.
score_Control <-
score_Control[, finalControlScore := ]

#### Select control stores based on the highest matching store
#### (closest to 1 but not the store itself, i.e. the second ranked highest store)
#### Select control store for trial store 86
control_store <- score_Control[Store1 == trial_store,
][order(-finalControlScore)][2, Store2]
control_store

Looks like store 155 will be a control store for trial store 86.
Again, let's check visually if the drivers are indeed similar in the period before
the trial.
We'll look at total sales first.

#### Over to you! Conduct visual checks on trends based on the drivers
measureOverTimeSales <- measureOverTime
pastSales <- measureOverTimeSales[, Store_type :=
][, totSales := , by = )
][, TransactionMonth := )
][YEARMONTH < 201903 , ]
ggplot() +
geom_line() +
labs()

Great, sales are trending in a similar way.
Next, number of customers.

#### Over to you again! Conduct visual checks on trends based on the drivers
measureOverTimeCusts <- measureOverTime
pastCustomers <- measureOverTimeCusts[, Store_type :=
][, numberCustomers := , by =
][, TransactionMonth :=
][YEARMONTH < 201903 , ]
ggplot() +
geom_line() +
labs()

Good, the trend in number of customers is also similar.
Let's now assess the impact of the trial on sales.

#### Scale pre-trial control sales to match pre-trial trial store sales
scalingFactorForControlSales <- preTrialMeasures[STORE_NBR == trial_store &
YEARMONTH < 201902, sum(totSales)]/preTrialMeasures[STORE_NBR == control_store &
YEARMONTH < 201902, sum(totSales)]

#### Apply the scaling factor
measureOverTimeSales <- measureOverTime
scaledControlSales <- measureOverTimeSales[STORE_NBR == control_store, ][ ,
controlSales := totSales * scalingFactorForControlSales]

#### Over to you! Calculate the percentage difference between scaled control sales
and trial sales
#### Hint: When calculating percentage difference, remember to use absolute
difference
percentageDiff <- merge(,
,
by = "YEARMONTH"
)[, percentageDiff := ]

#### As our null hypothesis is that the trial period is the same as the pre-trial
period, let's take the standard deviation based on the scaled percentage difference
in the pre-trial period
#### Over to you! Calculate the standard deviation of percentage differences during
the pre-trial period
stdDev <-
degreesOfFreedom <- 7

#### Trial and control store total sales
#### Over to you! Create a table with sales by store type and month.
#### Hint: We only need data for the trial and control store.
measureOverTimeSales <- measureOverTime
pastSales <- measureOverTimeSales[, Store_type :=
][, totSales := , by =
][, TransactionMonth :=
][, ]

#### Over to you! Calculate the 5th and 95th percentile for control store sales.
#### Hint: The 5th and 95th percentiles can be approximated by using two standard
deviations away from the mean.
#### Hint2: Recall that the variable stdDev earlier calculates standard deviation
in percentages, and not dollar sales.
pastSales_Controls95 <- pastSales[Store_type == ,] [, totSales :=][, Store_type := "Control 95th % confidence
interval"]
pastSales_Controls5 <- pastSales[Store_type == ,
][, totSales :=
][, Store_type := "Control 5th % confidence
interval"]

#### Then, create a combined table with columns from pastSales,
pastSales_Controls95 and pastSales_Controls5
trialAssessment <-

#### Plotting these in one nice graph
ggplot(trialAssessment, aes(TransactionMonth, totSales, color = Store_type)) +
geom_rect(data = trialAssessment[ YEARMONTH < 201905 & YEARMONTH > 201901 ,],
aes(xmin = min(TransactionMonth), xmax = max(TransactionMonth), ymin = 0 , ymax =
Inf, color = NULL), show.legend = FALSE) +
geom_line(aes(linetype = Store_type)) +
labs(x = "Month of operation", y = "Total sales", title = "Total sales by month")

The results show that the trial in store 86 is not significantly different to its
control store in the trial period as the trial store performance lies inside the 5%
to 95% confidence interval of the control store in two of the three trial months.
Let's have a look at assessing this for the number of customers as well.

#### This would be a repeat of the steps before for total sales
#### Scale pre-trial control customers to match pre-trial trial store customers
scalingFactorForControlCust <- preTrialMeasures[STORE_NBR == trial_store &
YEARMONTH < 201902, sum(nCustomers)]/preTrialMeasures[STORE_NBR == control_store &
YEARMONTH < 201902, sum(nCustomers)]

#### Apply the scaling factor
measureOverTimeCusts <- measureOverTime
scaledControlCustomers <- measureOverTimeCusts[STORE_NBR == control_store,
][ , controlCustomers := nCustomers
* scalingFactorForControlCust
][, Store_type := ifelse(STORE_NBR
== trial_store, "Trial",
ifelse(STORE_NBR == control_store,
"Control", "Other stores"))
]

#### Calculate the percentage difference between scaled control sales and trial
sales
percentageDiff <- merge(scaledControlCustomers[, c("YEARMONTH",
"controlCustomers")],
measureOverTime[STORE_NBR == trial_store, c("nCustomers",
"YEARMONTH")],
by = "YEARMONTH"
)[, percentageDiff :=
abs(controlCustomers-nCustomers)/controlCustomers]

#### As our null hypothesis is that the trial period is the same as the pre-trial
period, let's take the standard deviation based on the scaled percentage difference
in the pre-trial period
stdDev <- sd(percentageDiff[YEARMONTH < 201902 , percentageDiff])
degreesOfFreedom <- 7

#### Trial and control store number of customers
pastCustomers <- measureOverTimeCusts[, nCusts := mean(nCustomers), by =
c("YEARMONTH", "Store_type")
][Store_type %in% c("Trial", "Control"), ]

#### Control store 95th percentile
pastCustomers_Controls95 <- pastCustomers[Store_type == "Control",
][, nCusts := nCusts * (1 + stdDev * 2)
][, Store_type := "Control 95th % confidence
interval"]

#### Control store 5th percentile
pastCustomers_Controls5 <- pastCustomers[Store_type == "Control",
][, nCusts := nCusts * (1 - stdDev * 2)
][, Store_type := "Control 5th % confidence
interval"]
trialAssessment <- rbind(pastCustomers, pastCustomers_Controls95,
pastCustomers_Controls5)

#### Plotting these in one nice graph
ggplot(trialAssessment, aes(TransactionMonth, nCusts, color = Store_type)) +
geom_rect(data = trialAssessment[ YEARMONTH < 201905 & YEARMONTH > 201901 ,],
aes(xmin = min(TransactionMonth), xmax = max(TransactionMonth), ymin = 0 , ymax =
Inf, color = NULL), show.legend = FALSE) +
geom_line() +
labs(x = "Month of operation", y = "Total number of customers", title = "Total
number of customers by month")

It looks like the number of customers is significantly higher in all of the three
months. This seems to suggest that the trial had a significant impact on increasing
the number of customers in trial store 86 but as we saw, sales were not
significantly higher. We should check with the Category Manager if there were
special deals in the trial store that were may have resulted in lower prices,
impacting the results.
## Trial store 88 

#### All over to you now! Your manager has left for a conference call, so you'll be
on your own this time.
#### Conduct the analysis on trial store 88.
measureOverTime <-

#### Use the functions from earlier to calculate the correlation of the sales and
number of customers of each potential control store to the trial store
trial_store <- 88
corr_nSales <-
corr_nCustomers <-

#### Use the functions from earlier to calculate the magnitude distance of the
sales and number of customers of each potential control store to the trial store
magnitude_nSales <-
magnitude_nCustomers <-

#### Create a combined score composed of correlation and magnitude by merging the
correlations table and the magnitudes table, for each driver.
corr_weight <- 0.5
score_nSales <-
score_nCustomers <-

#### Combine scores across the drivers by merging sales scores and customer scores,
and compute a final combined score.
score_Control <-
score_Control[, finalControlScore := ]

#### Select control stores based on the highest matching store
#### (closest to 1 but not the store itself, i.e. the second ranked highest store)
#### Select control store for trial store 88
control_store <-
control_store

We've now found store 237 to be a suitable control store for trial store 88.
Again, let's check visually if the drivers are indeed similar in the period before
the trial.
We'll look at total sales first.

#### Visual checks on trends based on the drivers
#### For the period before the trial, create a graph with total sales of the trial
store for each month, compared to the control store and other stores.
measureOverTimeSales <- measureOverTime
pastSales <-
ggplot() +
geom_line() +
labs()

Great, the trial and control stores have similar total sales.
Next, number of customers.

#### Visual checks on trends based on the drivers
#### For the period before the trial, create a graph with customer counts of the
trial store for each month, compared to the control store and other stores.
measureOverTimeCusts <- measureOverTime
pastCustomers <-
ggplot() +
geom_line() +
labs()

Total number of customers of the control and trial stores are also similar.
Let's now assess the impact of the trial on sales.

#### Scale pre-trial control store sales to match pre-trial trial store sales
scalingFactorForControlSales <-

#### Apply the scaling factor
measureOverTimeSales <- measureOverTime
scaledControlSales <-

#### Calculate the absolute percentage difference between scaled control sales and
trial sales
percentageDiff <-

#### As our null hypothesis is that the trial period is the same as the pre-trial
period, let's take the standard deviation based on the scaled percentage difference
in the pre-trial period
stdDev <-
degreesOfFreedom <- 7

#### Trial and control store total sales
measureOverTimeSales <- measureOverTime
pastSales <-

#### Control store 95th percentile
pastSales_Controls95 <-

#### Control store 5th percentile
pastSales_Controls5 <-

#### Combine the tables pastSales, pastSales_Controls95, pastSales_Controls5
trialAssessment <-

#### Plot these in one nice graph
ggplot() +
geom_rect() +
geom_line() +
labs()

The results show that the trial in store 88 is significantly different to its
control store in the trial period as the trial store performance lies outside of
the 5% to 95% confidence interval of the control store in two of the three trial
months.
Let's have a look at assessing this for number of customers as well.

#### This would be a repeat of the steps before for total sales
#### Scale pre-trial control store customers to match pre-trial trial store
customers
scalingFactorForControlCust <-

#### Apply the scaling factor
measureOverTimeCusts <- measureOverTime
scaledControlCustomers <-

#### Calculate the absolute percentage difference between scaled control sales and
trial sales
percentageDiff <-

#### As our null hypothesis is that the trial period is the same as the pre-trial
period, let's take the standard deviation based on the scaled percentage difference
in the pre-trial period
stdDev <-
degreesOfFreedom <- 7 # note that there are 8 months in the pre-trial period hence
8 - 1 = 7 degrees of freedom

#### Trial and control store number of customers
pastCustomers <-

#### Control store 95th percentile
pastCustomers_Controls95 <-

#### Control store 5th percentile
pastCustomers_Controls5 <-

#### Combine the tables pastSales, pastSales_Controls95, pastSales_Controls5
trialAssessment <-

#### Plotting these in one nice graph
ggplot() +
geom_rect() +
geom_line() +
labs()

Total number of customers in the trial period for the trial store is significantly
higher than the control store for two out of three months, which indicates a
positive trial effect.
## Conclusion
Good work! We've found control stores 233, 155, 237 for trial stores 77, 86 and 88
respectively.
The results for trial stores 77 and 88 during the trial period show a significant
difference in at least two of the three trial months but this is not the case for
trial store 86. We can check with the client if the implementation of the trial was
different in trial store 86 but overall, the trial shows a significant increase in
sales. Now that we have finished our analysis, we can prepare our presentation to
the Category Manager.