# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) 2.05 - A/B Testing Lab

---

**Use $\alpha = 0.05$ for any hypothesis tests!**

Exercise 1: Describe the difference between a one-sample $t$-test and a two-sample $t$-test.

**Answer 1:** A one-sample $t$-test is appropriate when we want to compare one sample's mean to a baseline number and test whether that number is a likely value for the population mean. A two-sample $t$-test is appropriate when we want to compare two samples' means and test whether it is likely that the two population means could be the same number.

Exercise 2: Describe the difference between a two-sample $t$-test and a matched pairs $t$-test.

**Answer 2:** A two-sample $t$-test is appropriate when we want to compare two **independent** samples' means and test whether it is likely that the two populations could have the same mean. A matched pairs $t$-test is appropriate when we want to compare two **dependent** samples' means and test whether it is likely that the two populations could have the same mean.

Exercise 3: Describe the difference between a two-sample $t$-test and an ANOVA test.

**Answer 3:** A two-sample $t$-test is appropriate when we want to compare **exactly two** independent samples' means and test whether it is likely that the two populations could have the same mean. An ANOVA test is appropriate when we want to compare **two or more** independent samples' means and test whether it is likely that all populations could have the same mean.

Exercise 4: In the `data` folder, we have a sample of Walmart stores data. Read in the `stores.csv` data. If I wanted to test whether the average size was different across types, what is the appropriate test? Why?

In [1]:
import pandas as pd

In [2]:
stores = pd.read_csv("./data/stores.csv")

In [3]:
stores.head()

Unnamed: 0,Store,Type,Size
0,1,A,151315
1,2,A,202307
2,3,B,37392
3,4,A,205863
4,5,B,34875


In [4]:
stores['Type'].value_counts()

A    22
B    17
C     6
Name: Type, dtype: int64

**Answer 4:** Since there are three types of stores, if we want to test whether the average size is different across types, we would have to do an ANOVA test.

Exercise 5: Conduct the hypothesis test from Exercise 4. Be sure to state your hypotheses, find your p-value, and make the appropriate conclusion.

In [5]:
import scipy.stats as stats

In [6]:
stats.f_oneway(stores[stores['Type'] == 'A']['Size'],
               stores[stores['Type'] == 'B']['Size'],
               stores[stores['Type'] == 'C']['Size'])

F_onewayResult(statistic=34.348221948830279, pvalue=1.4502004628034625e-09)

**Answer 5:**

- Hypotheses: $H_0: \mu_A = \mu_B = \mu_C$ vs. $H_A$ at least one $\mu_i \neq \mu_j$

- $p$-value: $1.45 \times 10^{-9}$

- Conclusion: $p < \alpha$, so we reject $H_0$ and can conclude that not all three population means are equal.

Exercise 6: Read in the `train.csv` data. Suppose I want to test whether average weekly sales differs among the types of stores. What is the appropriate test? Why?

In [7]:
train = pd.read_csv("./data/train.csv")

In [8]:
train.head()

Unnamed: 0,Store,Dept,Date,Weekly_Sales,IsHoliday
0,1,1,2010-02-05,24924.5,False
1,1,1,2010-02-12,46039.49,True
2,1,1,2010-02-19,41595.55,False
3,1,1,2010-02-26,19403.54,False
4,1,1,2010-03-05,21827.9,False


**Answer 6:** Since there are three types of stores, if we want to test whether the average weekly sales is different across types, we would have to do an ANOVA test. Nothing has really changed from the last time - we're still comparing three populations, just with different numbers.

Exercise 7: Conduct the hypothesis test from Exercise 6. Be sure to state your hypotheses, find your p-value, and make the appropriate conclusion. (Hint: Think about how you need to munge your data here. It'll likely take a few steps!)

In [9]:
combined = train.merge(stores, how = 'left', on = 'Store')

In [10]:
combined.head()

Unnamed: 0,Store,Dept,Date,Weekly_Sales,IsHoliday,Type,Size
0,1,1,2010-02-05,24924.5,False,A,151315
1,1,1,2010-02-12,46039.49,True,A,151315
2,1,1,2010-02-19,41595.55,False,A,151315
3,1,1,2010-02-26,19403.54,False,A,151315
4,1,1,2010-03-05,21827.9,False,A,151315


In [11]:
stats.f_oneway(combined[combined['Type'] == 'A']['Weekly_Sales'],
               combined[combined['Type'] == 'B']['Weekly_Sales'],
               combined[combined['Type'] == 'C']['Weekly_Sales'])

F_onewayResult(statistic=7764.4262174492524, pvalue=0.0)

**Answer 7:**
- Hypotheses: $H_0: \mu_A = \mu_B = \mu_C$ vs. $H_A$ at least one $\mu_i \neq \mu_j$

- $p$-value: Approximately $0$. (It cannot be exactly zero, but it's so close to zero that the computer can't differentiate the true value from zero.)

- Conclusion: $p < \alpha$, so we reject $H_0$ and can conclude that not all three population means are equal.

Exercise 8: Suppose I want to test whether average weekly sales differs on holidays versus non-holidays. What is the appropriate test? Why? 

**Answer 8:** Because we're comparing two populations, we can rule out one-sample and ANOVA tests. If we examine the documentation [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) and [here](https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.stats.ttest_rel.html), we must have identically-shaped arrays for a matched pairs $t$-test. We also don't have a pre-test/post-test scenario for which we've used a matched pairs $t$-test before. As such, we should use a two-sample $t$-test with independent samples.

Exercise 9: Conduct the hypothesis test from Exercise 8. Be sure to state your hypotheses, find your p-value, and make the appropriate conclusion.

In [12]:
stats.ttest_ind(combined[combined['IsHoliday'] == True]['Weekly_Sales'],
                combined[combined['IsHoliday'] == False]['Weekly_Sales'])

Ttest_indResult(statistic=8.2947568539318937, pvalue=1.0912222677432844e-16)

**Answer 9:**
- Hypotheses: $H_0: \mu_{holiday} = \mu_{nonholiday}$ vs. $H_A: \mu_{holiday} \neq \mu_{nonholiday}$

- $p$-value: $1.09 \times 10^{-16}$

- Conclusion: $p < \alpha$, so we reject $H_0$ and can conclude that the weekly sales differ between holiday weeks and non-holiday weeks.

Exercise 10: Generate and interpret a 95% confidence interval for the true average weekly sales among **holiday** weeks.

In [13]:
import numpy as np

In [14]:
stats.t.interval(0.95,
                 combined[combined['IsHoliday'] == True].shape[0],
                 loc = combined[combined['IsHoliday'] == True]['Weekly_Sales'].mean(),
                 scale = (np.std(combined[combined['IsHoliday'] == True]['Weekly_Sales'], ddof = 1)) / combined[combined['IsHoliday'] == True].shape[0] ** 0.5)

(16726.014954027953, 17345.631420673009)

**Answer 10:** I am 95% confident that the true average weekly sales among holiday weeks is between 16,726.01 and 17,345.63.

Exercise 11: Generate and interpret a 95% confidence interval for the true average weekly sales among **non-holiday** weeks.

In [15]:
stats.t.interval(0.95,
                 combined[combined['IsHoliday'] == False].shape[0],
                 loc = combined[combined['IsHoliday'] == False]['Weekly_Sales'].mean(),
                 scale = (np.std(combined[combined['IsHoliday'] == False]['Weekly_Sales'], ddof = 1)) / combined[combined['IsHoliday'] == False].shape[0] ** 0.5)

(15831.531725246738, 15971.358412770294)

**Answer 11:** I am 95% confident that the true average weekly sales among holiday weeks is between 15,831.53 and 15,971.36.

Exercise 12: Compare your results from exercises 9, 10, and 11. What do you notice about these results? (Hint: Recall that a hypothesis test is "an inversion" of the confidence interval.)

**Answer 12:**
- In exercise 9, we can conclude that, at the 5% significance level, the average weekly sales on holiday weeks differs from the average weekly sales on non-holiday weeks.
- In exercises 10 and 12, we note that the 95% confidence interval for average weekly sales on holiday weeks does not overlap with the 95% confidence interval for average weekly sales on non-holiday weeks.
- Both the hypothesis test (with $\alpha = 5\%$) and the confidence interval (with confidence = 95\%) cause us to infer that the two means are different. These results are in sync!
    - In fact, this will **always** happen. With the same data, the same distribution (in this case, both the CI and HT use the $t$-distribution), and in cases where our significance level $\alpha$ + confidence level = 100\%, we will **always** get conciding results.
    - A confidence interval identifies a set of likely values for the parameter and a hypothesis test checks whether a set of values are likely choices for the parameter. 
    - The hypothesis test, constructed under the same conditions, will always fail to reject $H_0$ if the $\mu_0$ value is in the confidence interval and will always reject $H_0$ if the $\mu_0$ value is outside the confidence interval.
    - This is what we mean when we say "a hypothesis test is the inversion of a confidence interval!"