## Statistical Inference (two populations)

In this demonstration, we'll understand how to draw statistical inferences with two samples drawn from different populations. 
We will be interested in:
- Determining whether the population means are different.
- Constructing confidence intervals for the difference in population means. 
---

## Demonstration Overview

- Problem Statement Discussion
- Data Preparation and Exploratory Data Analysis
- Hypothesis Testing for Difference in Means
    - Independent Samples
        - Equal variances
        - Unequal variances
    - Paired Samples
    
---

### Problem Statement

A restaurant chain is interested in knowing whether a new storefront look and employee uniforms can improve sales. For conducting this experiment, 2 restaurant stores are chosen with similar location and similar performance for the treatment and the control groups. Here's an overview

**Treatment Group**
- *Store Owner*: Andy
- *Changes made*: New Storefront look, employee uniforms

**Control Group**
- *Store Ower*: Bob
- *Changes made*: No changes

We use a 14 day testing period to estimate which storefront look does "*better*"

Now let's go ahead and analyze the dataset 

---

### Data Preparation and EDA

In [None]:
### Import the libraries


In [None]:
##Import the dataset


Let's prepare a few visualizations to analyze the data

In [None]:
## We can observe the day wise trends for both Andy and Bob using a grouped bar chart


**Observations**:


Next, let's compare the sales distributions for both the stores

In [None]:
## We shall use a boxplot for this
## You can directly pass the DataFrame as an argument and the sns.boxplot function would only consider the numeric columns
## Which in this case are the columns 'Andy' and 'Bob'

We can see that the median sales in Andy's store is slightly better than Bob's 

In [None]:
#Mean difference calculation


Average sales for Andy are about $81 more

In [None]:
## We can also add a column that computes the pair-wise differences


In [None]:
#Check the DataFrame again


Let's obtain some more descriptive statistics for both the restaurants

In [None]:
#Obtain descriptive statistics


**Summary**:


###  Hypothesis Testing for Difference in Means

**Independent Samples**
- Equal variances

We shall start with conducting t-tests for difference in means using *scipy*. Assumes independent samples and equality of variance across populations. 

The hypotheses are as follows. 

> Let $\mu_1$ be the mean for Andy and $\mu_2$ the mean for Bob.

$$H_0: \mu_1-\mu_2 = 0$$
$$H_a: \mu_1-\mu_2 \neq 0$$

> This will be a two-sided test, with $\alpha$ = 0.05

In [None]:
##Import the necessary methods


In [None]:
##A quick look at the documentation of the ttest_ind method


In [None]:
## Run the t-test


- Unequal variances

Next, we can run the test, without the equal variance assumption

In [None]:
##Run the test for the unequal variance assumption


In [None]:
### Additional Code
### To test equality of variances

##Import the necessary methods


# Bartlett's test of equality of variances

---

**Paired Samples**

In [None]:
##Reiterating the need for using paired samples
## Plot a scatterplot between Andy's sales and Bob's sales

**Observation**:

##### Using one-sample t-test on the Difference column

We already have computed the "Difference" column. We can use it to perform hypothesis tests.

Our hypotheses changes *a little bit*

> Let $\mu_d$ denote the difference in means for sales in Andy's and Bob's restaurants 

> Our original hypotheses that was shown previously changes to 

$$H_0: \mu_d = 0$$
$$H_a: \mu_d \neq 0$$

> This will still be a two-sided test, with $\alpha$ = 0.05

In [None]:
## We can peform a one-sample t-test on the 'Difference' column
## Import the necessary methods


In [None]:
##Let's check the documentation for ttest_1amp


In [None]:
## Run the one-sample t-test


###### Using two-sample t-test for paired samples

In [None]:
#Import the necessary methods


In [None]:
## A quick look at the ttest_rel method's documentation


In [None]:
# Paired sample t-test


##### Using paired t-test in Pingouin

*Pingouin* is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy. It provides slightly more exhaustive statistical information by default as compared to the methods in the *scipy.stats* package

Read more: https://pingouin-stats.org/

In [None]:
##Install pingouin if it's not already there in the system
!pip install pingouin

In [None]:
##Import the library


In [None]:
## Check the documentation for pg.ttest

In [None]:
## We can first conduct an unpaired test
## To verify the results we got in the independent samples case


In [None]:
## Now we can go ahead and conduct the paired test
## Here, paired will be set to True
