### Types of Statistical Analysis

1. One Sample Analysis
2. Two sample Analysis

## 1 Z-test

First lets list the steps in Hypothesis testing

### 1.1 Steps in Hypothesis Testing

**STEP #1**: Setup Null-Hypothesis $H_0$ and Alternate-Hypothesis $H_a$  
**STEP #2**: Choose a Distribution, test-statistic and Significance-level $(\alpha)$  
**STEP #3**: Select Tail type: Left / Right / Two-tailed  
**STEP #4**: Compute $\text{p-value}$  
**STEP #5**: Compare $\text{p-value}$ with $\alpha$, reject $H_0$ if $\text{p-value} < \alpha$ else reject $H_a$

### 1.2 Conditions for Z-test

1. Sample size $n$ should be greater than 30.
2. Population Standard Deviation should be available.

### 1.3 Application of CLT in Hypothesis Testing

### 1.4 Critical Value

Minimum value at which the required confidence level is reached.

> **Note**:
>
> Use `ppf` function to get $\text{z-critical}$ value, which is then used to calculate actual critical value.

### 1.4 Test Statistic

## 2 Examples

### 2.1 Marketing Case study

Suppose there is a Retail Store Chain that sells Shampoo bottles:

This chain has **2000 stores** across India.  

The parameters for weekly sales of the shampoo bottle were reported as:
- Mean: 1800
- Standard deviation: 100

This was calculated by analyzing a lot of historical data.  
As a Manager / Owner / Data Scientist, you want to increase these sales, to generate more revenue.

**Q1. What are the techniques at your disposal?**

- Hire a marketing team

But there is an important factor to consider. These marketing teams/firms are not cheap, and would add a significant cost.  
It stands to reason that you would not straightaway hand over all 2000 stores to them.  
You would want an assurance that their work actually does impact the sales, and generate enough revenue that it is feasible to hire them.

**Q2. How would you get that assurance?**

Perhaps you can allot them a few stores, and analyze the sale parameters (Mean and Standard deviation).  
If results are good in a couple of weeks, then hire for all 2000 stores.

You decide to do this experiment with 2 competing marketing firms:

**Firm A**

- Worked on **50 stores**
- Sold an **average** **1850** bottles of shampoo

**Firm B**

- Worked on **5 stores**
- Sold an **average** **1900** bottles of shampoo

**Q1. Which firm gave better results?**

Clearly the sales are more for Firm B, but it seems that the number of stores under them were significantly less than Firm A.  
It is possible that this increase by Firm B is just a chance factor because the standard deviation of the population was 100.

**Q2. How do we quantify this and determine if it is just by chance or if it is actually statistically significant?**

When we talk about statistical significance, the word significance level pops into mind.  
Since this is a big decision that would affect revenue, you want to be very very sure (99% confidence) about your decision, i.e. $Î±=0.01$  
So, we need to employ the framework we saw and conduct hypothesis testing to see which firm's results are more significant.

#### Hypothesis Testing on Firm A

##### STEP #1

$H_0 := mean = 1800$   
$H_a := mean \ge 1800$

##### STEP #2

Distribution: Normal Distribution  
Test Statistic: z-score  
Significance level: 0.01

In [1]:
alpha = 0.01

##### STEP #3

Since we have to find $P(x > 1850)$ we need to perform Right-Tailed test.

##### STEP #4

Compute $p-value$

In [2]:
import numpy as np
from scipy import stats

In [3]:
mu = 1800
sigma = 100
n = 50

In [4]:
# Find: P(x > 1850)
x = 1850

In [5]:
se = sigma / np.sqrt(n)  # Standard error
se.round(4).item()

14.1421

In [6]:
# x = mu + (z * sigma)
# For sample distribution sigma is Standard error.
z = (x - mu) / se
z.round(4).item()

3.5355

In [7]:
p_x_gt_1850 = 1 - stats.norm.cdf(z)
p_value = p_x_gt_1850.round(4).item()
print("p-value:", p_value)

p-value: 0.0002


##### STEP #5

Compare $\text{p-value}$ with $\alpha$

In [8]:
if p_value < alpha:
    print("Reject Null-Hypothesis i.e., accept Alternate-Hypothesis.")
else:
    print("Failed to reject Null-Hypothesis i.e., reject Alternate-Hypothesis.")

Reject Null-Hypothesis i.e., accept Alternate-Hypothesis.


Probability of finding data given that the null-hypothesis is true is very low hence we accept the alternate-hypothesis.

##### Critical Value

In [9]:
z_critical = stats.norm.ppf(0.99)
z_critical.round(4).item()

2.3263

In [10]:
x_critical = mu + (z_critical * se)
x_critical.round(0).item()

1833.0

#### Business Insights

1. The very low probability of 0.0002 indicates that there is no enough evidence backup the null hypothesis being true.
2. There is less than 1% probability i.e., around 0.02% chances that the average sales would have reached 1850 without the effort of Marketing Firm A.
3. There is a 99.98% probability that average sales improved from 1800 to 1850 after hiring Marketing Firm A.

#### Hypothesis Testing on Firm B

##### STEP #1

$H_0 := mean = 1800$   
$H_a := mean \ge 1800$

##### STEP #2

Distribution: Normal Distribution  
Test Statistic: $\text{z-score}$  
Significance level: 0.01

In [11]:
alpha = 0.01

##### STEP #3

Since we have to find $P(x > 1850)$ we need to perform Right-Tailed test.

##### STEP #4

Compute $\text{p-value}$

In [12]:
import numpy as np
from scipy import stats

In [13]:
mu = 1800
sigma = 100
n = 5

In [14]:
# Find: P(x > 1900)
x = 1900

In [15]:
se = sigma / np.sqrt(n)  # Standard error
se.round(4).item()

44.7214

In [16]:
# x = mu + (z * sigma)
# For sample distribution sigma is Standard error.
z = (x - mu) / se
z.round(4).item()

2.2361

In [17]:
p_x_gt_1900 = 1 - stats.norm.cdf(z)
p_value = p_x_gt_1900.round(4).item()
print("p-value:", p_value)

p-value: 0.0127


##### STEP #5

Compare $\text{p-value}$ with $\alpha$

In [18]:
if p_value < alpha:
    print("Reject Null-Hypothesis i.e., accept Alternate-Hypothesis.")
else:
    print("Failed to reject Null-Hypothesis i.e., reject Alternate-Hypothesis.")

Failed to reject Null-Hypothesis i.e., reject Alternate-Hypothesis.


In [19]:
0.0127 * 100, 100 - 0.0127 * 100

(1.27, 98.73)

The is a 1.27% chance of finding data given that the null-hypothesis is true, which is higher than the threshold value of 1%, hence we reject the alternate-hypothesis.

##### Critical Value

In [20]:
z_critical = stats.norm.ppf(0.99)
z_critical.round(4).item()

2.3263

In [21]:
x_critical = z_critical * se + mu
x_critical.round(0).item()

1904.0

#### Business Insights

1. The probability of 0.0127 indicates that there is enough evidence to backup the null hypothesis being true.
2. There is around 1.27% chances that the average sales would have reached 1900 without the effort of Marketing Firm B.
3. There is a 98.73% chances that average sales improved from 1800 to 1900 after hiring Marketing Firm B.
4. If Firm B had done an average sale of 1904 then they would have achieved 99% confidence level.

### Conclusion

1. Firm B has 98.73% chances of improving the average sales from 1800 to 1900.
2. Firm A has 99.98% chances of improving the average sales from 1800 to 1850.
3. Because the confidence level of 98.73% is slightly below our 99% target for confirming a marketing  
   firm's effectiveness in improving average sales, we will proceed with Firm A rather than Firm B.

## 3 z-test Template

##### STEP #1

Define Null-Hypothesis and Alternate-Hypothesis.

1. $H_0 := $
2. $H_a := $

##### STEP #2

Identify:

1. Distribution:
2. Test Statistic: z-score
2. Significance level:

In [22]:
# alpha =

##### STEP #3

Identify tail-type based on observed value.

In [23]:
# Observed value.
# x =

##### STEP #4

Compute $\text{p-value}$

In [24]:
# mu =
# sigma =
# n =

In [25]:
# se = sigma / np.sqrt(n)  # Standard error
# se.round(4).item()

In [26]:
# # x = mu + (z * sigma)
# # For sample distribution sigma is Standard error.
# z = (x - mu) / se
# z.round(4).item()

In [27]:
# Compute p_value using stats.norm.cdf(z)

##### STEP #5

Compare $\text{p-value}$ with $\alpha$

In [28]:
# if p_value < alpha:
#     print("Reject Null-Hypothesis i.e., accept Alternate-Hypothesis.")
# else:
#     print("Failed to reject Null-Hypothesis i.e., reject Alternate-Hypothesis.")

## 4 Quizzes

### Quiz #1

A french cake shop claims that the average number of pastries they can produce in a day exceeds 500.  
The average number of pastries produced per day over a 70 day period was found to be 530.  
Assume that the population standard deviation for the pastries produced per day is 125.  
Test the claim using a z-test with the critical z-value = 1.64 at the alpha (significance level) = 0.05, and state your interpretation.

##### STEP #1

Define Null-Hypothesis and Alternate-Hypothesis.

1. $H_0 := \mu <= 500$
2. $H_a := \mu > 500$

##### STEP #2

Identify:

1. Distribution: Normal Distribution
2. Test Statistic: z-score
3. Significance level: 0.05

In [29]:
alpha = 0.05

##### STEP #3

Identify tail-type.

Since we have to find $P(x > 530)$ we need to perform Right-Tailed test

In [30]:
# Observed value.
x = 530

##### STEP #4

Compute $\text{p-value}$

In [31]:
mu = 500
sigma = 125
n = 70

# Find: P(x > 530)
x = 530

In [32]:
se = sigma / np.sqrt(n)  # Standard error
se.round(4).item()

14.9404

In [33]:
# x = mu + (z * sigma)
# For sample distribution sigma is Standard error.
z = (x - mu) / se
z.round(4).item()

2.008

In [34]:
p_x_gt_530 = 1 - stats.norm.cdf(z)
p_value = p_x_gt_530.round(4).item()

print("p-value:", p_value)

p-value: 0.0223


##### STEP #5

Compare $\text{p-value}$ with $\alpha$

In [35]:
if p_value < alpha:
    print("Reject Null-Hypothesis i.e., accept Alternate-Hypothesis.")
else:
    print("Failed to reject Null-Hypothesis i.e., reject Alternate-Hypothesis.")

Reject Null-Hypothesis i.e., accept Alternate-Hypothesis.


In [36]:
z_critical = 1.64  # Given in the question.

p_x_gt_530 = 1 - stats.norm.cdf(z_critical)
p_value = p_x_gt_530.round(4).item()

print("p-value:", p_value)

p-value: 0.0505


In [37]:
if p_value < alpha:
    print("Reject Null-Hypothesis i.e., accept Alternate-Hypothesis.")
else:
    print("Failed to reject Null-Hypothesis i.e., reject Alternate-Hypothesis.")

Failed to reject Null-Hypothesis i.e., reject Alternate-Hypothesis.


##### Critical value

In [38]:
z_critical = stats.norm.ppf(0.95)
z_critical.round(4).item()

1.6449

In [39]:
x_critical = mu + (z_critical * sigma)
x_critical.round(0).item()

706.0