# Hypothesis Tests

Sometimes we want to know if a difference between two groups is real and meaningful, not just a random fluke. To figure this out, we use something called a hypothesis test.

A test like this helps us see how likely it is that our results are due to a real effect. We talk about this using terms like **statistical significance** and **effect size**.

As always, everything starts with data!

## Loading and Understanding the Data
<font color='green'>**Let's load the data (`"02_Sales.csv"`) and get to know it a bit.**</font>

The data is synthetically generated for practicing statistical methods and is based on this [Kaggle Link](https://www.kaggle.com/datasets/matinmahmoudi/sales-and-satisfaction/data?select=Sales_without_NaNs_v1.3.csv).

If the below code does not work for you, try running either of the following lines first:
- `!pip install matplotlib`
- `%pip install matplotlib`

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("02_Sales.csv")

In [None]:
df.head()

<font color='green'>**What kind of analyses could we perform with this dataset? Take a moment to think of a few ideas before moving on.**</font>

# Possible Analyses
## Correlations
- How much is customer satisfaction related to sales? → **Pearson's or Spearman's Correlation**
- How does the satisfaction or sales *before* relate to the satisfaction or sales *after*? → **Pearson's or Spearman's Correlation**

## Group Differences
- Are the sales or satisfaction of the Treatment group significantly different from the Control group? → **Independent Samples t-test**
- Did sales or satisfaction significantly change for a single group when you compare *before* and *after*? → **Paired Samples t-test**
- Did the Treatment group make significantly more purchases than the Control group? → **Bonus: Chi-Squared Test**

# Independent Samples t-test
## _When_ do we use it?
An **independent samples t-test** helps us determine how likely it is that an observed difference **between two separate groups** happened just by random chance. This likelihood is represented by a **p-value**.

If the p-value is very small (a common threshold is $p < 0.05$), we say the result is **statistically significant**. This means there's less than a 5% probability that we would see such a difference if there were no real effect. In other words, we can be about 95% confident that the difference between the groups is real and is linked to what makes the groups different (e.g., Medicine vs. Placebo, Website Design A vs. B, etc.).

## _How_ do we use it?

It's quite simple in Python! We just need to identify our grouping variable and our target variable.

In our case, we'll use `Group` (Control vs. Treatment) as our grouping variable and `Sales_After` as our numerical target variable.

**A quick note:**
- We could (and probably should) also check the `Sales_Before` variable. Why? To make sure there wasn't already a significant difference between the groups *before* our intervention. If the Treatment group was already full of high-spending customers by chance, our results would be biased.
- We could do the exact same analysis for the `Satisfaction` variable.
- The `Purchase_Made` variable is categorical (yes/no), not numerical, so it's **not** suitable for a t-test.

In [None]:
from scipy import stats

In [None]:
# We filter our DataFrame for each group and then select our target variable.
# A t-test is symmetrical, so the order of the groups doesn't matter.

t_stat, p_value = stats.ttest_ind(
    df[df.Group == "Control"].Sales_After,
    df[df.Group == "Treatment"].Sales_After
    )

print(f"p-value: {p_value}")

In this example, the p-value is extremely small (it's displayed as 0.0), which is much less than our 0.05 threshold. Therefore, we can conclude that the intervention that separated the Treatment group from the Control group had a statistically significant effect on sales.

But don't celebrate just yet! A t-test, much like Pearson's correlation, has a few assumptions we should check to make sure our result is trustworthy.

## Assumptions of an Independent Samples t-test
A t-test is only truly reliable if the underlying data meets certain conditions:

- 1) The sample should be representative of the population we want to talk about: (✅) → We'll assume this is true for our data.

- 2) The target variable must be numerical: ✅

- 3) The observations should be independent: (✅) → We'll assume this is true.

- 4) No significant outliers: ❓ → We need to check this!

- 5) The target variable should be normally distributed within each group: ❓ → We need to check this!

- 6) Homogeneity of variances (the spread of the data in both groups should be similar): ❓ → We need to check this!

### 4. Checking for Outliers

We can use a boxplot to visualize outliers. It looks like we might have a few high and low outliers.

In [None]:
df.boxplot(column="Sales_After", by="Group")
plt.title("Sales After Intervention by Group")
plt.suptitle('') # Suppress the automatic title
plt.xlabel("Group")
plt.ylabel("Sales After")
plt.show()

However, if we look at these data points in the context of their spending *before* the intervention, these 'outliers' start to make sense. They seem to be customers who simply spend a lot (or a little) in general. So, it's not surprising that their spending is also higher or lower than the majority after the intervention.

For now, I see no strong reason to remove them. We could investigate the most extreme points individually if we wanted to be extra thorough.

In [None]:
color_map = {"Control": "grey", "Treatment": "red"}
colors = df["Group"].map(color_map)

df.plot.scatter(x="Sales_Before", y="Sales_After", c=colors)
plt.title("Spending Before vs. After")
plt.xlabel("Sales Before")
plt.ylabel("Sales After")
plt.show()

### 5. Checking for Normality

Just by looking at a histogram, both groups seem to be roughly bell-shaped (i.e., normally distributed). Since the t-test is pretty robust against slight deviations from normality, especially with larger datasets, this visual check is often good enough to start.

In [None]:
df[df.Group == "Control"].Sales_After.hist(alpha=0.7, label='Control')
df[df.Group == "Treatment"].Sales_After.hist(alpha=0.7, label='Treatment')
plt.legend()
plt.title("Distribution of Sales After")
plt.xlabel("Sales")
plt.ylabel("Frequency")
plt.show()

#### Statistical Tests for Normality
Normality can also be tested statistically, using the **Shapiro-Wilk Test** and the **Kolmogorov-Smirnov (KS)** Test, both named after their creators.

- **Shapiro-Wilk** is particularly suitable for small samples (fewer than 2000 data points, and even under 50).
- **KS** requires larger data sets of at least 50 to yield meaningful results.

Let's first count how many data points we have to test:

In [None]:
print(df[df.Group == "Control"].Sales_After.shape[0])
print(df[df.Group == "Treatment"].Sales_After.shape[0])

We have around 5,000 data points in each group. Therefore, the KS test for normality is probably the more appropriate choice.

But I’ll show you both nonetheless:

In [None]:
from scipy.stats import shapiro

# Perform Shapiro-Wilk test
stat, p_value = shapiro(df[df.Group == "Control"].Sales_After)

print(f"Shapiro-Wilk Test Statistic: {stat}")
print(f"P-Value: {p_value}")

# Interpret result
if p_value > 0.05:
    print("Data is likely normal")
else:
    print("Data is not normal")


In [None]:
from scipy.stats import kstest

# Perform Kolmogorov-Smirnov test for normality
stat, p_value = kstest(df[df.Group == "Control"].Sales_After, 'norm')

print(f"Kolmogorov-Smirnov Test Statistic: {stat}")
print(f"P-Value: {p_value}")

# Interpret result
if p_value > 0.05:
    print("Data is likely normal")
else:
    print("Data is not normal (reject H0)")


<font color='green'>**Try running the normality test for the other group we want to examine.**</font>

In [None]:
# Your code here

So... Strictly speaking the normality assumption is violated (the data is not normally distributed). But the t-test is somewhat robust against such violations and judging visually, the distribution does not look too far off... Let's proceed.

### 6. Checking for Homogeneity of Variances
An independent samples t-test works best when the variances (the 'spread' of the data) of the two groups are roughly equal.

Our boxplot from earlier gave us a hint, but we can test this assumption more formally with a **Levene test**.

In [None]:
from scipy.stats import levene

stat, p = levene(df[df.Group == "Control"].Sales_After, df[df.Group == "Treatment"].Sales_After)
print(f"Levene’s test statistic: {stat:.4f}, p-value: {p:.4f}")

# Interpretation
if p > 0.05:
    print("The variances are likely equal (p > 0.05)")
else:
    print("The variances are likely not equal (p <= 0.05)")

### A Quick Correction!

The Levene test result tells us that the variances of our two groups are **not equal**. This violates one of the assumptions of the standard t-test.

Luckily, there's an easy fix! We can use a variation of the t-test, called **Welch's t-test**, which does not assume equal variances. It's the correct and more robust choice here.

To do this in Python, we simply run the same function as before, but add the parameter `equal_var=False`.

In [None]:
# Rerunning the t-test with Welch's correction for unequal variances
t_stat_welch, p_value_welch = stats.ttest_ind(
    df[df.Group == "Control"].Sales_After,
    df[df.Group == "Treatment"].Sales_After,
    equal_var=False # This is the important part
    )

print(f"Welch's t-test p-value: {p_value_welch}")

if p_value_welch < 0.05:
  print("\nOur conclusion remains the same: there is a significant difference between the groups!")
else:
  print(f"Welch's test shows there is no significant difference between the groups (p={p_value_welch})")

# Your Turn!
<font color='green'>**Now, try to follow the same steps to test for a difference between the groups in `Customer_Satisfaction_After`.**</font>

1.  Run a t-test on `Customer_Satisfaction_After`.
2.  Check the assumptions (especially normality and homogeneity of variances).
3.  Decide if you need to use Welch's t-test.
4.  State your final conclusion!

# Bonus: Paired Samples T-Test
So far, we analysed differences between two **independent groups**. However, often we may perform two measurements of the **same group at two points in time**. This is called a _paired sample_.
The assumptions, execution and interpretation are similar to the independent samples t-test.

<font color='green'>**If you're interested and have the time, try to run a paired samples t-test for two separate columns on the same group.**</font>

For the sake of brevity, we'll avoid testing all assumptions, here.

In [None]:
from scipy.stats import ttest_rel

# ttest_rel(..., ...)

# Bonus: Chi-Squared Test
Want to dive a little deeper into statistics? Try using an AI assistant or other online resources to learn about the **Chi-Squared Test**.

You can use it to test whether the intervention was successful in getting more customers to make a purchase (`Purchase_Made`).

**Hint:** We need a Chi-Squared test here because our target variable (`Purchase_Made`) is categorical (True/False), not numerical.