# Evaluating Treatment Effectiveness with Hypothesis Testing

Time estimate: **20** minutes

## Objectives

After completing this lab, you will be able to:

- Formulate null and alternative hypotheses for clinical comparisons.
- Choose and run appropriate statistical tests (t-test, chi-square, Mann–Whitney) based on data type and assumptions.
- Check test assumptions (normality, variance equality, expected counts) and apply alternatives when assumptions fail.
- Interpret p-values, confidence intervals, and effect sizes to make evidence-based treatment decisions.
- Report results clearly for clinicians, including limitations and practical significance.


## What you will do in this lab

- Simulate a randomized clinical trial dataset comparing two treatments and a multi-arm observational example.
- Inspect data and check assumptions (normality, equal variances).
- Run two-sample t-test, Welch's t-test, Mann–Whitney U test, and Chi-square test for categorical outcomes.
- Compute confidence intervals and Cohen's d effect size.
- Consolidate findings into clinician-friendly summary.
- Complete 6 consolidated exercises with hints & solutions at the end of the lab.


## Overview

Hypothesis testing is the backbone of evaluating whether observed differences between treatment groups are likely due to an actual effect or random sampling variability. This lab focuses on practical selection and interpretation of hypothesis tests used in clinical research: t-tests (standard and Welch), nonparametric alternatives, chi-square for categorical outcomes, confidence intervals, and effect-size measures.

We will simulate datasets that mimic randomized trials and observational cohorts to illustrate assumption checks and interpretation. Understanding when to apply each test — and how to report results — is essential for evidence-based clinical decisions.


## About the dataset/environment

This lab simulates:
- A randomized trial comparing Treatment A versus Treatment B with continuous outcome (e.g., reduction in systolic blood pressure).
- A categorical outcome (e.g., event occurrence: improved / not improved) for chi-square testing.
- A non-normal outcome example to illustrate Mann–Whitney.

Tools: Python (pandas, numpy, scipy, statsmodels, matplotlib, seaborn).


## Setup

Run the following cell to install (if needed) and import libraries. In Colab, run the cell to ensure dependencies are available.

This code prepares the analysis environment by checking for Google Colab usage, installing any missing packages, importing all required statistical and visualization libraries, and configuring reproducibility and display settings so your hypothesis-testing workflow runs smoothly and consistently.

In [None]:
# Colab compatibility: uncomment installation if needed
try:
    import google.colab
    IN_COLAB = True
except Exception:
    IN_COLAB = False

if IN_COLAB:
    !pip -q install numpy pandas scipy statsmodels seaborn matplotlib

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.stats.api as sms
from datetime import datetime, timedelta

# Reproducibility
np.random.seed(42)
pd.set_option('display.max_columns', 50)
sns.set(style='whitegrid')
print('Setup complete. Running in Colab:', IN_COLAB)


Run the code to simulate a randomized two-group clinical trial dataset by generating systolic blood pressure change values and correlated binary event outcomes for two treatment groups. This allows you to practice hypothesis testing (t-tests and chi-square tests) using realistic, controlled experimental data.

In [None]:
def simulate_trial(n_per_group=80, effect_size=5.0, sd=12.0):
    # Treatment A baseline mean = 0, Treatment B mean = effect_size (difference)
    a = np.random.normal(loc=0.0, scale=sd, size=n_per_group)
    b = np.random.normal(loc=effect_size, scale=sd, size=n_per_group)
    df = pd.DataFrame({
        'treatment': ['A']*n_per_group + ['B']*n_per_group,
        'change_sbp': np.concatenate([a, b])
    })
    # Add a binary event outcome correlated with treatment for chi-square demo
    # Prob of event in A = 0.30, B = 0.18 (treatment reduces event)
    events_a = np.random.binomial(1, 0.30, size=n_per_group)
    events_b = np.random.binomial(1, 0.18, size=n_per_group)
    df['event'] = np.concatenate([events_a, events_b])
    return df

df = simulate_trial(n_per_group=90, effect_size=4.5, sd=13.0)
df.head()

## Step 1: Inspect the data

Basic inspection: shape, types, summary

This code cell provides a quick structural and statistical overview of the simulated trial dataset by showing its size, data types, and descriptive statistics. This helps you verify that the data loaded correctly before running hypothesis tests.

In [None]:
print('Rows, Columns:', df.shape)
print(df.dtypes)
df.describe().T

## Step 2: Visualize distributions and check normality

Histograms, Q-Q plot, Shapiro test.

Run this code cell to assess whether the systolic blood pressure change values in each treatment group follow a normal distribution. This step includes visualizing the distributions, generating Q–Q plots, and performing Shapiro–Wilk tests to determine if using parametric tests like the t-test is appropriate.

In [None]:
# Distribution and normality checks
plt.figure(figsize=(8,4))
sns.histplot(data=df, x='change_sbp', hue='treatment', kde=True, bins=30)
plt.title('Distribution of Change in SBP by Treatment')
plt.show()

# Q-Q plots
import statsmodels.api as sm
sm.qqplot(df[df['treatment']=='A']['change_sbp'].dropna(), line='s')
plt.title('Q-Q plot Treatment A')
plt.show()
sm.qqplot(df[df['treatment']=='B']['change_sbp'].dropna(), line='s')
plt.title('Q-Q plot Treatment B')
plt.show()

# Shapiro-Wilk test (note: sensitive to sample size)
from scipy.stats import shapiro
print('Shapiro A:', shapiro(df[df['treatment']=='A']['change_sbp']).pvalue)
print('Shapiro B:', shapiro(df[df['treatment']=='B']['change_sbp']).pvalue)

## Step 3: Two-sample t-test and Welch's t-test

Perform tests and compare results.

Run this code cell to compare the mean systolic blood pressure changes between the two treatment groups using appropriate statistical tests. This step performs both the standard Student’s t-test and Welch’s t-test, and includes Levene’s test to check variance assumptions. Together, these results help you determine which test is most suitable and how to correctly interpret differences between the treatment groups.

In [None]:
# Standard Student's t-test (assume equal variances)
group_a = df[df['treatment']=='A']['change_sbp'].dropna()
group_b = df[df['treatment']=='B']['change_sbp'].dropna()

t_stat, p_val = stats.ttest_ind(group_a, group_b, equal_var=True)
print("Student's t-test: t=%.3f, p=%.4f" % (t_stat, p_val))

# Welch's t-test (unequal variances)
t_stat_w, p_val_w = stats.ttest_ind(group_a, group_b, equal_var=False)
print("Welch's t-test: t=%.3f, p=%.4f" % (t_stat_w, p_val_w))

# Levene's test for equal variances
lev_stat, lev_p = stats.levene(group_a, group_b)
print('Levene test for equal variances: stat=%.3f, p=%.4f' % (lev_stat, lev_p))

## Step 4: Nonparametric alternative (Mann–Whitney U)

When normality fails.

Run this code cell to evaluate differences in systolic blood pressure change between the two treatment groups using a non-parametric approach. This step applies the Mann–Whitney U test, which compares the overall distributions of the two groups when normality assumptions are not met.

In [None]:
# Mann-Whitney U test
u_stat, u_p = stats.mannwhitneyu(group_a, group_b, alternative='two-sided')
print('Mann-Whitney U: U=%.3f, p=%.4f' % (u_stat, u_p))

# Note: Mann-Whitney tests for difference in distributions (stochastic dominance) not strictly medians.


## Step 5: Categorical outcome - Chi-square test

Contingency table, expected counts, Fisher's exact if needed.

Run these code cells to examine whether event rates differ between treatment groups. You will first generate a contingency table to summarize event occurrences, then apply chi-square and Fisher’s exact tests to determine whether any observed differences are statistically significant.

In [None]:
# Contingency table for event by treatment
ct = pd.crosstab(df['treatment'], df['event'])
ct.columns = ['no_event', 'event']
ct


In [None]:
# Chi-square test
chi2, chi_p, dof, expected = stats.chi2_contingency(ct)
print('Chi-square: chi2=%.3f, p=%.4f, dof=%d' % (chi2, chi_p, dof))
print('Expected counts:\n', expected)

# If expected counts <5, consider Fisher's exact test (only for 2x2)
from scipy.stats import fisher_exact
oddsratio, fisher_p = fisher_exact(ct)
print('Fisher exact p=%.4f' % fisher_p)

## Step 6: Effect sizes and confidence intervals

Cohen's d and CI for difference in means.

Run this code cell to quantify the magnitude and precision of the treatment effect. You will calculate Cohen’s d to measure the effect size between groups and generate a 95% confidence interval for the mean difference, helping you interpret not just whether the treatments differ, but by how much.

In [None]:
# Cohen's d for independent groups (pooled)
def cohens_d(x, y):
    nx = len(x); ny = len(y)
    dof = nx + ny - 2
    pooled_sd = np.sqrt(((nx-1)*np.nanvar(x, ddof=1) + (ny-1)*np.nanvar(y, ddof=1)) / dof)
    return (np.nanmean(y) - np.nanmean(x)) / pooled_sd

d = cohens_d(group_a, group_b)
print("Cohen's d (B vs A):", round(d,3))

# 95% CI for difference in means using statsmodels
cm = sms.CompareMeans(sms.DescrStatsW(group_b), sms.DescrStatsW(group_a))
ci_low, ci_upp = cm.tconfint_diff(usevar='unequal')
print('95% CI for mean difference (B-A):', (ci_low, ci_upp))

## Step 7: Reporting and interpretation

Assemble clinician-ready summary.

Run this code cell to generate a clear, programmatic summary of your results. It reports group means, the mean difference, statistical significance, and effect size—giving you an immediate, interpretable snapshot of how the two treatments compare.

In [None]:
# Programmatic reporting
mean_a = group_a.mean()
mean_b = group_b.mean()
print(f"Mean change SBP - Treatment A: {mean_a:.2f}")
print(f"Mean change SBP - Treatment B: {mean_b:.2f}")
print(f"Mean difference (B-A): {mean_b-mean_a:.2f}")
print(f"p-value (Welch): {p_val_w:.4f}")
print(f"Cohen's d: {d:.3f}")

# Simple interpretation helper
if p_val_w < 0.05:
    print('\nInterpretation: Statistically significant difference between treatments (p<0.05).')
else:
    print('\nInterpretation: No statistically significant difference detected.')

## Consolidated practice exercises

### Exercise 1: State null and alternative hypotheses for the continuous outcome (change in SBP) in this trial

In [None]:
# your code goes here

<details> <summary>Click here for a hint</summary>

Write clear H0 and H1 strings for comparing mean change in SBP between Treatment A and Treatment B.

</details>

<details> <summary>Click here for solution</summary>

```python
'H0: mean change in SBP (B) - mean change in SBP (A) = 0\nH1: mean change in SBP (B) - mean change in SBP (A) != 0'
```

</details>

### Exercise 2: Run Welch's t-test for difference in means and report t-statistic and p-value

In [None]:
# your code goes here

<details> <summary>Click here for a hint</summary>

Use `stats.ttest_ind(group_a, group_b, equal_var=False)` and print results.

</details>

<details> <summary>Click here for solution</summary>

```python
t_stat_w, p_val_w = stats.ttest_ind(group_a, group_b, equal_var=False)
print(t_stat_w, p_val_w)
```

</details>

### Exercise 3: If normality is violated, run a Mann–Whitney U test and report p-value

In [None]:
# your code goes here

<details> <summary>Click here for a hint</summary>

Use `stats.mannwhitneyu(group_a, group_b, alternative='two-sided')`.

</details>

<details> <summary>Click here for solution</summary>

```python
u_stat, u_p = stats.mannwhitneyu(group_a, group_b, alternative='two-sided')
print('U:', u_stat, 'p:', u_p)
```

</details>

### Exercise 4: Perform chi-square test for the binary event outcome and report p-value

In [None]:
# your code goes here

<details> <summary>Click here for a hint</summary>

Create contingency table `pd.crosstab(df['treatment'], df['event'])` and run `stats.chi2_contingency`.

</details>

<details> <summary>Click here for solution</summary>

```python
ct = pd.crosstab(df['treatment'], df['event'])
chi2, chi_p, dof, expected = stats.chi2_contingency(ct)
print('p:', chi_p)
```

</details>

### Exercise 5: Compute Cohen's d for difference in means (B vs A)

In [None]:
# your code goes here

<details> <summary>Click here for a hint</summary>

Use the provided `cohens_d` function or compute manually using pooled SD.

</details>

<details> <summary>Click here for solution</summary>

```python
d = cohens_d(group_a, group_b)
print('Cohen d:', d)
```

</details>

### Exercise 6: Create a short clinician-friendly summary sentence containing: mean difference, 95% CI, p-value, and practical conclusion

In [None]:
# your code goes here

<details> <summary>Click here for a hint</summary>

Format a sentence using values computed earlier (e.g., mean difference, ci_low/ci_upp, p_val_w).

</details>

<details> <summary>Click here for solution</summary>

```python
print(f"Mean difference (B-A): {mean_b-mean_a:.2f}, 95% CI: ({ci_low:.2f}, {ci_upp:.2f}), p={p_val_w:.4f}. Conclusion: ...")
```

</details>

## Final thoughts and best practices

- Check assumptions before selecting a test.  
- Report effect sizes and confidence intervals along with p-values.  
- For small samples or sparse categorical tables, prefer exact tests.  
- Be transparent about multiple testing and pre-specified endpoints.

# Congratulations!

You have successfully completed this lab on **Evaluating Treatment Effectiveness with Hypothesis Testing**.

In this lab, you evaluated whether two treatments differ in effectiveness using a simulated clinical trial dataset. You inspected the data, checked normality and variance assumptions, and applied the right statistical tests—including t-tests, Welch’s test, Mann–Whitney U, chi-square, and Fisher’s exact.

You also calculated confidence intervals and Cohen’s d to understand the size and precision of the treatment effect, then generated clear, clinician-ready interpretations.

By the end of the lab, you practiced choosing appropriate tests, interpreting results, and summarizing findings in a way that supports evidence-based clinical decision-making.

## Authors

Ramesh Sannareddy

Copyright © 2025 SkillUp. All rights reserved.