# Solution: Hypothesis Testing in Healthcare: Drug Safety

This is a jupyter notebook, i.e. intended to be run step by step.

Author: Eric Einspänner

First version: 17th of September 2024

Copyright 2023 Clinic of Neuroradiology, Magdeburg, Germany

License: Apache-2.0

The original dataset can be found [here](https://hbiostat.org/data/repo/safety.rda).

## Initial Set-Up for Google Colab
<u> Execute these code blocks just in Google Colab! </u>

In [None]:
!wget -q -O - https://github.com/University-Clinic-of-Neuroradiology/python-bootcamp/archive/refs/heads/main.tar.gz | tar -xzf - --strip-components=2 python-bootcamp-main/notebooks/projects

In [None]:
import os
import sys
from google.colab import output
output.enable_custom_widget_manager()

sys.path.insert(0,'projects')
os.chdir(sys.path[0])

In [None]:
%pip install -q ipympl numpy pandas statsmodels pingouin matplotlib seaborn

In [3]:
# Import packages
import numpy as np
import pandas as pd
from statsmodels.stats.proportion import proportions_ztest
import pingouin
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
drug_safety = pd.read_csv("files/drug_safety.csv")

## --- Start notebook ---

In [None]:
drug_safety.head()

### 1. Two-sample proportions z-test:

In [None]:
# Count the adverse_effects column values for each trx group
adv_eff_by_trx = drug_safety.groupby("trx").adverse_effects.value_counts()

# Compute total rows in each group
adv_eff_by_trx_totals = adv_eff_by_trx.groupby("trx").sum()

print(adv_eff_by_trx)
print(adv_eff_by_trx_totals)

In [None]:
# Create an array of the "Yes" counts for each group
yeses = [adv_eff_by_trx["Drug"]["Yes"], adv_eff_by_trx["Placebo"]["Yes"]]

# Create an array of the total number of rows in each group
n = [adv_eff_by_trx_totals["Drug"], adv_eff_by_trx_totals["Placebo"]]

# Perform a two-sided z-test on the two proportions
two_sample_results = proportions_ztest(yeses, n)

# Store the p-value
two_sample_p_value = two_sample_results[1]

### 2. Association between adverse effects and the groups:

In [None]:
# Determine if num_effects and trx are independent
num_effects_groups = pingouin.chi2_independence(
    data=drug_safety, x="num_effects", y="trx")

# Extract the p-value
num_effects_p_value = num_effects_groups[2]["pval"][0]

### 3. Inspecting whether age is normally distributed:

In [None]:
# Create a histogram with Seaborn
sns.histplot(data=drug_safety, x="age", hue="trx")

#### Optional

In [None]:
# To choose between unpaired t-test and Wilcoxon-Mann-Whitney test
normality = pingouin.normality(
    data=drug_safety,
    dv='age',
    group='trx',
    method='shapiro', # the default
    alpha=0.05) # 0.05 is also the default

### 4. Significant difference between the ages of both groups:

In [None]:
# Select the age of the Drug group
age_trx = drug_safety.loc[drug_safety["trx"] == "Drug", "age"]

# Select the age of the Placebo group
age_placebo = drug_safety.loc[drug_safety["trx"] == "Placebo", "age"]

# Since the data distribution is not normal
# Conduct a two-sided Mann-Whitney U test
age_group_effects = pingouin.mwu(age_trx, age_placebo)

# Extract the p-value
age_group_effects_p_value = age_group_effects["p-val"]