# Health Outcomes Analysis – Individual Assignment (Part 1+Part 2)

In this notebook I work as a junior data analyst at a research institute.  
The task is to analyse a health-related dataset and answer a set of basic statistical questions using Python.

The goals are to:

- describe the data using simple summary statistics,
- create visualisations of important variables,
- run a simple simulation related to disease status,
- compute a 95% confidence interval for the mean systolic blood pressure using a normal approximation,
- also compute a bootstrap confidence interval for the same mean and compare the two methods,
- test a hypothesis about smokers and non-smokers using a t-test,
- run a small simulation study to estimate the power of the hypothesis test.

All analysis is self-contained in this notebook.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from src.health_analysis import HealthAnalyzer

# Make plots appear inside the notebook
%matplotlib inline


In [None]:
# Load the dataset from the data folder
df = pd.read_csv("data/health_study_dataset.csv")

# Show the first rows
df.head()

# Name --
analyzer = HealthAnalyzer(df)


In [3]:
# Basic information about the dataset
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           800 non-null    int64  
 1   age          800 non-null    int64  
 2   sex          800 non-null    object 
 3   height       800 non-null    float64
 4   weight       800 non-null    float64
 5   systolic_bp  800 non-null    float64
 6   cholesterol  800 non-null    float64
 7   smoker       800 non-null    object 
 8   disease      800 non-null    int64  
dtypes: float64(4), int64(3), object(2)
memory usage: 56.4+ KB


In [4]:
# Summary statistics for numeric variables
df.describe()


Unnamed: 0,id,age,height,weight,systolic_bp,cholesterol,disease
count,800.0,800.0,800.0,800.0,800.0,800.0,800.0
mean,400.5,49.42625,171.84925,73.413,149.178625,4.92915,0.05875
std,231.0844,14.501118,9.804259,13.685059,12.79336,0.848413,0.235303
min,1.0,18.0,144.4,33.7,106.8,2.5,0.0
25%,200.75,39.0,164.775,64.8,140.9,4.3275,0.0
50%,400.5,50.0,171.35,73.2,149.4,4.97,0.0
75%,600.25,59.0,178.925,82.6,157.6,5.4825,0.0
max,800.0,90.0,200.4,114.4,185.9,7.88,1.0


## Data overview

The dataset contains individual-level information about 800 participants in a health study.

The main variables are:

- `age` – age in years
- `sex` – sex (M/F)
- `height` – height in centimetres
- `weight` – weight in kilograms
- `systolic_bp` – systolic blood pressure (mmHg)
- `cholesterol` – cholesterol level (mmol/L)
- `smoker` – smoking status (Yes/No)
- `disease` – indicator for a certain disease (0 = no, 1 = yes)

The summary statistics above give a first impression of typical values and ranges for the numeric variables.


## Descriptive statistics

Next, I calculate mean, median, minimum and maximum for a selection of key health variables.

In [5]:
desc_table = analyzer.basic_stats()
desc_table


Unnamed: 0,mean,median,min,max
age,49.42625,50.0,18.0,90.0
height,171.84925,171.35,144.4,200.4
weight,73.413,73.2,33.7,114.4
systolic_bp,149.178625,149.4,106.8,185.9
cholesterol,4.92915,4.97,2.5,7.88


The table above shows mean, median, minimum and maximum for age, height, weight, systolic blood pressure and cholesterol.

## Visualisations

I now create three simple plots to explore the distributions and some relationships in the data.

### 1. Distribution of systolic blood pressure

In [None]:
plt.figure(figsize=(6, 4))
plt.hist(df["systolic_bp"].dropna(), bins=20, edgecolor="black")
plt.xlabel("Systolic blood pressure (mmHg)")
plt.ylabel("Frequency")
plt.title("Histogram of systolic blood pressure")
plt.tight_layout()
plt.show()


### 2. Weight by sex (boxplot)

In [None]:
# Prepare data for a simple boxplot by sex
weights_m = df.loc[df["sex"] == "M", "weight"].dropna()
weights_f = df.loc[df["sex"] == "F", "weight"].dropna()

plt.figure(figsize=(6, 4))
plt.boxplot([weights_m, weights_f],
            tick_labels=["M", "F"],
            showmeans=True)
plt.xlabel("Sex")
plt.ylabel("Weight (kg)")
plt.title("Weight distribution by sex")
plt.tight_layout()
plt.show()


### 3. Proportion of smokers

In [None]:
smoker_counts = df["smoker"].value_counts(normalize=True)

plt.figure(figsize=(6, 4))
plt.bar(smoker_counts.index, smoker_counts.values)
plt.xlabel("Smoker status")
plt.ylabel("Proportion")
plt.title("Proportion of smokers vs non-smokers")
plt.tight_layout()
plt.show()

smoker_counts


The three plots give a simple visual overview of the distribution of blood pressure,
differences in weight between men and women, and the proportion of smokers.

## Simulation based on disease probability

In this step I use the observed proportion of disease in the dataset to simulate new individuals  
with the same probability of having the disease.


In [None]:
# Since disease is coded as 0/1, the mean is the proportion with disease
p_disease = df["disease"].mean()
p_disease


In [None]:
np.random.seed(42)  # for reproducibility

n_sim = 1000
simulated_disease = np.random.binomial(n=1, p=p_disease, size=n_sim)

simulated_proportion = simulated_disease.mean()
simulated_proportion


The true proportion of disease in the original dataset is given by `p_disease`  
and the simulated proportion in 1000 new individuals is given by `simulated_proportion`.

The two values are not exactly the same, but they are reasonably close.  
This is expected when we simulate a random sample of this size from the same probability.


## Confidence interval for mean systolic blood pressure

Next, I construct a 95% confidence interval for the mean of `systolic_bp`  
using a normal approximation.

In [None]:
bp = df["systolic_bp"].dropna()

mean_bp = bp.mean()
std_bp = bp.std(ddof=1)
n_bp = bp.shape[0]

alpha = 0.05
z_value = stats.norm.ppf(1 - alpha/2)  # ≈ 1.96 for 95% CI

margin_of_error = z_value * std_bp / np.sqrt(n_bp)

ci_lower = mean_bp - margin_of_error
ci_upper = mean_bp + margin_of_error

mean_bp, ci_lower, ci_upper


### Alternative confidence interval using bootstrap

To compare with the normal approximation, I also calculate a 95% confidence interval
for the mean of `systolic_bp` using a simple bootstrap method.

The idea is to resample the observed blood pressure values with replacement many times. Then for each resample, calculate the mean and then use the 2.5% and 97.5% percentiles of the bootstrap means as the interval.


In [None]:
# Bootstrap 95% confidence interval for mean systolic blood pressure

bp_values = df["systolic_bp"].dropna().values
n_bp = len(bp_values)

np.random.seed(42)   # for reproducibility
B = 2000             # number of bootstrap samples

bootstrap_means = []

for i in range(B):
    sample = np.random.choice(bp_values, size=n_bp, replace=True)
    bootstrap_means.append(sample.mean())

boot_lower = np.percentile(bootstrap_means, 2.5)
boot_upper = np.percentile(bootstrap_means, 97.5)

boot_lower, boot_upper


The bootstrap interval is based on the variation in the resampled means.

Now I have two 95% confidence intervals for the mean systolic blood pressure:

- Normal approximation: from `ci_lower` to `ci_upper`.
- Bootstrap interval: from `boot_lower` to `boot_upper`.

In this dataset the two intervals are fairly similar, which suggests that the normal approximation works reasonably well here.


## Hypothesis test: smokers vs non-smokers

I now test whether smokers have a higher mean systolic blood pressure than non-smokers.

- Null hypothesis (H0): Smokers and non-smokers have the same mean systolic blood pressure.
- Alternative hypothesis (H1): Smokers have a higher mean systolic blood pressure than non-smokers.

I use an independent samples t-test.


In [None]:
bp_smokers = df.loc[df["smoker"] == "Yes", "systolic_bp"].dropna()
bp_nonsmokers = df.loc[df["smoker"] == "No", "systolic_bp"].dropna()

bp_smokers.mean(), bp_nonsmokers.mean()


In [None]:
t_stat, p_value_two_sided = stats.ttest_ind(
    bp_smokers,
    bp_nonsmokers,
    equal_var=False
)

t_stat, p_value_two_sided


In [None]:
# Convert to a one-sided p-value for the hypothesis "smokers > non-smokers"
if bp_smokers.mean() > bp_nonsmokers.mean():
    p_value_one_sided = p_value_two_sided / 2
else:
    p_value_one_sided = 1 - p_value_two_sided / 2

p_value_one_sided


The mean systolic blood pressure is calculated separately for smokers and non-smokers,  
and an independent samples t-test is used to compare the two groups.

The one-sided p-value (printed above) tells us how compatible the data are with the null hypothesis.

- If the p-value is **below 0.05**, there is evidence that smokers tend to have a higher  
  mean systolic blood pressure than non-smokers in this sample.
- If the p-value is **above 0.05**, the data do not provide strong enough evidence  
  to claim a higher mean for smokers.

In this sample the one-sided p-value is about 0.33, which is clearly above 0.05.  
Therefore I do not reject the null hypothesis.

The data do not provide strong enough evidence to conclude that smokers have a higher
mean systolic blood pressure than non-smokers in this study, even if the sample means
may differ somewhat.


## Power simulation for the t-test

To see how often the t-test would be able to detect a real difference,
I run a small simulation (power analysis).

Idea:

- Use the observed means and standard deviations for smokers and non-smokers as an approximation of the "true" values.
- Simulate many new datasets with the same sample sizes as in this study.
- For each simulated dataset, run the same t-test as before.
- Count how often the one-sided p-value is below 0.05.


In [None]:
# Use the current sample statistics as "true" parameters for the simulation

bp_smokers = df.loc[df["smoker"] == "Yes", "systolic_bp"].dropna()
bp_nonsmokers = df.loc[df["smoker"] == "No", "systolic_bp"].dropna()

n_smokers = len(bp_smokers)
n_nonsmokers = len(bp_nonsmokers)

mean_smokers = bp_smokers.mean()
mean_nonsmokers = bp_nonsmokers.mean()

std_smokers = bp_smokers.std(ddof=1)
std_nonsmokers = bp_nonsmokers.std(ddof=1)

n_smokers, n_nonsmokers, mean_smokers, mean_nonsmokers


In [None]:
np.random.seed(123)

n_sim = 1000          # number of simulated "studies"
alpha = 0.05
count_significant = 0

for i in range(n_sim):
    # simulate new samples for smokers and non-smokers
    sim_smokers = np.random.normal(loc=mean_smokers,
                                   scale=std_smokers,
                                   size=n_smokers)
    sim_nonsmokers = np.random.normal(loc=mean_nonsmokers,
                                      scale=std_nonsmokers,
                                      size=n_nonsmokers)
    
    # two-sided t-test
    t_stat, p_two_sided = stats.ttest_ind(
        sim_smokers,
        sim_nonsmokers,
        equal_var=False
    )
    
    # one-sided p-value for "smokers > non-smokers"
    if sim_smokers.mean() > sim_nonsmokers.mean():
        p_one_sided = p_two_sided / 2
    else:
        p_one_sided = 1 - p_two_sided / 2
    
    if p_one_sided < alpha:
        count_significant += 1

power_estimate = count_significant / n_sim
power_estimate


In this case the estimated power is around 0.14 (14%).  
This means that if the same study were repeated many times under similar conditions,
the test would only detect the difference in mean blood pressure in about 14% of the studies.
In other words, the test has quite low power in this setting.


## Summary

In this notebook I have:

- explored a health dataset with basic descriptive statistics,
- created three visualisations (distribution of blood pressure, weight by sex, proportion of smokers),
- performed a simple simulation using the observed disease proportion,
- constructed a 95% confidence interval for mean systolic blood pressure,
- computed a bootstrap confidence interval for the same mean and compared the two methods
- tested whether smokers have a higher mean systolic blood pressure than non-smokers.
- carried out a small simulation study to estimate the power of this test under conditions similar to the dataset.

## Sources

- Course lecture notes and videos in statistics and Python.
- (https://acclab.github.io/bootstrap-confidence-intervals.html)
- (https://www.stathelp.se/sv/ttest_sv.html)
- Official documentation for pandas, NumPy and Matplotlib.