# Calculation of association between HFE variants and *Yersinia pestis* status

## Purpose of the notebook
This notebook evaluates associations between blood-group–related genetic variants and Yersinia pestis (plague) case–control status in Early Medieval individuals. It implements a reproducible workflow suitable for peer review and public release.

The analysis includes:

1. Loading blood-group genotype data and plague phenotype labels.
2. Computing allele and genotype frequencies in cases and controls.
3. Performing association tests (e.g., Fisher’s exact tests or logistic regression).
4. Estimating effect sizes, odds ratios, and confidence intervals.
5. Applying multiple-testing correction where applicable.


## System requirements

- Operating system: Linux (Ubuntu 20.04+), macOS 12+, or Windows 10+
- Programming language: **Python 3.11**
- Hardware: Standard 64-bit desktop or laptop; ≥ 8 GB RAM recommended

---

## Required Python packages

The notebook assumes the following Python packages are installed:

- `pandas` (data handling)
- `numpy` (numerical computations)
- `scipy` (statistical tests)
- `statsmodels` (regression / modelling)
- `matplotlib` (optional: plots and visualisation)
- `jupyterlab` or `notebook` (to run this file)

Exact versions and full dependency list are provided in the project’s `requirements.txt` or `environment.yml` file.

---

## Expected inputs and outputs

### Input

A CSV file containing:

- `Sample_ID` – unique identifier per individual  
- A case–control indicator (e.g. `Plague_status` with values `"case"` / `"control"` or equivalent)  
- Blood-group variant columns, for example:  
  - `ABO_rs8176719`  
  - `ABO_rs8176746`
  - `ABO_rs8176750`
  - `RhD_variant or equivalent`

Any other blood-group SNPs analysed in the study  

Genotypes may be coded as 0/1/2 (dosage of the effect allele) or as alleles (e.g. `GG`, `GA`, `AA`), as described in the Methods section of the manuscript.

The default path to the input file is set in a configuration cell (e.g. `INPUT_CSV = "BloodGroup_input.csv"`); this can be modified by the user as needed.

### Outputs

The notebook produces:

- `BloodGroup_case_control_summary.csv` – allele/genotype counts and frequencies in cases vs. controls  
- `BloodGroup_association_results.csv` – effect sizes, odds ratios, confidence intervals, raw and adjusted *P*-values  

These files include:
- allele/genotype frequencies
- case/control counts
- odds ratios and 95% CI
- P-values and FDR-adjusted Q-values

---

## Statistical methods (summary)

Depending on the variant and data format, the notebook may use:

- Fisher’s exact tests
- 2×3 genotype contingency-table tests
- Logistic regression (optional)

All tests report effect estimates (odds ratios) and corresponding significance values.
Multiple-testing correction uses Benjamini–Hochberg FDR when multiple variants are tested.

Full details are given in the manuscript and Supplementary Methods.

---

## How to run

1. **Set up the environment**

   Create and activate the analysis environment (example using conda):
  
  ```bash
   conda env create -f environment.yml
   conda activate abo-plague
   ```
   
   or install packages directly with pip:
   
   ```bash
   pip install -r requirements.txt
   ```

2. **Open the notebook**
    
    Launch Jupyter and open this file (e.g. 'HFE_Analysis.ipynb')
    
     ```bash
    jupyter lab
     ```
    or
     ```bash
    jupyter notebook
     ```
3. **Configure the input file**

    In the configuration cell, ensure the path to the input CSV is correct, for example:
    
    INPUT_CSV = "BloodGroup_input.csv"  # update if needed

4. **Run the analysis**

    In the Jupyter interface, select:
    
    - 'Kernel' -> 'Restart & Run All'
    
    All cells will execute in sequence, and the output tables and any plots will be written to the working directory.
    
5. **Expected runtime**
    
    On a standard desktop or laptop, the complete analysis typically runs in < 1 minute for the datasets used in the manuscript.



In [2]:
import pandas as pd
import numpy as np
from pathlib import Path

# Path to your CSV (adjust if needed)
csv_path = Path('BloodGroup_input.csv')

df = pd.read_csv(csv_path)

# Normalize fields
df['ID'] = df['ID'].astype(str)
df['Plague'] = df['Plague'].astype(str).str.strip().str.lower()   # 'case' / 'control'
df['BloodGroup'] = df['BloodGroup'].astype(str).str.strip().str.upper()  # 'O','A','B','AB'

display(df.head())
df['ID'].str[:3].value_counts()
df['Plague'].value_counts(), df['BloodGroup'].value_counts()


Unnamed: 0,ID,Plague,BloodGroup
0,EDI102,case,O
1,EDI001,case,O
2,EDI060,case,O
3,EDI062,case,A
4,EDI064,case,A


(control    85
 case        9
 Name: Plague, dtype: int64,
 O     55
 A     33
 B      5
 AB     1
 Name: BloodGroup, dtype: int64)

In [3]:
from math import log, sqrt, exp
from scipy.stats import fisher_exact, chi2_contingency

def analyze_O_difference(df, cases_mask, controls_mask, label):
    """Builds a 2x2 for O vs non-O and performs Fisher exact (two-sided).
       Also reports Yates chi2 p (reference), OR with 95% CI (Woolf, Haldane-Anscombe for zeros)."""
    cases = df.loc[cases_mask]
    controls = df.loc[controls_mask]
    
    # 2x2 counts
    a = int((cases['BloodGroup'] == 'O').sum())     # cases with O
    b = int((cases['BloodGroup'] != 'O').sum())     # cases non-O
    c = int((controls['BloodGroup'] == 'O').sum())  # controls with O
    d = int((controls['BloodGroup'] != 'O').sum())  # controls non-O
    
    table = np.array([[a, b],[c, d]])
    
    # Fisher’s exact (two-sided)
    or_fisher, p_fisher = fisher_exact(table, alternative='two-sided')
    
    # Chi-squared with Yates’ correction (reference only)
    chi2, p_chi2, dof, expected = chi2_contingency(table, correction=True)
    
    # Haldane–Anscombe correction for 0 cells, then Woolf CI for log(OR)
    a_ = a if a > 0 else a + 0.5
    b_ = b if b > 0 else b + 0.5
    c_ = c if c > 0 else c + 0.5
    d_ = d if d > 0 else d + 0.5
    or_hw = (a_ * d_) / (b_ * c_)
    
    se_log_or = sqrt(1/a_ + 1/b_ + 1/c_ + 1/d_)
    z = 1.96
    log_or = log(or_hw)
    ci_low, ci_high = exp(log_or - z*se_log_or), exp(log_or + z*se_log_or)
    
    cases_total, controls_total = a + b, c + d
    pct_cases_O = (a / cases_total * 100) if cases_total else np.nan
    pct_controls_O = (c / controls_total * 100) if controls_total else np.nan
    
    result = {
        "Comparison": label,
        "Cases_n": cases_total,
        "Controls_n": controls_total,
        "Cases_O": a,
        "Cases_nonO": b,
        "Controls_O": c,
        "Controls_nonO": d,
        "Cases_O_%": round(pct_cases_O, 1),
        "Controls_O_%": round(pct_controls_O, 1),
        "OddsRatio": or_hw,
        "OR_95%CI_low": ci_low,
        "OR_95%CI_high": ci_high,
        "Fisher_p_two_sided": p_fisher,
        "Chi2_Yates_p": p_chi2,
    }
    return result, table, expected


In [4]:
# Masks
mask_EDI = df['ID'].str.startswith('EDI')
mask_case = df['Plague'] == 'case'
mask_control = df['Plague'] == 'control'

# (1) Within EDI only: cases vs controls
res1, table1, expected1 = analyze_O_difference(df, mask_EDI & mask_case, mask_EDI & mask_control,
                                               "EDI samples: cases vs controls")

# (2) Overall: all cases vs all controls
res2, table2, expected2 = analyze_O_difference(df, mask_case, mask_control,
                                               "All samples: cases vs controls")

# (3) EDI cases vs ALL controls
res3, table3, expected3 = analyze_O_difference(df, mask_EDI & mask_case, mask_control,
                                               "EDI cases vs ALL controls")

summary_df = pd.DataFrame([res1, res2, res3])

# Round for display
round_cols = ["OddsRatio","OR_95%CI_low","OR_95%CI_high","Fisher_p_two_sided","Chi2_Yates_p"]
summary_df[round_cols] = summary_df[round_cols].astype(float).round(4)
summary_df


Unnamed: 0,Comparison,Cases_n,Controls_n,Cases_O,Cases_nonO,Controls_O,Controls_nonO,Cases_O_%,Controls_O_%,OddsRatio,OR_95%CI_low,OR_95%CI_high,Fisher_p_two_sided,Chi2_Yates_p
0,EDI samples: cases vs controls,9,25,5,4,16,9,55.6,64.0,0.7031,0.1496,3.3055,0.7041,0.9625
1,All samples: cases vs controls,9,85,5,4,50,35,55.6,58.8,0.875,0.2193,3.4918,1.0,0.8678
2,EDI cases vs ALL controls,9,85,5,4,50,35,55.6,58.8,0.875,0.2193,3.4918,1.0,0.8678


In [5]:
def show_tables(observed, expected, title):
    obs_df = pd.DataFrame(observed, index=["Cases","Controls"], columns=["O","non-O"])
    exp_df = pd.DataFrame(np.round(expected, 2), index=["Cases","Controls"], columns=["O","non-O"])
    display(pd.DataFrame({title+" — Observed": [" "]}))  # spacer title
    display(obs_df.astype(int))
    display(pd.DataFrame({title+" — Expected (chi2 model)": [" "]}))
    display(exp_df)

show_tables(table1, expected1, "EDI")
show_tables(table2, expected2, "All samples")
show_tables(table3, expected3, "EDI cases vs ALL controls")


Unnamed: 0,EDI — Observed
0,


Unnamed: 0,O,non-O
Cases,5,4
Controls,16,9


Unnamed: 0,EDI — Expected (chi2 model)
0,


Unnamed: 0,O,non-O
Cases,5.56,3.44
Controls,15.44,9.56


Unnamed: 0,All samples — Observed
0,


Unnamed: 0,O,non-O
Cases,5,4
Controls,50,35


Unnamed: 0,All samples — Expected (chi2 model)
0,


Unnamed: 0,O,non-O
Cases,5.27,3.73
Controls,49.73,35.27


Unnamed: 0,EDI cases vs ALL controls — Observed
0,


Unnamed: 0,O,non-O
Cases,5,4
Controls,50,35


Unnamed: 0,EDI cases vs ALL controls — Expected (chi2 model)
0,


Unnamed: 0,O,non-O
Cases,5.27,3.73
Controls,49.73,35.27


In [6]:
out_path = Path("bloodgroup_O_stats_summary.csv")
summary_df.to_csv(out_path, index=False)
print(f"Saved: {out_path}")


Saved: bloodgroup_O_stats_summary.csv
