# README

This directory has three main subfolders:
1. **EBV DNA Quantification**: Quantify EBV DNA reads from WGS, normalize to 30xWGS coverage, and determine EBV+ individuals (as a binary trait). Also includes code that formats the covariate input files used in EBV DNA GWAS.
- `01_Quantify_EBV_DNA.ipynb`: normalized EBV DNA quantities per person.
- `02_Format_EBV_DNA_covariates.ipynb`: ancestry, PCs, sex at birth, age (age^2), and EBV+ annotations per person.
2. **EBV DNA GWAS**: Run REGENIE on the EBV+ trait for individuals with EUR ancestry.   
- `01_Filter_SNPs.ipynb`: input SNPlists for REGENIE.  
- `02_Format_input_files.ipynb`: input covariate/phenotype files for REGENIE.  
- `03_Run_REGENIE.ipynb`: REGENIE analysis for chr1-22.  
- `04_Parse_REGENIE_results.ipynb`: parsed REGENIE output files.
3. **EBV DNA PheWAS**: Run Fisher tests for associations of EBV+ with ICD9/10 codes represented in individuals with EUR ancestry. 
- `01_Query_PheWAS_inputs.ipynb`: queries to obtain EHR data per person. 
- `02_Clean_ICD_annotations.ipynb`: cleaned version of ICD9/10 annotation files.
- `03_Run_PheWAS.ipynb`: associations of each ICD code with EBV+, with ICD code annotations. 

In addition, we have included a fourth subfolder, `HLA_haplotype_construction`, containing instructions and scripts for reconstructing HLA haplotypes for all EUR individuals in AoU.

## Notes and updates
Since our initial analyses, AoU has migrated away from the Google Life Sciences API to the Google Batch API (announcement in May 2025: https://support.researchallofus.org/hc/en-us/articles/37137186102676-Migration-to-Google-Batch-API). To run the `dsub` and `dstat` commands in the **EBV DNA GWAS** section, `--provider` should now be set to `google-batch` instead of `google-cls-v2`. Also, note that the disk sizes set in these notebooks are large enough for AoU Controlled Tier V7 bgen files; however, if running on V8, the bgen file sizes are considerably larger and would need scaling up accordingly.