# Format REGENIE input files



To perform EBV DNAemia GWAS within the AoU cohort of individuals with European ancestry (EUR), we followed the demo workspace for performing GWAS in AoU: https://workbench.researchallofus.org/workspaces/aou-rw-5981f9dc/aouldlgwasregeniedsubctv6duplicate/analysis. In particular, we modified the code from the `4.0_regenie_dsub_HP_TM` script to run REGENIE with our EBV DNA binary trait.

This script formats the covariate and phenotype files that are inputs to REGENIE, i.e., the files specified in the dsub command in `03_Run_REGENIE.ipynb`:
```bash
--input pheno_file="{my_bucket}/data/dsub/ebv_EUR_0018.tsv"
--input cov_file="{my_bucket}/data/dsub/ebv_EUR_covar.tsv" 
```
This follows the formats described in `3_pheno_reformatting_HC` of the demo workspace.

The covariates can be obtained as described here: https://support.researchallofus.org/hc/en-us/articles/4614687617556-How-the-All-of-Us-Genomic-data-are-organized-Archived-C2022Q4R13-CDRv7 and here:
https://support.researchallofus.org/hc/en-us/articles/29475233432212-Controlled-CDR-Directory .

Specifically, these gs bucket paths:
- Genetic ancestry and PCs: gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/aux/ancestry/ancestry_preds.tsv
- Genomic metrics (sex_at_birth): gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/aux/qc/genomic_metrics.tsv 
- Demographics (?)

In [None]:
library(data.table)
library(dplyr)

In [None]:
setwd("/home/jupyter/workspaces/ebvgwas")

In [None]:
pheno_df <- fread("EBV_GWAS_data/ebv_EUR_0018.csv")

In [None]:
pheno_df <- pheno_df %>%
  dplyr::mutate(
      FID = person,
      IID = FID) %>%
dplyr::select(FID, IID, has_ebv)

In [None]:
# Without the 'quote=FALSE' option REGENIE will error out, as the column headers will have "" 
write.table(pheno_df, file = "EBV_GWAS_data/EUR/ebv_EUR_0018.tsv", row.names=FALSE, sep="\t", quote=FALSE) 

In [None]:
covar_df <- fread("EBV_data/EUR/all_covariates_EUR.csv")

In [None]:
covar_df <- covar_df[covar_df$person %in% pheno_df$FID,]

In [None]:
covar_df <- covar_df %>% 
dplyr::mutate(
      FID = person,
      IID = FID) %>%
dplyr::select(-person) %>% 
dplyr::select(FID, IID, 
              sex_at_birth, age, age2, 
              PC1, PC2, PC3, PC4, PC5, PC6, PC7, PC8, PC9, PC10, PC11, PC12, PC13, PC14, PC15)

In [None]:
# Without the 'quote=FALSE' option REGENIE will error out, as the column headers will have "" 
write.table(covar_df, file = "EBV_GWAS_data/EUR/ebv_EUR_covar_0018.tsv", row.names=FALSE, sep="\t", quote=FALSE)

Running EBV DNA as a quantitative trait:

In [None]:
pheno_df <- fread("EBV_data/EUR/ebv_30x_df_EUR.csv")

In [None]:
# subset to those that passed threshold
pheno_df <- pheno_df[pheno_df$ebv_q30_30x > 0.0018,] %>%
  dplyr::mutate(
      FID = person,
      IID = FID) %>%
dplyr::select(FID, IID, ebv_q30_30x)

In [None]:
# Without the 'quote=FALSE' option Regenie will error out as the column headers will have "" 
write.table(pheno_df, file = "EBV_GWAS_data/EUR/ebv_EUR_30x_0018.tsv", row.names=FALSE, sep="\t", quote=FALSE)