# gwas_processing workflow

This notebook is intended to handle processing GWAS SNP data to be used in scDRS analysis.

The three distinct steps in this workflow are:
1. Munging GWAS results into something that MAGMA can handle
2. Conducting SNP-gene annotation mapping SNPs in the GWAS data to genes in a gene location file
3. Conducting gene analysis, computing gene p-values in association with the trait of interest being studied in the GWAS

## Load required packages

In [1]:
from pathlib import Path

from src.utils import make_work_dir, move_output, setup
from src.gwas_processing.commands import annotate_variants, gene_analysis
from src.gwas_processing.utils import munge_gwas

## Setup the working environment

The `work_dir` directory will contain any intermediate files that are generated as a part of this process. The `output_dir` directory should contain any final outputs.

In [2]:
output_dir, tmp_dir = setup("src/gwas_processing/output")
work_dir = make_work_dir(tmp_dir)

directory gwas_processing/output already exists
directory gwas_processing/tmp already exists
making directory tmp/d74e4eac3d3cb0d8de6253272b4c93fb


## Munge the GWAS results into a format that MAGMA can annotate and analyze

MAGMA-compatible criteria are:

- Tab-delimited
- No p-values less than or equal to 1e-308
- The SNP ID, chromosome, and base-pair location columns to be the first three columns

Additionally, variants with missing SNP IDs need to be handled. For this workflow, we fill missing SNP IDs with a value with format `[CHR]_[BP]_[REF ALLELE]_[EFFECT ALLELE]`

In [3]:
gwas_path = Path("data/GCST90132222_buildGRCh37.tsv")

gwas = munge_gwas(gwas_path, variant_id="variant_id", pval="p_value")

munged_gwas = f"{work_dir}/munged_gwas.tsv"

# MAGMA annotate expects that the output is whitespace delimited
gwas.to_csv(munged_gwas, index=False, sep = "\t")

## Map SNPs to genes

The mapping is based on genomic location, assigning an SNP to a gene if the SNP’s location falls
inside the region provided for each gene; typically this region is defined by the transcription start and stop sites of that gene.

Alternatively, an annotation window can be provided using a tuple of integers:
- The first element of the tuple defines the number of kilo-bases upstream of the transcription start site
- The second element of the tuple defines the number of kilo-bases downstream of the transcription start site

Together, the upstream/downstream tuple increases the region in which SNPs are mapped to a particular gene.

The gene location file used here can be found [here](https://ctg.cncr.nl/software/magma).

In [4]:
annotation_output, annotation_log = annotate_variants(
    gene_loc="data/gene_locations/NCBI37.3.gene.loc",
    snp_loc=munged_gwas,
    output_prefix=f"{work_dir}/annotated_variants",
    annotation_window=(100, 20),
)

## Conduct SNP-wise gene analysis using MAGMA

In [35]:
reference_output = f"src/make_reference/output/merge"

gene_analysis_out, gene_analysis_raw, gene_analysis_log, gene_analysis_supplemental_log = gene_analysis(
    bfile=reference_output,
    gene_annot=str(annotation_output),
    gwas=munged_gwas,
    variant_id="variant_id",
    pval="p_value",
    n=276020,
    output_prefix=f"{work_dir}/gene_analysis",
)

In [None]:
move_output(output_dir, annotation_output, gene_analysis_out)