# Genotype data preprocessing

This document performs genotype data quality control and preprocessing, as outlined in the yellow boxes of the flowchart below.

![Data_proc_flowchart](../../../../_images/data_preprocessing.png)

## Overview

### Analysis steps

1. Genotype data quality control (QC). See here for the [QC default settings](https://cumc.github.io/xqtl-pipeline/pipeline/data_preprocessing/genotype/GWAS_QC.html).
2. Principle component analysis (PCA) based QC, and PC computation for each sub-population available in the genotype data.
3. Genomic relationship matrix (GRM) computation.
4. Genotype data reformatting for downstream fine-mapping analysis.

### Input data requirement

1. Genotype data. See here for [format details](https://cumc.github.io/xqtl-pipeline/pipeline/data_preprocessing/genotype/genotype_formatting.html).
2. [Optional] a sample information file to specific population information, if external data such as HapMap or 1000 Genomes are to be integrated to the PCA analysis to visualize and assess population structure in the genotype data. See here for [format details](https://cumc.github.io/xqtl-pipeline/pipeline/data_preprocessing/genotype/genotype_formatting.html).

## Genotype QC

In [None]:
sos run genotype_formatting.ipynb merge_plink \
    --genoFile data/genotype/chr1.bed data/genotype/chr6.bed \
    --cwd output/genotype \
    --name chr1_chr6 \
    --container container/bioinfo.sif

Determine and split between related and unrelated individuals,

In [None]:
sos run GWAS_QC.ipynb king \
    --cwd output/genotype \
    --genoFile output/genotype/chr1_chr6.bed \
    --name 20220110 \
    --kinship 0.05 \
    --container container/bioinfo.sif

Variant level and sample level QC on unrelated individuals, in preparation for PCA analysis:

In [None]:
sos run GWAS_QC.ipynb qc \
    --cwd output/genotype \
    --genoFile output/genotype/chr1_chr6.20220110.unrelated.bed \
    --maf_filter 0.5 \
    --geno_filter 0.2 \
    --mind_filter 0.1 \
    --hwe_filter 1e-6 \
    --window 50 \
    --shift 10 \
    --r2 0.5 \
    --name for_pca \
    --container container/bioinfo.sif

Extract previously selected variants from related individuals in preparation for PCA, only applying missingness filter at sample level,

In [None]:
sos run GWAS_QC.ipynb qc \
    --cwd output/genotype \
    --genoFile output/genotype/chr1_chr6.20220110.related.bed \
    --keep-variants output/genotype/chr1_chr6.20220110.unrelated.for_pca.filtered.prune.in \
    --maf-filter 0 --geno-filter 0 --mind-filter 0.1 --hwe-filter 0 --r2 0 \
    --name for_pca \
    --container container/bioinfo.sif

**FIXME: contents below are yet to be reviewed. Please do not run**

## Principle component analysis

In [None]:
nohup sos run PCA.ipynb pca \
            --cwd PCA/ \
            --container_lmm "flashpcaR.sif" \
            --name "demo" \
            --unrelated_genotype merge.mergrd.unrelated.bed \
            --related_genotype merge.mergrd.related.bed \
            --phenoFile demo.for_pca.mol_phe.exp \
            --label_col "RACE" \
            --pop_col "RACE"  

## Genomic relationship matrix
input:
1. A list of plink trio per chrom, output from Genotype QC

output: 
1. A collection of ld.rds file that are suitable for mvsusie_rss and susie_rss
2. One row of the LD recipe file for this particular theme, so that susie_rss can find the correct ld

In [None]:
nohup sos run GRM.ipynb GRM
    --genotype_list demo.processed_genotype.plink_per_chrom.recipe \
    --wd GRM/ \
    --name "demo" \
    --container "base-bioinfo.sif" 

## Reformatting and partition for regional genotypic data

In [None]:

nohup sos run plink2vcf\
    --genoFile ac.mergrd.ac.filtered.prune.bed \
    --wd genotype_reformmating/ \
    --name "ac" \
    --region_list geneTpmResidualsAgeGenderAdj_rename_region_list.txt  \
    --container "base-bioinfo.sif" 

nohup sos run genotype_formatting.ipynb plink_by_gene \
    --genoFile ac.mergrd.ac.filtered.prune.bed \
    --wd genotype_reformmating/ \
    --name "dlpfc" \
    --region_list mwe_region  \
    --container "base-bioinfo.sif" 

nohup sos run genotype_formatting.ipynb plink_by_chrom \
    --genoFile ac.mergrd.ac.filtered.prune.bed \
    --wd genotype_reformmating/ \
    --name "pcc" \
    --region_list mwe_region  \
    --container "base-bioinfo.sif" 