# Genotype Preprocessing

This notebook outlines the workflow for processing genotype files, transitioning from VCF format to chromosome-specific PLINK files.

#### Miniprotocol Timing
This represents the total duration for all miniprotocol phases. While module-specific timings are provided separately on their respective pages, they are also included in this overall estimate. 

Timing <30 minutes

## Overview

This miniprotocol shows the use of various modules to run genotype data quality control (QC), perform Principal Components Analysis (PCA) based QC, and PC computation for each sub-population available in the genotype data. The modules are as follows:
1. `VCF_QC.ipynb` & `genotype_formatting.ipynb` (steps i-ii): Genotype VCF file QC
2. `GWAS_QC.ipynb` & `genotype_formatting.ipynb` (step iii-iv): Genotype PLINK file QC
3. `GWAS_QC.ipynb` & `PCA.ipynb` (step v-viii): Kinship and PCA


## Steps


### i. Perform genotype VCF file quality control (placeholder)

In [None]:
sos run pipeline/VCF_QC.ipynb qc \
    --genoFile vcf_qc/ZOD14598_AD_GRM_WGS_2021-04-29_vcf_files.txt \
    --dbsnp-variants /mnt/vast/hpc/csg/snuc_pseudo_bulk/data/reference_data/00-All.add_chr.variants.gz \
    --reference-genome /mnt/vast/hpc/csg/snuc_pseudo_bulk/data/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --cwd vcf_qc/ --container /mnt/vast/hpc/csg/containers/bioinfo.sif \
    -J 22 -q csg -c csg.yml --mem 120G

### ii. Merge separated bed files (placeholder)

In [None]:
sos run pipeline/genotype_formatting.ipynb vcf_to_plink
    --genoFile `ls vcf_qc/*.leftnorm.bcftools_qc.vcf.gz` \
    --cwd Genotype/ \
    --keep_samples ./ROSMAP_sample_list.txt
    --container /mnt/vast/hpc/csg/containers/bioinfo.sif \
    -J 22 -q csg -c csg.yml --mem 120G

sos run xqtl-pipeline/pipeline/genotype_formatting.ipynb merge_plink \
    --genoFile `ls *.leftnorm.bcftools_qc.bed` \
    --name ROSMAP_NIA_WGS.leftnorm.bcftools_qc  \
    --cwd Genotype/ \
    --container /mnt/vast/hpc/csg/containers/bioinfo.sif \
    -J 5 -q csg -c csg.yml --mem 300G

### iii. Genotype PLINK file quality control (placeholder)

In [None]:
sos run xqtl-pipeline/pipeline/GWAS_QC.ipynb qc_no_prune \
   --cwd Genotype \
   --genoFile Genotype/ROSMAP_NIA_WGS.leftnorm.bcftools_qc.bed \
   --geno-filter 0.1 \
   --mind-filter 0.1 \
   --hwe-filter 1e-08   \
   --mac-filter 0 \
   --container /mnt/vast/hpc/csg/containers/bioinfo.sif \
   -J 1 -q csg -c csg.yml --mem 150G

### iv. Genotype data partition by chromosome

In [None]:
sos run pipeline/genotype_formatting.ipynb genotype_by_chrom \
    --genoFile input/protocol_example.genotype.chr21_22.bed \
    --cwd output \
    --chrom `cut -f 1 input/protocol_example.genotype.chr21_22.bim | uniq | sed "s/chr//g"` \
    --container containers/bioinfo.sif 

### v. Sample match with genotype

In [None]:
sos run pipeline/GWAS_QC.ipynb genotype_phenotype_sample_overlap \
        --cwd output/sample_meta \
        --genoFile input/protocol_example.genotype.chr21_22.fam  \
        --phenoFile input/protocol_example.protein.csv \
        --container singularity/bioinfo.sif \
        --mem 5G

### vi. Kinship Quality Control

In [None]:
sos run pipeline/GWAS_QC.ipynb king \
    --cwd output/kinship \
    --genoFile input/protocol_example.genotype.chr21_22.bed \
    --name pQTL \
    --keep-samples output/sample_meta/protocol_example.protein.sample_genotypes.txt \
    --container singularity/bioinfo.sif \
    --no-maximize-unrelated \
    --mem 40G

### vii. Prepare unrelated individuals data for PCA

In [None]:
sos run pipeline/GWAS_QC.ipynb qc \
   --cwd output/cache \
   --genoFile output/kinship/protocol_example.genotype.chr21_22.pQTL.unrelated.bed \
   --mac-filter 5 \
   --container singularity/bioinfo.sif \
   --mem 16G

sos run pipeline/GWAS_QC.ipynb qc \
   --cwd output/cache \
   --genoFile input/protocol_example.genotype.chr21_22.bed \
   --keep-samples output/sample_meta/protocol_example.protein.sample_genotypes.txt \
   --name pQTL \
   --mac-filter 5 \
   --container singularity/bioinfo.sif \
   --mem 40G

### viii. Run Principal Components Analysis on genotype

In [None]:
sos run pipeline/PCA.ipynb flashpca \
   --cwd output/genotype_pca \
   --genoFile output/cache/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.bed \
   --container singularity/flashpcaR.sif \
   --mem 16G

## Anticipated Results