# Genotype data preprocessing

This notebook outlines the workflow for processing genotype files, transitioning from VCF format to chromosome-specific PLINK files.

**Note**: in order to reuse the workflow for your data, for some of the steps you might need to change paths to files.

## Methods overview

This workflow is an application of the genotype related workflows from the xQTL project pipeline.

## Data Input 
- `Joint VCF files`: /mnt/vast/hpc/bvardarajan_lab/data/Family_WGS/vcfs/vcf_b38_with_rosmap_2022/joint_vcf
- `00-All.add_chr.variants.gz` 
- `GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta` reference files created via Reference_data_notebook

## Data Output

* QCed Genotype:
- `ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.bed` 
- `ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.bim`
- `ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.fam`

## Steps in detail

### QC for VCF files
This step will run QC for vcf files, `qc_1` and `qc_2` will process ~14G files every hour. `qc_3` will summarize the quality metrics for the VCF files.

In [None]:
# We only do this for autosomal variants

echo ./ZOD14598_AD_GRM_WGS_2021-04-29_chr1.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr2.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr3.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr4.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr5.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr6.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr7.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr8.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr9.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr10.recalibrated_variants.vcf.gz  ./ZOD14598_AD_GRM_WGS_2021-04-29_chr11.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr12.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr13.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr14.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr15.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr16.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr17.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr18.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr19.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr20.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr21.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr22.recalibrated_variants.vcf.gz \
    | tr ' ' '\n' > vcf_qc/ZOD14598_AD_GRM_WGS_2021-04-29_vcf_files.txt

sos run pipeline/VCF_QC.ipynb qc \
    --genoFile vcf_qc/ZOD14598_AD_GRM_WGS_2021-04-29_vcf_files.txt \
    --dbsnp-variants /mnt/vast/hpc/csg/snuc_pseudo_bulk/data/reference_data/00-All.add_chr.variants.gz \
    --reference-genome /mnt/vast/hpc/csg/snuc_pseudo_bulk/data/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --cwd vcf_qc/ --container /mnt/vast/hpc/csg/containers/bioinfo.sif \
    -J 22 -q csg -c csg.yml --mem 120G

### Merge separated bed files into one

Converting VCF to PLINK keep only ROSMAP samples.

`ROSMAP_sample_list.txt` is a list that includes all ROSMAP samples we need for analysis, in formatting of FID, IID. This file has been uploaded to ftp: `/ftp_fgc_xqtl/projects/WGS/ROSMAP`

In [None]:
sos run pipeline/genotype_formatting.ipynb vcf_to_plink
    --genoFile `ls vcf_qc/*.leftnorm.bcftools_qc.vcf.gz` \
    --cwd Genotype/ \
    --keep_samples ./ROSMAP_sample_list.txt
    --container /mnt/vast/hpc/csg/containers/bioinfo.sif \
    -J 22 -q csg -c csg.yml --mem 120G

This step merges all the files and may require anout 300G mem to run, because there are some variants' ID with 80+ characters. And only plink can do the merge job, plink2 doesn't support merge.

In [None]:
sos run xqtl-pipeline/pipeline/genotype_formatting.ipynb merge_plink \
    --genoFile `ls *.leftnorm.bcftools_qc.bed` \
    --name ROSMAP_NIA_WGS.leftnorm.bcftools_qc  \
    --cwd Genotype/ \
    --container /mnt/vast/hpc/csg/containers/bioinfo.sif \
    -J 5 -q csg -c csg.yml --mem 300G

### QC for PLINK files

Using PLINK-based workflows we:

* Filter out those have more than 10% missing
* Set HWE cutoff as 1E-8
* No minor allel filter

In [None]:
sos run xqtl-pipeline/pipeline/GWAS_QC.ipynb qc_no_prune \
   --cwd Genotype \
   --genoFile Genotype/ROSMAP_NIA_WGS.leftnorm.bcftools_qc.bed \
   --geno-filter 0.1 \
   --mind-filter 0.1 \
   --hwe-filter 1e-08   \
   --mac-filter 0 \
   --container /mnt/vast/hpc/csg/containers/bioinfo.sif \
   -J 1 -q csg -c csg.yml --mem 150G

The genotype files after this step had been uploaded to ftp: `/ftp_fgc_xqtl/projects/WGS/ROSMAP`

### Genotype data partition by chromosome

This step is necessary for TensorQTL applications.

In [5]:
sos run pipeline/genotype_formatting.ipynb genotype_by_chrom \
    --genoFile protocol_example/protocol_example.genotype.chr21_22.bed \
    --cwd output \
    --chrom `cut -f 1 protocol_example/protocol_example.genotype.chr21_22.bim | uniq | sed "s/chr//g"` \
    --container containers/bioinfo.sif 

INFO: Running [32mgenotype_by_chrom_1[0m: 
INFO: [32mgenotype_by_chrom_1[0m (index=0) is [32mcompleted[0m.
INFO: [32mgenotype_by_chrom_1[0m (index=1) is [32mcompleted[0m.
INFO: [32mgenotype_by_chrom_1[0m output:   [32m/home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/genotype_by_chrom/protocol_example.genotype.chr21_22.21.bed /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/genotype_by_chrom/protocol_example.genotype.chr21_22.22.bed in 2 groups[0m
INFO: Running [32mgenotype_by_chrom_2[0m: 
INFO: [32mgenotype_by_chrom_2[0m is [32mcompleted[0m (pending nested workflow).
INFO: Running [32mwrite_data_list[0m: 
INFO: [32mwrite_data_list[0m is [32mcompleted[0m.
INFO: [32mwrite_data_list[0m output:   [32m/home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/genotype_by_chrom/protocol_example.genotype.chr21_22.genotype_by_chrom_files.txt[0m
INFO: [32mgenotype_by_chrom_2[

## PCA on genotypes of selected samples

This workflow is an application of `PCA.ipynb` from the xQTL project pipeline.

### Data Input

- `protocol_example.genotype.chr21_22.bed`
- `protocol_example.genotype.chr21_22.bim`
- `protocol_example.genotype.chr21_22.fam`

### Data Output

- `ROSMAP_NIA_WGS.leftnorm.filtered.pQTL.unrelated.filtered.prune.pca.rds`

### Kinship QC only on proteomics samples

To accuratly estimate the PCs for the genotype. We split participants based on their kinship coefficients, estimated by KING

#### Sample match with genotype 

In [1]:
sos run pipeline/GWAS_QC.ipynb genotype_phenotype_sample_overlap \
        --cwd output/sample_meta \
        --genoFile protocol_example/protocol_example.genotype.chr21_22.fam  \
        --phenoFile protocol_example/protocol_example.protein.csv \
        --container containers/bioinfo.sif \
        --mem 5G

INFO: Running [32mgenotype_phenotype_sample_overlap[0m: This workflow extracts overlapping samples for genotype data with phenotype data, and output the filtered sample genotype list as well as sample phenotype list
INFO: [32mgenotype_phenotype_sample_overlap[0m is [32mcompleted[0m.
INFO: [32mgenotype_phenotype_sample_overlap[0m output:   [32m/home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/samples/protocol_example.protein.sample_overlap.txt /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/samples/protocol_example.protein.sample_genotypes.txt[0m
INFO: Workflow genotype_phenotype_sample_overlap (ID=wb6cc0b72d21c80e5) is executed successfully with 1 completed step.


#### Kinship

In [2]:
sos run pipeline/GWAS_QC.ipynb king \
    --cwd output/kinship \
    --genoFile protocol_example/protocol_example.genotype.chr21_22.bed \
    --name pQTL \
    --keep-samples output/sample_meta/protocol_example.protein.sample_genotypes.txt \
    --container containers/bioinfo.sif \
    --no-maximize-unrelated \
    --mem 40G

INFO: Running [32mking_1[0m: Inference of relationships in the sample to identify closely related individuals
INFO: [32mking_1[0m is [32mcompleted[0m.
INFO: [32mking_1[0m output:   [32m/home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/kinship/protocol_example.genotype.chr21_22.pQTL.kin0[0m
INFO: Running [32mking_2[0m: Select a list of unrelated individual with an attempt to maximize the unrelated individuals selected from the data
INFO: [32mking_2[0m is [32mcompleted[0m.
INFO: [32mking_2[0m output:   [32m/home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/kinship/protocol_example.genotype.chr21_22.pQTL.related_id[0m
INFO: Running [32mking_3[0m: Split genotype data into related and unrelated samples, if related individuals are detected
INFO: [32mking_3[0m is [32mcompleted[0m.
INFO: [32mking_3[0m output:   [32m/home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/kinship

1. Variant level and sample level QC on unrelated individuals using missingness > 10%, and LD-prunning in preparation for PCA analysis.    
2. There is no related samples in these ROSMAP samples, so there is an additional step to only keep those samples in `rosmap_pheno.sample_genotypes.txt` to do PCA.

**Be aware:**    

**If the message from `king_2` shown as `No related individuals detected from *.kin0`, this means no related individuals detected for the samples in `--keep_samples`. In this case, there will be no output for unrelated individuals from this step.**

#### Prepare unrelated individuals data for PCA

Here we write data to `cache` folder instead of `output` because this genotype data can be removed later after PCA. Also filter out minor allel accout < 5.

**If your data has `*.unrelated.bed` generated, that means there are related individuals in your data. In cases, we will use output from the KING step for unrelated individuals.**

In [3]:
sos run pipeline/GWAS_QC.ipynb qc \
   --cwd output/cache \
   --genoFile output/kinship/protocol_example.genotype.chr21_22.pQTL.unrelated.bed \
   --mac-filter 5 \
   --container containers/bioinfo.sif \
   --mem 16G

INFO: Running [32mbasic QC filters[0m: Filter SNPs and select individuals
INFO: [32mbasic QC filters[0m is [32mcompleted[0m.
INFO: [32mbasic QC filters[0m output:   [32m/home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/cache/protocol_example.genotype.chr21_22.pQTL.unrelated.plink_qc.bed[0m
INFO: Running [32mLD pruning[0m: LD prunning and remove related individuals (both ind of a pair) Plink2 has multi-threaded calculation for LD prunning
INFO: [32mLD pruning[0m is [32mcompleted[0m.
INFO: [32mLD pruning[0m output:   [32m/home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/cache/protocol_example.genotype.chr21_22.pQTL.unrelated.plink_qc.prune.bed /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/cache/protocol_example.genotype.chr21_22.pQTL.unrelated.plink_qc.prune.in[0m
INFO: Workflow qc (ID=w3a34828bd2888342) is executed successfully with 2 completed steps.


**In other cases eg ROSMAP proteomics data, message `No related individuals detected from *.kin0` occured, there is no separate genotype data generated for unrelated individuals. In this case, we need to work from the original genotype data and must use `--keep-samples` to run `qc` to extract samples for PCA.**. For example:

In [4]:
sos run pipeline/GWAS_QC.ipynb qc \
   --cwd output/cache \
   --genoFile protocol_example/protocol_example.genotype.chr21_22.bed \
   --keep-samples output/sample_meta/protocol_example.protein.sample_genotypes.txt \
   --name pQTL \
   --mac-filter 5 \
   --container containers/bioinfo.sif \
   --mem 40G

INFO: Running [32mbasic QC filters[0m: Filter SNPs and select individuals
INFO: [32mbasic QC filters[0m is [32mcompleted[0m.
INFO: [32mbasic QC filters[0m output:   [32m/home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/cache/protocol_example.genotype.chr21_22.pQTL.plink_qc.bed[0m
INFO: Running [32mLD pruning[0m: LD prunning and remove related individuals (both ind of a pair) Plink2 has multi-threaded calculation for LD prunning
INFO: [32mLD pruning[0m is [32mcompleted[0m.
INFO: [32mLD pruning[0m output:   [32m/home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/cache/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.bed /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/cache/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.in[0m
INFO: Workflow qc (ID=wd519554233f99db1) is executed successfully with 2 completed steps.


#### PCA on genotype
Note PC1 vs 2 outlier

In [5]:
sos run pipeline/PCA.ipynb flashpca \
   --cwd output/genotype_pca \
   --genoFile output/cache/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.bed \
   --container containers/flashpcaR.sif \
   --mem 16G

INFO: Running [32mflashpca_1[0m: Run PCA analysis using flashpca
INFO: [32mflashpca_1[0m is [32mcompleted[0m.
INFO: [32mflashpca_1[0m output:   [32m/home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/genotype_pca/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.rds[0m
INFO: Running [32mflashpca_2[0m: 
INFO: [32mflashpca_2[0m is [32mcompleted[0m (pending nested workflow).
INFO: Running [32mdetect_outliers[0m: Calculate Mahalanobis distance per population and report outliers
INFO: [32mdetect_outliers[0m is [32mcompleted[0m.
INFO: [32mdetect_outliers[0m output:   [32m/home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/genotype_pca/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.mahalanobis.rds /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/genotype_pca/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.outliers... (5 items)[0m
INFO: [32mfla

The plot of PCA is under figure folder. FIXME: please show the preview in this notebook as well.