# QTL association analysis

This notebook contains record of commands used to perform QTL association analysis.

## Data input
* `output/genotype_by_chrom/protocol_example.genotype.chr21_22.genotype_by_chrom_files.txt`: Generated from [genotype_preprocessing](https://github.com/cumc/xqtl-pipeline/tree/main/code/data_preprocessing/genotype_preprocessing.ipynb)
* `output/phenotype_by_chrom/protocol_example.protein.bed.phenotype_by_chrom_files.txt`: Generated from [phenotype_preprocessing](https://github.com/cumc/xqtl-pipeline/tree/main/code/data_preprocessing/phenotype_preprocessing.ipynb)
* `output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.unrelated.plink_qc.prune.pca.Marchenko_PC.gz`: Generated from [covariates_preprocessing](https://github.com/cumc/xqtl-pipeline/tree/main/code/data_preprocessing/covariate_processing.ipynb)
* `prototype_example/protocol_example/protocol_example.protein.enhanced_cis_chr21_chr22.bed`: this is TAD-B list generated based on the TADB list [`TADB_enhanced_cis.bed`](https://github.com/cumc/fungen-xqtl-analysis/blob/main/resource/TADB_enhanced_cis.bed) to handle protein data. The code to generate it can be found in [create_protocol_example_data](https://github.com/cumc/fungen-xqtl-analysis/blob/main/analysis/Wang_Columbia/ROSMAP/MWE/create_protocol_example_data.ipynb). Please be noted that, all molecular_trait_id in the phenotype data are suppose to have a customized cis window corresponding to it.

## Data output
- Empirical cis results: /mnt/vast/hpc/csg/molecular_phenotype_calling/pQTL_cis/rosmap
- Standardized cis results: /mnt/vast/hpc/csg/molecular_phenotype_calling/pQTL_cis/rosmap_stad/pQTL.#

### Cis TensorQTL command 

In [None]:
sos run pipeline/TensorQTL.ipynb cis \
    --genotype-file output/genotype_by_chrom/protocol_example.genotype.chr21_22.genotype_by_chrom_files.txt \
    --phenotype-file  output/phenotype_by_chrom/protocol_example.protein.bed.phenotype_by_chrom_files.txt \
    --covariate-file output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
    --customized_cis_windows prototype_example/protocol_example/protocol_example.protein.enhanced_cis_chr21_chr22.bed \
    --cwd output/cis_association/ \
    --container containers/TensorQTL.sif --MAC 5 

### Trans TensorQTL command 
Some protein is not in the customized cis windows list. There we will need to remove them from the analysis by create a region_list. Noted that the region list need to be a actual file. So `<()` file is not acceptable. 

In [55]:
zcat output/protocol_example.protein.bed.gz | cut -f 1,2,3,4 | grep -v -e ENSG00000163554 \
    -e ENSG00000171564 -e ENSG00000171560 -e ENSG00000171557 > output/protocol_example.protein.region_list

It take more than 180G of mem to run the following commands.

In [None]:
sos run xqtl-pipeline/pipeline/TensorQTL.ipynb trans \
    --genotype-file output/protocol_example.genotype.chr21_22.bed \
    --phenotype-file  output/protocol_example.protein.bed.gz \
    --region-list output/protocol_example.protein.region_list \
    --covariate-file output/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.unrelated.plink_qc.prune.pca.Marchenko_PC.gz \
    --customized-cis-windows output/protocol_example.protein.customized_cis.tsv \
    --cwd output/association/trans/ \
    --container containers/TensorQTL.sif --MAC 5 --numThreads 8 -J 1 -q csg --mem 240G -c /mnt/vast/hpc/csg/molecular_phenotype_calling/csg.yml 

INFO: Running [32mtrans_1[0m: 
INFO: t3e846e705734907f [32mrestart[0m from status [32mfailed[0m
INFO: t3e846e705734907f [32msubmitted[0m to csg with job id 6227759
INFO: Waiting for the completion of [32m1[0m task.
INFO: Waiting for the completion of [32m1[0m task.
INFO: Waiting for the completion of [32m1[0m task.
INFO: Waiting for the completion of [32m1[0m task.
INFO: Waiting for the completion of [32m1[0m task.
INFO: Waiting for the completion of [32m1[0m task.
INFO: Waiting for the completion of [32m1[0m task.
INFO: Waiting for the completion of [32m1[0m task.
INFO: Waiting for the completion of [32m1[0m task.
INFO: Waiting for the completion of [32m1[0m task.
INFO: Waiting for the completion of [32m1[0m task.
INFO: Waiting for the completion of [32m1[0m task.
INFO: Waiting for the completion of [32m1[0m task.
INFO: Waiting for the completion of [32m1[0m task.
INFO: Waiting for the completion of [32m1[0m task.
INFO: Waiting for the completion 

### Standardize the results - To be updated, you can safely skip it for now.

#### Generate yml

In [None]:
sos run xqtl-pipeline/pipeline/yml_generator.ipynb yml_list \
    --sumstat-list /mnt/vast/hpc/csg/molecular_phenotype_calling/pQTL_cis/TADB/TensorQTL.cis._recipe.tsv \
    --cwd /mnt/vast/hpc/csg/molecular_phenotype_calling/pQTL_cis/rosmap_stad \
    --name pQTL \
    --container /mnt/vast/hpc/csg/containers/bioinfo.sif

#### Generate the target

In [None]:
sos run xqtl-pipeline/pipeline/summary_stats_standardizer.ipynb TARGET_generation \
    --fasta /mnt/vast/hpc/csg/molecular_phenotype_calling/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta  \
    --sumstat-list /mnt/vast/hpc/csg/molecular_phenotype_calling/pQTL_cis/rosmap_stad/qced_sumstat_list.txt \
    --yml-list /mnt/vast/hpc/csg/molecular_phenotype_calling/pQTL_cis/rosmap_stad/yml_list.txt  \
    --cwd /mnt/vast/hpc/csg/molecular_phenotype_calling/pQTL_cis/rosmap_stad \
    --container /mnt/vast/hpc/csg/containers/bioinfo.sif \
    --walltime 100h \
    --numThreads 20 \
    --mem 150G -J 50 -c /mnt/vast/hpc/csg/molecular_phenotype_calling/csg.yml -q csg

#### Standardized the sumstat

In [None]:
sos run xqtl-pipeline/pipeline/summary_stats_standardizer.ipynb sumstat_standardization \
    --sumstat-list /mnt/vast/hpc/csg/molecular_phenotype_calling/pQTL_cis/rosmap_stad/qced_sumstat_list.txt  \
    --yml-list /mnt/vast/hpc/csg/molecular_phenotype_calling/pQTL_cis/rosmap_stad/yml_list.txt \
    --TARGET_list /mnt/vast/hpc/csg/molecular_phenotype_calling/pQTL_cis/rosmap_stad/TARGET.ref.list \
    --cwd /mnt/vast/hpc/csg/molecular_phenotype_calling/pQTL_cis/rosmap_stad/ \
    --container /mnt/vast/hpc/csg/containers/bioinfo.sif \
    --walltime 100h \
    --numThreads 20 \
    --mem 200G -J 50 -c /mnt/vast/hpc/csg/molecular_phenotype_calling/csg.yml -q csg

#### Convert sumstats to VCF

In [1]:
sos run xqtl-pipeline/pipeline/summary_stats_standardizer.ipynb sumstat_to_vcf \
    --sumstat-list /mnt/vast/hpc/csg/molecular_phenotype_calling/pQTL_cis/rosmap_stad/qced_sumstat_list.txt  \
    --cwd /mnt/vast/hpc/csg/molecular_phenotype_calling/pQTL_cis/rosmap_stad/ \
    --container /mnt/vast/hpc/csg/containers/bioinfo.sif \
    --walltime 100h \
    --numThreads 20 \
    --mem 200G -J 50 -c /mnt/vast/hpc/csg/molecular_phenotype_calling/csg.yml -q csg

#### Associate result processing

In [None]:
sos run xqtl-pipeline/pipeline/assoc_result_processing.ipynb genome \
    --vcf `ls /mnt/vast/hpc/csg/molecular_phenotype_calling/pQTL_cis/rosmap_stad/*/*cis_long_table.vcf` \
    --padjust-method "bonferroni" \
    --container /mnt/vast/hpc/csg/containers/bioinfo.sif \
    --mem 200G -J 22 -q csg -c /mnt/vast/hpc/csg/molecular_phenotype_calling/csg.yml  

#### Summary of result - To be update

In [None]:
cat /mnt/vast/hpc/csg/molecular_phenotype_calling/pQTL_cis/rosmap_stad/pQTL.1/pheno_recipe_rosmap_pheno.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.filtered.pQTL.unrelated.filtered.prune.pca.resid.Marchenko_pc.1.n_sig.txt

| tissue | n_assoc | n_snp | n_gene | 
| --- | --- | --- | --- |
| pQTL | 290330 | 266394 | 3166 | 