# Fine-mapping for Summary Statistics

## Generation of LD (Skipable)
If there is no LD matrixes provided, then in-Sample LD matrix shall be generated by our pipeline

In [None]:
cd /mnt/vast/hpc/csg/molecular_phenotype_calling/LD

sos run pipeline/genotype_formatting.ipynb ld_by_region_plink \
    --region_list ~/1300_hg38_EUR_LD_blocks.tsv --cwd output  \
    --genoFile ../genotype/ROSMAP_NIA_WGS.leftnorm.filtered.filtered.bed \
    --container ../containers/bioinfo.sif

## Standardized of sumstat
We should standardized our sumstat, and fix potential allele filp issue against the LD matrixs. After downloading the ADGWAS data, some processing are needed. Please refer to this [notebook](https://github.com/cumc/fungen-xqtl-analysis/blob/main/analysis/Wang_Columbia/GWAS/AD_GWAS_processing.ipynb) for processing that are not implemented in the pipeline to generate a yml list that guides the standardization of the sumstat.


In [None]:
cd /mnt/vast/hpc/csg/xqtl_workflow_testing/ADGWAS
sos run pipeline/yml_generator.ipynb yml_list \
    --sumstat-list ADGWAS2022.recipe  \
    --cwd  data_intergration_new/ADGWAS2022   &


This is to create a reference data for the standardization based on our fasta files. IF an external LD panel are used, please manually extract the chr,pos,ref,alt of the external LD in the same format of our TARGET file. 

In [None]:
sos run  pipeline/summary_stats_standardizer.ipynb   TARGET_generation  \
      --sumstat-list /mnt/vast/hpc/csg/xqtl_workflow_testing/ADGWAS/data_intergration/ADGWAS2022/qced_sumstat_list.txt    \
      --yml-list /mnt/vast/hpc/csg/xqtl_workflow_testing/ADGWAS/data_intergration/ADGWAS2022/yml_list.txt     \
      --fasta /mnt/vast/hpc/csg/xqtl_workflow_testing/finalizing/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
      --cwd data_intergration/ADGWAS2022/  -J 22 -c /mnt/vast/hpc/csg/xqtl_workflow_testing/finalizing/csg.yml -q csg2 --mem 50G --walltime 48h &


Once the TARGET is availble, the standardizaion can begun. This is the same as the sumstat_standardization step that generates our QTL results.

In [None]:
sos run  pipeline/summary_stats_standardizer.ipynb   sumstat_standardization  \
      --sumstat-list /mnt/vast/hpc/csg/xqtl_workflow_testing/ADGWAS/data_intergration/ADGWAS2022/qced_sumstat_list.txt    \
      --yml-list /mnt/vast/hpc/csg/xqtl_workflow_testing/ADGWAS/data_intergration/ADGWAS2022/yml_list.txt     \
      --fasta /mnt/vast/hpc/csg/xqtl_workflow_testing/finalizing/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
      --TARGET_list data_intergration/ADGWAS2022/TARGET.ref.list \
      --cwd data_intergration/ADGWAS2022/  -J 22 -c /mnt/vast/hpc/csg/xqtl_workflow_testing/finalizing/csg.yml -q csg2 --mem 50G --walltime 48h &

## SuSiE RSS
For both xQTL sumstat or the GWAS sumstat, as long as they are the output of sumstat_standardization, they can be fed into the susie_rss module directly. The fine mappling implembeted at the moment will be based on LD npz files that are from the first step of this notebook. If external LD are used, they would need to be splited as such.

In [None]:
cd /mnt/vast/hpc/csg/xqtl_workflow_testing/susie_rss
sos run pipeline/SuSiE_RSS.ipynb SuSiE_RSS \
    --LD_list test.ld.list \
    --sumstat_list /mnt/vast/hpc/csg/xqtl_workflow_testing/ADGWAS/data_intergration/ADGWAS2022/qced_sumstat_list.txt \
    --container containers/stephenslab.sif --impute --cwd output_impute_2 &

## SuSiE results post processing
The default output of susie or susie RSS are rds object, which are hard to intergrate. Therefore we have a utlity module to extract and save the info into three types of tables. One of the table, the lbf one, will have all the information needed for coloc analysis. For eQTL, the region list are the one used to do fine-mapping. For sumstat, the region list are the one used to call LD. When the input RDS have chr{1：22} in its name, their LBF table will be rbind into a lbf for the full chromosome.

In [None]:
cd /mnt/vast/hpc/csg/molecular_phenotype_calling/eqtl

sos run pipeline/SuSiE_post_processing.ipynb susie_to_tsv \
    --cwd output/test --rds_path `ls output/test/cache/*rds | head ` \
    --region-list <(head -50  ./dlpfc_region_list) --container containers/stephenslab.sif 

In [None]:
cd /mnt/vast/hpc/csg/xqtl_workflow_testing/susie_rss

sos run pipeline/SuSiE_post_processing.ipynb susie_to_tsv \
    --cwd output/ADGWAS_finemapping_extracted --rds_path `ls GWAS_Finemapping_Results/Bellenguez/ADG_*rds ` \
    --region-list ~/1300_hg38_EUR_LD_blocks_orig.tsv \
    --container containers/stephenslab.sif 

sos run pipeline/SuSiE_post_processing.ipynb susie_tsv_collapse \
    --cwd output/ADGWAS_finemapping_extracted --tsv_path `ls output/ADGWAS_finemapping_extracted/*lbf.tsv` \
    --container containers/stephenslab.sif 

## Coloc analysis
The qtl_tsv are the lbf tsv who are not rbind into per chromosome, while the sumstat_tsv are per chromosome and have their chromosome it its name. The requirement is crucial for padding

In [None]:
sos run pipeline/coloc.ipynb coloc \
    --qtl_tsv `ls /mnt/vast/hpc/csg/molecular_phenotype_calling/eqtl/output/susie_per_gene_tad/*f.tsv`   \
    --sumstat_tsv `ls output/ADGWAS_finemapping_extracted/*chr17.unisusie_rss.lbf.tsv` \
    --region_list test.region_list 