## Tutorial Workflow for LDSC with DeepSea Integration:

This is the code to run the minimal working example for LDSC with DeepSea Integration. The code will train the deepsea model on the set of features provided using the .yml file on the google drive folder, get predictions on the reference genome from the trained model, and run LDSC on the resulting predictions to output enrichments. 

If you would like to use a different set of features or change training parameters, please edit the .yml file provided and everything else will still work.

This is the command to run the Minimal Working Example:


sos run LDSC_DeepSea_Code.ipynb --model /mnt/mfs/statgen/Anmol/training_files/tutorial/training_outputs/model --feature_list /mnt/mfs/statgen/Anmol/training_files/tutorial/tutorial_features.txt --output_tsv /mnt/mfs/statgen/Anmol/training_files/tutorial/testing --tsv /mnt/mfs/statgen/Anmol/training_files/tutorial/testing --annot_files /mnt/mfs/statgen/Anmol/training_files/tutorial/annot_files --sumst /mnt/mfs/statgen/Anmol/polyfun/Dey/PGCALZ2sumstatsExcluding23andMe.txt --output_sumst /mnt/mfs/statgen/Anmol/polyfun/Dey/2021.Updated.sumstats.gz --signed True --bim /mnt/mfs/statgen/Anmol/training_files/tutorial/plink_files --num_features 7 --ld_scores /mnt/mfs/statgen/Anmol/training_files/tutorial/annot_files --ctrl_sumstats /mnt/mfs/statgen/Anmol/polyfun/Dey/AMD.sumstats.gz --AD_sumstats /mnt/mfs/statgen/Anmol/polyfun/Dey/2021.Updated.sumstats.gz --w_ld_ctrl /mnt/mfs/statgen/Anmol/training_files/testing/ldsc/AD_Variants/tutorial_data/weights_hm3_no_hla/weights. --frq_file_ctrl /mnt/mfs/statgen/Anmol/training_files/testing/ldsc/AD_Variants/frq/1000G.EUR.QC. --w_ld_AD /mnt/mfs/statgen/Anmol/training_files/testing/ldsc/AD_Variants/tutorial_data/weights_hm3_no_hla/weights.2021. --frq_file_AD /mnt/mfs/statgen/Anmol/training_files/testing/ldsc/AD_Variants/frq/1000G.2021.EUR.QC. --ref_ld annot_files --pheno AMD

## Command Interface:

This is the list of commands and workflows with explanations for each one

In [18]:
!sos run LDSC_DeepSea_Code.ipynb -h


usage: sos run LDSC_DeepSea_Minimal_Example.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  train_model
  make_annot
  format_annot
  munge_sumstats_no_sign
  munge_sumstats_sign
  calc_ld_score
  convert_ld_snps
  calc_enrichment

Sections
  train_model:
  make_annot:
    Workflow Options:
      --feature-list VAL (as str, required)
                        path to feature list file
      --model VAL (as str, required)
                        path to trained model location
      --output-tsv VAL (as str, required)
                        path to output directory
  format_annot:
    Workflow Options:
      --tsv . (as path)
                        path to tsv files

## Train Model:

In [93]:

[train_model]

bash: container='/mnt/mfs/statgen/Anmol/deepsea_latest.sif'

    python3.7 /mnt/mfs/statgen/Anmol/training_files/tutorial/run_neuron_full_tutorial.py 

## Make Full Annotation File Based on Trained Model

In [None]:
# Get Predictions for Features based on Trained Model


[make_annot]

#path to feature list file
parameter: feature_list = str
#path to trained model location
parameter: model = str
#path to output directory
parameter: output_tsv = str


python3: container='/mnt/mfs/statgen/Anmol/deepsea_latest.sif'

    from selene_sdk.utils import load_path
    from selene_sdk.utils import parse_configs_and_run
    from selene_sdk.predict import AnalyzeSequences
    from selene_sdk.sequences import Genome
    from selene_sdk.utils import load_features_list
    from selene_sdk.utils import NonStrandSpecific
    from selene_sdk.utils import DeeperDeepSEA
    import glob
    import os
    distinct_features = load_features_list({feature_list})

    model_predict = AnalyzeSequences(
    NonStrandSpecific(DeeperDeepSEA(1000,{num_features})),
    {model}+"/best_model.pth.tar",
    sequence_length=1000,
    features=distinct_features,
    reference_sequence=Genome("/mnt/mfs/statgen/Anmol/training_files/male.hg19.fasta"),
    use_cuda=False # update this to False if you do not have CUDA on your machine.
    )

    for i in range(1,23):
        model_predict.variant_effect_prediction(
        "/mnt/mfs/statgen/Anmol/training_files/testing/1000G_chr_"+str(i)+".vcf",
        save_data=["abs_diffs"],  # only want to save the absolute diff score data
        output_dir={output})

## Format Annotation File

In [None]:
# Separate Annotation Files by Chromosome


[format_annot]

#path to tsv files directory
parameter: tsv = path()
#path to output file directory
parameter: annot_files = path()

R: expand = "${ }", container="/mnt/mfs/statgen/Anmol/r-packages.sif"
    library(data.table)
    library(tidyverse)
    data = fread(paste0("${tsv}","/tutorial_1000G_chr_",22,"_abs_diffs.tsv"))
    features = colnames(data)[9:ncol(data)]
    features = data.frame(features)
    features$encoding = paste0("feat_",seq(1,nrow(features)))
    fwrite(features,paste0("${annot_files}","/feature_encoding.txt"),quote=F,sep="\t",row.names=F,col.names=T)
    for (i in seq(1,22)){
    data = fread(paste0("${tsv}","/tutorial_1000G_chr_",i,"_abs_diffs.tsv"))
    data_2 = select(data,-seq(4,8))
    base = data.frame(base=rep(1,nrow(data_2)))
    fwrite(base,paste0("${annot_files}","/base_chr_",i,".annot.gz"),quote=F,sep="\t",row.names=F,col.names=T)
    for (j in seq(4,ncol(data_2))){
    data_3 = select(data_2,c(1,2,3,j))
    colnames(data_3) = c("CHR","BP","SNP",paste0("feat_",j))
    data_3 = setorder(data_3,BP)
    data_3 = select(data_3,-c("CHR","BP","SNP"))
    fwrite(data_3,paste0("${annot_files}","/feat_",j,"_chr_",i,".annot.gz"),quote=F,sep="\t",row.names=F,col.names=T)
    }
    }

## Munge Summary Statistics (Option 1: No Signed Summary Statistic):

In [None]:
# Option when Summary Statistic File does not contain a Z or Beta Column (Signed Summary Statistic)

[munge_sumstats_no_sign]



#path to summary statistic file
parameter: sumst = str
#path to Hapmap3 SNPs file, keep all columns (SNP, A1, and A2) for the munge_sumstats program
parameter: alleles = "w_hm3.snplist"
#path to output file
parameter: output_sumst = str
#does summary statistic contain Z or Beta
parameter: signed = False

bash: expand = '${ }'
   if [${signed}==True]
       then
           python2 /mnt/mfs/statgen/Anmol/ldsc/munge_sumstats.py --sumstats ${sumst} --merge-alleles ${alleles} --out ${output_sumst} --a1-inc
       fi

## Munge Summary Statistics (Option 2: Contains Signed Summary Statistic):

In [None]:
# This option is for when the summary statistic file does contain a signed summary statistic (Z or Beta)
[munge_sumstats_sign]



#path to summary statistic file
parameter: sumst = str
#path to Hapmap3 SNPs file, keep all columns (SNP, A1, and A2) for the munge_sumstats program
parameter: alleles = "w_hm3.snplist"
#path to output file
parameter: output_sumst_2 = str
#does summary statistic contain Z or Beta
parameter: signed = False

bash: expand = '${ }'
    if [${signed}==False]
        then
            python2 /mnt/mfs/statgen/Anmol/ldsc/munge_sumstats.py --sumstats ${sumst} --merge-alleles ${alleles} --out ${output_sumst_2}
        fi

## Calculate LD Scores:

**Make sure to delete SNP,CHR, and BP columns from annotation files if they are present otherwise this code will not work. Before deleting, if these columns are present, make sure that the annotation file is sorted.**

In [None]:

[calc_ld_score]

#Path to directory with bim files
parameter: bim = path()
#Path to directory with annotation files, output will appear here too. Make sure to remove the SNP, CHR, and BP columns from the annotation files if present before running.
parameter: annot_files = path()
#number of features
parameter: num_features = int

bash: expand = '${ }'
   #echo {annot_files} > out.txt
   seq 1 ${num_features}| xargs -n 1 -I j -P 4 python2 /mnt/mfs/statgen/Anmol/ldsc/ldsc.py --bfile ${bim}/1000G.EUR.QC.22 --l2 --ld-wind-cm 1 --annot ${annot_files}/feat_j_chr_22.annot.gz --thin-annot --out ${annot_files}/feat_j_chr_22 --print-snps /mnt/mfs/statgen/Anmol/ldsc/tutorial_data/w_hm3.snplist/snplist.txt
   seq 1 22| xargs -n 1 -I j -P 4 python2 /mnt/mfs/statgen/Anmol/ldsc/ldsc.py --bfile ${bim}/1000G.EUR.QC.j --l2 --ld-wind-cm 1 --annot ${annot_files}/base_chr_j.annot.gz --thin-annot --out ${annot_files}/base_chr_j --print-snps /mnt/mfs/statgen/Anmol/ldsc/tutorial_data/w_hm3.snplist/snplist.txt

## Convert LD Score SNPs to AD Summary Statistic Format:

In [None]:
# Convert SNP format in LD Score Files to CHR:BP to match with AD Summary Statistic Format


[convert_ld_snps]

#Path to directory with ld score files AND annotation files
parameter: ld_scores = str

parameter: num_features = int


R: expand = "${ }", container="/mnt/mfs/statgen/Anmol/r-packages.sif"
    library(tidyverse)
    #library(R.utils)
    library(data.table)
    for (i in seq(1,22)){
      data = read.table(gzfile(paste0("${ld_scores}/base_chr_",i,".l2.ldscore.gz")),header=T)
      data_2 = fread(paste0("${ld_scores}/base_chr_",i,".l2.M_5_50"))
      data_3 = read.table(gzfile(paste0("${ld_scores}/base_chr_",i,".annot.gz")),header=T)
      data$SNP = paste0(data$CHR,":",data$BP)
      fwrite(data,paste0("${ld_scores}/AD_base_chr_",i,".l2.ldscore.gz"),quote=F,sep="\t",row.names=F,col.names=T)
      fwrite(data_2,paste0("${ld_scores}/AD_base_chr_",i,".l2.M_5_50"),quote=F,sep="\t",row.names=F,col.names=F)
      fwrite(data_3,paste0("${ld_scores}/AD_base_chr_",i,".annot.gz"),quote=F,sep="\t",row.names=F,col.names=T)
      for (j in seq(1,${num_features})){
      data = read.table(gzfile(paste0("${ld_scores}/feat_",j,"_chr_",i,".l2.ldscore.gz")),header=T)
      data_2 = fread(paste0("${ld_scores}/feat_",j,"_chr_",i,".l2.M_5_50"))
      data_3 = read.table(gzfile(paste0("${ld_scores}/feat_",j,"_chr_",i,".annot.gz")),header=T)
      data$SNP = paste0(data$CHR,":",data$BP)
      fwrite(data,paste0("${ld_scores}/AD_feat_",j,"_chr_",i,".l2.ldscore.gz"),quote=F,sep="\t",row.names=F,col.names=T)
      fwrite(data_2,paste0("${ld_scores}/AD_feat_",j,"_chr_",i,".l2.M_5_50"),quote=F,sep="\t",row.names=F,col.names=F)
      fwrite(data_3,paste0("${ld_scores}/AD_feat_",j,"_chr_",i,".annot.gz"),quote=F,sep="\t",row.names=F,col.names=T)
    }
    }
  


## Calculate Functional Enrichment using Annotations:

In [None]:

[calc_enrichment]

#Path to Control Summary statistics File
parameter: ctrl_sumstats = str
#Path to AD Summary statistics File
parameter: AD_sumstats = str
#Path to Reference LD Scores File Directory 
parameter: ref_ld = str
#Path to LD Weight Files for Control Sumstats (Format like minimal working example)
parameter: w_ld_ctrl = str
#path to frequency files for Control Sumstats (Format like minimal working example)
parameter: frq_file_ctrl = str
#Path to LD Weight Files for AD Sumstats (Format like minimal working example)
parameter: w_ld_AD = str
#path to frequency files for AD Sumstats (Format like minimal working example)
parameter: frq_file_AD = str
#Number of Features
parameter: num_features = int 
#Control Phenotype, For Output
parameter: pheno = str

bash: expand = '${ }'
    seq 1 ${num_features}| xargs -n 1 -I j -P 4 python2 /mnt/mfs/statgen/Anmol/ldsc/ldsc.py --h2 ${ctrl_sumstats} --ref-ld-chr ${ref_ld}/base_chr_,${ref_ld}/feat_j_chr_ --w-ld-chr ${w_ld_ctrl} --overlap-annot --frqfile-chr ${frq_file_ctrl} --out ${ref_ld}/${pheno}_feat_j
    seq 1 ${num_features}| xargs -n 1 -I j -P 4 python2 /mnt/mfs/statgen/Anmol/ldsc/ldsc.py --h2 ${AD_sumstats} --ref-ld-chr ${ref_ld}/AD_base_chr_,${ref_ld}/AD_feat_j_chr_ --w-ld-chr ${w_ld_AD} --overlap-annot --frqfile-chr ${frq_file_AD} --out ${ref_ld}/AD_feat_j