Update examples

getian107 · Jun 13, 2023 · bdabe58 · bdabe58
1 parent 9bd2860
commit bdabe58
Show file tree

Hide file tree

Showing 10 changed files with 217 additions and 40 deletions.
diff --git a/.DS_Store b/.DS_Store
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # SuSiEx
 
-**SuSiEx** is a Python based command line tool that performs cross-ethnic fine-mapping using GWAS summary statistics and LD reference panels. The method is built on the Sum of Single Effects (SuSiE) model:
+**SuSiEx** is a C++ based command line tool that performs cross-ancestry fine-mapping using GWAS summary statistics and LD reference panels. The method is built on the Sum of Single Effects (SuSiE) model:
 
 Gao Wang, Abhishek Sarkar, Peter Carbonetto, Matthew Stephens. A simple new approach to variable selection in regression, with application to genetic fine-mapping. *Journal of the Royal Statistical Society: Series B (Statistical Methodology)*, https://doi.org/10.1111/rssb.12388, 2020.
 
@@ -13,28 +13,62 @@ Gao Wang, Abhishek Sarkar, Peter Carbonetto, Matthew Stephens. A simple new appr
 
     Alternatively, download the source files from the github website (https://github.com/getian107/SuSiEx).
 
-- **SuSiEx** requires Python package **scipy** (https://www.scipy.org/) installed.
+- **SuSiEx** requires C++ compiler **GCC** (https://gcc.gnu.org/) and parallel programming library **OpenMP** (https://www.openmp.org/) installed.
 
-- Once Python and its dependencies have been installed, running
+- Compile SuSiEx using the following command:
+
+    `cd src`
+
+    `make all`
 
-    `./SuSiEx.py --help` or `./SuSiEx.py -h`
+- Once the compling is done, the executable SuSiEx will be found in the _bin_ folder. Type 
+
+    `../bin/SuSiEx --help` or `../bin/SuSiEx -h`
 
     will print a list of command-line options.
 
 
 ## Using SuSiEx
 
-`
-python SuSiEx.py --sst_file=SUM_STATS_FILE --n_gwas=GWAS_SAMPLE_SIZE --ref_file=REF_FILE --ld_file=LD_MATRIX_FILE --out_dir=OUTPUT_DIR --out_name=OUTPUT_FILENAME --chr=CHR --bp=BP --chr_col=CHR_COL --snp_col=SNP_COL --bp_col=BP_COL --a1_col=A1_COL --a2_col=A2_COL --eff_col=EFF_COL --se_col=SE_COL --pval_col=PVAL_COL --plink=PLINK [--keep-ambig=KEEP_AMBIGUOUS_SNPS --maf=MAF_THRESHOLD --n_sig=NUMBER_OF_SIGNALS --level=LEVEL --min_purity=MINIMUM_PURITY --mult-step=MULT_STEP_FITTING --full_out=FULL_OUTPUT --pval_thresh=MARGINAL_PVAL_THRESHOLD --max_iter=MAXIMUM_ITERATIONS --tol=TOLERANCE]
-`
+```
+SuSiEx \
+  --sst_file=SUM_STATS_FILE \ 
+  --n_gwas=GWAS_SAMPLE_SIZE \ 
+  --ref_file=REF_FILE \ 
+  --ld_file=LD_MATRIX_FILE \ 
+  --out_dir=OUTPUT_DIR \ 
+  --out_name=OUTPUT_FILENAME \ 
+  --chr=CHR \ 
+  --bp=BP \ 
+  --chr_col=CHR_COL \ 
+  --snp_col=SNP_COL \ 
+  --bp_col=BP_COL \ 
+  --a1_col=A1_COL \ 
+  --a2_col=A2_COL \ 
+  --eff_col=EFF_COL \ 
+  --se_col=SE_COL \ 
+  --pval_col=PVAL_COL \ 
+  --plink=PLINK \ 
+  [ --keep-ambig=KEEP_AMBIGUOUS_SNPS \ ]
+  [ --maf=MAF_THRESHOLD \ ]
+  [ --n_sig=NUMBER_OF_SIGNALS \ ]
+  [ --level=LEVEL \ ]
+  [ --min_purity=MINIMUM_PURITY \ ]
+  [ --mult-step=MULT_STEP_FITTING \ ]
+  [ --full_out=FULL_OUTPUT \ ]
+  [ --pval_thresh=MARGINAL_PVAL_THRESHOLD \ ]
+  [ --max_iter=MAXIMUM_ITERATIONS \ ]
+  [ --tol=TOLERANCE \ ]
+  [ --threads=N_THREADS ]
+```
 
 - SUM_STATS_FILE (required): Full path and filename of the GWAS summary statistics for each population, separated by comma. Each file must contain a header line. The column corresponding to the effect size estimates must have a header of BETA or OR, indicating whether the effect estimates are regression coefficients or odds ratios.
 
 - GWAS_SAMPLE_SIZE (required): Sample size of the GWAS for each population, in the order corresponding to the GWAS summary statistics files, separated by comma.
 
 - REF_FILE (optional): Full path and filename prefix of the reference panel for each population in PLINK binary format (.bed/.bim/.fam), in the order corresponding to the GWAS summary statistics files, separated by comma. The format for SNP IDs (e.g., rs IDs or chr:bp:a1:a2) and version of genome build must be consistent across GWAS summary statistics files and reference panels. A curated 1000 Genomes phase 3 reference panle in PLINK format can be download from the MAGMA website: https://ctg.cncr.nl/software/magma.
 
-- LD_MATRIX_FILE (required): Full path and filename prefix of the LD matrix to be computed from the reference panels for each population, in the order corresponding to the GWAS summary statistics files, separated by comma. If LD matrices with the specified filenames already exist in the specified directory, the calculation of LD will be skipped and existing LD files will be used directly. A script for precomputing LD matrices and allele frequency files is [available](https://www.dropbox.com/s/lyh7cwaabm8sxw7/SuSiEx_LD.py?dl=0 "available").
+- LD_MATRIX_FILE (required): Full path and filename prefix of the LD matrix to be computed from the reference panels for each population, in the order corresponding to the GWAS summary statistics files, separated by comma. If LD matrices with the specified filenames already exist in the specified directory, the calculation of LD will be skipped and existing LD files will be used directly. A script for precomputing LD matrices and allele frequency files is available in the utilities folder.
 
 - OUTPUT_DIR (required): Full path of the output directory.
 
@@ -54,7 +88,7 @@ python SuSiEx.py --sst_file=SUM_STATS_FILE --n_gwas=GWAS_SAMPLE_SIZE --ref_file=
 
 - A2_COL (required): The column number of the A2 allele (alternative allele) in each GWAS summary statistics file, separated by comma.
 
-- EFF_COL (required): The column number of the effect size estimate (beta or odds ratio) in each GWAS summary statistics file, separated by comma.
+- EFF_COL (required): The column number of the effect size estimate (beta or odds ratio) in each GWAS summary statistics file, separated by comma. SuSiEx will automatically recognize columns labeled with a prefix of "beta" as beta and a prefix of "or" as odds ratio (case-insensitive).
 
 - SE_COL (required): The column number of the standard error of the effect size estimate in each GWAS summary statistics file, separated by comma.
 
@@ -82,12 +116,12 @@ python SuSiEx.py --sst_file=SUM_STATS_FILE --n_gwas=GWAS_SAMPLE_SIZE --ref_file=
 
 - TOLERANCE (optional): Tolerance for the convergence of the variational algorithm. Default is 1e-04.
 
+- N_THREADS (optional): Number of threads for computation. Default is 1.
+
 
 ## Output
 
 A `.summary` file, a `.cs` file and a `.snp` file will be written to the specified output directory.
-The `.snp` file contains a list of SNPs that are used in the fine-mapping algorithm.
-These are the SNPs that are located in the specified fine-mapping region, available in the GWAS summary statistics and the reference panel for at least one population, and survived the minor allele frequency filtering.
 
 If the varitional algorithm did not converge, "FAIL" will be written to both the `.summary` file and the `.cs` file.
 If no credible set was identified at the specified coverage level after purity and marginal p-value filtering, "NULL" will be written to both the `.summary` file and the `.cs` file.
@@ -118,8 +152,9 @@ Otherwise, the `.summary` file contains credible set level information, which ha
 
 - MAX_PIP: Maximum posterior inclusion probability (PIP) in the credible set.
 
+- POST-HOC_PROB_POP${i}: _Post hoc_ probability credible set manifest causal in population ${i}.
 
-The `.cs` file contains information for all the SNPs included in credible sets (if `--full_out=False`) or all the SNPs included in fine-mapping (if `--full_out=True`), and has the following columns:
+The `.cs` file contains information for all the SNPs included in credible sets and has the following columns:
 
 - CS_ID: ID of the credible set.
 
@@ -143,35 +178,49 @@ The `.cs` file contains information for all the SNPs included in credible sets (
 
 - OVRL_PIP: Posterior inclusion probability (PIP) of the SNP in any of the credible set.
 
+The `.snp` file contains information for all the SNPs that are used in the fine-mapping algorithm.
+These are the SNPs that are located in the specified fine-mapping region, available in the GWAS summary statistics and the reference panel for at least one population, and survived the minor allele frequency filtering.
+It has the following columns:
+
+- BP: The base pair coordinate of the SNP.
+
+- SNP: SNP identifier.
+
+- LogBF(CS${i},Pop${j}): The Logarithm of Bayes factor for credible set ${i}, population ${j}. If the variational algorithm successfully converges and credible sets remain after the purity and marginal p-value filtering, the output will include column(s) of Logarithm of Bayes Factor for each population and credible set.
+
 
 ## Example
 
 ```
-python SuSiEx.py \
-    --sst_file=${sst_eur},${sst_afr} \
-    --n_gwas=${n_eur},${n_afr} \
-    --ref_file=${ref_eur},${ref_afr} \
-    --ld_file=${ld_eur},${ld_afr} \
-    --out_dir=${out_dir} \
-    --out_name=${out_name} \
-    --chr=1 \
-    --bp=94813205,95812998 \
-    --chr_col=1,1 \
-    --snp_col=2,2 \
-    --bp_col=3,3 \
-    --a1_col=4,4 \
-    --a2_col=5,5 \
-    --eff_col=8,8 \
-    --se_col=9,9 \
-    --pval_col=10,10 \
-    --keep-ambig=False \
-    --mult-step=True \
-    --maf=0.01 \
-    --plink=${plink_dir}/plink_v1.90
+cd eamples
+../bin/SuSiEx \
+	--sst_file=EUR.sumstats.txt,AFR.sumstats.txt \
+	--n_gwas=50000,50000 \
+	--ref_file=EUR,AFR \
+	--ld_file=EUR,AFR \
+	--out_dir=./ \
+	--out_name=SuSiEx.EUR.AFR.output.cs95 \
+	--level=0.95 \
+	--pval_thresh=1e-5 \
+	--maf=0.005
+	--chr=1 \
+	--bp=7314654,8314677 \
+	--snp_col=2,2 \
+	--chr_col=1,1 \
+	--bp_col=3,3 \
+	--a1_col=4,4 \
+	--a2_col=5,5 \
+	--eff_col=6,6 \
+	--se_col=7,7 \
+	--pval_col=9,9 \
+	--plink=../utilities/plink \
+	--mult-step=True \
+	--keep-ambig=True \
+	--threads=16
 ```
 
 
 ## Support
 
-Please direct questions or bug reports to Tian Ge (tge1@mgh.harvard.edu).
+Please direct questions or bug reports to Kai Yuan (kyuan@broadinstitute.org) and Tian Ge (tge1@mgh.harvard.edu).
 
diff --git a/examples/cal_ld.sh b/examples/cal_ld.sh
@@ -0,0 +1,15 @@
+python2 ../utilities/SuSiEx_LD.py \
+	--ref_file=EUR \
+	--ld_file=EUR \
+	--chr=1 \
+	--bp=7314654,8314677 \
+	--plink=../utilities/plink \
+	--maf=0.005
+python2 ../utilities/SuSiEx_LD.py \
+	--ref_file=AFR \
+	--ld_file=AFR \
+	--chr=1 \
+	--bp=7314654,8314677 \
+	--plink=../utilities/plink \
+	--maf=0.005
+
diff --git a/examples/run.sh b/examples/run.sh
@@ -17,7 +17,7 @@
 	--eff_col=6,6 \
 	--se_col=7,7 \
 	--pval_col=9,9 \
-	--plink=plink \
+	--plink=../utilities/plink \
 	--mult-step=True \
 	--keep-ambig=True \
 	--threads=16
diff --git a/src/main.cpp b/src/main.cpp
@@ -50,8 +50,8 @@ static struct option susiexOption[] =
 
 void susiexHelp()
 {
-	std::cout << "SuSiExCPP" << std::endl;
-	std::cout << "Usage: SuSiExCPP <[options]>" << std::endl;
+	std::cout << "SuSiEx" << std::endl;
+	std::cout << "Usage: SuSiEx <[options]>" << std::endl;
 	std::cout << "Options:" << std::endl;
 	std::cout << "\t--sst_file      /  -s       SUM_STATS_FILE (required): Full path and filename of the GWAS summary statistics for each population, separated by comma. Each file must contain a header line. The column corresponding to the effect size estimates must have a header of BETA or OR, indicating whether the effect estimates are regression coefficients or odds ratios." << std::endl;
 	std::cout << "\t--n_gwas        /  -n       GWAS_SAMPLE_SIZE (required): Sample size of the GWAS for each population, in the order corresponding to the GWAS summary statistics files, separated by comma." << std::endl;
@@ -79,7 +79,7 @@ void susiexHelp()
 	std::cout << "\t--pval_thresh   /  -a       MARGINAL_PVAL_THRESHOLD (optional): Filtering threshold for the marginal p-value. Credible sets containing no marginal p-value below this specified value will be filtered out. Default is 1e-05." << std::endl;
 	std::cout << "\t--max_iter      /  -I       MAXIMUM_ITERATIONS (optional): Maximum number of iterations allowed for the model fitting algorithm. Default is 100." << std::endl;
 	std::cout << "\t--tol           /  -t       TOLERANCE (optional): Tolerance for the convergence of the variational algorithm. Default is 1e-04." << std::endl;
-	std::cout << "\t--threads       /  -X       NUMBER_OF_THREADS (optional): Number of threads. Default is 1." << std::endl;
+	std::cout << "\t--threads       /  -X       N_THREADS (optional): Number of threads for computation. Default is 1." << std::endl;
 	std::cout << "\t--help          /  -h       Print this help." << std::endl;
 	exit(0);
 }

diff --git a/src/model.cpp b/src/model.cpp
@@ -460,7 +460,7 @@ int susiex::write_cs(int convergent)
 	}
 	else if(par.key_by == 1)
 	{
-		fpsnp << "Pos\tA1\tA2";
+		fpsnp << "BP\tA1\tA2";
 		for(int i = 1 ; i <= ncs; ++i)
 			for(int j = 1; j <= npop; ++j)
 				fpsnp << "\tLogBF(CS" << i << ",Pop" << j << ")";
@@ -483,7 +483,7 @@ int susiex::write_cs(int convergent)
 	}
 	else if(par.key_by == 2)
 	{
-        fpsnp << "Pos\tSNP";
+        fpsnp << "BP\tSNP";
         for(int i = 1 ; i <= ncs; ++i)
         	for(int j = 1; j <= npop; ++j)
         		fpsnp << "\tLogBF(CS" << i << ",Pop" << j << ")";

diff --git a/utilities/.DS_Store b/utilities/.DS_Store
diff --git a/utilities/SuSiEx_LD.py b/utilities/SuSiEx_LD.py
@@ -0,0 +1,113 @@
+#!/usr/bin/env python
+
+"""
+SuSiEx_LD: LD calculation for SuSiEx
+
+Usage:
+
+python SuSiEx.py --ref_file=REF_FILE --ld_file=LD_MATRIX_FILE --chr=CHR --bp=BP --plink=PLINK --maf=MAF_THRESHOLD
+
+"""
+
+
+import sys
+import getopt
+import subprocess
+
+
+def parse_param():
+    long_opts_list = ['ref_file=', 'ld_file=', 'chr=', 'bp=', 'plink=', 'maf=']
+
+    param_dict = {'ref_file': None, 'ld_file': None, 'chr': None, 'bp': None, 'plink': None, 'maf': 0.005}
+
+    print('\n')
+
+    if len(sys.argv) > 1:
+        try:
+            opts, args = getopt.getopt(sys.argv[1:], "h", long_opts_list)          
+        except:
+            print('Option not recognized.')
+            print('Use --help for usage information.\n')
+            sys.exit(2)
+
+        for opt, arg in opts:
+            if opt == "-h" or opt == "--help":
+                print(__doc__)
+                sys.exit(0)
+            elif opt == "--ref_file": param_dict['ref_file'] = arg
+            elif opt == "--ld_file": param_dict['ld_file'] = arg
+            elif opt == "--chr": param_dict['chr'] = int(arg)
+            elif opt == "--bp": param_dict['bp'] = map(int, arg.split(','))
+            elif opt == "--plink": param_dict['plink'] = arg
+            elif opt == "--maf": param_dict['maf'] = float(arg)
+    else:
+        print(__doc__)
+        sys.exit(0)
+
+
+    if param_dict['ref_file'] == None:
+        print('* Please provide the directory and filename prefix of the reference panel using --ref_file\n')
+        sys.exit(2)
+    elif param_dict['ld_file'] == None:
+        print('* Please provide the directory and filename prefix of the LD matrix to be computed using --ld_file\n')
+        sys.exit(2)
+    elif param_dict['chr'] == None:
+        print('* Please provide the chromosome code of the fine-mapping region using --chr\n')
+        sys.exit(2)
+    elif param_dict['bp'] == None or len(param_dict['bp']) != 2:
+        print('* Please provide the start and end base pair coordinate of the fine-mapping region using --bp\n')
+        sys.exit(2)
+    elif param_dict['plink'] == None:
+        print('* Please provide the directory and filename of PLINK using --plink\n')
+        sys.exit(2)
+
+    for key in param_dict:
+        print('--%s=%s' % (key, param_dict[key]))
+
+    print('\n')
+    return param_dict
+
+
+def main():
+    param_dict = parse_param()
+    print('... calculate LD matrix ...')
+
+    SNP = []
+    with open(param_dict['ref_file']+'.bim') as ff:
+        for line in ff:
+            ll = (line.strip()).split()
+            if int(ll[0]) == param_dict['chr'] and int(ll[3]) >= param_dict['bp'][0] and int(ll[3]) <= param_dict['bp'][1]:
+                SNP.append(ll[1])
+
+    with open(param_dict['ld_file']+'.snp', 'w') as ff:
+        for snp in SNP:
+            ff.write('%s\n' % snp)
+
+
+    cmd = '%s --bfile %s --keep-allele-order --chr %d --extract %s --maf %f --make-bed --out %s' \
+        % (param_dict['plink'], param_dict['ref_file'], param_dict['chr'], param_dict['ld_file']+'.snp', param_dict['maf'], param_dict['ld_file']+'_ref')
+    subprocess.check_output(cmd, shell=True)
+
+    cmd = '%s --bfile %s --keep-allele-order --r square bin4 --out %s' %(param_dict['plink'], param_dict['ld_file']+'_ref', param_dict['ld_file'])
+    subprocess.check_output(cmd, shell=True)
+
+    cmd = '%s --bfile %s --keep-allele-order --freq --out %s' %(param_dict['plink'], param_dict['ld_file']+'_ref', param_dict['ld_file']+'_frq')
+    subprocess.check_output(cmd, shell=True)
+
+    subprocess.check_output('rm '+param_dict['ld_file']+'.snp', shell=True)
+    subprocess.check_output('rm -rf '+param_dict['ld_file']+'.nosex', shell=True)
+    subprocess.check_output('rm -rf '+param_dict['ld_file']+'_*nosex', shell=True)
+    subprocess.check_output('rm '+param_dict['ld_file']+'_ref.bed', shell=True)
+    subprocess.check_output('rm '+param_dict['ld_file']+'_ref.fam', shell=True)
+    subprocess.check_output('rm '+param_dict['ld_file']+'_ref.log', shell=True)
+    subprocess.check_output('rm '+param_dict['ld_file']+'_frq.log', shell=True)
+    subprocess.check_output('rm '+param_dict['ld_file']+'.log', shell=True)
+
+    print('... Done ...\n')
+
+
+if __name__ == '__main__':
+    main()
+
+
+
diff --git a/utilities/plink b/utilities/plink
diff --git a/utilities/plink_linux_x86_64_20230116.zip b/utilities/plink_linux_x86_64_20230116.zip