move null model to separate script

Instead of running the null model as the first step of assoc.py, it is now run in a separate script null_model.py. The output of the null model script is then used as input to assoc.py.
UW-GAC · Jun 28, 2019 · a0da1df · a0da1df
1 parent 7a2a7ff
commit a0da1df
Show file tree

Hide file tree

Showing 30 changed files with 192 additions and 145 deletions.
diff --git a/README.md b/README.md
@@ -199,6 +199,8 @@ config parameter | default value | description
 
 ## Association testing
 
+### Null model
+
 Association tests are done with a mixed model if a kinship matrix (`pcrelate_file`) or GRM (`grm_file`) is given in the config file. If `pcrelate_file` and `grm_file` are both `NA` or missing, testing is done with a fixed effects model.
 
 When combining samples from groups with different variances for a trait (e.g., study or ancestry group), it is recommended to allow the null model to fit heterogeneous variances by group using the parameter `group_var`. The default pipeline options will then result in the following procedure:
@@ -214,16 +216,13 @@ When combining samples from groups with different variances for a trait (e.g., s
     - Include covariates and PCs as fixed effects
     - Include kinship as random effect
 
-The effect estimate is for the alternate alelle, and multiple alternate alelles for a single variant are treated separately.
-
-Association tests have an additional level of parallelization: by segment within chromosome. The R scripts take an optional `"--segment"` (or `"-s"`) argument. The python script `assoc.py` uses the environment variable `SGE_TASK_ID` to submit jobs by segment for each chromosome. By default each segment is 10 Mb in length, but this may be changed by using the arguments `"--segment_length"` or `"--n_segments"`. Note that `"--n_segments"` defines the number of segments for the entire genome, so using this argument with selected chromosomes may result in fewer segments than you expect (and the minimum is one segment per chromosome).
+`null_model.py`
 
-### Parameters common to all association tests
+1. `null_model.R`
 
 config parameter | default value | description
 --- | --- | ---
 `out_prefix` | | Prefix for files created by this script.
-`gds_file` | | GDS file. Include a space to insert chromosome.
 `pca_file` | `NA` | RData file with PCA results created by `pcair.py`.
 `pcrelate_file` | `NA` | RData file with 2*kinship created by `pcrelate.py`. 
 `grm_file` | `NA` | GDS file with GRM created by `grm.py`.
@@ -236,8 +235,24 @@ config parameter | default value | description
 `norm_bygroup` | `FALSE` | If `TRUE` and `group_var` is provided, the inverse normal transform is done on each group separately.
 `rescale_variance` | `marginal` | Applies only if `inverse_normal` is `TRUE`. Controls whether to rescale the variance after inverse-normal transform, restoring it to the original variance before the transform. Options are `marginal`, `varcomp`, or `none`.
 `n_pcs` | `0` | Number of PCs to include as covariates.
-`conditional_variant_file` | `NA` | RData file with data frame of of conditional variants. Columns should include `chromosome` and `variant.id`. The alternate allele dosage of these variants will be included as covariates in the analysis.
+`conditional_variant_file` | `NA` | RData file with data frame of of conditional variants. Columns should include `chromosome` and `variant.id`. The alternate allele dosage of these variants will be included as covariates in the analysis. 
+`gds_file` | `NA` | GDS file. Include a space to insert chromosome. Required if `conditional_variant_file` is specified.
 `sample_include_file` | `NA` | RData file with vector of sample.id to include. 
+
+
+### Parameters common to all association tests
+
+The effect estimate is for the alternate alelle, and multiple alternate alelles for a single variant are treated separately.
+
+Association tests have an additional level of parallelization: by segment within chromosome. The R scripts take an optional `"--segment"` (or `"-s"`) argument. The python script `assoc.py` uses the environment variable `SGE_TASK_ID` to submit jobs by segment for each chromosome. By default each segment is 10 Mb in length, but this may be changed by using the arguments `"--segment_length"` or `"--n_segments"`. Note that `"--n_segments"` defines the number of segments for the entire genome, so using this argument with selected chromosomes may result in fewer segments than you expect (and the minimum is one segment per chromosome).
+
+config parameter | default value | description
+--- | --- | ---
+`out_prefix` | | Prefix for files created by this script.
+`gds_file` | | GDS file. Include a space to insert chromosome.
+`null_model_file` | | RData file with null model from `null_model.py`.
+`null_model_params` | | Parameter file ending in `null_model.params` in the `report` directory from `null_model.py`.
+`phenotype_file` | | RData file with AnnotatedDataFrame of phenotypes. Use the output phenotype file from `null_model.py`.
 `variant_include_file` | `NA` | RData file with vector of variant.id to include. 
 `variant_block_size` | `1024` | Number of variants to read in a single block.
 `pass_only` | `TRUE` | `TRUE` to select only variants with FILTER=PASS.
@@ -324,13 +339,6 @@ The segment file created at the start of each association test contains the chro
 The script [`assoc.py`](assoc.py) submits a SGE array job for each chromosome, where the SGE task id is the row number of the segment in the segments file. If a segment has no requested variants, its job will exit without error. After all segments are complete, they are combined into a single file for each chromosome and the temporary per-segment output files are deleted.
 
 
-### Multiple tests with the same null model
-
-To run additional tests using the same null model as a previous test, add the config parameters `null_model_file` and `null_model_params`. `null_model_file` is the output file created by a previous association test run. `null_model_params` is the parameter file ending in `null_model.params` in the `report` directory for the previous association test. The parameter file is needed to generate the report for the new test.
-
-If the number of samples in the initial phenotype file was less than the total number of samples in the GDS file, also provide `phenotype_file` as the output phenotype file created along with the null model file.
-
-
 
 ## LocusZoom
 

diff --git a/assoc.py b/assoc.py
@@ -32,8 +32,6 @@
                     help="type of compute cluster environment [default %(default)s]")
 parser.add_argument("--cluster_file", default=None,
                     help="json file containing cluster options")
-parser.add_argument("-n", "--ncores", default="1-8",
-                    help="number of cores to use; either a number (e.g, 1) or a range of numbers (e.g., 1-4) [default %(default)s]")
 parser.add_argument("-e", "--email", default=None,
                     help="email address for job reporting")
 parser.add_argument("--print_only", action="store_true", default=False,
@@ -52,7 +50,6 @@
 n_segments = args.n_segments
 cluster_file = args.cluster_file
 cluster_type = args.cluster_type
-ncores = args.ncores
 email = args.email
 print_only = args.print_only
 verbose = args.verbose
@@ -73,30 +70,10 @@
 #     {jobname: [jobids]}
 hold_null_agg = []
 
-# null model
-job = "null_model"
-
-# if a null model file is given in the config, skip this step
-run_null_model = "null_model_file" not in configdict
-if run_null_model:
-
-    rscript = os.path.join(pipeline, "R", job + ".R")
-
-    config = deepcopy(configdict)
-    config["out_file"] = configdict["data_prefix"] + "_null_model.RData"
-    config["out_phenotype_file"] = configdict["data_prefix"] + "_phenotypes.RData"
-    configfile = configdict["config_prefix"] + "_" + job + ".config"
-    TopmedPipeline.writeConfig(config, configfile)
-
-    submitID = cluster.submitJob(job_name=job, cmd=driver, args=[rscript, configfile, version], request_cores=ncores, email=email, print_only=print_only)
-
-    hold_null_agg.append(submitID)
-else:
-    print("Using null model in " + configdict["null_model_file"])
-    # copy parameter file for report
-    if "null_model_params" in configdict:
-        paramfile = os.path.basename(configdict["config_prefix"]) + "_" + job + ".config.null_model.params"
-        copyfile(configdict["null_model_params"], paramfile)
+# copy parameter file for report
+if "null_model_params" in configdict:
+    paramfile = os.path.basename(configdict["config_prefix"]) + "_null_model.config.null_model.params"
+    copyfile(configdict["null_model_params"], paramfile)
 
 
 # for aggregate tests, generate variant list
@@ -144,11 +121,6 @@
 # set up config for association test
 config = deepcopy(configdict)
 config["assoc_type"] = assoc_type
-# if we just ran the null model, use output files as input for assoc test
-# otherwise, these parameters should already be in the config
-if run_null_model:
-    config["null_model_file"] = configdict["data_prefix"] + "_null_model.RData"
-    config["phenotype_file"] = configdict["data_prefix"] + "_phenotypes.RData"
 
 if assoc_type == "aggregate":
     config["aggregate_variant_file"] = configdict["data_prefix"] + "_aggregate_list_chr .RData"

diff --git a/null_model.py b/null_model.py
@@ -0,0 +1,83 @@
+#! /usr/bin/env python2.7
+
+"""Association tests"""
+
+import TopmedPipeline
+import sys
+import os
+import subprocess
+from time import localtime, strftime
+from argparse import ArgumentParser
+from copy import deepcopy
+from shutil import copyfile
+from datetime import datetime, timedelta
+
+description = """
+Association tests
+"""
+
+default_segment_length = "10000"
+
+parser = ArgumentParser(description=description)
+parser.add_argument("config_file", help="configuration file")
+parser.add_argument("--cluster_type", default="UW_Cluster",
+                    help="type of compute cluster environment [default %(default)s]")
+parser.add_argument("--cluster_file", default=None,
+                    help="json file containing cluster options")
+parser.add_argument("-n", "--ncores", default="12",
+                    help="number of cores to use; either a number (e.g, 1) or a range of numbers (e.g., 1-4) [default %(default)s]")
+parser.add_argument("-e", "--email", default=None,
+                    help="email address for job reporting")
+parser.add_argument("--print_only", action="store_true", default=False,
+                    help="print qsub commands without submitting")
+parser.add_argument("--verbose", action="store_true", default=False,
+                    help="enable verbose output to help debug")
+parser.add_argument("--version", action="version",
+                    version="TopmedPipeline "+TopmedPipeline.__version__,
+                    help="show the version number and exit")
+args = parser.parse_args()
+
+configfile = args.config_file
+cluster_file = args.cluster_file
+cluster_type = args.cluster_type
+ncores = args.ncores
+email = args.email
+print_only = args.print_only
+verbose = args.verbose
+
+version = "--version " + TopmedPipeline.__version__
+
+cluster = TopmedPipeline.ClusterFactory.createCluster(cluster_type, cluster_file, verbose)
+
+pipeline = cluster.getPipelinePath()
+driver = os.path.join(pipeline, "runRscript.sh")
+
+configdict = TopmedPipeline.readConfig(configfile)
+configdict = TopmedPipeline.directorySetup(configdict, subdirs=["config", "data", "log", "plots", "report"])
+
+# analysis init
+cluster.analysisInit(print_only=print_only)
+
+# null model
+job = "null_model"
+
+rscript = os.path.join(pipeline, "R", job + ".R")
+
+config = deepcopy(configdict)
+config["out_file"] = configdict["data_prefix"] + "_null_model.RData"
+config["out_phenotype_file"] = configdict["data_prefix"] + "_phenotypes.RData"
+configfile = configdict["config_prefix"] + "_" + job + ".config"
+TopmedPipeline.writeConfig(config, configfile)
+
+submitID = cluster.submitJob(job_name=job, cmd=driver, args=[rscript, configfile, version], request_cores=ncores, email=email, print_only=print_only)
+
+
+# post analysis
+job = "post_analysis"
+jobpy = job + ".py"
+pcmd=os.path.join(pipeline, jobpy)
+argList = [pcmd, "-a", cluster.getAnalysisName(), "-l", cluster.getAnalysisLog(),
+           "-s", cluster.getAnalysisStartSec()]
+pdriver=os.path.join(pipeline, "run_python.sh")
+cluster.submitJob(job_name=job, cmd=pdriver, args=argList,
+                  holdid=submitID, print_only=print_only)
diff --git a/testdata/assoc_aggregate_allele.config b/testdata/assoc_aggregate_allele.config
@@ -1,11 +1,8 @@
 out_prefix "test"
 gds_file "testdata/1KG_phase3_subset_chr .gds"
 phenotype_file "testdata/1KG_phase3_subset_annot.RData"
-pcrelate_file "testdata/pcrelate_Matrix.RData"
-pca_file "testdata/pcair.RData"
-outcome status
-covars "sex"
-n_pcs 4
+null_model_file "testdata/null_model.RData"
+null_model_params "testdata/null_model.params"
 variant_group_file "testdata/coding_variants_chr .RData"
 aggregate_type "allele"
 test "burden"

diff --git a/testdata/assoc_aggregate_position.config b/testdata/assoc_aggregate_position.config
@@ -1,13 +1,9 @@
 out_prefix "test"
 gds_file "testdata/1KG_phase3_subset_chr .gds"
 phenotype_file "testdata/1KG_phase3_subset_annot.RData"
-pcrelate_file "testdata/pcrelate_Matrix.RData"
-pca_file "testdata/pcair.RData"
-sample_include_file "testdata/sample_include.RData"
+null_model_file "testdata/null_model.RData"
+null_model_params "testdata/null_model.params"
 variant_include_file "testdata/variant_include_chr .RData"
-outcome outcome
-covars "sex"
-n_pcs 4
 alt_freq_max "0.1"
 variant_group_file "testdata/genes_chr .RData"
 aggregate_type "position"

diff --git a/testdata/assoc_aggregate_weights.config b/testdata/assoc_aggregate_weights.config
@@ -1,12 +1,8 @@
 out_prefix "test"
 gds_file "testdata/1KG_phase3_subset_chr .gds"
 phenotype_file "testdata/1KG_phase3_subset_annot.RData"
-pcrelate_file "testdata/pcrelate_Matrix.RData"
-pca_file "testdata/pcair.RData"
-sample_include_file "testdata/sample_include.RData"
-outcome outcome
-covars "sex"
-n_pcs 4
+null_model_file "testdata/null_model.RData"
+null_model_params "testdata/null_model.params"
 alt_freq_max "0.1"
 variant_group_file "testdata/coding_variants_weights.RData"
 aggregate_type "allele"

diff --git a/testdata/assoc_single.config b/testdata/assoc_single.config
@@ -1,12 +1,8 @@
 out_prefix "test"
 gds_file "testdata/1KG_phase3_subset_chr .gds"
 phenotype_file "testdata/1KG_phase3_subset_annot.RData"
-pcrelate_file "testdata/pcrelate_Matrix.RData"
-pca_file "testdata/pcair.RData"
-sample_include_file "testdata/sample_include.RData"
+null_model_file "testdata/null_model.RData"
+null_model_params "testdata/null_model.params"
 variant_include_file "testdata/variant_include_chr .RData"
-outcome outcome
-covars "sex Population"
-n_pcs 4
 mac_threshold 3
 genome_build "hg19"
diff --git a/testdata/assoc_single_wald.config b/testdata/assoc_single_wald.config
@@ -1,13 +1,9 @@
 out_prefix "test"
 gds_file "testdata/1KG_phase3_subset_chr .gds"
 phenotype_file "testdata/1KG_phase3_subset_annot.RData"
-pcrelate_file "testdata/pcrelate_Matrix.RData"
-pca_file "testdata/pcair.RData"
-sample_include_file "testdata/sample_include.RData"
+null_model_file "testdata/null_model.RData"
+null_model_params "testdata/null_model.params"
 variant_include_file "testdata/variant_include_chr .RData"
-outcome outcome
-covars "sex"
-n_pcs 4
 maf_threshold 0.01
 test_type "wald"
 genome_build "hg19"
diff --git a/testdata/assoc_window_burden.config b/testdata/assoc_window_burden.config
@@ -1,13 +1,9 @@
 out_prefix "test"
 gds_file "testdata/1KG_phase3_subset_chr .gds"
 phenotype_file "testdata/1KG_phase3_subset_annot.RData"
-pcrelate_file "testdata/pcrelate_Matrix.RData"
-pca_file "testdata/pcair.RData"
-sample_include_file "testdata/sample_include.RData"
+null_model_file "testdata/null_model.RData"
+null_model_params "testdata/null_model.params"
 variant_include_file "testdata/variant_include_chr .RData"
-outcome outcome
-covars "sex"
-n_pcs 4
 alt_freq_max "0.1"
 test "burden"
 test_type "score"

diff --git a/testdata/assoc_window_burden_binary.config b/testdata/assoc_window_burden_binary.config
@@ -1,12 +1,8 @@
 out_prefix "test"
 gds_file "testdata/1KG_phase3_subset_chr .gds"
 phenotype_file "testdata/1KG_phase3_subset_annot.RData"
-pcrelate_file "testdata/pcrelate_Matrix.RData"
-pca_file "testdata/pcair.RData"
-outcome status
-binary TRUE
-covars "sex"
-n_pcs 4
+null_model_file "testdata/null_model_binary.RData"
+null_model_params "testdata/binary.null_model.params"
 alt_freq_max "0.1"
 test "burden"
 test_type "wald"

diff --git a/testdata/assoc_window_burden_wald.config b/testdata/assoc_window_burden_wald.config
@@ -1,13 +1,9 @@
 out_prefix "test"
 gds_file "testdata/1KG_phase3_subset_chr .gds"
 phenotype_file "testdata/1KG_phase3_subset_annot.RData"
-pcrelate_file "testdata/pcrelate_Matrix.RData"
-pca_file "testdata/pcair.RData"
-sample_include_file "testdata/sample_include.RData"
+null_model_file "testdata/null_model.RData"
+null_model_params "testdata/null_model.params"
 variant_include_file "testdata/variant_include_chr .RData"
-outcome outcome
-covars "sex"
-n_pcs 4
 alt_freq_max "0.1"
 test "burden"
 test_type "wald"

diff --git a/testdata/assoc_window_skat.config b/testdata/assoc_window_skat.config
@@ -1,11 +1,8 @@
 out_prefix "test"
 gds_file "testdata/1KG_phase3_subset_chr .gds"
 phenotype_file "testdata/1KG_phase3_subset_annot.RData"
-pcrelate_file "testdata/pcrelate_Matrix.RData"
-pca_file "testdata/pcair.RData"
-outcome outcome
-covars "sex"
-n_pcs 4
+null_model_file "testdata/null_model.RData"
+null_model_params "testdata/null_model.params"
 alt_freq_max "0.1"
 test "skat"
 weight_beta "1 25"

diff --git a/testdata/assoc_window_skato.config b/testdata/assoc_window_skato.config
@@ -1,11 +1,8 @@
 out_prefix "test"
 gds_file "testdata/1KG_phase3_subset_chr .gds"
 phenotype_file "testdata/1KG_phase3_subset_annot.RData"
-pcrelate_file "testdata/pcrelate_Matrix.RData"
-pca_file "testdata/pcair.RData"
-outcome outcome
-covars "sex"
-n_pcs 4
+null_model_file "testdata/null_model.RData"
+null_model_params "testdata/null_model.params"
 alt_freq_max "0.1"
 test "skato"
 rho "0 0.5 1"

diff --git a/testdata/assoc_window_unrel.config b/testdata/assoc_window_unrel.config
@@ -1,13 +1,9 @@
 out_prefix "test"
 gds_file "testdata/1KG_phase3_subset_chr .gds"
 phenotype_file "testdata/1KG_phase3_subset_annot.RData"
-pcrelate_file NA
-pca_file "testdata/pcair.RData"
-sample_include_file "testdata/unrelated.RData"
+null_model_file "testdata/null_model_unrel.RData"
+null_model_params "testdata/unrel.null_model.params"
 variant_include_file "testdata/variant_include_chr .RData"
-outcome outcome
-covars "sex"
-n_pcs 4
 alt_freq_max "0.1"
 test "burden"
 test_type "score"

diff --git a/testdata/assoc_window_weights.config b/testdata/assoc_window_weights.config
@@ -1,12 +1,8 @@
 out_prefix "test"
 gds_file "testdata/1KG_phase3_subset_chr .gds"
 phenotype_file "testdata/1KG_phase3_subset_annot.RData"
-pcrelate_file "testdata/pcrelate_Matrix.RData"
-pca_file "testdata/pcair.RData"
-sample_include_file "testdata/sample_include.RData"
-outcome outcome
-covars "sex"
-n_pcs 4
+null_model_file "testdata/null_model.RData"
+null_model_params "testdata/null_model.params"
 alt_freq_max "0.1"
 test "burden"
 test_type "score"