Skip to content

Commit

Permalink
move null model to separate script
Browse files Browse the repository at this point in the history
Instead of running the null model as the first step of assoc.py, it
is now run in a separate script null_model.py. The output of the
null model script is then used as input to assoc.py.
  • Loading branch information
smgogarten committed Jun 28, 2019
1 parent 7a2a7ff commit a0da1df
Show file tree
Hide file tree
Showing 30 changed files with 192 additions and 145 deletions.
34 changes: 21 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -199,6 +199,8 @@ config parameter | default value | description

## Association testing

### Null model

Association tests are done with a mixed model if a kinship matrix (`pcrelate_file`) or GRM (`grm_file`) is given in the config file. If `pcrelate_file` and `grm_file` are both `NA` or missing, testing is done with a fixed effects model.

When combining samples from groups with different variances for a trait (e.g., study or ancestry group), it is recommended to allow the null model to fit heterogeneous variances by group using the parameter `group_var`. The default pipeline options will then result in the following procedure:
Expand All @@ -214,16 +216,13 @@ When combining samples from groups with different variances for a trait (e.g., s
- Include covariates and PCs as fixed effects
- Include kinship as random effect

The effect estimate is for the alternate alelle, and multiple alternate alelles for a single variant are treated separately.

Association tests have an additional level of parallelization: by segment within chromosome. The R scripts take an optional `"--segment"` (or `"-s"`) argument. The python script `assoc.py` uses the environment variable `SGE_TASK_ID` to submit jobs by segment for each chromosome. By default each segment is 10 Mb in length, but this may be changed by using the arguments `"--segment_length"` or `"--n_segments"`. Note that `"--n_segments"` defines the number of segments for the entire genome, so using this argument with selected chromosomes may result in fewer segments than you expect (and the minimum is one segment per chromosome).
`null_model.py`

### Parameters common to all association tests
1. `null_model.R`

config parameter | default value | description
--- | --- | ---
`out_prefix` | | Prefix for files created by this script.
`gds_file` | | GDS file. Include a space to insert chromosome.
`pca_file` | `NA` | RData file with PCA results created by `pcair.py`.
`pcrelate_file` | `NA` | RData file with 2*kinship created by `pcrelate.py`.
`grm_file` | `NA` | GDS file with GRM created by `grm.py`.
Expand All @@ -236,8 +235,24 @@ config parameter | default value | description
`norm_bygroup` | `FALSE` | If `TRUE` and `group_var` is provided, the inverse normal transform is done on each group separately.
`rescale_variance` | `marginal` | Applies only if `inverse_normal` is `TRUE`. Controls whether to rescale the variance after inverse-normal transform, restoring it to the original variance before the transform. Options are `marginal`, `varcomp`, or `none`.
`n_pcs` | `0` | Number of PCs to include as covariates.
`conditional_variant_file` | `NA` | RData file with data frame of of conditional variants. Columns should include `chromosome` and `variant.id`. The alternate allele dosage of these variants will be included as covariates in the analysis.
`conditional_variant_file` | `NA` | RData file with data frame of of conditional variants. Columns should include `chromosome` and `variant.id`. The alternate allele dosage of these variants will be included as covariates in the analysis.
`gds_file` | `NA` | GDS file. Include a space to insert chromosome. Required if `conditional_variant_file` is specified.
`sample_include_file` | `NA` | RData file with vector of sample.id to include.


### Parameters common to all association tests

The effect estimate is for the alternate alelle, and multiple alternate alelles for a single variant are treated separately.

Association tests have an additional level of parallelization: by segment within chromosome. The R scripts take an optional `"--segment"` (or `"-s"`) argument. The python script `assoc.py` uses the environment variable `SGE_TASK_ID` to submit jobs by segment for each chromosome. By default each segment is 10 Mb in length, but this may be changed by using the arguments `"--segment_length"` or `"--n_segments"`. Note that `"--n_segments"` defines the number of segments for the entire genome, so using this argument with selected chromosomes may result in fewer segments than you expect (and the minimum is one segment per chromosome).

config parameter | default value | description
--- | --- | ---
`out_prefix` | | Prefix for files created by this script.
`gds_file` | | GDS file. Include a space to insert chromosome.
`null_model_file` | | RData file with null model from `null_model.py`.
`null_model_params` | | Parameter file ending in `null_model.params` in the `report` directory from `null_model.py`.
`phenotype_file` | | RData file with AnnotatedDataFrame of phenotypes. Use the output phenotype file from `null_model.py`.
`variant_include_file` | `NA` | RData file with vector of variant.id to include.
`variant_block_size` | `1024` | Number of variants to read in a single block.
`pass_only` | `TRUE` | `TRUE` to select only variants with FILTER=PASS.
Expand Down Expand Up @@ -324,13 +339,6 @@ The segment file created at the start of each association test contains the chro
The script [`assoc.py`](assoc.py) submits a SGE array job for each chromosome, where the SGE task id is the row number of the segment in the segments file. If a segment has no requested variants, its job will exit without error. After all segments are complete, they are combined into a single file for each chromosome and the temporary per-segment output files are deleted.


### Multiple tests with the same null model

To run additional tests using the same null model as a previous test, add the config parameters `null_model_file` and `null_model_params`. `null_model_file` is the output file created by a previous association test run. `null_model_params` is the parameter file ending in `null_model.params` in the `report` directory for the previous association test. The parameter file is needed to generate the report for the new test.

If the number of samples in the initial phenotype file was less than the total number of samples in the GDS file, also provide `phenotype_file` as the output phenotype file created along with the null model file.



## LocusZoom

Expand Down
36 changes: 4 additions & 32 deletions assoc.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,6 @@
help="type of compute cluster environment [default %(default)s]")
parser.add_argument("--cluster_file", default=None,
help="json file containing cluster options")
parser.add_argument("-n", "--ncores", default="1-8",
help="number of cores to use; either a number (e.g, 1) or a range of numbers (e.g., 1-4) [default %(default)s]")
parser.add_argument("-e", "--email", default=None,
help="email address for job reporting")
parser.add_argument("--print_only", action="store_true", default=False,
Expand All @@ -52,7 +50,6 @@
n_segments = args.n_segments
cluster_file = args.cluster_file
cluster_type = args.cluster_type
ncores = args.ncores
email = args.email
print_only = args.print_only
verbose = args.verbose
Expand All @@ -73,30 +70,10 @@
# {jobname: [jobids]}
hold_null_agg = []

# null model
job = "null_model"

# if a null model file is given in the config, skip this step
run_null_model = "null_model_file" not in configdict
if run_null_model:

rscript = os.path.join(pipeline, "R", job + ".R")

config = deepcopy(configdict)
config["out_file"] = configdict["data_prefix"] + "_null_model.RData"
config["out_phenotype_file"] = configdict["data_prefix"] + "_phenotypes.RData"
configfile = configdict["config_prefix"] + "_" + job + ".config"
TopmedPipeline.writeConfig(config, configfile)

submitID = cluster.submitJob(job_name=job, cmd=driver, args=[rscript, configfile, version], request_cores=ncores, email=email, print_only=print_only)

hold_null_agg.append(submitID)
else:
print("Using null model in " + configdict["null_model_file"])
# copy parameter file for report
if "null_model_params" in configdict:
paramfile = os.path.basename(configdict["config_prefix"]) + "_" + job + ".config.null_model.params"
copyfile(configdict["null_model_params"], paramfile)
# copy parameter file for report
if "null_model_params" in configdict:
paramfile = os.path.basename(configdict["config_prefix"]) + "_null_model.config.null_model.params"
copyfile(configdict["null_model_params"], paramfile)


# for aggregate tests, generate variant list
Expand Down Expand Up @@ -144,11 +121,6 @@
# set up config for association test
config = deepcopy(configdict)
config["assoc_type"] = assoc_type
# if we just ran the null model, use output files as input for assoc test
# otherwise, these parameters should already be in the config
if run_null_model:
config["null_model_file"] = configdict["data_prefix"] + "_null_model.RData"
config["phenotype_file"] = configdict["data_prefix"] + "_phenotypes.RData"

if assoc_type == "aggregate":
config["aggregate_variant_file"] = configdict["data_prefix"] + "_aggregate_list_chr .RData"
Expand Down
83 changes: 83 additions & 0 deletions null_model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
#! /usr/bin/env python2.7

"""Association tests"""

import TopmedPipeline
import sys
import os
import subprocess
from time import localtime, strftime
from argparse import ArgumentParser
from copy import deepcopy
from shutil import copyfile
from datetime import datetime, timedelta

description = """
Association tests
"""

default_segment_length = "10000"

parser = ArgumentParser(description=description)
parser.add_argument("config_file", help="configuration file")
parser.add_argument("--cluster_type", default="UW_Cluster",
help="type of compute cluster environment [default %(default)s]")
parser.add_argument("--cluster_file", default=None,
help="json file containing cluster options")
parser.add_argument("-n", "--ncores", default="12",
help="number of cores to use; either a number (e.g, 1) or a range of numbers (e.g., 1-4) [default %(default)s]")
parser.add_argument("-e", "--email", default=None,
help="email address for job reporting")
parser.add_argument("--print_only", action="store_true", default=False,
help="print qsub commands without submitting")
parser.add_argument("--verbose", action="store_true", default=False,
help="enable verbose output to help debug")
parser.add_argument("--version", action="version",
version="TopmedPipeline "+TopmedPipeline.__version__,
help="show the version number and exit")
args = parser.parse_args()

configfile = args.config_file
cluster_file = args.cluster_file
cluster_type = args.cluster_type
ncores = args.ncores
email = args.email
print_only = args.print_only
verbose = args.verbose

version = "--version " + TopmedPipeline.__version__

cluster = TopmedPipeline.ClusterFactory.createCluster(cluster_type, cluster_file, verbose)

pipeline = cluster.getPipelinePath()
driver = os.path.join(pipeline, "runRscript.sh")

configdict = TopmedPipeline.readConfig(configfile)
configdict = TopmedPipeline.directorySetup(configdict, subdirs=["config", "data", "log", "plots", "report"])

# analysis init
cluster.analysisInit(print_only=print_only)

# null model
job = "null_model"

rscript = os.path.join(pipeline, "R", job + ".R")

config = deepcopy(configdict)
config["out_file"] = configdict["data_prefix"] + "_null_model.RData"
config["out_phenotype_file"] = configdict["data_prefix"] + "_phenotypes.RData"
configfile = configdict["config_prefix"] + "_" + job + ".config"
TopmedPipeline.writeConfig(config, configfile)

submitID = cluster.submitJob(job_name=job, cmd=driver, args=[rscript, configfile, version], request_cores=ncores, email=email, print_only=print_only)


# post analysis
job = "post_analysis"
jobpy = job + ".py"
pcmd=os.path.join(pipeline, jobpy)
argList = [pcmd, "-a", cluster.getAnalysisName(), "-l", cluster.getAnalysisLog(),
"-s", cluster.getAnalysisStartSec()]
pdriver=os.path.join(pipeline, "run_python.sh")
cluster.submitJob(job_name=job, cmd=pdriver, args=argList,
holdid=submitID, print_only=print_only)
7 changes: 2 additions & 5 deletions testdata/assoc_aggregate_allele.config
Original file line number Diff line number Diff line change
@@ -1,11 +1,8 @@
out_prefix "test"
gds_file "testdata/1KG_phase3_subset_chr .gds"
phenotype_file "testdata/1KG_phase3_subset_annot.RData"
pcrelate_file "testdata/pcrelate_Matrix.RData"
pca_file "testdata/pcair.RData"
outcome status
covars "sex"
n_pcs 4
null_model_file "testdata/null_model.RData"
null_model_params "testdata/null_model.params"
variant_group_file "testdata/coding_variants_chr .RData"
aggregate_type "allele"
test "burden"
Expand Down
8 changes: 2 additions & 6 deletions testdata/assoc_aggregate_position.config
Original file line number Diff line number Diff line change
@@ -1,13 +1,9 @@
out_prefix "test"
gds_file "testdata/1KG_phase3_subset_chr .gds"
phenotype_file "testdata/1KG_phase3_subset_annot.RData"
pcrelate_file "testdata/pcrelate_Matrix.RData"
pca_file "testdata/pcair.RData"
sample_include_file "testdata/sample_include.RData"
null_model_file "testdata/null_model.RData"
null_model_params "testdata/null_model.params"
variant_include_file "testdata/variant_include_chr .RData"
outcome outcome
covars "sex"
n_pcs 4
alt_freq_max "0.1"
variant_group_file "testdata/genes_chr .RData"
aggregate_type "position"
Expand Down
8 changes: 2 additions & 6 deletions testdata/assoc_aggregate_weights.config
Original file line number Diff line number Diff line change
@@ -1,12 +1,8 @@
out_prefix "test"
gds_file "testdata/1KG_phase3_subset_chr .gds"
phenotype_file "testdata/1KG_phase3_subset_annot.RData"
pcrelate_file "testdata/pcrelate_Matrix.RData"
pca_file "testdata/pcair.RData"
sample_include_file "testdata/sample_include.RData"
outcome outcome
covars "sex"
n_pcs 4
null_model_file "testdata/null_model.RData"
null_model_params "testdata/null_model.params"
alt_freq_max "0.1"
variant_group_file "testdata/coding_variants_weights.RData"
aggregate_type "allele"
Expand Down
8 changes: 2 additions & 6 deletions testdata/assoc_single.config
Original file line number Diff line number Diff line change
@@ -1,12 +1,8 @@
out_prefix "test"
gds_file "testdata/1KG_phase3_subset_chr .gds"
phenotype_file "testdata/1KG_phase3_subset_annot.RData"
pcrelate_file "testdata/pcrelate_Matrix.RData"
pca_file "testdata/pcair.RData"
sample_include_file "testdata/sample_include.RData"
null_model_file "testdata/null_model.RData"
null_model_params "testdata/null_model.params"
variant_include_file "testdata/variant_include_chr .RData"
outcome outcome
covars "sex Population"
n_pcs 4
mac_threshold 3
genome_build "hg19"
8 changes: 2 additions & 6 deletions testdata/assoc_single_wald.config
Original file line number Diff line number Diff line change
@@ -1,13 +1,9 @@
out_prefix "test"
gds_file "testdata/1KG_phase3_subset_chr .gds"
phenotype_file "testdata/1KG_phase3_subset_annot.RData"
pcrelate_file "testdata/pcrelate_Matrix.RData"
pca_file "testdata/pcair.RData"
sample_include_file "testdata/sample_include.RData"
null_model_file "testdata/null_model.RData"
null_model_params "testdata/null_model.params"
variant_include_file "testdata/variant_include_chr .RData"
outcome outcome
covars "sex"
n_pcs 4
maf_threshold 0.01
test_type "wald"
genome_build "hg19"
8 changes: 2 additions & 6 deletions testdata/assoc_window_burden.config
Original file line number Diff line number Diff line change
@@ -1,13 +1,9 @@
out_prefix "test"
gds_file "testdata/1KG_phase3_subset_chr .gds"
phenotype_file "testdata/1KG_phase3_subset_annot.RData"
pcrelate_file "testdata/pcrelate_Matrix.RData"
pca_file "testdata/pcair.RData"
sample_include_file "testdata/sample_include.RData"
null_model_file "testdata/null_model.RData"
null_model_params "testdata/null_model.params"
variant_include_file "testdata/variant_include_chr .RData"
outcome outcome
covars "sex"
n_pcs 4
alt_freq_max "0.1"
test "burden"
test_type "score"
Expand Down
8 changes: 2 additions & 6 deletions testdata/assoc_window_burden_binary.config
Original file line number Diff line number Diff line change
@@ -1,12 +1,8 @@
out_prefix "test"
gds_file "testdata/1KG_phase3_subset_chr .gds"
phenotype_file "testdata/1KG_phase3_subset_annot.RData"
pcrelate_file "testdata/pcrelate_Matrix.RData"
pca_file "testdata/pcair.RData"
outcome status
binary TRUE
covars "sex"
n_pcs 4
null_model_file "testdata/null_model_binary.RData"
null_model_params "testdata/binary.null_model.params"
alt_freq_max "0.1"
test "burden"
test_type "wald"
Expand Down
8 changes: 2 additions & 6 deletions testdata/assoc_window_burden_wald.config
Original file line number Diff line number Diff line change
@@ -1,13 +1,9 @@
out_prefix "test"
gds_file "testdata/1KG_phase3_subset_chr .gds"
phenotype_file "testdata/1KG_phase3_subset_annot.RData"
pcrelate_file "testdata/pcrelate_Matrix.RData"
pca_file "testdata/pcair.RData"
sample_include_file "testdata/sample_include.RData"
null_model_file "testdata/null_model.RData"
null_model_params "testdata/null_model.params"
variant_include_file "testdata/variant_include_chr .RData"
outcome outcome
covars "sex"
n_pcs 4
alt_freq_max "0.1"
test "burden"
test_type "wald"
Expand Down
7 changes: 2 additions & 5 deletions testdata/assoc_window_skat.config
Original file line number Diff line number Diff line change
@@ -1,11 +1,8 @@
out_prefix "test"
gds_file "testdata/1KG_phase3_subset_chr .gds"
phenotype_file "testdata/1KG_phase3_subset_annot.RData"
pcrelate_file "testdata/pcrelate_Matrix.RData"
pca_file "testdata/pcair.RData"
outcome outcome
covars "sex"
n_pcs 4
null_model_file "testdata/null_model.RData"
null_model_params "testdata/null_model.params"
alt_freq_max "0.1"
test "skat"
weight_beta "1 25"
Expand Down
7 changes: 2 additions & 5 deletions testdata/assoc_window_skato.config
Original file line number Diff line number Diff line change
@@ -1,11 +1,8 @@
out_prefix "test"
gds_file "testdata/1KG_phase3_subset_chr .gds"
phenotype_file "testdata/1KG_phase3_subset_annot.RData"
pcrelate_file "testdata/pcrelate_Matrix.RData"
pca_file "testdata/pcair.RData"
outcome outcome
covars "sex"
n_pcs 4
null_model_file "testdata/null_model.RData"
null_model_params "testdata/null_model.params"
alt_freq_max "0.1"
test "skato"
rho "0 0.5 1"
Expand Down
8 changes: 2 additions & 6 deletions testdata/assoc_window_unrel.config
Original file line number Diff line number Diff line change
@@ -1,13 +1,9 @@
out_prefix "test"
gds_file "testdata/1KG_phase3_subset_chr .gds"
phenotype_file "testdata/1KG_phase3_subset_annot.RData"
pcrelate_file NA
pca_file "testdata/pcair.RData"
sample_include_file "testdata/unrelated.RData"
null_model_file "testdata/null_model_unrel.RData"
null_model_params "testdata/unrel.null_model.params"
variant_include_file "testdata/variant_include_chr .RData"
outcome outcome
covars "sex"
n_pcs 4
alt_freq_max "0.1"
test "burden"
test_type "score"
Expand Down
8 changes: 2 additions & 6 deletions testdata/assoc_window_weights.config
Original file line number Diff line number Diff line change
@@ -1,12 +1,8 @@
out_prefix "test"
gds_file "testdata/1KG_phase3_subset_chr .gds"
phenotype_file "testdata/1KG_phase3_subset_annot.RData"
pcrelate_file "testdata/pcrelate_Matrix.RData"
pca_file "testdata/pcair.RData"
sample_include_file "testdata/sample_include.RData"
outcome outcome
covars "sex"
n_pcs 4
null_model_file "testdata/null_model.RData"
null_model_params "testdata/null_model.params"
alt_freq_max "0.1"
test "burden"
test_type "score"
Expand Down
Loading

0 comments on commit a0da1df

Please sign in to comment.