Skip to content
francesccoll edited this page Apr 12, 2022 · 53 revisions

PowerBacGWAS: Power calculations for Bacterial GWAS

Usage

Overview

Input files, steps and scripts used to implement PowerBacGWAS pipeline: PowerBacGWAS scripts

Running PowerBacGWAS on an LSF cluster

If you have access to an LSF cluster, we recommend to run PowerBacGWAS Nextflow pipeline with the LSF executor for faster running times, wherein each process is submitted as a separate job by using the bsub command. This execution mode can be selected by adding the option ‘-profile lsf’ to the Nextflow command. You can use Singularity without Docker using the option "-profile farm" when executing your nextflow command. The option "-profile farm,lsf" will thus run the Nextflow command as a job using Singularity. You may need to edit the configuration file (nextflow.config) to create a new profile matching your system’s cluster configuration, see Nextflow’s configuration and executor for information on how to do this.

Data pre-processing steps

The following input files are needed to run the PowerBacGWAS pipeline:

  • a phylogenetic tree in Newick format
  • a multi-sample VCF file or a Roary-formatted pan-genome CSV file

These input files need to be pre-processed and re-formatted to make them compliant with PowerBacGWAS scripts.

The dataset efm_clade and associated files will be used to show how data pre-processing scripts are used.

Nextflow commands

The Nextflow implementation includes a pipeline that packages all data pre-processing steps in a single command. The option --run_prepare_vcf launches the pre-processings steps to prepare the VCF file for downstream analyses; whereas the option --run_prepare_pangenome will do the same for Roary's gene_presence_absence.Rtab file.

nextflow run main.nf --vcf efm_clade.vcf.gz --outprefix efm_clade.var --run_prepare_vcf --num_cores 8
nextflow run main.nf --pangenome efm_clade.gene_presence_absence.Rtab --outprefix efm_clade.pg --run_prepare_pangenome --num_cores 8

Individual commands

Alternatively to the Nextflow command, each script can be run individually. First, the format of the VCF file can be checked using the script prepare_vcf_file.py which will make sure multi-allelic sites are split, variant identifiers (ID VCF field) are included and genotypes (GT VCF field) are converted to haploid format.

python3 ./scripts/prepare_vcf_file.py --input_vcf ./data/efm_clade.vcf.gz --output_vcf ./data/efm_clade.reformatted.vcf

The internal nodes of the phylogenetic tree need to be annotated, i.e. labelled with unique identifiers, using the script annotate_nodes_newick.py.

python3 ./scripts/annotate_nodes_newick.py --input_tree ./data/efm_clade.tree.nwk --output_tree ./data/efm_clade.tree.annotated.nwk

Because of the use of PastML tool in the pipeline, VCF and Roary-formatted files need be converted to a CSV file compatible with PastML (https://pastml.pasteur.fr/help) using vcf_to_pastml_matrix.py and roary_to_pastml_matrix.py scripts, respectively.

python3 ./scripts/vcf_to_pastml_matrix.py --input_vcf ./data/efm_clade.reformatted.vcf.gz --output_table ./data/efm_clade.vcf.pastml.csv
gunzip ./data/efm_clade.gene_presence_absence.Rtab.gz
python3 ./scripts/roary_to_pastml_matrix.py --gene_presence_absence ./data/efm_clade.gene_presence_absence.Rtab --input_format 2 --output_table ./data/efm_clade.pg.pastml.csv

Because of the use of PLINK tool in the pipeline, VCF and Roary-formatted files need be converted to PLINK binary PED files (https://www.cog-genomics.org/plink/2.0/input#bed) using the scripts vcf_to_plink_files.py and roary_to_plink_files.py. PLINK input-formatted files are needed to run GCTA phenotype simulations.

python3 ./scripts/vcf_to_plink_files.py --input_vcf ./data/efm_clade.reformatted.vcf.gz --output_prefix ./data/efm_clade
python3 ./scripts/roary_to_plink_files.py --gene_presence_absence ./data/efm_clade.gene_presence_absence.Rtab --input_format 2 --output_prefix ./data/efm_clade.pg

Ancestral state reconstruction

One of the goals of the pipeline is to assess the power of detecting causal variants depending on their degree of homoplasy. Thus, the ancestral state of variants need to be reconstructed to count how many times they arose in the phylogeny. PowerBacGWAS uses PastML to infer the ancestral characters of VCF variants and genes using maximum parsimony. Because ancestral state reconstruction is computationally expensive but, at the same time it is applied independently to each variant, we sliced input variant files to parallelize this process using multi-processing.

Nextflow commands

The Nextflow implementation includes a pipeline to perform ancestral state reconstruction from a VCF (--run_vcf_ast, to be run after --run_prepare_vcf), a pan-genome file (--run_pangenome_ast, to be run after --run_prepare_pangenome) or to count homoplasies on regions (--run_vcf_regions_ast, to be run after --run_prepare_vcf and --run_vcf_ast).

nextflow run main.nf --outprefix efm_clade.var --tree efm_clade.tree.nwk --run_vcf_ast --num_cores 8
nextflow run main.nf --outprefix efm_clade.pg --tree efm_clade.tree.nwk --run_pangenome_ast --num_cores 8
nextflow run main.nf --outprefix efm_clade.var --tree efm_clade.tree.nwk --regions efm_genes.csv --run_vcf_regions_ast --num_cores 8

Individual commands

The wrapper scripts ancestral_state_reconstruction.py and ancestral_state_reconstruction_roary.py implement this step for VCF and Roary files, respectively.

python3 ./scripts/ancestral_state_reconstruction.py --input_vcf_table ./data/efm_clade.vcf.pastml.csv --input_tree ./data/efm_clade.tree.annotated.nwk --output_table ./data/efm_clade.vcf.ancestral.csv --output_steps ./data/efm_clade.vcf.ancestral_steps.csv --processes 8
python3 ./scripts/ancestral_state_reconstruction_roary.py --input_pastml_table ./data/efm_clade.pg.pastml.csv --input_tree ./data/efm_clade.tree.annotated.nwk --output_table ./data/efm_clade.pg.ancestral.csv --output_steps ./data/efm_clade.pg.ancestral_steps.csv --processes 8

NOTE: ancestral state reconstruction is a computationally expensive step, particularly for VCF files with large numbers of variants. The script ancestral_state_reconstruction.py is expected to run for several hours for large VCF files.

The output file ending with .ancestral.csv contains the ancestral state of each variant at all internal nodes in the phylogenetic tree, and the file ending with .ancestral_steps.csv the number of independent acquisitions (homoplasies).

If a burden test GWAS is to be run, the number of independent variant acquisitions (homoplasies) must be additionally calculated at the region level (e.g. gene) using the script calculate_changes_per_region.py. This script uses, in turn, PastML script calculate_changes.py.

Second, run the script calculate_changes_per_region.py. Note that the PastML output table with VCF variants and their ancestral states produced by script ancestral_state_reconstruction.py is used as one the required input files.

python3 ./scripts/calculate_changes_per_region.py --pastml_output_table ./data/efm_clade.vcf.ancestral.csv --input_tree ./data/efm_clade.tree.annotated.nwk --regions_file ./data/efm_genes.csv --processes 4 --calculate_changes_path ./pastml/pastml/utilities/calculate_changes.py --output_steps ./data/efm_clade.vcf.genes.ancestral_steps.csv --tmp_dir tmp_dir

Running power calculations: sub-sampling approach

The sub-sampling approach uses a known genotype-phenotype relationship to conduct power calculations and is implemented for binary phenotypes. The scripts prepare_gwas_runs_subsampling.py and prepare_gwas_runs_subsampling_roary.py implement this approach for VCF variants and the pan-genome, respectively.

The following input files are required:

  • A variant file (VCF or pan-genome)
  • A file with causal variants (a GCTA-formatted file with a list of variants IDs)
  • A Pyseer-formatted phenotype file, where the variants in the causal variants file are known to determine the trait in the phenotype file.
  • A parameters file specifying the range of allele frequencies, sample sizes and effect sizes to test.

Applied to the detection of a streptomycin resistance gene in Enterococcus faecium

We will apply the PowerBacGWAS sub-sampling approach to the detection of a streptomycin resistance gene in Enterococcus faecium.

Nextflow commands

The Nextflow command below will run the sub-sampling approach for a pan-genome GWAS in the efm_clade population. Three sub-sampling pipelines are implemented, depending on the type of input variant: --run_subsampling_vcf, --run_subsampling_pangenome and --run_subsampling_burden for individual VCF variants, pan-genome and burden testing GWAS, respectively. Note that a phenotype and causal variants files are needed. It is highly recommended to use a specific directory to store temporary GWAS files (--gwas_tmp_dir) for each power calculation analysis. The individual steps/scripts packaged in this pipeline are showed in the next section.

nextflow run main.nf --outprefix efm_clade.pg --tree efm_clade.tree.nwk --pangenome efm_clade.gene_presence_absence.Rtab --parameters efm_clade.subsampling_parameters.txt --causal_variants efm_clade.streptomycin_causal_loci.txt --phenotype efm_clade.streptomycin_phenotype.phen --run_subsampling_pangenome --num_cores 8 --gwas_tmp_dir tmp/gwas_tmp_efm_clade_pg_sub/

Individual commands

First, the PySeer script phylogeny_distance.py needs to be run to extract the patristic distances from the phylogeny that PySeer will use to run the Linear Mixed Model (LMM).

python3 ./pyseer/scripts/phylogeny_distance.py --calc-C ./data/efm_clade.tree.annotated.nwk > ./data/efm_clade.tree_distances.csv

As streptomycin resistance in Enterococcus faecium is caused by the acquisition of the ant(6)-Ia/aad(6) gene (annotated as aadE in the pan-genome file), the pan-genome branch of the pipeline (i.e. script prepare_gwas_runs_subsampling_roary.py) will be used:

python3 ./scripts/prepare_gwas_runs_subsampling_roary.py --roary_table ./data/efm_clade.gene_presence_absence.Rtab --input_format 2 --similarity ./data/efm_clade.tree_distances.csv --parameters_file ./data/efm_clade.subsampling_parameters.txt --causal-loci ./data/efm_clade.streptomycin_causal_loci.txt --phenotype ./data/efm_clade.streptomycin_phenotype.phen --code_directory ./scripts/ --pyseer_path ./pyseer/pyseer-runner.py --output_dir output_dir --output_prefix efm_clade.streptomycin

NOTE: the parameters file must be formatted like the example below:

allele_frequency: 0.01,0.025,0.05,0.10,0.25
sample_size: 100,200,300,400,500,600,700
case_control_ratio: 0.5
effect_size: 1.5,5,10,100
simulation_repetitions: 10

NOTES on editing this parameters file above:

The units of effect_size must be specified as odds ratio (where 0 to 1 indicate negative association, 1 no association, and > 1 positive association) when simulating a binary phenotype. The simulation of quantitative phenotypes is not supported using the sub-sampling approach.

The script prepare_gwas_runs_subsampling_roary.py will calculate the observed allele frequency and odds ratio of causal genes provided (--causal-loci), and will exit if the maximum allele frequency and effect sizes specified in the parameters file (--parameters_file) are greater than the observed ones.

Observed allele frequency: 0.3416557161629435
Causal variant odds ratio: 8986.25

Two output files will be obtained:

  • a bash script with GWAS runs (as many as sub-samples) ending with .gwas_runs.sh (e.g. efm_clade.streptomycin.gwas_runs.sh)
  • a CSV file ending with .gwas_runs.csv with all parameter combinations used (e.g. efm_clade.streptomycin.gwas_runs.csv)

The next step consists in running the bash script with GWAS runs:

bash efm_clade.streptomycin.gwas_runs.sh

NOTE: the option --job_submission_command in script prepare_gwas_runs_subsampling_roary.py allows you to include a job submission command (e.g. 'bsub -q normal -J job_id -o job_id.out -e job_id.err') in .gwas_runs.sh bash script to submit each GWAS run as a seperate job. Using the string 'job_id' in this argument (as shown in the example) will assign a unique identifier (combination_id) to each job.

Once GWAS runs have completed, the script process_gwas_runs.py is used to extract the LMM adjusted p-values of causal variants from Pyseer output files; and the R script plot_gwas_runs_subsampling.R to plot the results.

python3 ./scripts/process_gwas_runs.py --gwas_runs_in_table efm_clade.streptomycin.gwas_runs.csv --variant_type r --causal-loci ./data/efm_clade.streptomycin_causal_loci.txt --output_dir ./output_dir/ --gwas_runs_out_table efm_clade.streptomycin.gwas_runs.results.csv
Rscript ./scripts/plot_gwas_runs_subsampling.R --input_table efm_clade.streptomycin.gwas_runs.results.csv --parameters_file ./data/efm_clade.subsampling_parameters.txt --variants_tested 4868 --significance_level 0.05 --output_plot efm_clade.streptomycin.gwas_runs.results.plot.pdf

NOTE: the genome-wide significance threshold is calculated using a Bonferroni correction, dividing the chosen significance level (default 0.05) by the number of independent tests. The number of independent tests (number of unique gene presence/absence patterns) can be calculated using the command below:

cat ./data/efm_clade.gene_presence_absence.Rtab | cut -f2- | sed 1d | sort | uniq | wc -l

The resulting PowerBacGWAS plot can be seen below: PowerBacGWAS plot PowerBacGWAS plots like the one above are the final output of the pipeline. They show the sample sizes required to detect causal genes of different frequencies and effect sizes. The y-axis shows the power, calculated as the proportion of GWAS replicates in which the causal gene is above the Bonferroni-corrected genome-wide significance threshold. The black and dotted horizontal line marks 80% power. Sample sizes are represented in the x-axis. The colour of lines denotes different allele frequencies whereas point shapes represent different effect sizes (in odds ratio units).

Applied to the detection of an isoniazid resistance gene in Mycobacterium tuberculosis (burden testing)

We will apply the PowerBacGWAS sub-sampling approach to the detection of katG isoniazid resistance conferring mutation in Mycobacterium tuberculosis.

Nextflow commands

The Nextflow command below will run the sub-sampling approach for a burden GWAS in the mtb_clade population. Note that a phenotype and causal variants files are needed. It is highly recommended to use a specific directory to store temporary GWAS files (--gwas_tmp_dir) for each power calculation analysis. See the individual steps/scripts packaged in this pipeline in the next section.

nextflow run main.nf --outprefix mtb_clade.var --tree mtb_clade.tree.nwk --vcf mtb_clade.vcf.gz --parameters mtb_clade.subsampling_parameters.txt --causal_variants mtb_clade.isoniazid_causal_mutations.txt --causal_regions mtb_clade.isoniazid_causal_loci.txt --phenotype mtb_clade.isoniazid_phenotype.phen --regions mtb_genes.csv --run_subsampling_burden --num_cores 8 --gwas_tmp_dir tmp/gwas_tmp_mtb_clade_burden_sub/

NOTE: Burden tests are performed as implemented by PySeer, regardless of the annotation of VCF alleles (that is, their predicted functional effect). It is often recommended to group variants with the same predicted functional effect (e.g. loss of function mutations within a gene). In this case, you will need to filter the input VCF file as explained in PySeer’s manual.

Individual commands

As done before, the PySeer script phylogeny_distance.py needs to be run to extract the patristic distances from the phylogeny that PySeer will use to run the Linear Mixed Model (LMM).

python3 ./pyseer/scripts/phylogeny_distance.py --calc-C ./data/mtb_clade.tree.annotated.nwk > ./data/mtb_clade.tree_distances.csv

As drug resistance in Mycobacterium tuberculosis is caused by the acquisition of mutations, the VCF branch of the pipeline (i.e. script prepare_gwas_runs_subsampling.py) will be used. By specifying the option --burden, and providing a regions file (mtb_genes.csv), mutations will be grouped by region (gene) for burden testing. Thus, we will be testing the power of detecting mutated genes, as opposed to individual mutations.

python3 ./scripts/prepare_gwas_runs_subsampling.py --input_vcf ./data/mtb_clade.reformatted.vcf --similarity ./data/mtb_clade.tree_distances.csv --parameters_file ./data/mtb_clade.subsampling_parameters.txt --causal-loci ./data/mtb_clade.isoniazid_causal_mutations.txt --phenotype ./data/mtb_clade.isoniazid_phenotype.phen --burden ./data/mtb_genes.csv --code_directory ./scripts/ --pyseer_path ./pyseer/pyseer-runner.py --output_dir output_dir --output_prefix mtb_clade.isoniazid

NOTE: the parameters file (mtb_clade.subsampling_parameters.txt) is formatted like the example shown before, but a different range of parameters:

allele_frequency: 0.01,0.025,0.05,0.10,0.12
sample_size: 100,200,300,400,500,600,700,800,900,1000
case_control_ratio: 0.5
effect_size: 1.5,5,10,100,160
simulation_repetitions: 10

The script prepare_gwas_runs_subsampling.py will calculate the observed allele frequency and odds ratio of the causal gene (--causal-loci), and will exit if the maximum allele frequency and effect sizes specified in the parameters file (--parameters_file) are greater than the observed ones.

Observed allele frequency: 0.12906057945566285
Causal variant odds ratio: 163.60296296296298

Two output files will be obtained:

  • a bash script with GWAS runs (as many as sub-samples) ending with .gwas_runs.sh (e.g. mtb_clade.isoniazid.gwas_runs.sh)
  • a CSV file ending with .gwas_runs.csv with all parameter combinations used (e.g. mtb_clade.isoniazid.gwas_runs.csv)

The next step consists in running the bash script with GWAS runs:

bash mtb_clade.isoniazid.gwas_runs.sh

Once GWAS runs have completed, the script process_gwas_runs.py is used to extract the LMM adjusted p-values of causal variants from Pyseer output files; and the R script plot_gwas_runs_subsampling.R to plot the results.

python3 ./scripts/process_gwas_runs.py --gwas_runs_in_table mtb_clade.isoniazid.gwas_runs.csv --variant_type b --causal-loci ./data/mtb_clade.isoniazid_causal_loci.txt --output_dir ./output_dir/ --gwas_runs_out_table mtb_clade.isoniazid.gwas_runs.results.csv
Rscript ./scripts/plot_gwas_runs_subsampling.R --input_table mtb_clade.isoniazid.gwas_runs.results.csv --parameters_file ./data/mtb_clade.subsampling_parameters.txt --variants_tested 1958 --significance_level 0.05 --output_plot mtb_clade.isoniazid.gwas_runs.results.plot.pdf

NOTE: NOTE: The number of independent tests (number of unique gene presence/absence patterns), used to calculate the Bonferroni corrected genome-wide significance threshold, can be calculated using the command below:

cat ./data/mtb_clade.gene_presence_absence.Rtab | cut -f2- | sed 1d | sort | uniq | wc -l
    1958

The resulting PowerBacGWAS plot can be seen below: PowerBacGWAS plot

Applied to the detection of an isoniazid resistance mutation in Mycobacterium tuberculosis (VCF GWAS)

We will apply the PowerBacGWAS sub-sampling approach to the detection of a single katG isoniazid resistance conferring mutation in Mycobacterium tuberculosis.

Nextflow commands

The Nextflow command below will run the sub-sampling approach for a VCF GWAS in the mtb_clade population. Note that a phenotype and causal variants files are needed. It is highly recommended to use a specific directory to store temporary GWAS files (--gwas_tmp_dir) for each power calculation analysis.

nextflow run main.nf --outprefix mtb_clade.var --tree mtb_clade.tree.nwk --vcf mtb_clade.vcf.gz --parameters mtb_clade.subsampling_parameters.txt --causal_variants mtb_clade.isoniazid_causal_mutation.txt --phenotype mtb_clade.isoniazid_phenotype.phen --run_subsampling_vcf --num_cores 8 --gwas_tmp_dir tmp/gwas_tmp_mtb_clade_vcf_sub/

Running power calculations: phenotype simulation approach

The phenotype simulation approach to conduct power calculations is implemented in the PowerBacGWAS toolkit for when a genotype-phenotype relationship is not known in the population, for both binary and quantitative phenotypes. In brief, this approach consists in sampling existing variants from pan-genome or VCF file meeting a predefined allele frequency and degree of homoplasy, to then simulate phenotypes from these variants with a pre-defined effect size and sample size. The wrapper scripts prepare_gwas_runs.py and prepare_gwas_runs_roary.py implement this approach for VCF variants and the pan-genome, respectively.

We will use the dataset kpn_clade (a collection of 1,193 Klebsiella pneumoniae CC258 genomes) to show how PowerBacGWAS scripts should be applied.

Data pre-processing steps

As done for the sub-sampling approach, input files (a multi-sample VCF file, a Roary-formatted pan-genome CSV file and a phylogenetic tree in Newick format) must be pre-processed and prepared beforehand (See the section Data pre-processing steps for details). See commands for the dataset kpn_clade below:

Nextflow commands

nextflow run main.nf --outprefix kpn_clade.var --vcf kpn_clade.vcf.gz --run_prepare_vcf --num_cores 8
nextflow run main.nf --outprefix kpn_clade.pg --tree kpn_clade.tree.nwk --run_prepare_pangenome --pangenome kpn_clade.gene_presence_absence.Rtab --num_cores 8

Individual commands

python3 ./scripts/prepare_vcf_file.py --input_vcf ./data/kpn_clade.vcf.gz --output_vcf ./data/kpn_clade.reformatted.vcf
python3 ./scripts/annotate_nodes_newick.py --input_tree ./data/kpn_clade.tree.nwk --output_tree ./data/kpn_clade.tree.annotated.nwk
python3 ./scripts/vcf_to_pastml_matrix.py --input_vcf ./data/kpn_clade.reformatted.vcf.gz --output_table ./data/kpn_clade.vcf.pastml.csv
gunzip ./data/kpn_clade.gene_presence_absence.Rtab.gz
python3 ./scripts/roary_to_pastml_matrix.py --gene_presence_absence ./data/kpn_clade.gene_presence_absence.Rtab --input_format 2 --output_table ./data/kpn_clade.pg.pastml.csv
python3 ./scripts/vcf_to_plink_files.py --input_vcf ./data/kpn_clade.reformatted.vcf.gz --output_prefix ./data/kpn_clade
python3 ./scripts/roary_to_plink_files.py --gene_presence_absence ./data/kpn_clade.gene_presence_absence.Rtab --input_format 2 --output_prefix ./data/kpn_clade.pg

Ancestral state reconstruction

See the section Ancestral state reconstruction for details). See commands for the dataset kpn_clade below:

Nextflow commands

nextflow run main.nf --outprefix kpn_clade.var --tree kpn_clade.tree.nwk --run_vcf_ast --num_cores 8
nextflow run main.nf --outprefix kpn_clade.pg --tree kpn_clade.tree.nwk --run_pangenome_ast --num_cores 8
nextflow run main.nf --outprefix kpn_clade.var --tree kpn_clade.tree.nwk --regions kpn_genes.csv --run_vcf_regions_ast

Individual commands

python3 ./scripts/ancestral_state_reconstruction.py --input_vcf_table ./data/kpn_clade.vcf.pastml.csv --input_tree ./data/kpn_clade.tree.annotated.nwk --output_table ./data/kpn_clade.vcf.ancestral.csv --output_steps ./data/kpn_clade.vcf.ancestral_steps.csv --processes 8
python3 ./scripts/ancestral_state_reconstruction_roary.py --input_pastml_table ./data/kpn_clade.pg.pastml.csv --input_tree ./data/kpn_clade.tree.annotated.nwk --output_table ./data/kpn_clade.pg.ancestral.csv --output_steps ./data/kpn_clade.pg.ancestral_steps.csv --processes 8
python3 ./scripts/calculate_changes_per_region.py --pastml_output_table ./data/kpn_clade.vcf.ancestral.csv --input_tree ./data/kpn_clade.tree.annotated.nwk --regions_file ./data/kpn_genes.csv --processes 4 --calculate_changes_path ./pastml/pastml/utilities/calculate_changes.py --output_steps ./data/kpn_clade.vcf.genes.ancestral_steps.csv --tmp_dir tmp_dir
python3 ./pyseer/scripts/phylogeny_distance.py --calc-C ./data/kpn_clade.tree.annotated.nwk > ./data/kpn_clade.tree_distances.csv

Individual variants GWAS

The following files are needed to run PowerBacGWAS for VCF files:

  • An input VCF file (previously re-formatted with prepare_vcf_file.py)
  • Simularity matrix used by Pyseer (created by Pyseer phylogeny_distance.py script)
  • VCF PLINK-formatted files (created by vcf_to_plink_files.py)
  • A file with number of homoplasies per variant (created by ancestral_state_reconstruction.py)
  • And a parameters file

The format and content of the parameters file must be like the one below.

allele_frequency_from: 0.00,0.020,0.035,0.08,0.23
allele_frequency_to: 0.01,0.030,0.065,0.12,0.27
homoplasy_steps_from: 0
homoplasy_steps_to: 1000
number_causal_variants: 1
sampling_repetitions: 10

sample_size: 100,200,300,400,500,600,700,800,900,1000,1100,1200
case_control_ratio: 0.5
effect_size: 1.5,5,10,100
heritability: 0.5
prevalence: 0.5
simulation_repetitions: 5

NOTES on editing this parameters file above:

Select a single wide range of homoplasy steps (e.g. 'homoplasy_steps_from: 0' and 'homoplasy_steps_to: 1000' to disable selection of variants by degree of homoplasy.

Some parameters are indicated with ranges (defined from parameter_from to parameter_to) as it is more likely to find variants (e.g. SNPs) whose parameter values fall within a range (e.g. allele frequency) than having a specific value.

The pipeline allows to vary the effect sizes of causal variants or their heritability, but not both. If multiple values of effect_size and heritability are used, the script prepare_gwas_runs.py will exit with an error.

The units of effect_size must be specified as odds ratio (where 0 to 1 indicate negative association, 1 no association, and > 1 positive association) when simulating a binary phenotype, or in beta units when simulating a quantitative phenotype.

Although a variable named number_causal_variants is included in the parameters file, the pipeline does not currently support power calculations for phenotypes caused by multiple loci. This variable must be set to 1 as shown here.

Nextflow command

nextflow run main.nf --outprefix kpn_clade.var --tree kpn_clade.tree.nwk --vcf kpn_clade.vcf.gz --parameters kpn_clade.parameters.binary.efs.txt --run_phensim_vcf --num_cores 8 --gwas_tmp_dir tmp/gwas_tmp_kpn_clade_var_efs/

Individual commands

The wrapper script prepare_gwas_runs.py will reads a VCF file and randomly sample the chosen number of causal variants that meet the indicated combinations allele frequency and homoplasy level. The script outputs the list of variant IDs along with their chosen effect size in a GCTA-compliant format. The number of times this variant sampling step is repeated can be chosen in the parameters file (sampling_repetitions). The number of times phenotypes are simulated from sampled variants can also be chosen (simulation_repetitions). The output of the phenotype simulation step is a Pyseer-compliant phenotype file for each parameter combination. Finally, the wrapper script prepare_gwas_runs.py will output a bash script with as many GWAS runs, and a CSV file with as many lines, as parameter combinations.

python3 ./scripts/prepare_gwas_runs.py --input_vcf ./data/kpn_clade.reformatted.vcf --parameters_file ./data/kpn_clade.parameters.binary.efs.txt --code_directory ./scripts/ --pyseer_path ./pyseer/pyseer-runner.py --similarity ./data/kpn_clade.tree_distances.csv --plink_prefix ./data/kpn_clade --pastml_steps_file ./data/kpn_clade.vcf.ancestral_steps.csv --output_dir output_dir --output_prefix kpn_clade

NOTE: the option --job_submission_command in prepare_gwas_runs.py allows you to include a job submission command (e.g. 'bsub -q normal -J job_id -o job_id.out -e job_id.err') in .gwas_runs.sh bash script to submit each GWAS run as a seperate job. Using the string 'job_id' in this argument (as shown in the example) will assign a unique identifier (combination_id) to each job.

Next, run the output bash script:

bash kpn_clade.gwas_runs.sh

Once GWAS runs have completed, the script process_gwas_runs.py is used to extract the LMM adjusted p-values of causal variants from Pyseer output files.

python3 ./scripts/process_gwas_runs.py --gwas_runs_in_table kpn_clade.gwas_runs.csv --variant_type v --output_dir ./output_dir/ --gwas_runs_out_table kpn_clade.gwas_runs.results.csv

And finally, plot the results:

Rscript ./scripts/plot_gwas_runs.R --input_table kpn_clade.gwas_runs.results.csv --parameters_file ./data/kpn_clade.parameters.binary.efs.txt --plot_type 1 -v 6526 --output_plot kpn_clade.gwas_runs.results.plot.pdf

NOTE: the genome-wide significance threshold is calculated using a Bonferroni correction, dividing the chosen significance level (default 0.05) by the number of independent tests. The number of independent tests (number of unique allele patterns) can be calculated using the command below:

bcftools query -f '[%GT]\n' ./data/kpn_clade.reformatted.vcf | sort | uniq | wc -l
    6526

The resulting PowerBacGWAS plot can be seen below: PowerBacGWAS plot

Burden testing GWAS

PowerBacGWAS can also be used to conduct power calculations for burden test GWAS, where VCF variants are aggregated within chromosomal regions (e.g. genes), and regions tested for association. Thus, we will be testing the power of detecting mutated genes, as opposed to individual mutations.

The following input files are needed:

  • An input VCF file (previously re-formatted with prepare_vcf_file.py)
  • Simularity matrix used by Pyseer (created by Pyseer phylogeny_distance.py script)
  • VCF PLINK-formatted files (created by vcf_to_plink_files.py)
  • A regions file
  • A file with number of homoplasies per region (created by calculate_changes_per_region.py)
  • And a parameters file

The commands for the kpn_clade dataset would be:

Nextflow command

nextflow run main.nf --outprefix kpn_clade.var --tree kpn_clade.tree.nwk --vcf kpn_clade.vcf.gz --parameters kpn_clade.parameters.binary.efs.txt --regions kpn_genes.csv --run_phensim_burden --num_cores 8 --gwas_tmp_dir tmp/gwas_tmp_kpn_clade_burden_efs/

Individual commands

python3 ./scripts/prepare_gwas_runs.py --input_vcf ./data/kpn_clade.reformatted.vcf --burden ./data/kpn_genes.csv --parameters_file ./data/kpn_clade.parameters.binary.efs.txt --code_directory ./scripts/ --pyseer_path ./pyseer/pyseer-runner.py --similarity ./data/kpn_clade.tree_distances.csv --plink_prefix ./data/kpn_clade --pastml_steps_file ./data/kpn_clade.vcf.genes.ancestral_steps.csv --output_dir output_dir --output_prefix kpn_clade.burden
bash kpn_clade.burden.gwas_runs.sh
python3 ./scripts/process_gwas_runs.py --gwas_runs_in_table kpn_clade.burden.gwas_runs.csv --variant_type b --output_dir ./output_dir/ --gwas_runs_out_table kpn_clade.burden.gwas_runs.results.csv
Rscript ./scripts/plot_gwas_runs.R --input_table kpn_clade.burden.gwas_runs.results.csv --parameters_file ./data/kpn_clade.parameters.binary.efs.txt --plot_type 1 -v 5404 --output_plot kpn_clade.burden.gwas_runs.results.plot.pdf

The resulting PowerBacGWAS plot can be seen below: PowerBacGWAS plot

Pan-genome GWAS

PowerBacGWAS also supports power calculations for pan-genome GWAS. The following input files are needed:

  • A Roary presence/absence file
  • Simularity matrix used by Pyseer (created by Pyseer phylogeny_distance.py script)
  • A PLINK-formatted version of Roary table (created by roary_to_plink_files.py)
  • A file with number of homoplasies per gene (created by ancestral_state_reconstruction_roary.py)
  • And a parameters file

The commands for the kpn_clade dataset would be:

Nextflow command

nextflow run main.nf --outprefix kpn_clade.pg --tree kpn_clade.tree.nwk --pangenome kpn_clade.gene_presence_absence.Rtab --parameters kpn_clade.parameters.binary.efs.txt --run_phensim_pangenome --num_cores 8 --gwas_tmp_dir tmp/gwas_tmp_kpn_clade_pg_efs/

Individual commands

python3 ./scripts/prepare_gwas_runs_roary.py --roary_table ./data/kpn_clade.gene_presence_absence.Rtab --input_format 2 --parameters_file ./data/kpn_clade.parameters.binary.efs.txt --code_directory ./scripts/ --pyseer_path ./pyseer/pyseer-runner.py --similarity ./data/kpn_clade.tree_distances.csv --plink_prefix ./data/kpn_clade.pg --pastml_steps_file ./data/kpn_clade.pg.ancestral_steps.csv --output_dir output_dir --output_prefix kpn_clade.pg
bash kpn_clade.pg.gwas_runs.sh
python3 ./scripts/process_gwas_runs.py --gwas_runs_in_table kpn_clade.pg.gwas_runs.csv --variant_type r --output_dir ./output_dir/ --gwas_runs_out_table kpn_clade.pg.gwas_runs.results.csv
Rscript ./scripts/plot_gwas_runs.R --input_table kpn_clade.pg.gwas_runs.results.csv --parameters_file ./data/kpn_clade.parameters.binary.efs.txt --plot_type 1 -v 12034 --output_plot kpn_clade.pg.gwas_runs.results.plot.pdf

NOTE: The number of independent tests (number of unique gene presence/absence patterns) can be calculated using the command below:

cat ./data/kpn_clade.gene_presence_absence.Rtab | cut -f2- | sed 1d | sort | uniq | wc -l
   12034

The resulting PowerBacGWAS plot can be seen below: PowerBacGWAS plot