Active subnetwork oriented Enrichment Documentation
Overview of Active-subnetwork-oriented Enrichment Analysis

pathfindR proposes to leverage information from a PIN to identify distinct active subnetworks and then perform enrichment analyses on these subnetworks. As illustrated above, mapping the input genes with the associated p values onto the PIN (after processing the input), active subnetwork search is performed. The resulting active subnetworks are then filtered based on their scores and the number of significant genes they contain. These filtered list of active subnetworks are then used for enrichment analyses, i.e. using the genes in each of the active subnetworks, the significantly enriched terms (biological pathways, gene ontology terms, transcription factor target genes, miRNA target genes etc.) are identified. Enriched terms with adjusted p values larger than the given threshold are discarded and the lowest adjusted p value (over all active subnetworks) for each term is kept. This process of active subnetwork search + enrichment analyses is repeated for a selected number of iterations, performed in parallel. Over all iterations, the lowest and the highest adjusted-p values, as well as number of occurrences over all iterations are reported for each significantly enriched term in the result data frame. An HTML report containing the results is also provided.
While it is possible for the user to perform these steps manually, for convenience, we provide the wrapper function run_pathfindR() to be used for the active-subnetwork-oriented enrichment analysis.
Using the gene symbols and the associated p values, run_pathfindR() filters the input data frame (i.e. only keeping genes with p <= p_val_threshold), identifies and filters active subnetworks, performs enrichment analyses on the filtered subnetworks and summarizes the enrichment results.
To run with the default parameters, run the following command:
result <- run_pathfindR(input_df)For a detailed description on all arguments of run_pathfindR(), see the next section.
The output of run_pathfindR() is a data frame containing 8 (or 9) columns:
- ID: ID of the enriched term
- Term_Description: Description of the enriched term
- Fold_Enrichment: Fold enrichment value for the enriched term (Calculated using ONLY the input genes)
- occurrence: the number of iterations that the given term was found to enriched over all iterations
- lowest_p: the lowest adjusted-p value of the given term over all iterations
- highest_p: the highest adjusted-p value of the given term over all iterations
- non_Signif_Snw_Genes (OPTIONAL): the non-significant active subnetwork genes, comma-separated
- Up_regulated: the up-regulated genes (as determined by ‘change value' > 0, if the 'change column' was provided) in the input involved in the given term’s gene set, comma-separated. If change column not provided, all affected are listed here.
- Down_regulated: the down-regulated genes (as determined by ‘change value' < 0, if the 'change column' was provided) in the input involved in the given term’s gene set, comma-separated
All arguments of run_pathfindR()
Input-related Arguments
-
input: the input data that pathfindR uses. The input must be a data frame with 3 (or 2) columns:- Gene Symbol: Gene symbols of genes of interest
- Change value: Preferably log-fold-change for the given genes (OPTIONAL). (This is only used for visualization of input genes in enriched terms' gene sets)
- P value: (preferably adjusted) p value associated with the test (e.g. differential expression, differential methylation)
-
p_val_threshold: the adjusted-p value threshold to use when filtering the input data frame. Must a numeric value between 0 and 1. (default = 0.05) -
convert2alias: boolean to indicate whether or not to convert gene symbols in the input that are not found in the PIN to an alias symbol found in the PIN (default = TRUE) IMPORTANT NOTE: the conversion uses human gene symbols/alias symbols.
Active-subnetwork-search-related Arguments
-
pin_name_path: Name of the chosen PIN or path/to/PIN.sif. If PIN name, must be one of c("Biogrid", "STRING", "GeneMania", "IntAct", "KEGG", "mmu_STRING"). If path/to/PIN.sif, the file must comply with the PIN specifications. Defaults to "Biogrid". The PIN is used for both active subnetwork identification and enrichment analyses (i.e. all of the genes in the PIN are used as background genes in ORAs) -
search_method: Algorithm to use when performing active subnetwork search. Options are greedy search (GR), simulated annealing (SA) or genetic algorithm (GA) for the search (default: GR). Also see [Selecting the Active Subnetwork Search Algorithm](https://github.com/egeulgen/pathfindR/wiki/Active-subnetwork-oriented-Enrichment-Documentation#selecting-the-active-subnetwork-search-algorithm. -
score_quan_thr: Active subnetwork score quantile threshold used in filtering active subnetworks (Default = 0.80) -
sig_gene_thr: Threshold for the minimum proportion of significant genes used in filtering active subnetworks (Default = 0.02) -
silent_option: Boolean value indicating whether to print the messages to the console (FALSE) or print to a file (TRUE) during active subnetwork search (default = TRUE). This argument was added because during parallel runs, the console messages get mixed up.
Greedy Search Arguments
-
grMaxDepth: Sets max depth in greedy search, 0 for no limit (default = 1) -
grSearchDepth: Search depth in greedy search (default = 1) -
grOverlap: Overlap threshold for results of greedy search (Default = 0.5) -
grSubNum: Number of subnetworks to be presented in the results (Default = 1000)
Simulated Annealing Arguments
-
use_all_positives: if TRUE: In SA, initializes candidate solution with all positive nodes. (default = FALSE) -
saTemp0: Initial temperature for SA (Default = 1.0) -
saTemp1: Final temperature for SA (Default = 0.01) -
saIter: Iteration number for SA (Default = 10000)
Genetic algorithm Arguments
-
use_all_positives: if TRUE: in GA, adds an individual with all positive nodes. (default = FALSE) -
gaPop: Population size for GA (Default = 400) -
gaIter: Iteration number for GA (Default = 200) -
gaThread: Number of threads to be used in GA (Default = 5) -
gaCrossover: Applies crossover with the given probability in GA (default = 1, i.e. always perform crossover) -
gaMut: Applies mutation with given mutation rate in GA (default = 0, i.e. mutation off)
Enrichment-related Arguments
-
gene_sets: Name of the gene sets to be used for enrichment analysis. Available gene sets are "KEGG", "Reactome", "BioCarta", "GO-All", "GO-BP", "GO-CC", "GO-MF", "cell_markers", "mmu_KEGG" or "Custom". If "Custom", the arguments custom_genes and custom_descriptions must be specified. (default = "KEGG") -
min_gset_size: Minimum number of genes a term must contain (default = 10) -
max_gset_size: Maximum number of genes a term must contain (default = 10) -
custom_genes: A list containing the genes involved in each custom term. Each element is a vector of gene symbols located in the given custom term. Names should correspond to the IDs of the custom terms. This argument MUST be used whengene_sets == "Custom"(default = NULL) -
custom_descriptions: A vector containing the descriptions for each custom term. Names of the vector should correspond to the IDs of the custom terms. This argument MUST be used whengene_sets == "Custom"(default = NULL) -
adj_method: Correction method to be used for adjusting p-values of enrichment results (Default: 'bonferroni', see ?p.adjust) -
enrichment_threshold: Adjusted-p value threshold used when filtering enrichment results
Arguments for Multiple Iterations of Active Subnetwork Search + Enrichment Analyses
-
iterations: Number of iterations for active subnetwork search and enrichment analyses (Default = 10. Gets set to 1 for Genetic Algorithm) -
n_processes: Optional argument for specifying the number of processes used by foreach. If not specified, the function determines this automatically (Default == NULL. Gets set to 1 for Genetic Algorithm) -
list_active_snw_genes: Boolean value indicating whether or not to report the non-DEG active subnetwork genes for the active subnetwork which was enriched for the given term with the lowest p value (default = FALSE)
Arguments for Visualization etc.
-
visualize_enriched_terms: Boolean value to indicate whether or not to create diagrams for enriched terms (default = TRUE) -
plot_enrichment_chart: Boolean value. If TRUE, a bubble chart displaying the enrichment results is plotted (default = TRUE) -
output_dir: the directory to be created where the output and intermediate files are saved (default = "pathfindR_Results")
Specifying the Output Directory
By default, run_pathfindR() creates a directory named "pathfindR_Results" under the current working directory for writing the output files. To change the output directory, use output_dir:
output_df <- run_pathfindR(input_df, output_dir = "this_is_my_output_directory")This creates "this_is_my_output_directory" under the current working directory. In essence, this argument is treated as a path so it can be used to create the output directory anywhere. For example, to create the directory "my_dir" under "~/Desktop" and run the analysis there, you may run:
output_df <- run_pathfindR(input_df, output_dir = "~/Desktop/my_dir")Note: If the output directory (e.g.
"my_dir") already exists,run_pathfindR()creates and works under"my_dir(1)". If that exists also exists, it creates"my_dir(2)"and so on. This was intentionally implemented so that any previous pathfindR results are not overwritten.
Changing the Gene Sets Used for Enrichment Analyses
The active-subnetwork-oriented enrichment analyses can be performed on any gene sets (biological pathways, gene ontology terms, transcription factor target genes, miRNA target genes etc.). The available gene sets in pathfindR are "KEGG", "Reactome", "BioCarta", "GO-All", "GO-BP", "GO-CC" and "GO-MF" (all for Homo sapiens). For changing the default gene sets for enrichment analysis (hsa KEGG pathways), use the argument gene_sets:
output_df <- run_pathfindR(input_df, gene_sets = "GO-MF")By default, run_pathfindR() filters the gene sets by including only the terms containing at least 10 and at most 300 genes. To change the default behaviour, you may change min_gset_size and max_gset_size:
## Including more terms for enrichment analysis
output_df <- run_pathfindR(input_df,
gene_sets = "GO-MF",
min_gset_size = 5,
max_gset_size = 500)Note that increasing the number of terms for enrichment analysis will result in significantly longer run time.
If the user prefers to use another gene set source, the gene_sets argument should be set to "Custom" and the custom gene sets (list) and the custom gene set descriptions (named vector) should be supplied via the arguments custom_genes and custom_descriptions, respectively.
Filtering Enriched Terms by Adjusted-p Values
By default, run_pathfindR() adjusts the enrichment p values via the "bonferroni" method and filters the enriched terms by adjusted-p value < 0.05. To change this adjustment method and the threshold, set adj_method and enrichment_threshold, respectively:
output_df <- run_pathfindR(input_df,
adj_method = "fdr",
enrichment_threshold = 0.01)Changing the Protein-protein Interaction Network
For the active subnetwork search process, a protein-protein interaction network (PIN) is used. run_pathfindR() maps the input genes onto this PIN and identifies active subnetworks which are then be used for enrichment analyses. To change the default PIN ("Biogrid"), use the pin_name_path argument:
output_df <- run_pathfindR(input_df, pin_name_path = "IntAct")The pin_name_path argument can be one of "Biogrid", "GeneMania", "IntAct", "KEGG" or it can be the path to a custom PIN file provided by the user.
# to use an external PIN of your choice
output_df <- run_pathfindR(input_df, pin_name_path = "/path/to/myPIN.sif")NOTE: the PIN is also used for generating the background genes (in this case, all unique genes in the PIN) during hypergeometric-distribution-based tests in enrichment analyses. Therefore, a large PIN will generally result in better results.
Selecting the Active Subnetwork Search Algorithm
In pathfindR, we provided three algorithms that can be used by the user in order to search for active subnetworks. These are: greedy algorithm, simulated annealing and genetic algorithm. All three algorithms have been part of key studies in this domain and are widely used.
The default search method in pathfindR is greedy algorithm with a search depth of 1 and maximum depth of 1. This method stands out with its simplicity and speed. This is also the local subnetwork approach used in the LEAN method proposed by Gwinner et al.1. As mentioned in this study, the number of subnetworks to be identified typically increases exponentially with increasing number of genes in the PIN, and the local subnetwork approach enables iterating over each local subnetwork and determining phenotype-related clusters. Greedy algorithm with search depth and maximum depth equal to 2 or more lets the search algorithm look further in the network for another significant gene to add to the cluster.
While determining the depth of greedy algorithm, it should be kept in mind that KEGG, Biogrid, GeneMania and IntAct PINs have characteristic path lengths of 4.69, 3.26, 4.47 and 3.45 respectively. This means that with a depth of 3, the subnetwork to be checked for each seed gene will be too big, the search will take much longer time, and a loss in interpretability is anticipated.
Simulated annealing and genetic algorithms are heuristic methods that do not make any assumptions on the active subnetwork model. They can let insignificant genes between two clusters of significant genes to create a single connected active subnetwork. Thus, these algorithms may result in a large highest scoring active subnetwork while the remaining subnetworks identified become small and therefore uninformative. This tendency towards large subnetworks was attributed to a statistical bias prevalent in many tools 2.
In pathfindR, we use multiple subnetworks obtained via the chosen active subnetwork search algorithm. We then filter the subnetworks and perform enrichment on the genes of each of these subnetworks separately. The enrichment results are aggregated later on. In this approach, the default greedy algorithm is sufficient and fast. If the user decides to use the single highest scoring active subnetwork for the enrichment process, they are encouraged to consider greedy algorithm with greater depth, simulated annealing or genetic algorithm.
1 Gwinner F, Boulday G, Vandiedonck C, et al. Network-based analysis of omics data: the LEAN method. Bioinformatics. 2017;33(5):701-709.
2 Nikolayeva I, Guitart pla O, Schwikowski B. Network module identification-A widespread theoretical bias and best practices. Methods. 2018;132:19-25.