Toolbox - generic utilities for data processing (e.g., parsing, proximity, guild scoring, etc...)
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore
FormattedFileProcessor.py modifications for DO / MESH parsing and disease name matching Apr 1, 2014
GO.py fix in balanced AUC, additions for LINCS, GO and GEO analysis, initia… Aug 16, 2014
GOGOAParser.py
OBO.py modifications for DO / MESH parsing and disease name matching Apr 1, 2014
OboParser.py
README.md
R_utilities.R several additions for transqst Jul 31, 2018
Randomizer.py
TsvReader.py
__init__.py
calculate_proximity.py
classifier_evaluation.py updates & fixes for dz and proxide projects Apr 7, 2016
cluster_utilities.py
configuration.py moved parsers for several databases to toolbox Mar 26, 2014
dict_utilities.py updates & fixes for dz and proxide projects Apr 7, 2016
drug_info.py
file_converter.py
func_associate.py
functional_enrichment.py
guild_utilities.py
mcl_utilities.py Minor corrections and additions. Some commenting and xval additions t… Nov 27, 2012
network_utilities.py transqst related changes Jul 16, 2018
parse_amigo.py
parse_brite.py
parse_chembl.py
parse_clinical_trials.py
parse_cmap.py fda parser, msigdb parser refactoring, various parser modifications Sep 30, 2014
parse_compartments.py added several parsers Jan 19, 2018
parse_ctd.py
parse_dailymed.py
parse_disgenet.py
parse_do.py
parse_drugbank.py
parse_drugbank_v3.py
parse_drugrepurposing.py
parse_fda.py
parse_fda_spl.py
parse_gad.py
parse_gdsc.py
parse_genecards.py
parse_go.py several additions for transqst Jul 31, 2018
parse_hetionet.py
parse_hpa.py
parse_iid.py
parse_kegg.py
parse_labeledin.py dailymed spl parsing correction for a href Jul 25, 2015
parse_lincs.py MeSH ontology tree traversal bug fix Dec 4, 2014
parse_medi.py updated medi parsing and added hetionet parsing Sep 20, 2017
parse_medic.py
parse_mesh.py
parse_metab2mesh.py added orphanet parser Aug 15, 2015
parse_msigdb.py
parse_mycheminfo.py
parse_ncbi.py
parse_ndfrt.py
parse_omim.py several parser modifications May 19, 2016
parse_openphacts.py added chembl target retrieval and drugbank mapping as well as wrapper… Feb 9, 2018
parse_orangebook.py
parse_orphanet.py
parse_psimi_xml.py Added psi_mi parsing. Apr 29, 2013
parse_sider.py gdsc parser, ct parsing update, sider parsing update, spl fetched 50K Dec 20, 2014
parse_sider_v4.py
parse_snomedct.py
parse_stitch.py updates for side effect module detection Feb 17, 2016
parse_string.py
parse_umls.py
parse_uniprot.py
selection_utilities.py
sequence_utilities.py
sql_utilities.py added sql utilities & chembl parser Feb 10, 2018
stat_utilities.py
text_utilities.py
visual_utilities.py added several parsers Jan 19, 2018
wrappers.py

README.md

toolbox

Toolbox is a repository encapsulating various scripts used in my research on the analysis of disease and drug related biological data sets. It contains generic utilities for data processing (e.g., parsing, network-based analysis, proximity, etc, ...).

Contents

Background

The code here has been developed during the analysis of data in various projects such as

  • BIANA (@javigx2 was the lead developer): Biological data analysis and network integration
  • GUILD: Network-based disease-gene prioritization
  • Proximity: A method to calculate distances between two groups of nodes in the network while correcting for degree biases (e.g., incompleteness or study bias)

The package mainly consists of two types of files:

  • parser_{resource_name_to_be_parsed}.py
  • {type_of_data/software}_utilities.py

For instance, parse_drugbank.py contains methods to parse DrugBank data base (v.3) XML dump and network_utilities.py contains methods related to network generation and analysis.

Parsers

Parsers available for the APIs / files provided in the following resources (note that they are specific to retrieving a certain type of information --often related to pharmacological analyses-- and might not be up-to-date):

The parsers are provided "as is" and might not work due to updates on the data format of these resources. Please contact me for suggestions, bug reports and enquiries.

External packages (Optional)

Some functions in toolbox rely on the following packages. The package will load properly but certain functionality might not be available.

  • GenRev for Steiner-tree algorithm implementation
  • negex for identifying negatiation in textual medical records
  • FuncAssoc client for connecting FuncAssociate server (incorporated in the repository)

Wrappers

wrappers.py provides an easy to use interface to various methods I commonly use. It is continuously under development. Currently it contains methods to

  • Map UniProt, ENTREZ ids and gene symbols
  • Creating networkx network from file
  • Calculating proximity
  • Calculating functional enrichment using FuncAssociate API

GUILD

See below for python interface to run GUILD (assumes it is properly compiled and accessible at executable_path) using A and C as seeds and a toy network:

>>> from toolbox import wrappers
>>> file_name = "toy.sif"
>>> network = wrappers.get_network(file_name, only_lcc = True)
>>> nodes = set(network.nodes())
>>> seeds = ["A", "C"]
>>> node_to_score = dict((node, 1) for node in seeds)
>>> name = "sample_run"
>>> output_dir = "./"
>>> wrappers.run_guild(name, node_to_score, nodes, file_name, output_dir, executable_path)

After this command input node score file "sample_run.node" and output node score file "sample_run.ns" will be created in the current directory.

Proximity

Proximity analysis

To replicate the analysis in the paper please refer to proximity repository.

Proximity calculation

See calculate_proximity method in wrappers.py for calculating proximity:

calculate_proximity(network, nodes_from, nodes_to, nodes_from_random=None, nodes_to_random=None, n_random=1000, min_bin_size=100, seed=452456)

For instance, to calculate the proximity from (A, C) to (B, D, E) in a toy network (given below):

>>> from toolbox import wrappers
>>> file_name = "toy.sif"
>>> network = wrappers.get_network(file_name, only_lcc = True)
>>> nodes_from = ["A", "C"]
>>> nodes_to = ["B", "D", "E"]
>>> d, z, (mean, sd) = wrappers.calculate_proximity(network, nodes_from, nodes_to, min_bin_size = 2)
>>> print (d, z, (mean, sd))
(1.0, 0.97823676194805476, (0.75549999999999995, 0.24993949267772786))
>>>

Toy network (toy.sif):

A 1 B
A 1 C
A 1 D
A 1 E
A 1 F
A 1 G
A 1 H
B 1 C
B 1 D
B 1 I
B 1 J
C 1 K
D 1 E
D 1 I
E 1 F

The inputs are the two groups of nodes and the network. The proximity is not symmetric (if nodes_from and nodes_to are swapped, the results would be different, see below for details). The nodes in the network are binned such that the nodes in the same bin have similar degrees. For real networks, use a larger min_bin_size (e.g., 10, 25, 50, 100, see below for choosing the bin size). The random nodes matching the number and the degree of the nodes in the node sets are chosen using these bins. The average distance from the nodes in one set to the other is then calculated and compared to the random expectation (the distances observed in random groups).

Proximity calculation considerations

  • From/to nodes: Note that proximity, by definition, is not symmetric and the order of nodes_from and nodes_to makes a difference. If you do not have an intrinsic relationship between the two sets of nodes (the proximity from one to the other, such as from drugs to diseases), you can use the node set smaller in size.

  • Degree binning: For random selection of nodes with similar degrees to those in the original node sets, proximity uses binning of similar degree-nodes. This is because, high-degree nodes in the network are less common and the randomization algorithm would choose always the same set of nodes if binning is not used. That being said, if the bins are too large, the nodes within the bin are not a good representative of the original nodes (spanning many nodes with different degrees). Accordingly, the bins should contain enough number of nodes that would allow a representative random sampling. In each bin the nodes with degree higher degree (e.g., k+1, if k is the degree of the nodes in the current bin) are added iteratively till min_bin_size is reached. For instance, min_bin_size can be chosen to be at least twice as large as the max number of nodes in the node sets.

Citation

  • If you use biomedical data base parsers or proximity related methods please cite: Guney E, Menche J, Vidal M, Barabási AL. Network-based in silico drug efficacy screening. Nat. Commun. 7:10331 doi: 10.1038/ncomms10331 (2016). link

  • If you use GUILD related methods please cite: Guney E, Oliva B. Exploiting Protein-Protein Interaction Networks for Genome-Wide Disease-Gene Prioritization. PLoS ONE 7(9): e43557 (2012). link