Skip to content

Latest commit

 

History

History
1771 lines (1327 loc) · 79.8 KB

dev_log.rst

File metadata and controls

1771 lines (1327 loc) · 79.8 KB

TODOs for the project in the future:

On the table:

  • TEST: [FEATURE]: Translation tables
    • TEST: raise an exception if the translation base is too small
    • TEST: fold it into the translate into internal ids
    • TEST: In the mapping, preserve the weights column
  • TEST: [BUG]: forward the edge dropping into the construction routines
  • TODO: [PAPER]:
    • Replicability analysis:
    • ASK RONG for data (published):
      DONE: found in archives
      • Linhao paper for aggregates RNA-seq
      • NOPE: (too much already) Akshay p53 screens
    • ASK EWALD/BADER for data (published):
      DONE: found in archives
      • TWIST-1

        '/home/andrei/Dropbox/workspaces/JHU/Ewald Lab/TWIST1_ECAD/Hits.csv',

      • K14 (Veena?):

        '/home/andrei/Dropbox/workspaces/JHU/Ewald Lab/Veena data/both_HUM.csv', '/home/andrei/Dropbox/workspaces/JHU/Ewald Lab/TWIST1_ECAD/All_genes.csv'

      • Kp/Km

        '/home/andrei/Dropbox/workspaces/JHU/Ewald Lab/Kp_Km data/top_100_hum.csv',

      • Collagen vs Matrigel

        '/home/andrei/Dropbox/workspaces/JHU/Ewald Lab/Matrigel vs Collagen/Matrigel_vs_collagen-tumor.tsv'

    • OTHER VALIDATIONS:
      • Breast Cancer cell lines aneuploidy
      • Replicate the COVID19 patient fluids diff expression
  • TODO: [PAPER]: generate the plot to justify Gumbel distribution choice as fitting the max value
  • TODO: [PAPER]:
    • INTEST: run the chr11 re-analysis
    • TODO: replicate the COVID19 patient fluids diff expression
    • NOPE: p53 in case of Akshay

      Data will be hard to impossible to find

    • INTEST: Veena networks
  • TODO: [PAPER]: Ablation study
    • DONE: Code to perform the ablation study comparison:
      • DONE: compare calls
      • DONE: compare cll groups
      • DONE: generate ablations file to be compared
    • TEST: Hits degradation:3
      • randomly remove 5%, 10%, 20% and 50% hits
      • randomly remove 5%, 10%, 20% and 50% of lowest hits
    • TEST: Random noise in hits:
      • replace 5%, 10%, 20% and 50% hits with random node sets
    • TODO: Size of background samples
      • perform a sampling with 5, 10, 20, 25, 50 and 100 background reads
    • TEST: Weighting:
      • Weighted vs unweighted
    • TODO: Network degradation
      • interactome: randomly remove 5%, 10%, 20% and 50% of edges
      • annotome: randomly remove 5%, 10% and 20% of annotation attachments on proteins
    • TODO: Resistance to poisoning (Baggerly-robustness)
      • take a random set of nodes
      • show absence of calls
      • sprinkle a test dataset (glycogen biosynthesis)
      • show that only that cluster pops up

Current refactoring:

  • TODO: [KNOWN BUG] [CRITICAL]: pool fail to restart on cholmod usage

    There seems to be an interference between the "multiprocessing" pool method and a third-party library (specifically cholmod). It looks like after a first pool spawn, cholmod is failing to re-load a new solution again. - DONE: try re-loading the problematic module every time => Fails - DONE: try to explicitely terminate the pool. It didn't work in the end - TODO: extract a minimal example and post it to StackOverflow - DONE: add an implicit switch if there is a single element to analysis or multiple between explicitely multi-threaded and implicitely single-threaded - DONE: temporary patch and flag it as a known bug

  • TODO: rebuild and upload the project to the PyPI
    Delayed until after the review
    • TODO: rebuild and upload to the testPyPI
    • TODO: test it inside a docker instance
    • TODO: rebuild and upload to the PyPI
    • TODO: test it inside a docker instance
  • TODO: [REFACTOR]: factor out the process spawning logic shared between knowledge and

    interactome analysis to a "utility" domain of BioFlow

  • TODO: [SANITY] [REFACTOR]: sanify the databases management
    • TODO: create the data_stores package and move everything relating to mongoDB and neo4j there
    • TODO: remove the GraphDeclarator.py, fold the logic directly into the cypher_drivers
    • TODO: rename db_io_routines to bionet_routines
    • TODO: import cypher_drivers as bionet_store
    • TODO: rename cypher_drivers to _neo4j_backend
    • TODO: import the mongodb.py as an alias with samples_storage
    • TODO: fold the laplacians .dump object storage in dumps as auxilary_data_storage
    • TODO: put a type straight-jacket
    • TODO: move the internet_io to the data_stores package
  • TODO: [TESTING]: write integration test suite
    • TODO: implement docker testing
    • TODO: check computation speed
    • TODO: check integration test coverage
  • TODO: add model_assumptions filter to the auto_analyze of the interactome_analysis as well and

    annotome_analysis

  • TODO: [SANIFY][REFACTOR] Add a typing module with shared types
  • TODO: [FEATURE]: Factor out the structural analysis of the network properties to a module
    • TODO: basically eigenvalues + eigenvector for the largest one
    • TODO: tools used with Mehdi for the analysis of the network
    • TODO: create a bioflow.var folder and put scripts there
      • TODO: gene essentiality analysis => it's a different project and needs to be moved

        there with bioflow as a dependency

      • TODO:

<Environment registration>

  • TODO: build status.yaml in the $BIOFLOWHOME$/.internal
    • TODO: gets written to upon
      • databases downloads ['DOWNLOAD section']: name + date of download + hash
      • organism definition
      • neo4j filling: upon a neo4j "build"
      • Laplacians constructions
      • Translation of a dataset
    • TODO: on each addition to the stack, everything that is above a certain layer gets

      removed

    • TODO: on an addition of a stack, if a next level is to be added without the previous one

      existing, the level gets niked

    • TODO: gets copied to run upon each
      • upon a run, the status.yaml gets copied into the base folder
      • and a commit # gets added to it
      • plus if any untracked changes are present in the tracked files inside .bioflow
  • TODO: [USABILITY] store a header of what was analyzed and where it was pulled from + env

    parameters in a text file in the beginning of a run.

  • TODO: define a persistent "environment_state" file in the $BIOFLOWHOME/.internal/
    • TODO: log the organism currently operating

      => check_organism() -> base, neo4j, laplacians => update_organism(base, neo4j, laplacians) -> None

    • TODO: log the organism loaded in the neo4j
    • TODO: log the organism loaded in laplacians
    • TODO: define a "check_org_state" function
    • TODO: define an "update_org_state" function
    • TODO: make sure that the organisms in operating/neo4j/laplacian are all synced

      => "check_sync()" (calls check_organism, raises if inconsistency)

    • TODO: make sure that the neo4j is erased before a new organism is loaded into it.

      => "check_neo4j_empty" (calls check_organism, checks that neo4j is "None")

    • TODO: make sure that the retrieved backtround set is still valid

<END Environment registration>

<Sanify BioKnowledge>

  • TODO: Develop a pluggable Informativity weighting function for the matrix assembly
  • TODO: Allow for a score for a physical entity term attachment to the ontology system
    • eg. GO attachment comes from UNIPROT
    • reactome comes from reactome and can be assigned a linkage score.
  • DONE: inline the reach computation to remove excessively complex function

<Type hinting, typing and imports>

  • TODO: [SANITY][REFACTOR]: put all the imports under the umbrella making clear where they come from
  • TODO: move the models and types into a top-level file "typing", containing all the class models
  • TODO: [MAINTENABILITY][REFACTOR]: put a straightjacket of the types of the tuples passed

    around and function type signatures => Partially done, long-term project

  • TODO: [SANITY] convert the dicts to Type Aliases / NewType and perform proper type hinting

    => Partially done, long-term project

  • TODO: [SANITY][REFACTOR]: define appropriate types: => Partially done, long-term project
    • neo4j IDs
    • laplacian Matrix
    • current
    • potential

<Pretty progress>

  • TODO: [USABILITY] Improve the progress reporting

    Move the INFO to a progress bar. The problem is that we are working with multiple threads in async environment. This can be mitigated by using the aptbar library

    - TODO: single sample loop to aptbar progress monitoring
    • TODO: outer loop (X samples) to aptbar progress monitoring
    • TODO: move parameters that are currently being printed in the main loop in INFO channel to

      DEBUG channel

    • TODO: provide progress bar binding for the importers as well
  • TODO: [USABILITY]: fold the current verbose state into a -v/--vebose argument

DONE SEPARATOR:

  • DONE: sort clusters by p-value
    • DONE: there seems to be a bug where most clsters don't get output correctly anymore

      => Nope, correct, just no correct calls made

  • DONE: there is a problem with the sparse_sampling toggle being stuck on -1 even in the cases

    where it should not be.

  • DONE: there is a problem with trimming the length of sampled sets
    • current hypothesis is that it's due to duplicate neo4j ids that get eliminated during the

      translation to the matrix_ids

    • hypothesis is confirmed by the sampling engine not having the replace set to False in the

      np.random.choice

  • DONE: re-enable the env_skip flags in InteractomeInterface
  • NOFX: [FEATURE]: Bayesian re-weighting
    A possible implementation of this feature is to provide a mechanism that would sample the

    flow through the network based on provided pairs/groups and correct the resistances to make sure the generated flow is non significant (aka set things to 1 by dividing by the information flow).

    • TODO: sample a large set of nodes, non-normalized
    • TODO: calculate the resulting flows
    • NOFX: We are solving the same problem through statistics due to the difficulty of defining a

      proper prior and the amount of calculation needed to get there.

DONE: document the need to add to the java heap of neo4j when operating the database on the human interactome knowledge.

DONE: [FEATURE]: Factor out the clustering analysis of the network to a different function in

the knowledge/interactome analyses

- DONE: write a significance analysis function,
  • taking in the UP2UP tension + UP2UP background tension
  • if the analysis was dense
  • hierarchically clustering the matrix
  • sorting the clusters by size and average flow
  • for each size, compare the flow intensity
  • use Gumbel to determine significance
  • DONE: replicate it for the knowledge analysis
DONE: [SANITY][REFACTOR]: Integration test grid
  • Flat prim, weighted prim, flat prim/sec, weighted prim/sec
  • Knowledge & Interactome
  • No background, flat backgroumd, weighted background
DONE: write the expected environmental variables:
  • NEOPASS
  • BIOFLOWHOME

DONE: test docker deployment

DONE: rebuild and test for human deployment

DONE: it seems that the current architecture of neo4j is having trouble with very large transaction
  • DONE: implement the autobatching
    • we are adding a new parameter to the nodes being processed in batch: n.processing
    • it is cleared after a request goes through
    • we are limiting calls by the WITH n LIMIT XXX statement
    • where XXX is autobatching parameter
    • and performing a batching loop in self._driver.session, but around the session

      .write_transaction

TODO: document new features
DONE: CLI:
  • DONE: add support for secondary sets
  • DONE: add support for mail report of completion (import log as well and patch the log)
  • DONE: switch pure boolean options to flags; propagate to readme examples

DONE: todoc pass DONE: Three levels of usage: - DONE: Basic >>> readme - DONE: rewrite an example for basic usage, from CLI and example line - DONE: mention the integration tests and the samples shipped for unittests - DONE: Advanced - DONE: secondary set - DONE: weighted set - DONE: weighted background - DONE: reweighting/specific nodes exclusion - DONE: Deep dive - DONE: adding to the main database - DONE: Weighting schema modification - DONE: Pair generation method - DONE: Statistical significance - DONE: Sampling method - DONE: GDF

- NOFX : change the active organism by simple list and then read from the main_configs

(NOFX: major refactor, interferes)

DONE: clean up dead code
  • DONE: Delete old code path
  • DONE: Clear deprecation markers
  • DONE: Follow up with the propagate from main configs markers
  • DONE: Follow up with the renaming of nodes
  • TODO: Clear the dangling currentpass and tracing/intest/todoc todos
  • DONE: check for doc inconsistency
  • DONE: switch pool spin-up to internal function (_)
  • NOFX: switch the calculation of the sparsity into the active samples loading function, drop

    elsewhere: calculated only in the auto_analyze so far

  • DONE: move debug prints/logs to the log.debug
  • DONE: modify so that the line is printed only into the debug log and not into the info.log

    (aka console log)

  • DONE: figure out wtf is wrong with the exception reporting through the SMTP
  • DONE: disable all sys.excepthooks in logging and smtp logging and insert explicit wrappers
  • DONE: correct all the sparse_sampling in the docs
  • NOFX: in knowledge interface, find instances of the coupled LegacyID and name and rename them

    (NOFX: they are uniprot-specific and are used in iterations. That would be a major refactor)

<Memoization of actual analysis runs>

  • NOFX: [FEATURE] currently, re-doing an analysis with an already analyzed set requires a complete

    re-computation of flow generated by the set. However, if we start saving the results into the mongoDB, we can just retrieve them, if the environement and the starting set are identical. (NOFX: interferes with exclusion-based re-analysis)

  • NOFX: [USABILITY] move the dumps into a mongo database instance to allow swaps between builds
    • wrt backgrounds and the neo4j states (NOFX: implemented otherwise)
  • DONE: [USABILITY] since 4.0 neo4j allows multi-database support that can be used in order to

    build organism-specific databases and then switch between them, without a need to rebuild

  • NOFX: [USABILITY]: allow a fast analysis re-run by storing actual UP groups analysis in a

    mongo database - aka a true memoization. (NOFX: interferes with exclusion-based re-analysis)

  • ????: [USABILITY] add the Laplacian nonzero elements to the shape one (????)

<Specific nodes/links exclusion>

  • DONE: provide a list of ids of the nodes to be excluded from the analysis
  • DONE: map the nodes to the concrete annotation/physical entity nodes
  • DONE: after loading the laplacian interface, find the affected nodes/node pairs
    • for nodes, null the corresponding row & column
    • for pairs of nodes, null the specific cell pairs indicating the connections

<>

  • DONE: [USABILITY]: adjust the sampling spin-up according to how many "good" samples are

    already in the mongodb

  • NOFX: inline the mapping of the foreground/background IDs inside the auto-analyze set
    (NOFX: not an essentail feature)
    • improves run state registration
    • removes an additional layer of logic of saving/retrieving
    • is not necessary now that background is used only for the sampling
  • NOFX: fold in the different policy functions into the internal properties of the Interface

    object and carry them through to avoid excessive arguments forwarding (NOFX: not an essential feature)

  • DONE: transplant those functions into the hash calcualtion

<new flow and sampling routines>

<Split sets>
  • DONE: provide infrastructure for the loading for split hits sets
    • The easiest to do will be to add a separation character to the loading dumps
    • From the UI/UX perspective, however, it is a pure nightmare
    • From the user logic, the first-class usage of the secondary set would be a nightmare as

      well - it is not a happy path, but rather an additional feature. To enable the secondary set analysis, we will then be using the

    • NOPE: the final decision is to add and document the secondary set start in hits with a

      special entry "TARGET SET"

    • PROBLEM: there is a heavy interference with the parsing of weighted vs unweighted sets that
      will be problematic.
      • DONE: we are splitting the hits_list into two in order to supply to the downstream tools
    • TEST:
      • DONE: Discovered that now there was an issue with unmapped values sticking around after

        the translation

      • DONE: Discovered an issue with floats not being properly parsed anymore

<Weighting of the nodes>:

  • DONE: Define pairs in the sampling with a "charge" parameter if the parameters supplied by the
    • The problem is that there is no good rule for performing a weight sampling, given that there

      are now two distribution in the interplay

    • We however cannot ignore the problem, because we discretize a continuous distribution - something that is a VERY BAD PRACTICE (TM)
    • Basically, the problem is how to perform statistical tests to make sure not to make

      overconfident calls. => degree vs weight - based sampling?

  • DONE: allow for weighted sets and biparty sets to be computed:
    • DONE: modify the flow computation functions to allow for a current to be set
    • DONE: allow the current computation to happen on biparty sets and weighted sets
    • DONE: perform automated switch between current computation policies based on what is

      supplied to the method

  • DONE: allow for different background sampling processes
    • DONE: add two additional parameters to the mongoDB:
      • DONE: parameter of the sampling type (set, weighted set, biparty, weighted biparty)
      • DONE: parameter specific to the set type: (set size, set size + weight distribution,

        pairs, pairs + weights)

    • DONE: add a sampling policy transformer that takes in the arguments and returns a proper
      policy:
      • DONE: set sampling
      • DONE: weighted set sampling
      • DONE: bipary sampling (~ set sampling)
      • DONE: weighted biparty sampling (~ weighted set sampling)
  • DONE: check if the background set is weighted, we can perform a sampling according to the
    weights indicated there
    • As of now, it is not used in sampling
    • DONE: check if it is parsed in the weighted version
    • DONE: check if it is propagated in the weighted version
<knowledge interface mirror>
  • DONE: Mirror the weight sampling modifications from InteractomeInterface/interactome_analysis to

    AnnototmeInterface/knowledge_analysis.

    - DONE: conduction routines (does not apply - the loop is accessed directly from the

    Knowledge loop due to filtering)

    - DONE: flow calculation methods
    • DONE: change the calculation of ops to the included method
    • DONE: change the decision to go sparse with
    • DONE: change the generation of the pairs to the one included in the flow calculation

      method

    - DONE: add support for split and weighted sets in knowledge_access_analysis
    • DONE: forwarding of the secondary set and hits set in the knowledge_access_analysis
    • DONE: forwarding of the weights in the knowledge_access_analysis
    - DONE: add support for split and weighted sets in the BioKnowledge interface:
    • DONE: - flow calcualtion/evaluation/reduction methods
    • DONE: - separation of active up_sample from the weighted sample
    • DONE: - switch active samples to private
    • NOFX: - explicit weight functions to be supplied upon a full rebuild
      • Add an explicit weight function transfer to allow the rebuild
    • NOFX: - fast_load: background logic needs to account for if it is a list of ids or

      ids+weights (NOFX: currently does not work)

    • DONE: add active sample md5 hash that takes in account the flow and weights as well as

      sampling/flow calculation methods

    • DONE: add self.set_flow sources, evaluate ops and reduce ops
    • DONE: sparse samples standardized to -1
    • DONE: random sampling is now a forwards of the random sampling method in the "policy

      folder"

    • DONE: parse modification - propagate the options now
    - DONE: switch to sampling policy in KnowledgeInterface
    • DONE: revert the change to sparse_sampling in the text documentation of methods
    • DONE: pass the arguments down the pipeline
    • DONE: move self.entity_2_terms_neo4j_ids.keys() into self.known_UP_ids upon construction, then

      in all the references

    • DONE: upon debug discovered that the UNIPROT/GO parse is currently broken:
      The borked edges seem to be coming from the BioGRID database
      • DONE: a lot of uniprot node connections and "weak interactions" parse ast GO terms
      • DONE: the proper GO terms are not loading - dangling legacy code

- DONE: Remove the current defaults in the policies and allow the user to provide them explicitely upon modules call

  • DONE: perform the explicit background pass for BioKnowledge as for the Interactome
  • DONE: rename the 'meta_objects' to 'Reactome_base_object' in reactome_inserter.py

<Documentation>

  • DONE: [DOC] pass and APIdoc all the functions and modules
  • DONE: [DOC] document all the possible exceptions that can be raised
    • what will raise an exception
  • DONE: [SANITY] remove all old dangling variables and code (deprecated X)
  • DONE: [DOC] Document the proper boot cycle of the application
    • $BIOFLOWHOME check, use the default location (~/bioflow)
    • in case the user configs .yaml is not found, copy it from its own registry to $BIOFLOWHOME
    • use the $BIOFLOWHOME/config/main_config.yaml to populate variables in main_configs
    • use the information there to load the databases from the internet and set the parsing

      locations

    • be careful with edits - configs are read safely but are not checked, so you can get random

      deep python errors that will need to be debugged.

    • everything is logged to the run directory and the $BIOFLOWHOME/.internal/logs
  • DONE: [DOC] put an explanation of overall workflow of the library
  • DONE: [DOC] check that all the functions and modules are properly documented
  • DONE: [SANITY] move the additional from the "annotation_network" to somewhere saner =>

    Separate application importing BioFlow as a library

  • DONE: [DOC] document where the user-mapped folders live from Docker
  • DONE: [DOC] document the user how to install and map to a local neo4j database
  • DONE: [REFACTOR] re-align the command line interface onto the example of an analysis pipeline
  • DONE: [SANITY] Docker:
    • Add outputs folder map to the host filesystem to the docker-compose
    • Remove ank as point of storage for miniconda in Docker

<node weights/context forwarding>

  • DONE: eliminate InitSet saving in KnowledgeInterface and Interactome interface
    • DONE: they become _background
    • DONE: _background can be set on init and is intersected with accessible nodes (that are saved)
      • on _init
      • on _set_sampling_background
  • DONE: perform the modification of the background selection and registration logic.
    DONE: First, it doesn't have to be integrated between the AnnotationAccess interface and

    ReactomeInterface

    DONE: Second, we can project the background into what can be sampled instead of

    re-defining the root of the sampling altogether.

    DONE: Background is no more a parameter supplied upon constructions, but only for the

    sampling, where it still gets saved with the sampling code.

    DONE: the transformation within the sampling is done

  • DONE: deal with the parse_type inconsistencies (likely remainders of the previous insertions
that was off)
  • DONE: (physical_entity)-refines-(annotation)
  • DONE: (annotation)-refines-(annotation)
  • DONE: (annotation)-annotates-(annotation)
  • DONE: is_next_in_pathway still has custom_from and custom_to
  • DONE: change the way the connections between GOs and UPs are loaded into the KnowledgeInterface
  • DONE: [REFACTOR] The policy for the building of a laplacian relies on neo4j crawl (2 steps)
    and the matrix build:
    • neo4j crawl
      • A rule/routine to retrieve the seeds of the expansion
      • A rule/routine to expand from those routines and insert nodes into the network
    • matrix build
      • creates the maps for the names, ids, legacy IDs and matrix indexes for the physical

        entities that will be in the interactome

      • connect the nodes with the links according to a weighting scheme
      • normalize the weights for the laplacian
  • DONE: STAGE 2/3:
    • DONE: parse the entire physical entity graph
    • DONE: convert the graph into a laplacian and an adjacency matrix
    • DONE: check for the giant component
    • DONE: write the giant component
    • DONE: re-parse the giant component only
    • DONE: re-convert the graph into a laplacian
  • TODO: ax the deprecated code and class variables in InteractomeInterface
  • TODO: ax the deprecated variables in the internal_configs
  • DONE: [REFACTOR] inline the neo4j classes deletion(the same way as self_diag)
  • DONE: [REFACTOR] On writing into the neo4j DB we need to separate the node types and edges:
    • node: physical entity nodes
    • edge: physical entity molecular interaction
    • edge: identity
    • node: annotation (GO + Reactome pathway + Reactome reaction)
    • edge: annotates
    • node: x-ref (currently the 'annotation' node)

    - edge: reference (currently the 'annotates' edge type) For compatibility with life code, those will initially be referred to as parse_types as a property

  • DONE: [REFACTOR] add universal properties:
    • N+E: parse_type:
      • N:
        • physical_entity
        • annotation
        • xref
      • E:
        • physical_entity_molecular_interaction
        • annotates
        • annotation_relation
        • identity
        • reference
        • refines
    • N+E: source
    • N+E: source<property> (optional)
    • N: legacyID
    • N: displayName
  • DONE: [REFACTOR] check that the universal properties were added with an exception in
    • DB.link if parse_type not defined or source not defined
    • DB.create if parse_type not defined, source not defined, legacyID not defined or

      displayName not defined

  • DONE: [REFACTOR]
    • NOPE: Either add a routine that performs weight assignment to the nodes
    • DONE: Or crawl the nodes according to the parse_type tags, return a dict of nodes and a
      dict of relationships of the types:
      • NodeID > neo4j.Node
      • NodeID > [(NodeID, OtherNodeID), ] + {(NodeID, OtherNodeID): properties}
  • DONE: [DEBUGGING] write a tool that checks the nodes and edges for properties and numbers and
    then finds patterns
    • DONE: nodes
    • DONE: edges
    • DONE: patterns
    • DONE: formatting
  • DONE: PROBLEM:
    • 'Collections' are implicated in reactions, not necessarily proteins themselves.
    • Patch:
      • Either: link the 'part of collection' to all the 'molecular entity nodes'
      • Or: create 'abstract_interface'
      • Or: same
    • DONE: Due to a number of inclusions (Collection part of Collection, ....), we are going to

      introduce a "parse_type: refines"

Current rewriting logic would involve:

- DONE: Upon external insertion, insert as well the properties that might influence the weight computation for the laplacian construction - cross-link the Reactome nodes linked with a "reaction" so that it's a direct linking in the database - DONE: Changing neo4j crawl so that it uses the edges properties rather than node types - For now we will be proceeding with the "class" node properties as a filter - Crawl allowed to pass through edges with a set of qualitative properties - Crawl allowed to pass through nodes with a set of qualitative properties - Record the link properties {(node_id, node_id): link (neo4j object)} - Record the node properties {node_id: node (neo4j object)} - Let the crawl run along the edges until: - either the allowed number of steps to crawl is exhausted - there is no more nodes to use as a seed - DONE: change the weight calculation that will be using the link properties that were recorded - use the properties of the link and the node pair to calculate the weights for both matrices

- NOPE: [FEATURE] [USABILITY] upon organism insertion and retrieval, use the 'orgnism' flag on the

proteins and relationships to allow for simultaneous loading of several organisms. - Superseeded by the better way of doing it through multiple databases in a single neo4j instance

- DONE: record the origin of the nodes and relationships:
  • Reactome
  • UNIPROT
- DONE: define trust into the names of different databases and make use it as mask when pulling

relationships

- NOPE: [REFACTOR] remove the GraphDeclarator.py and re-point it directly into the cypher_drivers
  • It's already an abstract interface that can be easily re-implemented
- NOPE: [REFACTOR] wrap the cypher_drivers into the db_io_routines class
  • Nope, it's already an abstract interface
- DONE: [FUTUREPROOFING] [CODESMELL] get away from using _properties of the neo4j database

objects. => Basically, now this uses a Node[property_name] convention

- DONE: [PLANNED] implement the neo4j edge weight transfer into the Laplacian
  • DONE: trace the weights injection
  • DONE: define the weighting rules for neo4j
  • DONE: enable neo4j remote debugging on the remote lpdpc
  • DONE: change the neo4j password on remote lpdpc
  • DONE: add the meta-information for loading (eg organ, context, ...)
    • doable through a policy function injection

The other next step will be to register the context in the neo4j network in order to be able to perform loads of networks conditioned on things such as the protein abundance in an organ or the trust we have in the existence of a link.

  • neo4j database:
    • REQUIRE: add context data - basically determining the degree of confidence we want to

      have in the node. This has to be a property, because the annotations will be used as weights for edge matrix calculations and hence edges need to have them as well.

  • database parsing/insertion functions:
    • REQUIRE: add a parser to read the relevant information from the source files
    • REQUIRE: add an inserter to add the additional information from the parse files into

      the neo4j database

    • REQUIRE: an intermediate dict mapping refs to property lists that will be attached to

      the nodes or edges.

  • laplacian construction:
    • REQUIRE: a "strategy" for calculating weights from the data, that can reason on the in

      and out nodes and the edge. They can take in properties returned by a retrieval pass.

    • REQUIRE: the weighting strategy should be a function that can be plugged in by the end

      user, so for a form neo4j_node, neo4j_node, neo4j_edge > properties.

    • REQUIRE: the weighting strategy function should always return a positive float and be

      able to account for the missing data, even if it is raising an error as a response to it.

    • REQUIRE: the current weighting strategy will be encoded as a function using node types

      (or rather sources).

<DONE: CONFIGS sanity>

Current architecture:
  • user_configs.py, containing the defaults that can be overriden
  • XXX.yaml in $BIOFLOWHOME/configs that allow them to be overriden
  • the override is performed in the user_configs.py
  • main_configs imports them and performs the calculation of the relative import paths
  • all other modules import main_configs as configs and use variables as configs.var

- the injection is performed by an explicit if-else loop after having read in a sanitized way the yamls needed.

  • example_configs.yaml are deployed during the installation, that the user can modify.
  • the import will look for user_configs.yaml, that the user would have modified
  • the yaml file would contain a version of configs, se
  • DONE: how do we find where is $BIOFLOWHOME to read the configs from?
    • Look up the environment variable. If none found, proceed with default location

      (~/bioflow). ask user to register it explicitly upon installation.

  • DONE: define the copy to the user directory without overwrite
  • DONE: test that the edits did not break anything by defining a new $BIOFLOWHOME and pulling

    all online dbs again

  • DONE: move user_configs.py to the .yaml
A possible approach is to use the cfg_load library and the recommended good application practices:
  • define the configurations dictionary ()
  • load the defaults (stored in the configs file within the library)
  • load the $BIOFLOWHOME/configs/<> - potential overrides for the defaults
  • update the configurations with user's overrides
  • update the configurations with the environment variables (or alternatively read for the

    command line and inject into the environment)

- PROBLEM: we use typing/naming assist from IDE

We can override this issue by still assigning all the the values retrieved from the

config files to the variable names defined in the main_configs file.

In this way, our default variable definitions are the "default", whereas the user's yaml

file supplies potential overrides

  • PROBLEM: storing the build/background parameters for the Interactome_Analysis and the

    Annotome_Analysis classes

Itermediate problem: there is a loading problem for the BioKnowledgeInterface due to InitSet used on the construction (~6721 nodes) is significantly bigger than the InitSet used in order to generate the version that is being put into storage. - InitSet is loaded as InteractomeInterface.all_uniprots_neo4j_id_list. In case reduced_set_uniprot_node_ids is defined in the parameters, the source set from the InteractomeInterface is trimmed to only nodes that are present in the limiter list. - InteractomeInterface is the one currently stored in a way that can be retrieved by a .fast_load() and is not changed since the last since the last load. The .fast_load() call is performed

- It looks like the problem is in the fact that the rebuild uses a background limiter (all detectable genes), whereas the fast load doesn't.

=> Temporary fix:
  • DONE: define a paramset with background in the analysis_pipeline_example.
=> In the end, it is a problem of organization of the parameters and of the context.
  • TODO: Set a user flag to know if we are currently using a background.
  • TODO: Set a checkers on the background loads to make sure we
  • TODO: Version the builds of the MolecularInterface and BioKnowledge interface
  • TODO: Provide a fast fail in case if the environment parameters differ between the

    build and the fastload. Environment parameters are in the user_configs.py file.

  • TODO: this is all wrapped in the environment variables
  • TODO: for each run, this is saved as .env text file render.
  • DONE: [SANITY] Configs management:
    • DONE: move the organism to the '~/bioflow'
    • DONE: all the stings + need to be os.path.join.
    - DONE: active organism is now the only thing that is saved. It is stored in "shelve" file inside the ".internal" directory - NOIMP: fold in the sources for the databases into a single location, with a selector from "shelve" indicating which organims to load. - NOIMP: create a user interface command in order to set up the environment and a saving file that allows the configs to be saved between the users.

    - DONE: move the online_dbs.ini, mouse.ini, yeast.ini to the ~/bioflow/configs and add user_configs.ini to it to replace user_configs.py.

  • DONE: [SANITY]: move the location from which the base folder is read for it to be computed

    (for relative bioflow home insertion) (basically the servers.ini override)

The next step will be to register user configurations in a more sane way. Basically, it can either be a persistent dump that is loaded every time the user is spooling up the program or an .ini file in additions to the ones already existing

  • Peristent dump:
    • PLUS: Removes the need to perform reading in and out of .ini files
    • PLUS: Guarantees that the parameters will always be well formed
    • MINUS: is non-trivial to modify for the users
  • .ini files: => selected, but in the .yaml incarnation
    • PLUS: Works in a way that is familiar to most people
    • PLUS: Allows
    • MINUS: in case configurations are not properly defined, all crashes
  • Both:
    • NOPE: a command line process to define the variables
    • NOPE: a command line to show all the active flags
    • DONE: the configs folder to be gutted of active configs and them moved to the

      ~/bioflow directory

    • DONE: a refactor to show transparently the override between the default parameters

      and the user-supplied parameters

    • DONE: a refactor to remove the conflicting definitions (such as deployment vs test

      server parameters)

  • DONE: [USABILITY] add a proper tabulation and limit float length in the final results print-out

    (tabulate: https://pypi.python.org/pypi/tabulate)

  • DONE [USABILITY] add limiters on the p_value that is printed out elements
  • DONE: [USABILITY] change colors of significant elements to red; all others to black (with alpha)

    - This modification is to be peformed in the samples scatter and hist function in the interactome_analysis module

  • DONE: [FEATURE] [REFACTOR]:
    • add the selection of the degree width window for stat. significance calculation.
  • DONE: [FEATURE]:
    • add p-value and pp-value to the GO annotation export
  • DONE: Currently, performing an output re-piping. The output destinations are piped around thanks

    to a NewOutput class in main_configs, which can be initialized with a local output directory (and will be initialized in the auto-analyze function for both the interactone and the knowledge analysis network)

  • DONE: [DEBUG] add the interactome_network_stats.png to the run folder
  • DONE: [USABILITY]: fold the p-values into the GO_GDF export in the same way we do it for the

    interactome

  • NOIMP: [USABILITY] Add an option for the user to add the location for the output in the

    auto-analyse

  • DONE: [USABILITY] align the rendering of the conditions in annotations analysis with the

    interactome analysis

  • DONE: [DEBUG]: align BioKnowledgeInterface analysis on the InteractomeAnalysis:
    • Take the background list into account

    - Take in account the analytics UP list in the hashing (once background is taken into account) We are dealing with a problem on the annotation analysis network not loading the proper background (proably due to the wrong computation of the laplacian). At this point we need to align the Annotation analysis on the molecular analysis. - DONE: run git blame on the Molecular network interface, copy new modifications - DONE: run git blame on molecular network analysis, copy the new modifications

  • DONE: [SHOW-STOPPER]: debug why GO terms load as the same term
    • Not an issue - just similar GO terms of different types (eg entities)
  • TODO: [SANITY]: Feed the location of the output folders for logs with the main parameters
    • DONE: create a function to generate paths from a root location
    • DONE: define new "TODO"s : (TRACING, OPTIMIZE and CURRENT)
    • DONE: Move the "info" log outputs to the parameters
    • NOIMP: Allow the user to provide the names for the locations where the information will be

      stored

    • DONE: trace the pipings of the output / log locations
  • DONE: [USABILITY] add a general error log into the info files
  • DONE: [USABILITY] save the final table as a tsv into the run directory
  • DONE: [USABILITY] format the run folders with the list send to the different methods
  • DONE: [USABILITY] add a catch-it-all for the logs
  • DONE: [OPTIMIZATION]: Profile the runtime in the main loop:
    • DONE: check for consistency of the sparse matrix types in the main execution loop
    • DONE: run a profiler to figure out the number of calls and time spend in each call. common

      profilers include cProfile (packaged in the base python) and pycallgraph (although no updates since 2016). Alternatively, cProfile can be piped into gprof2dot to generate a call graph

    • DONE: biggest time sinks are:
      • csr_matrix.__binopt (likely binaries for csr matrices) (22056 calls, 81404 ms)
      • lil_matrix.__sub__ (7350/79 968)
      • lil_matrix.tocsr (7371/76 837)
      • sparse_abs (7350/58 700)
      • lil_matrix._sub_sparse (3675/48 192)
      • csr_matrix.dot/__mul__/_mul_sparse_matrix (~7350 / ~48 000)
      • triu (3675/47 592)
      • csr_matrix.tocsc (7353 / 47 253)
      • csc_matrix.__init__ (121332 / 47 253) (probably in-place multiplication is better)
    • DONE: first correction:
      • baseline: edge_current_iteration: 3675 calls, 362 662 ms, 40238 own time.
      • DONE: uniformize the matrix types towards csc
        • eliminated lil_matrix: performance dropped, lil_matrix still there)

          (3675 calls, 366 321ms, 41 525 own time)

        • elimitated a debug branch in get_current_through_nodes => No change

          (3675 / 367 066 / 40 238)

        • corrected all spmat.diag/triu calls to return csc matrices + all to csc => worse

          (3675 / 408 557 / 40 657)

        • tracked and imposed formats to all matrix calls inside the fast loop => 50% faster

          (3675 / 262 627 / 40 420) => csr_matrix still gets initialized a lot and a coo_matrix is somewhere. lil_matrix is gone now though

        • repaced all mat.csc() conversions by tocsc() calls

          => was more or less already done

    • DONE: profiled line per line execution.
      • sparsity changes are the slowest part, but seem unavoidable
      • followed by triu
      • followed by additions/multiplications
      • followed by cholesky
    • DONE: delay the triu until after the current accumulator is filled
      • baseline: 261 983
      • after: 197 888 => huge improvement
    • NOPE: perform in-place multiplication => impossible (no in-place dot/add/subract

      versions)

    • TODO: clean-up:
      • DONE: remove the debug filters connectors
      • DONE: deal with the confusing logic of enabling the splu solver
        • problem => We run into the optimization of using a shared solver
      • DONE: test the splu and the non-shared solver branch
        • Non-sharing works, is slow AF
        • Splu switch works, but is slow AF
  • DONE: resolve the problem with "memoization" naming convention. in our cases it's remembering

    potential diffs. It appears that it also interacts with "fast load" in "build extended conduction system. Technically, by performing a memoization into a database, we could have a searchable DB of past runs, so that the comparison is more immediate. So far the usage is restricted to InteractomeInterface.compute_current_and_potentials, to enable a fast load behavior

  • DONE: [DEBUG] [SHOW-STOPPER]: connections between the nodes seem to have disappeared
    • DONE: check if this could have been related to memoization. Unlikely. The

      only place where the results of memoized were accessed was for voltages > it is.

    • DONE: run test on the glycogen set. No problem detected there
  • DONE: [SANITY] Logs:

    => DONE: Pull the logs and internal dumps into the ~/bioflow directory => IGNORE: Hide away the overly verbose info logs into debug logs.

  • PTCH: [SANITY] allow user to configure where to store intermediates and inputs/outputs
  • DONE: [SANITY] move configs somewhere saner: ~/bioflow/ directory seems to be a good start
  • PTCH: [CRICITAL] MATPLOTLIB DOES NOT WORK WITH CURRENT DOCKERFILE IF FIGURE IS CREATED =>

    figures are not created.

  • DONE: [CRITICAL] ascii in gdf export crashes (should be solved with Py3's utf8)
  • DONE: [DEBUG]/[SANITY]: MongoDB:
    • DONE: Create a mongoDB connection inside the fork for the pool
    • DONE: Move MongoDB interface from configs into a proper location and create DB type - agnostic bindings
  • DONE: [SHOW-STOPPER] Memory leak debugging:
    • DONE: apply muppy. Muppy did not detect any object bloating > most likely comes from matrix

      domain

    • DONE: apply psutil-based object tracing. The bloat appears around summation + signature

      change operations in the main loop > sparse matrix summation and type change seem to be the origin.

    • DONE: try to have consistent matrix classes and avoid implicit conversions. Did not help

      with memory, but accelerated the main loop by 10x. Further optimization of the main loop might be desirable

    • DONE: try to explicitely destroy objects with _del and calls of gc. Did not mitigate the

      problem. At this point, the memory leak seem to be localized to C code for the summation/differentiation between csc_matrices.

    • DONE: disable multithreading to see if there is any interference there. Did not help
    • DONE: extract he summation to create a minimal example: did not help
    • DONE: build a flowchart to see all the steps in matrices to try to extract a minimal

      example. Noticed that memoization was capturing a complete sparse matrix. That's where the bloat was happening.

    • DONE: correct the memoization to remember the currents only.
  • DONE: Threading seem to be failing as well.

    The additional threads execute the first sampling, but never commit. Given that they freeze out somewhere in the middle, the most likely hypothesis is that they run out of RAM and only one thread - that keeps a lock on it - continues going forwards. In the end it is due to memory leak.

  • DONE: [DEBUG]: sampling pools seem to be sharing the random ID now and not be parallel. CPU

    usage however indicates spawned processes running properly

  • DONE: Random ID assignements to the the threads seem to be not working as well
    • DONE: rename pool to thread
    • DONE: add the ID to the treads
    • DONE: debug why objects all share the same ID across threads (random seed behavior? - Nope).

      The final reason was that thread ID was called in Interactome_instance initialization and not

  • DONE: [USABILITY] change ops/sec from a constant to the average of the last run (was already

    the case)

  • DONE: debug the issue where the all_uniprots_id_list interesection with background leads
    to error-prone behavior. Errors:
    1. injection of non-uniprot IDs into the connected uniprots
    b) change of the signature of the Interface instance by changing:
    • analytics_uniprot_list in InteractomeInterface
    • analytics_up_list in BioKnowledgeInterface
    The issue seem to be stemming from the following variables: in `InteractomeInterface`: - self.all_uniprots_neo4j_id_list (which is a pointer to self.reached_uniprots_neo4j_id_list) - self.connected_uniprots - self.background - self.connected_uniprots and self.background are directly modified from the auto_analyze routine and then - The operation above is cancelled by random_sample specifically Which is probably the source of our problems. now the issue is how to get rid of the problem with nodes that failed - the issue only emerges upon sparse sampling branch firing - self.entry_point_uniprots_neo4j_ids is used by auto_analyze to determine sampling depth and is set by the set_uniprot_source() method and is checked by the get_interactome_interface() method in `BioKnowledgeInterface`: - self.InitSet (which is all_uniprots_neo4j_id_list from the InteractomeInstance from which the conduction system is built) - self.UPs_without_GO The intermediate solution does not seem to be working that well for now: the sampling mechanism tends to pull as well the nodes that are not connected to the giant component in the neo4j graph.

    - Tentatively patched by making the pull from which the IDs are sampled stricter. Seems to work well

  • DONE: [SHOW-STOPPER]: ReactomeParser does not work anymore likely a node.

    The issue was with a automated renaming during a refactoring to extract some additional data

  • DONE: [FEATURE]: (done by defining a function that can be plugged to process any tags in neo4j)
    • In Reactome, parse the "Evidence" and "Source" tags in order to refine the laplacian weighting

Bigger refactors

  • TODO: [OPTIMIZATION]:
    • bulk-group the insertions and cross-linkings for the Reactomse
  • TODO: [USABILITY] pull inlined updates printing from evoGANs project.

    => Currently the percentages are managed by log.info(calls) => Providing an in-line update would require a print(<log message>, end='r') => Change log management so that the info gets logged into a file without rising to the surface and couple all the log.info with a "print"

  • NOPE: [FEATURE] add flow calculation for real samples saving to mongodb, +

buffering (if unchanged InteractomeInstance and other secondary formatting, just retrieve the flow from the database: we added support for matrix reweighting)

  • DONE: [REFACTOR] refactor the entire edge typing upon insertion, retrieval upon construction of

laplacian/adjacency matrix and setting of Laplacian weights

  • TODO: [FEATURE] Change the informativity between nodes connection from the max of their

informativities to the difference of their informativities. In this way, the total path is equivalent to the quantity of the information stored

  • TODO: [FEATURE] provide the interface for overlaying the molecular maps to check for

signatures/compare samples

  • TODO: [FEATURE] change the informativity computation so that the path between nodes through the

network is log of probability of being connected through that suite of GO terms.

  • DONE: [FEATURE] flow to a targeted set
  • DONE: [FEATURE] weighted targets flow
  • DONE: [FEATURE] modification of the Laplacian weights by the end user.
  • DONE: [FEATURE] import credence from the interaction databases
  • TODO: [FEATURE] add direct interactions for TFs in yeasts by combining

https://www.nature.com/articles/ng2012 and https://pubmed.ncbi.nlm.nih.gov/29036684/ for binary interaction filters

  • DONE: [DEV TOOL] performance evaluation run: compute compops for sampling a large pool of

genes in yeast

  • TOOD: [TYPING] set the model for the database usage for the samples by performing

insertions/conversions inside the sample_storage databases wrappers (for mongodb: form the dict of query/payload)

  • DONE: neo4j and mongodb versionning based on the currently

active organism. For instance, all neo4j nodes and edges need to be marked with the organism tag.

  • DONE: [NOT THIS ITERATION] add a mail signalling to indicate the termination or crash of the

execution

Bulk Backlog:

Language of network alignment/explanation of net1 by net2: allows

to compare GO annotation with interactome, cell type specificity analysis or organ context. => Solved by the search of average resistance between connected nodes in the graph1 on graph2 . Alternatively, average flow intensity difference between a random sample of common noces on graphs 1 and 2.

Problems uncovered with user while testing the Docker integration:
  • explain how directories on the OS are mapped to the directories on the Docker
  • Suggest that in case you are using Mac/OSX, you need to manually increase memory allocated to Docker to at least 16 GBs:
    • 2 GB for each database
      • ~7GB for each processor used to perform random sampling
Potential problems with pip installation:
  • the configs/dump files modified by the user

    => Move them to the ~/bioflow/ directory

Travis tester:
  • unittest
  • docker from Githubs
  • pip install
  • docker-compose
  • databases downloads

=> Build an export of the sampling current weights to figure out which nodes are offending. => Re-compute eigenvalues of the laplacian and add values to the network weighting nodes

=> Modify the insertion/retrieval pattern into the mongoDB to separate foreground from background runs
  • Start with the foreground run
  • continue with the backgrounds until satisfied
  • always possible to resume the sampling afterwards

Add a mention for what were the parameters of the launch of the analysis - what was build and where the data was loaded from?

We are using Interactome Interface for 5 independent reasons:
  • build the laplacian matrix
  • store the laplacian matrix
  • perform sampling on the laplacian matrix
  • calculate the stats on teh sampling matrix => This is actually done in interactome analysis
  • (OK) export the rendering to the gdf => this actually is done by a separate object.
  • it might be a good idea to refactor it.

We can already factor out the two methods responsible for a laplacian matrix building.

(OK) Correct the HINT downloading and renaming

Switch to matrix instead of dict for a current/tension storing in a dense fashion

(OK) Implement output redirects - main_config Outputs patching does not seem to work - we need to create the object anew in case of need.

# correct the overestimation of flow gain for low-edge nodes in the network.

ADD to the documentation:
  • management of multicast to accession numbers and gene names.
    • random returns
    • cross-linked with 'is_likely_similar' links, that are imported to Laplacian with
Logging and CLI wrappers:
  • redirect logging to stderr
  • add version flag (version + commit #)
  • add autocomplete
  • dump gdf to stdout?
  • check option prompts
  • provide an interface to inform of the program completion (?)
  • add a spinner for slow processes
Documentation:
  • installation:
    • docker
    • Ubuntu
  • usage:
    • core command lines
    • core Python
    • post-processing for analysis
  • neo4j manual usage
    • Cypher
    • access to non-local neo4j instance
    • useful commands
    • what to do if the commands are slow (optimized for use case of the Bioflow, not necessarily best)

Bulk Backlong Done:

YEAST:

DISABLE TRRUST/human/mouse-specific imports: DONE

Next steps, in order:
  • (DONE) dump of indexed nodes Legacy Ids and a method to compare them (in the "DB inspection" realm)
  • (DONE) Delete dead branches, break dependency on bulbs
    • think if we could do testing for a neo4j build
  • (DONE) build a new docker image
There still seem to be a problem of regular convergence to the same paths in the network. Potential sources:
  • borked topology
  • current intensity between the interconnected nodes (potentially resolved)
  • tight clusters due to cross-linking that disperse the network

=> Solved through statistics computation refactoring

Functional:

  • (OK )Enforce p-value limitation to what is achievable with the background sampling size.
  • (OK ) Re-inject the p_values following the comparison back into the output and export them as part of gdf schema
  • (OK - Won't fix) Normalize the Laplacian (Joel Bader)
  • (OK - Won't fix) Increase the fudge factor to ~ 10% of error (Joel Bader)
  • (OK) Integrate the amplification level into the analysis - relative amplitude of perturbation
    • Would require to fold the hits list into the interface object
    • Would require to always use the hits list as
    • Would require an explicit injection of voltage (rather a normalization)
  • For dense sampling, switch to a matrix instead of a dictionary to store current/tension for each pair of proteins.
  • Enable a calculation with explicit sinks and sources groups
    • Would require a split in hits into "sink" and "source" groups
    • Would need a revised null model (random sinks/sources? if single sink, random sources?)
  • (OK) Add signal/noise ration - the flow we are getting in a given node compared to what we would have expected in a random node.

    • => More or less already done by the p_value; excpet p_value also accounts for the node of average degree X
  • Wrap neo4j and laplacian files loading/offloading into top-level commands
    • check if this is compatible with long-term neo4j architecture => have to; multi-org systems
  • Flow system comparison
    • change in the Laplacian
    • comparison of specific flow patterns
    • separation between instance, actual flow and flow comparator.

Structural:

  • (DONE) Create a separate structure for performing the statistical analysis, that is independent from the wrapper
  • (DONE) Enforce the single source of the Interface Objects for sampling to simplify consistency enforcing
  • Return the control of the reach-limiter from the Knowledge interface to the current routines (Why?)
  • (DONE) Insert the effective sample into the storage DB to avoid re-running it in case more background is needed or the background needs to be interrupted. (Implemented otherwise)
  • (DONE) In the random reference generator, reformat so that different types of queries are matched with the same types of queries. This is required in order to provide support for statistics on multiple query types calls.
    • There are two levels of tagging: conduction system (eg background, ....) and query system (hits, circulation, etc...)
    • There is also a need for specific querying for the conduction system to avoid re-generating them
    • As well as "hit" analysis systems, so that they can be retrieved instead of needing to be re-generated.
  • (Won't fix)Move node annotation loading/offloading to an elasticsearch instance, always mapping to uniprots
  • (DONE) The alteration of the chain of statistics calculation is somewhat hardcore

Integration:

  • (DONE) Update to CYPHER and a more recent neo4j instance accessed through bolt
    • possibility of using a periodic Cypher update (each 500-1000) instead of atomic?
  • (Rejected) Consider removing neo4j altogether => Nope, it's a good persistence solution

Cosmetic:

  • Properly indent multi-line :param <parameter type> <parameter name>: descriptors
  • (DONE) Integrate compops/second estimation to the sources.ini
  • (DONE) Perform profiling by creating a dedicated set of loggers that would log an "execution time" flag set
  • Add the forwarding of the thread number to the progress report
    1. Reformat progress report as a progress bar.
  • (DONE) The explanation of the model and its transmission into the published data needs to be easy enough to explain why the hits are justified and why they aren't.

Feature wishlist:

  • Add protein abundance level for instantiation of the network
  • Add a coarseness feature on the interactome analysis affecting sampling behavior, so that precision is sacrificed in favor of computation speed.
  • Build an "inspector routine" that would allow us to see the nodes that would allow us to route the most information => we need to recompute the most central nodes in Interactome, because we still observe a heavy skew in the nodes with a high degree.
  • We always need to first build Interactome Interface before BioKnowledge Interface and in the end we need to have both of them build before we can run automated analysis. A nice fix would be to raise flags when they are loaded, instead of relying on the loading behaviors. (Deprecated)
  • (WillNotFix) Event sourcing pattern for the graph assembly and modification from the base databases. (Far Fetched)
  • The execution entry points have to be the five canonical queries.
  • Write the flow groups so that it is possible to calculate the information circulation between two sets or as set and a single protein (application for p53 and PKD1 regulators)
  • Distinction between downstream and upstream targets can be implemented by translating the directed graph into an associated Markov transition matrix. This will allow to:

    • explicitly allow weight of importance of sources/sinks of information (match the distribution shape with the quantile distribution normalization)
    • account only for the information propagating downstream the pathways, not both ways as it is the case now. A Markov Matrix differential system solution is a good idea as well, of the type F(t) = A*(F(t-1)+B); dA = A*F(t-1)+A*B-F(t)
    • synchronous computation of the flow for all sources/sinks
  • Add citations into the online databases files, that allows integration of different source into a single database (Long-term applicability enhancement).
  • Clustering algorithms going beyond the spectral clustering,

    • Not needing a pre-defined number of clusters
    • Able to assign the same node to several clusters
    • Maybe iterative DBSCAN or agglomerative clustering with removal of detected clusters until we hit some threshold on the number of number of nodes - average circulation in cluster curve obtained from random nodes sampling
    • We can deduce that graph from the clustering of sets of random nodes v.s.
  • Graph exploration module:

    • Strongest eigenvectors / highest circulation in a random set of nodes
    • Randomized clustering
  • Introduce signal over noise ratio: amount of current in the current configuration compared to what we would have expected in case of a random set of nodes. We could introduce this as a bootstrap on a random subset of nodes to figure which ones are random and which ones aren't
  • add @jit() wrapper in order to compile the elements within the current calculation routines.
  • Single command to change the neo4j instance being used or copied
    • Copy a designated database
    • Cd into the designated database and execute neo4j start/stop

Structure-required refactoring:

  • (DONE) separate the envelopes for the GO and Reactome graphs retrieval from the envelope used to recover and compute over the graph.
    • remove the memoization of individual pairs during the flow withing the group computation
  • transfer the annotation search to an ElasticSearch engine.
    • remove the overhead of loading all the annotation nodes to the neo4j instance
    • allow efficient filtering on the node types. Currently type detection and filtering is done upon enumeration. In practice, this is not critical, because DB Ids from different databases have low intersection
    • approximate matching capacities for gene names mistypings
  • Inline the background for the InteractomeInstance into the __init__
  • Inline the undumping and dependent variables calculation into the __init__ of InteractomeInstance and BioKnowledgeInterface
  • change to element import directories from which too many functions/objects are imported (import numpy as np)
  • Make methods running large systems of procedures to being dictionary-driven (Why?)
  • Factor out the traversals used in order to build the Laplacians
  • Refactor the flow calculation as a calculation between two sets protein sets:

    • Dense calculation or sampling is a strategy
    • Self-set is just when the two sets are equal

    - Circulation with a single protein is a special case when one of the sets contain a single element.

  • Clustering algorithms going beyond the spectral clustering,
    • Not needing a pre-defined number of clusters
    • Able to assign the same node to several clusters
    - Maybe iterative DBSCAN or agglomerative clustering with removal of detected clusters until we hit some threshold on the number of number of nodes - average circulation in cluster
    • We can deduce that graph from the clustering of sets of random nodes

Good-to-have; non-critical:

  • In all the DB calls, add a mock-able wrapper that would read the state of a project-wide variable and if it is set to True (in unittests) will switch it to a mock, not expecting the database to answer
  • (DONE) Bulk-insertion into the neo4j. => Requires taking over the bulbs engine
  • Add active state memoization for the import commander, so that when an exception happens, it prints it, terminates gracefully and upon restart offers an option to resume from the point of failure while managing all the support
  • modify the config generator code so that there is only one place where the default configurations are stored and can be modified from hte command line interface, instead of a complex CofigsIO class management. We actually have several levels of configs:
    • Configs that are required to properly stitch the code that were introduced during the development
    • Configs managing the third-party services
    • Configs that are specific to a deploy:
      • Where to direct the flow of the loggers at every level
      • Where the datastores are located
      • How to connect to a database
    • Configs that allow switching between organisms
      • Re-filling the database
      • Re-building the intermediate representations
      • Re-building the mongoDB reference and average heatmap
  • (DONE) build a condas-compatible package that would be installable cross-plateform and would contain pre-compiled binaries for C-extenstions. (Failed - we are better off with a Docker given complexity of the stack)
  • In case we are calling time-consuming parsers from several location, we might want to insert "singleton" module into the block, that performs all the parsings only once per program run.
  • We need to dynamically update the values of main_config whenever the location whenever configs from configsfiles are modified, so their modification do not require restaring the program. Alternatively we say that the configs need to be modified before the rest of the program can be executed.
  • Transform all the matrices so that the first one is packed line-based and the second one column-based. This will allow the optimization for register pre-pulling in the processor
  • Docker container cp command to accelerate the database rebuild process?

Neo4j and Laplacian construction:

  • Use a dictionary-configurable parser to parse from a given file format to the neo4j database.

    • The dictionary must show what identifiers have to be recovered from the file and to what

      nodes they should be matched in the neo4j database

    • The dictionary must show how the relationships should be inserted into the neo4j database
  • All the insertions are added without node or edge duplication. In case of multiple insertions,

    additional key:value pairs are added to the annotation of the node or the edge

  • Laplacian construction interface takes a dictionary providing instructions on how to

    compute the laplacian or adjacency matrix from the key:map values, both for nodes and edges

    - It allows both easy instantiation from the values for the nodes, such as protein/metabolite

    abundance in a tissue/organ, suppression of an interaction because of a mutation

    - It allows to use a single routine in order to perform different types of computation, such as the reliability of information transmission, likehood of randomness/jitter, etc...

    - It allows a high degree of customization by the end user, beyond what would be suggested by

    the initial user

New Databases:

  • Protein abundance
  • (DONE) Transcription/translation regulation
  • Post-translational modifications
  • Isoforms
  • Interactions that store annotation datasets, such as IntAct

Documentation and description:

Description:

Following the interaction with Wahid when I was explaining him what my methods were doing:

  • Explanation of what is current and how uit relations to biology
  • Where are the pathways?
  • Print out the twist ration into the GDF: observed to expected ratio/ P_value
  • show how to install on a docker and provide the script to perform installation in Ubuntu
  • write a quickstart guide
  • add pictures of what netqork analysis looks like

- Validation of results with retrieval of Pamela Silver's paper and John's Overington 300 essential targets: high average information flow and low abundance.

- Generate figures showing the highy-connected nodes in the laplacian matrix corresponding to the
common chemical molecules (ADP, ATP, Pi, ...). Explain that mechanisms related to such molecules

would better be described in terms of propositions on actual biological knowledge and that we would need to run the two analyses in parallel: both on the concepts and the molecular entities.

- Generate the figures showing that taking in account background that is efficiently reachable by

a given experimental technique is critical for the proper annotation retrieval, especially for the low-informativity terms. Give an example of techniques relying on the aboundance change for detection, how they would behave if we randomly sample from back-ground without first setting the background.

Internals high-level doc:

  • Limitations: no physical-path toxicity (such as rising pH, changing the O2 content or depleting ATP/ADP). They are managed by appropriate GO annotations
  • Retrieving giant connex set and operating on it only.
  • Filtering GOs without enough UP attachement (less than 2) to avoid infinite informativity (entropy reduction to 0).

GO Terms analysis techniques

  • Perform the statistics on the flow amount and the relation betweeen the flow, informativity and confusion potential
  • Perform the statistics on the flow amount and tension for the partitions of initial set of proteins to analyse
  • Recover the analysis of the idependent linear groups of the GO terms.
  • Mutual information about the flow and different characteristics, such as informativity and confusion potential (which are in fact bijective)

Size and memoization pattern of the GO current system:

The current decision is that for the samples of the size of ~ 100 Uniprots, we are better off unpickling from 4 and more by factor 2 and by factor 10 from 9. Previous experimets have shown that memoization with pickling incurred no noticeable delay on samples of up to 50 UPs, but that the storage limit on mongo DB was rapidly exceeded, leading us to create an allocated dump file.

Specific module improvements:

This section contains rather general improvements we would like to see in different modules to make them more independent.

Better data package management:

Organize the data repository retrieval according to the Python pip convention:

- use package_data and include_package_data to load the pointers to the git repositories containing data location.ini files.

- issue a command to add a git repository mapping a data shortname and data location to a downloadable format

- let user input where the data should be stored on his machine before any actual download happens

  • store configuration folders in a $HOME/.data_manager/ domain

Better Reactome parser:

Overall, we want to have a more general and more sane .owl parser

  • Add the parsing of Unification X-REF tags in the Reactome.
  • Unify the parsing structure to the iterative parsing of the tags.

- Define functions of transformation that will assemble the elements of the owl parsing into the class elements. (Flattening the structure)

  • In order to do this, define reduction functions:

    • Inline child's load
    • Discard that attribute
  • The computation of an individual parameter is actually an inlining of a

Beyond something that I am actually needing, this is an excellent exercise at writing a functional rdf parser that would use a Maybe monad (in case a child/parameter/etc..) is not found

Some of the ideas specific to the bioflow project:

  • perform parsing of unification x-refs in all the meta-types and reactions in order to retrieve joins with other databases.
  • return the connecting databases with the number of connections and the number of entities getting connected
  • collapse meta-types into a single type and use a type field to distinguish them

Better DB_IO management for annot nodes:

We want to transfer the load of the indexing to an elasticsearch engine. In order to do that, we

will suppress the annotation nodes, with their payload and payload typing and transfer it to elasticsearch, both with respect to insertion and retrieval. This will allow us to get smaller neo4j networks and faster load times.

Beyond that, we would be able to use the mechanism for batch queries on elasticsearch when we are

retrieving lists to get bulbs identifiers immediately.

Utils and general Utils:

Wrappers:

  • debug wrapper that logs to the debug channel. In case we are performing a graphical debug, we log it as a picture saved with the name of calling and the time of calling to the project root
  • visual debugger for the matrix operations that allows to specify what input matrices we would like to inspect and what output matrices we would like to inspect (by index)

Information flow computation:

Flow with ponderation

  • transform the computation to allow for different amount of information to be assigned to different nodes.
  • as a rule of thumb, the main computation core does not change, but the rules of normalization change.
  • FLOW_1-2 IMPORTANCE = NODE_1_IMPORTANCE/TOTAL_IMPORTANCENODE_2_IMPORTANCE/TOTAL_IMPORTANCE = NODE_1_IMPORTANCENODE_2_IMPORTANCE/TOTAL_IMPORTANCE**2
  • FLOW_STACK = SUM OF FLOW_I_J_IMPORTANCE*FLOW(I, J)

Flow with signs

  • calculate potentials separately, then perform a summation of potentials. Once potentials have been summed, calculate the information flow. This however does not reflect much presentation
  • An alternative is to implement a pressure propagation with sign inversion to account for positive/negative relations. Even though technically relying on the same Laplacian, we will need to re-implement routines computing the regulations:
    • We need to separate reliability flow from the sign propagation flow
    • We would need to enforce the rules that would enforce sign propagation only one way: down
  • All in all, we are switching to temperature diffusion on a laplacian network. With respect to that, we need a "diffusion" module and a separate description of the method how to use it.

Overall Mathematics

  • Get rid of Cholesky decomposition: it is not appliable in our case because of presence of null eigenvalues In fact there are as many null eigenvalues as there are connex segments in the graph (Error is acceptable, LU is slower)
  • Removed: replace pickling by JSON wherever appliable => numpy objects are not JSON-seriasable
  • DONE: add the clustering of proteins according to the GO annotation similarity
  • TODO: add the evaluations of Zipf-ittude for the proteins
  • DONE: add random matrix filtering-out for the "too noizy" conductions
  • DONE: for the computation of the relevant computational values, normalize the connections Graph. Use a laplacian instead of the default graph for the decorrelation
  • TODO: add derivatives to analyse scaling factors on for element participation in a complex: Is this complex a limiting factor for this complex or not?. In case of level variation derivative will be the measure for the amount of the trafficked information, whereas in case of substantial modification (mutation silencing catalytical factor, this will) be the only available one.
  • TODO: add negative/positive potentials for the linkages to the GO terms for true Up/Down regulation
  • TODO: orient Zipf-central concepts for different environements (yeah, but this is direct biasis, isn't it?) => Better deduce your own Zipf-distribution
  • TODO: analyse the sign-connexity of the GO terms analysis tools
  • TODO: add an adaptor for markov model-like analysis - Problem 1: if we operate big graphs, we are liklely to run out of memory - Problem 2: we cannot necessary normalise all the vectors, since some proteins are affecting several proteins at the same time
  • TODO: Add the 95% confidence interval for a given precision rate for the depth of sampling. For instance if we want 95% confidence into the p_value with 95% confidence, we need to run not 25 samples, but rather 30 or something in that range.

Features that would be nice to have:

New analysis features:

  • Derivative of GO term flows with respect to a network disruption or protein disruption
  • Negative/positive pressure injection & diffusion in order to account for positive/negative

    regulation in regulatory networks

  • Replace diffusion and flow matrices by causality matrices (directed transitions, allowing to)

    account for upstream/downstream propagation

  • We need to replace eigenmatrix clustering by agglomerative clustring, so that some nodes can

    belong and be important for several clusters instead of having to choose one to which they belong more.

  • Stochasticity of transmission: Once we get the abundances of different proteins in the network,

Add protein domain state switches:

This will allow us to represent the the changes in protein function following a post-translational modification or association in a complex that would be hard to represent otherwise.

More generally, it is switching the distribution of instances between classes that can be converted one to another.

Add additional databases:

  • Perform a recovery of post-translational modification sites in the normal proteins
  • Perform a recovery of a larger database of the RNAs, both as protein transcription elements and as regulatory elements
  • Import the DNA / epigenetic annotation ontology into the database to account for the DNA (un)-availability and for the DNA transport towards specific (activation or repression regions)
  • Cast in the database Protein Aboundances so that it becomes one-and-for-all import Problem: what are we to do in case we are willing to use a specific organ and not a general database?
  • Add organ specificity levels of protein expression
  • Database sources cited in the differential network biology paper by {Idecker 2012}:
    • BioGRID
    • HPRD (Human Reference Protein-protein interaction Dataset - human only)
    • IntAct (good idea to integrate given the quality and extensivity of data)