add ability to filter out promiscuous intermediate nodes #330

andrewsu · 2021-10-21T17:56:39Z

In #324 , we implemented a performance / stability tweak that caps the overall number of entities being held in memory during query execution. This issue tracks a further potential enhancement to our performance / stability efforts.

Imagine a three-node, two-hop predict query that looks something like fanconi anemia [disease] - [gene] - [gene] (Example FA query below). The first hop includes many genes that are very specific to FA (including the canonical genes FANCA, FANCB, FANCC, etc.). But it also includes many "promiscuous" genes like TP53 that will get exploded to many results in the second hop. We expect that results that go through TP53 will be down-prioritized in the subsequent sorting and ranking step based on something like the Normalized Google Distance. But having to track these entities and all the entities they are linked to does affect performance and stability -- the query below does not complete on my local machine possibly due to this issue. In addition to promiscuous genes, there are also many promiscuous diseases (e.g., "cancer"), promiscuous drugs (e.g., "acetaminophen"), anatomical entities (e.g., "brain"), etc.

Here, I suggest we create the ability to optionally filter out promiscuous nodes in the course of query execution. I don't know the exact mechanism of implementing this feature, so this probably deserves some brainstorming. Naively, I propose two options:

comparing to some explicitly enumerate list of "excluded entities" (either centrally maintained or user-specified, with different pros and cons)
dynamically trying to assess promiscuity and removal via node attribute filters (filter on node attributes #174); could query pubmed or our semmeddb API as a data sources to score promiscuity

In addition to the question of how we will calculate a promiscuity score, we also need to decide how the user intent can be expressed in a TRAPI query (since the use of this filter would likely be use-case dependent). Is there a place where optional parameters can be specified in a TRAPI query?

Example FA query

{
    "message": {
        "query_graph": {
            "edges": {
                "e01": {
                    "subject": "n01",
                    "object": "n02"
                },
                "e02": {
                    "subject": "n02",
                    "object": "n03"
                }
            },
            "nodes": {
                "n01": {
                    "categories": [
                        "biolink:Disease"
                    ],
                    "ids": [
                        "MONDO:0019391"
                    ]
                },
                "n02": {
                    "categories": [
                        "biolink:Gene"
                    ]
                },
                "n03": {
                    "categories": [
                        "biolink:Gene"
                    ]
                }
            }
        }
    }
}

The text was updated successfully, but these errors were encountered:

colleenXu · 2021-11-19T06:32:51Z

This would be an example: https://monarchinitiative.org/disease/MONDO:0000001

andrewsu · 2023-08-23T16:34:31Z

I posted results to the FA query in #493 (comment). Closing this issue as roughly a duplicate of that issue.

andrewsu added this to todo in Translator project management (old) Oct 21, 2021

marcodarko moved this from todo to Marco in Translator project management (old) Nov 5, 2021

colleenXu mentioned this issue Dec 3, 2021

Graceful exit before full execution for queries that would have very large responses #363

Closed

colleenXu mentioned this issue Dec 22, 2021

clear language / documentation of vocab + data structures #379

Closed

colleenXu mentioned this issue Jun 24, 2022

implement "creative/inferred mode" for "what drugs may treat disease X" query #449

Closed

colleenXu mentioned this issue Apr 26, 2023

Scoring overhaul #634

Closed

andrewsu mentioned this issue Aug 23, 2023

how to refine a two-hop query that explodes on the first edge? #493

Open

andrewsu closed this as not planned Won't fix, can't repro, duplicate, stale Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add ability to filter out promiscuous intermediate nodes #330

add ability to filter out promiscuous intermediate nodes #330

andrewsu commented Oct 21, 2021

colleenXu commented Nov 19, 2021

andrewsu commented Aug 23, 2023

add ability to filter out promiscuous intermediate nodes #330

add ability to filter out promiscuous intermediate nodes #330

Comments

andrewsu commented Oct 21, 2021

colleenXu commented Nov 19, 2021

andrewsu commented Aug 23, 2023