Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add ability to filter out promiscuous intermediate nodes #330

Closed
andrewsu opened this issue Oct 21, 2021 · 2 comments
Closed

add ability to filter out promiscuous intermediate nodes #330

andrewsu opened this issue Oct 21, 2021 · 2 comments

Comments

@andrewsu
Copy link
Member

In #324 , we implemented a performance / stability tweak that caps the overall number of entities being held in memory during query execution. This issue tracks a further potential enhancement to our performance / stability efforts.

Imagine a three-node, two-hop predict query that looks something like fanconi anemia [disease] - [gene] - [gene] (Example FA query below). The first hop includes many genes that are very specific to FA (including the canonical genes FANCA, FANCB, FANCC, etc.). But it also includes many "promiscuous" genes like TP53 that will get exploded to many results in the second hop. We expect that results that go through TP53 will be down-prioritized in the subsequent sorting and ranking step based on something like the Normalized Google Distance. But having to track these entities and all the entities they are linked to does affect performance and stability -- the query below does not complete on my local machine possibly due to this issue. In addition to promiscuous genes, there are also many promiscuous diseases (e.g., "cancer"), promiscuous drugs (e.g., "acetaminophen"), anatomical entities (e.g., "brain"), etc.

Here, I suggest we create the ability to optionally filter out promiscuous nodes in the course of query execution. I don't know the exact mechanism of implementing this feature, so this probably deserves some brainstorming. Naively, I propose two options:

  • comparing to some explicitly enumerate list of "excluded entities" (either centrally maintained or user-specified, with different pros and cons)
  • dynamically trying to assess promiscuity and removal via node attribute filters (filter on node attributes #174); could query pubmed or our semmeddb API as a data sources to score promiscuity

In addition to the question of how we will calculate a promiscuity score, we also need to decide how the user intent can be expressed in a TRAPI query (since the use of this filter would likely be use-case dependent). Is there a place where optional parameters can be specified in a TRAPI query?

Example FA query

{
    "message": {
        "query_graph": {
            "edges": {
                "e01": {
                    "subject": "n01",
                    "object": "n02"
                },
                "e02": {
                    "subject": "n02",
                    "object": "n03"
                }
            },
            "nodes": {
                "n01": {
                    "categories": [
                        "biolink:Disease"
                    ],
                    "ids": [
                        "MONDO:0019391"
                    ]
                },
                "n02": {
                    "categories": [
                        "biolink:Gene"
                    ]
                },
                "n03": {
                    "categories": [
                        "biolink:Gene"
                    ]
                }
            }
        }
    }
}
@colleenXu
Copy link
Collaborator

This would be an example: https://monarchinitiative.org/disease/MONDO:0000001

@andrewsu
Copy link
Member Author

I posted results to the FA query in #493 (comment). Closing this issue as roughly a duplicate of that issue.

@andrewsu andrewsu closed this as not planned Won't fix, can't repro, duplicate, stale Aug 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

2 participants