Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tune the use of AEOLUS indications from mychem.info #727

Closed
andrewsu opened this issue Sep 15, 2023 · 16 comments
Closed

tune the use of AEOLUS indications from mychem.info #727

andrewsu opened this issue Sep 15, 2023 · 16 comments
Labels
data source On Test Related changes are deployed to Test server x-bte

Comments

@andrewsu
Copy link
Member

AEOLUS is a standardized version of the US Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS) data. According to https://www.fda.gov/drugs/surveillance/questions-and-answers-fdas-adverse-event-reporting-system-faers:

The FDA Adverse Event Reporting System (FAERS) is a database that contains adverse event reports, medication error reports and product quality complaints resulting in adverse events that were submitted to FDA.

So essentially it's a community-contributed database that has lots of good stuff, but it also has lots of junk. For example, here is an example record for Escitalopram, a medication used to manage and treat major depressive and generalized anxiety disorders: https://mychem.info/v1/chem/WSEQXVZVJXJVFP-FQEVSTJZSA-N?fields=aeolus. Among the listed "indications" are

  "indications": [
    {
      "count": 765,
      "id": "36918942",
      "meddra_code": "10012378",
      "name": "Depression"
    },
    {
      "count": 219,
      "id": "36918858",
      "meddra_code": "10002855",
      "name": "Anxiety"
    },
    {
      "count": 106,
      "id": "42890454",
      "meddra_code": "10070592",
      "name": "Product used for unknown indication"
    },
    {
      "count": 71,
      "id": "36918945",
      "meddra_code": "10057840",
      "name": "Major depression"
    },
    {
      "count": 33,
      "id": "36918855",
      "meddra_code": "10018075",
      "name": "Generalised anxiety disorder"
    },
    ...
  ]

These generally look good, but lower down, we see this:

    {
      "count": 1,
      "id": "35205038",
      "meddra_code": "10013968",
      "name": "Dyspnoea"
    },
    {
      "count": 1,
      "id": "35306119",
      "meddra_code": "10036476",
      "name": "Prader-Willi syndrome"
    },
    {
      "count": 1,
      "id": "35406391",
      "meddra_code": "10043882",
      "name": "Tinnitus"
    },
    {
      "count": 1,
      "id": "35707962",
      "meddra_code": "10069049",
      "name": "Gastrointestinal viral infection"
    },
    {
      "count": 1,
      "id": "35708108",
      "meddra_code": "10021518",
      "name": "Impaired gastric emptying"
    }

These are probably extreme off-label uses as best, and data errors at worst.

Given that we have indications from multiple other sources through mychem.info (like ChEMBL and DrugCentral), we could probably remove these edges from the SmartAPI annotations without much loss in content to BTE. Alternatively, we could figure out an appropriate threshold on the count field (using a similar strategy to what we did in NCATSTranslator/Feedback#100. Eventually, this should also be assigned a relatively weak knowledge_level (#715) so our scoring can account for it appropriately...

@mbrush
Copy link

mbrush commented Oct 23, 2023

Thanks for posting this Andrew - a closer look at AEOLUS has been on my list for a while.

From a quick review of their Nature Scientific Data paper, and looking at example records of AEOLUS data in mychem - I concluded that the 'indications' AEOLUS reports are based on FAERS self-reporting data, and reflect what the patient reporting the adverse event said they took the drug for, when reporting the adverse events they experienced. @andrewsu do you agree with this assessment?

If true, I would agree that AEOLUS is not the best source of 'treats' statements - given the existence of other more reliable sources you mention for this type of knowledge.

That said, it could be an interesting source of potential novel off-label usages of drugs - in cases where we see may patients self-reporting taking a drug for a particular non-indicated disease - so it may be worth keeping in Translator.

The key will be to clearly advertise the dubious nature of these claims, to ensure end users and reasoning/scoring tools are appropriately cautious when using this information. As you suggest, knowledge level/agent type tags will play a big role here - as may other 'at-a-glace' EPC properties we have proposed such as 'evidence type'. I think these types of statements would fall into the observation knowledge level bucket.

Finally, note that we have previously documented the AEOLUS use case as an example of how knowledge level and other EPC / AAG properties would work together to represent this information under the refactored approach to modeling treats relationships. Worth a look at the proposal in the screenshot below (and source document here). - to see how we might ultimately choose to handle a source like AEOLUS.

image

@andrewsu
Copy link
Member Author

super @mbrush, I think we are on the same page. And yes, we will definitely follow whatever is specified in the EPC modeling document you linked. Perhaps a suggestion on that... The Ranibizumab - treats - AMD example is helpful (1955 reports in AEOLUS), but just so people don't get tempted to over-trust AEOLUS, it might be useful to also add a poor AEOLUS "prediction" to that doc as well. Many examples to choose from in https://mychem.info/v1/query?q=ranibizumab&fields=aeolus.indications: Ranibizumab - treats - Thrombosis (9 reports) or Ranibizumab - treats - Type 2 diabetes mellitus (1 report) and Ranibizumab - treats - Phlebotomy (1 report)...

And now that we are out of code freeze, I do think we should implement a (hopefully) quick-to-implement stop-gap measure on CI/TEST. @colleenXu can you adjust the aeolus query to include a filter like this? https://mychem.info/v1/query?q=ranibizumab&fields=aeolus.indications&jmespath=aeolus.indications|[?count>`20`]

@colleenXu
Copy link
Collaborator

@andrewsu to confirm, you'd like the limit to be > 20?

@andrewsu
Copy link
Member Author

yes, absent evidence to more confidently set that threshold, I think 20 will considerably improve the precision while not substantially degrading recall...

@colleenXu
Copy link
Collaborator

colleenXu commented Oct 24, 2023

@andrewsu

I'm having trouble figuring out the reverse-operation "aeolus MEDDRA disease ID -(treated_by)-> chem". This matters because it's what BTE actually uses in creative-mode "treats", since creative-mode's starting ID is the disease.


@newgene Here's the details. Can you help?

(But I'm not sure if we can solve this. This is similar to a prior discussion on list_filter. Then, we decided that it wasn't really viable: one could do list_filter + JQ OR batch-query starting IDs, but not both)

This is the intended behavior

I want to take a query like this, and only keep the hits (the aeolus field?) when the nested object in aeolus.indication meets the criteria: (1) meddra_code is one of the 3 listed (but it can be up to 1000 IDs in a batch), and (2) the count > 20.

curl --location 'https://mychem.info/v1/query?size=1000&fields=aeolus.indications%2Caeolus.unii' \
--header 'Content-Type: application/json' \
--data '{
  "q": ["10018304", "10058990", "10038867"],
  "scopes": "aeolus.indications.meddra_code"
}'

For example, this hit for 10018304 (chemical is unii:F0P408N6V4) doesn't meet the criteria because the specific nested object with 10018304 has a count less than 20. So I'd like to remove this hit completely from the response (or at least the entire aeolus field for this hit).

    {
        "query": "10018304",
        "_id": "F0P408N6V4",
        "_score": 7.2257814,
        "aeolus": {
            "_license": "http://bit.ly/2DIxWwF",
            "indications": [
                {
                    "count": 19893,
                    "id": "43053715",
                    "meddra_code": "10035226",
                    "name": "Plasma cell myeloma"
                },
...
                {
                    "count": 1,
                    "id": "35606985",
                    "meddra_code": "10018304",
                    "name": "Glaucoma"
                },
...
            ],
            "unii": "F0P408N6V4"
        }
    },

What I tried, and how I know it isn't doing what I intend

First, I tried doing setting jmespath to aeolus.indications|[?count>`20`]

So the query would be:

curl --location --globoff 'https://mychem.info/v1/query?size=1000&fields=aeolus.indications%2Caeolus.unii&jmespath=aeolus.indications%7C[%3Fcount%3E%6020%60]' \
--header 'Content-Type: application/json' \
--data '{
  "q": ["10018304", "10058990", "10038867"],
  "scopes": "aeolus.indications.meddra_code"
}'

But the example unii:F0P408N6V4 is still in the hits, even though its nested object that matched 10018304 is missing (it was filtered out because its count was less than 20).

click to see the unii:F0P408N6V4 hit

    {
        "query": "10018304",
        "_id": "F0P408N6V4",
        "_score": 7.2257814,
        "aeolus": {
            "_license": "http://bit.ly/2DIxWwF",
            "indications": [
                {
                    "count": 19893,
                    "id": "43053715",
                    "meddra_code": "10035226",
                    "name": "Plasma cell myeloma"
                },
                {
                    "count": 2306,
                    "id": "35104397",
                    "meddra_code": "10028533",
                    "name": "Myelodysplastic syndrome"
                },
                {
                    "count": 1123,
                    "id": "35104667",
                    "meddra_code": "10028228",
                    "name": "Multiple myeloma"
                },
                {
                    "count": 425,
                    "id": "35104364",
                    "meddra_code": "10008958",
                    "name": "Chronic lymphocytic leukaemia"
                },
                {
                    "count": 364,
                    "id": "35104461",
                    "meddra_code": "10025310",
                    "name": "Lymphoma"
                },
                {
                    "count": 348,
                    "id": "42890454",
                    "meddra_code": "10070592",
                    "name": "Product used for unknown indication"
                },
                {
                    "count": 201,
                    "id": "35104394",
                    "meddra_code": "10068532",
                    "name": "5q minus syndrome"
                },
                {
                    "count": 201,
                    "id": "35104532",
                    "meddra_code": "10061275",
                    "name": "Mantle cell lymphoma"
                },
                {
                    "count": 196,
                    "id": "35104351",
                    "meddra_code": "10000880",
                    "name": "Acute myeloid leukaemia"
                },
                {
                    "count": 186,
                    "id": "36009859",
                    "meddra_code": "10002022",
                    "name": "Amyloidosis"
                },
                {
                    "count": 146,
                    "id": "35104490",
                    "meddra_code": "10012818",
                    "name": "Diffuse large B-cell lymphoma"
                },
                {
                    "count": 142,
                    "id": "35104252",
                    "meddra_code": "10028537",
                    "name": "Myelofibrosis"
                },
                {
                    "count": 138,
                    "id": "35104643",
                    "meddra_code": "10029547",
                    "name": "Non-Hodgkin's lymphoma"
                },
                {
                    "count": 130,
                    "id": "35104465",
                    "meddra_code": "10003899",
                    "name": "B-cell lymphoma"
                },
                {
                    "count": 86,
                    "id": "35124300",
                    "meddra_code": "10068361",
                    "name": "MDS"
                },
                {
                    "count": 58,
                    "id": "35125677",
                    "meddra_code": "10028233",
                    "name": "Multiple myeloma without mention of remission"
                },
                {
                    "count": 56,
                    "id": "43053717",
                    "meddra_code": "10073133",
                    "name": "Plasma cell myeloma recurrent"
                },
                {
                    "count": 47,
                    "id": "35104405",
                    "meddra_code": "10020206",
                    "name": "Hodgkin's disease"
                },
                {
                    "count": 45,
                    "id": "37522153",
                    "meddra_code": "10057097",
                    "name": "Drug use for unknown indication"
                },
                {
                    "count": 38,
                    "id": "43053713",
                    "meddra_code": "10035222",
                    "name": "Plasma cell leukaemia"
                },
                {
                    "count": 34,
                    "id": "35125678",
                    "meddra_code": "10028566",
                    "name": "Myeloma"
                },
                {
                    "count": 33,
                    "id": "35104669",
                    "meddra_code": "10035484",
                    "name": "Plasmacytoma"
                },
                {
                    "count": 29,
                    "id": "35124041",
                    "meddra_code": "10009310",
                    "name": "CLL"
                },
                {
                    "count": 27,
                    "id": "36617702",
                    "meddra_code": "10060862",
                    "name": "Prostate cancer"
                },
                {
                    "count": 27,
                    "id": "42888924",
                    "meddra_code": "10060880",
                    "name": "Monoclonal gammopathy"
                },
                {
                    "count": 26,
                    "id": "35104567",
                    "meddra_code": "10047801",
                    "name": "Waldenstrom's macroglobulinaemia"
                },
                {
                    "count": 25,
                    "id": "35104382",
                    "meddra_code": "10025270",
                    "name": "Lymphocytic leukaemia"
                },
                {
                    "count": 23,
                    "id": "35123953",
                    "meddra_code": "10000886",
                    "name": "Acute myeloid leukemia"
                }
            ],
            "unii": "F0P408N6V4"
        }
    },

Trying the following didn't work either:

  • aeolus|[?indications.count>`20`] : then all the hits had aeolus: null which is incorrect since I know some hits met the criteria (like unii:1O6WQ6T7G3 for 10018304)
  • .|[?aeolus.indications.count>`20`] : then it seemed like the jmespath statement did nothing (no nested objects filtered out)

colleenXu added a commit to NCATS-Tangerine/translator-api-registry that referenced this issue Oct 27, 2023
issue with adding this constraint to the reverse operation, see biothings/biothings_explorer#727
@colleenXu
Copy link
Collaborator

colleenXu commented Oct 27, 2023

Updates:

@andrewsu

I've implemented jmespath: aeolus.indications|[?count>`20`] for the aeolus-treats operation (chemical X -(treats)-> disease).

However, the reverse operation may be more important (as I said in the previous post). And while I'm making some progress (see below), I'm still not able to implement the count constraint for the reverse operation.

Query for testing: Escitalopram

Based on Andrew's first post on this issue

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["UNII:4O4S742ANY"],
                    "categories":["biolink:SmallMolecule"]
                },
                "n1": {
                    "categories":["biolink:Disease"]
               }
            },
            "edges": {
                "e1": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:treats"]
                }
            }
        }
    }
}

Got 110 results before, should now get 29. The low-count hits like Tinnitus (meddra code 10043882) should no longer be in the result set.

Query for testing: Ranibizumab

Based on Andrew's post above

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["UNII:ZL1R02VT79"],
                    "categories":["biolink:SmallMolecule"]
                },
                "n1": {
                    "categories":["biolink:Disease"]
               }
            },
            "edges": {
                "e1": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:treats"]
                }
            }
        }
    }
}

Got 120 results before, should now get 41. The low-count hits like thrombosis (meddra code 10043607) should no longer be in the result set.


@newgene

I still need your help, but I think I've made some progress:

  • I've found a way to only keep the elements in the aeolus.indication array that have both (1) meddra_code value is one of the ones I asked for, and (2) the count > 20
  • but I can't figure out how to remove the aeolus.unii field when the criteria above are met (or remove the hit, both will work for BTE's purposes). This is the main thing left to figure out. Can you help?
click to see what I have

Setting jmespath to aeolus.indications|[?(count>`20`) && (meddra_code=='10018304' ||meddra_code=='10038867')] (using biothings/biothings.api@31898fa as reference)

The MyChem query is:

curl --location --globoff 'https://mychem.info/v1/query?size=1000&fields=aeolus.indications%2Caeolus.unii&jmespath=aeolus.indications%7C[%3F(count%3E%6020%60)%20%26%26%20(meddra_code%3D%3D%2710018304%27%20%7C%7Cmeddra_code%3D%3D%2710038867%27)]' \
--header 'Content-Type: application/json' \
--data '{
  "q": ["10018304", "10038867"],
  "scopes": "aeolus.indications.meddra_code"
}'

Then the response looks like this for hits that fulfill the criteria:

    {
        "query": "10018304",
        "_id": "WSNODXPBBALQOF-VEJSHDCNSA-N",
        "_score": 7.2257814,
        "aeolus": {
            "_license": "http://bit.ly/2DIxWwF",
            "indications": [
                {
                    "count": 157,
                    "id": "35606985",
                    "meddra_code": "10018304",
                    "name": "Glaucoma"
                }
            ],
            "unii": "1O6WQ6T7G3"
        }
    },

    {
        "query": "10038867",
        "_id": "1RXS4UE564",
        "_score": 8.809106,
        "aeolus": {
            "_license": "http://bit.ly/2DIxWwF",
            "indications": [
                {
                    "count": 26,
                    "id": "35607414",
                    "meddra_code": "10038867",
                    "name": "Retinal haemorrhage"
                }
            ],
            "unii": "1RXS4UE564"
        }
    },

And like this for elements that don't fit the criteria (including the same F0P408N6V4 chemical I had in the last post):

    {
        "query": "10018304",
        "_id": "F0P408N6V4",
        "_score": 7.2257814,
        "aeolus": {
            "_license": "http://bit.ly/2DIxWwF",
            "indications": [],
            "unii": "F0P408N6V4"
        }
    },

    {
        "query": "10038867",
        "_id": "2S9ZZM9Q9V",
        "_score": 9.657343,
        "aeolus": {
            "_license": "http://bit.ly/2DIxWwF",
            "indications": [],
            "unii": "2S9ZZM9Q9V"
        }
    },

Notes for myself on generating queries like this with x-bte/BTE

  • I think doing this as non-batch is easier:
    • To add the input IDs: {{ queryInputs }} can be used in parameters (think external apis like biolink/monarch)
    • May involve some wrap, playing around with quotation marks and escaping \ to get the single-quotes
  • I'm less sure about being able to generate the batch-queries properly...even though batch-queries are theoretically possible (my example uses 2 meddra_code values)
    • how many unique values can this BioThings feature handle?
    • can I figure out how to get the multiple IDs formatted correctly? (wrap to generate a string, setting the delimiter to ||...)
    • batch-size-limit: caused by the url-character limit
      • and this'll be set for the whole-api, unless we implement something for individual operations (which may be a bit complicated by the deployment situation?)

@newgene
Copy link
Member

newgene commented Oct 27, 2023

@colleenXu jmespath does not add or remove hits, only transform hits given some critieria. If you want to modify the hits, you should modify your query. In your case above, you can include aeolus.indications.count:>20 into your query, then all hits should contain at least one count>20 item under indications array. This should serve the purpose if I understand correctly.

@colleenXu
Copy link
Collaborator

@newgene I tried adding this two ways: using a "no-scopes" query and post_filter. Both didn't seem to work: the responses were basically the same as before.

The responses are basically the same as above

"no-scopes" query and response

curl --location --globoff 'https://mychem.info/v1/query?size=1000&fields=aeolus.indications%2Caeolus.unii&jmespath=aeolus.indications%7C[%3F(count%3E%6020%60)%20%26%26%20(meddra_code%3D%3D%2710018304%27%20%7C%7Cmeddra_code%3D%3D%2710038867%27)]' \
--header 'Content-Type: application/json' \
--data '{
  "q": [
          "aeolus.indications.meddra_code:10018304 AND aeolus.indications.count:>20", 
          "aeolus.indications.meddra_code:10038867 AND aeolus.indications.count:>20"
        ],
  "scopes": []
}'

Response still has the hits that don't meet the criteria:

    {
        "query": "aeolus.indications.meddra_code:10018304 AND aeolus.indications.count:>20",
        "_id": "F0P408N6V4",
        "_score": 8.225781,
        "aeolus": {
            "_license": "http://bit.ly/2DIxWwF",
            "indications": [],
            "unii": "F0P408N6V4"
        }
    },

    {
        "query": "aeolus.indications.meddra_code:10018304 AND aeolus.indications.count:>20",
        "_id": "2S9ZZM9Q9V",
        "_score": 7.137364,
        "aeolus": {
            "_license": "http://bit.ly/2DIxWwF",
            "indications": [],
            "unii": "2S9ZZM9Q9V"
        }
    },

post-filter

Added post_filter parameter, set to aeolus.indications.count:>20

curl --location --globoff 'https://mychem.info/v1/query?size=1000&fields=aeolus.indications%2Caeolus.unii&post_filter=aeolus.indications.count%3A%3E20&jmespath=aeolus.indications%7C[%3F(count%3E%6020%60)%20%26%26%20(meddra_code%3D%3D%2710018304%27%20%7C%7Cmeddra_code%3D%3D%2710038867%27)]' \
--header 'Content-Type: application/json' \
--data '{
  "q": ["10018304", "10038867"],
  "scopes": "aeolus.indications.meddra_code"
}'

Response still has the hits that don't meet the criteria:

    {
        "query": "10018304",
        "_id": "F0P408N6V4",
        "_score": 7.2257814,
        "aeolus": {
            "_license": "http://bit.ly/2DIxWwF",
            "indications": [],
            "unii": "F0P408N6V4"
        }
    },

    {
        "query": "10018304",
        "_id": "2S9ZZM9Q9V",
        "_score": 6.137364,
        "aeolus": {
            "_license": "http://bit.ly/2DIxWwF",
            "indications": [],
            "unii": "2S9ZZM9Q9V"
        }
    },

@newgene
Copy link
Member

newgene commented Oct 31, 2023

@colleenXu you have additional filter criteria in jmespath as jmespath=aeolus.indications|[?(count>20) && (meddra_code=='10018304' ||meddra_code=='10038867')], so if indications returns as empty, it's due to these criteria, not the count:>20 which you have already filtered out.

@colleenXu
Copy link
Collaborator

colleenXu commented Oct 31, 2023

@newgene

Okay....but I still can't figure out: if the hit's aeolus.indications is empty, how to remove the aeolus.unii field or remove the hit...

(ref: this earlier post)

@colleenXu
Copy link
Collaborator

colleenXu commented Nov 1, 2023

(CC @newgene)

This is the info from our conversation:

  • the hits are tied to the q part of the query, so modifying that may be useful
  • But the logic in q works differently from the logic in jmespath:
    • we want logic like jmespath: a single aeolus.indication element should fulfill both criteria: (1) meddra_code value is one of the ones I asked for, and (2) the count > 20.
    • But when using q, some hits are problematic: they don't have any aeolus.indication elements that meet both criteria at the same element (some elements have the meddra_code and others have the count >20).
  • I'm unsure on whether post_filter / filter would be helpful here. I know filter isn't live yet (upcoming biothings sdk update) and I dunno if post_filter is live...

We tried setting the q field to be identical to the jmespath info, but it seemed to result in the same behavior as the previous tries.

click for info

So the jmespath parameter is: aeolus.indications|[?(count>`20`) && (meddra_code==`10018304`||meddra_code==`10038867`)]

And we set the request body to something very similar:

{
  "q": [
          "aeolus.indications.count:>20 AND (aeolus.indications.meddra_code:10018304 OR aeolus.indications.meddra_code:10038867)"
        ],
  "scopes": []
}

so the full query was:

curl --location --globoff 'https://mychem.info/v1/query?size=1000&fields=aeolus.indications%2Caeolus.unii&jmespath=aeolus.indications%7C[%3F(count%3E%6020%60)%20%26%26%20(meddra_code%3D%3D%6010018304%60%7C%7Cmeddra_code%3D%3D%6010038867%60)]%20' \
--header 'Content-Type: application/json' \
--data '{
  "q": [
          "aeolus.indications.count:>20 AND (aeolus.indications.meddra_code:10018304 OR aeolus.indications.meddra_code:10038867)"
        ],
  "scopes": []
}'

And the responses have the same issue:

    {
        "query": "aeolus.indications.count:>20 AND (aeolus.indications.meddra_code:10018304 OR aeolus.indications.meddra_code:10038867)",
        "_id": "F0P408N6V4",
        "_score": 8.225781,
        "aeolus": {
            "_license": "http://bit.ly/2DIxWwF",
            "indications": [],
            "unii": "F0P408N6V4"
        }
    },

@colleenXu
Copy link
Collaborator

colleenXu commented Apr 9, 2024

The MyChem-query-level limit (aeolus.indications.count > 20) is now implemented in the reverse direction too in Dev/CI!

Adding the new parameter jmespath_exclude_empty: true removed the hits that didn't match both criteria (count > 20 AND meddra field's value matches the input ID) - so BTE can parse the API response without issues. Commits:

Thanks to @newgene @DylanWelzel for the BioThings SDK/MyChem update


So the current situation in Dev/CI:

  • BTE now retrieves aeolus.indications.count for aeolusTreats/aeolusTreats-rev operations (ref: commit). The x-bte annotation maps this field to a TRAPI edge-attribute biolink:evidence_count. The value of this edge-attribute is currently always an array of ints (noted in issue 1 of this comment)
  • BTE has a "hard-coded"/MyChem-query-level limit for those operations: aeolus.indications.count > 20.

@colleenXu
Copy link
Collaborator

@tokebe @andrewsu

I know we've been discussing the aeolus edge-attribute format (flattening arrays into ints) in the edge-attribute constraint issue (part 1 here, and decision here). But I think it'd be make sense to add it to this issue and track its deployment here.

What do you think?

@colleenXu
Copy link
Collaborator

And a note - because the hard-coded limit of > 20 is for individual records, BTE won't return an edge for the following theoretical edge case:

  • individual record counts are <20
  • but BTE/NodeNorm would have merged records together and after the flattening/summation, the edge's count would have been > 20

I asked Andrew, and he said that this is fine for now.

@colleenXu
Copy link
Collaborator

colleenXu commented May 2, 2024

Addressed by this commit directly to main: biothings/bte_trapi_query_graph_handler@b0fc94d

I've confirmed that the flattening/summation works as-intended :)


Example based on the example in Part 1 here

Example query

Send to MyChem thru BTE: http://localhost:3000/v1/smartapi/8f08d1446e0bb9c2b323713ce83e2bd3/query

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["UNII:01K63SUP8D"],
                    "categories":["biolink:SmallMolecule"]
                },
                "n1": {
                    "categories":["biolink:Disease"]
               }
            },
            "edges": {
                "e1": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:applied_to_treat"]
                }
            }
        }
    }
}

Previously, we'd get edges from the aeolus operations that look like this:

                "dd9daae5b03bcad0698ff6669090f36b": {
                    "predicate": "biolink:applied_to_treat",
                    "subject": "PUBCHEM.COMPOUND:3386",
                    "object": "MEDDRA:10070592",
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:evidence_count",
                            "value": [
                                875
                            ]
                        }
                    ],


                "1feea171db6394cfd9bcb20deae0ad9a": {
                    "predicate": "biolink:applied_to_treat",
                    "subject": "PUBCHEM.COMPOUND:3386",
                    "object": "MONDO:0002050",
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:evidence_count",
                            "value": [
                                733,
                                42
                            ]
                        }
                    ],

After the commit, these edges look like this: the edge-attribute values are ints and sums if there were values from multiple records.

                "dd9daae5b03bcad0698ff6669090f36b": {
                    "predicate": "biolink:applied_to_treat",
                    "subject": "PUBCHEM.COMPOUND:3386",
                    "object": "MEDDRA:10070592",
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:evidence_count",
                            "value": 875
                        },


                "1feea171db6394cfd9bcb20deae0ad9a": {
                    "predicate": "biolink:applied_to_treat",
                    "subject": "PUBCHEM.COMPOUND:3386",
                    "object": "MONDO:0002050",
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:evidence_count",
                            "value": 775
                        },

@colleenXu
Copy link
Collaborator

colleenXu commented Jun 14, 2024

The flattening/summing code was deployed today to Prod as part of the Octopus release. I tested and it's live.

Summary of what was done in this issue:

  • aeolusTreats/aeolusTreats-rev operations (ref):
    • now include aeolus.indications.count field, mapped to biolink:evidence_count
    • only return documents/records with aeolus.indications.count > 20 (x-bte uses jmespath)
  • BTE updated to flatten the biolink:evidence_count value into an int (sum if multiple values). ref: described in part 1, decision, and implement/test comment directly above this one

Noting one edge case (pasted from above comment):

And a note - because the hard-coded limit of > 20 is for individual records, BTE won't return an edge for the following theoretical edge case:

  • individual record counts are <20
  • but BTE/NodeNorm would have merged records together and after the flattening/summation, the edge's count would have been > 20

I asked Andrew, and he said that this is fine for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data source On Test Related changes are deployed to Test server x-bte
Projects
None yet
Development

No branches or pull requests

5 participants