Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BTE isn't parsing meta_knowledge_graph specs correctly for Automat #269

Closed
colleenXu opened this issue Aug 26, 2021 · 12 comments
Closed

BTE isn't parsing meta_knowledge_graph specs correctly for Automat #269

colleenXu opened this issue Aug 26, 2021 · 12 comments
Labels
bug Something isn't working

Comments

@colleenXu
Copy link
Collaborator

BTE seems to be switching the subject/object when parsing the Automat HGNC meta_knowledge_graph endpoint. BTE therefore sends incorrect queries to this API...

This may be an issue for other APIs as well (ingested through meta_knowledge_graph? ingested through x-bte annotation?)


After a cron job to update smartapi specs, I see this JSON snippet below in BTE's data/predicates.json file for Automat HGNC (trapi v-1.1.2). Both entries under predicates are incorrect...

For example, a direct query to the Automat HGNC meta_knowledge_graph endpoint confirms that they give us Gene part_of GeneFamily. But we have Gene has_part GeneFamily, which is incorrect...

{
  "predicates_path": "/meta_knowledge_graph",
  "predicates": {
    "biolink:Gene": {
      "biolink:GeneFamily": [
        "biolink:has_part"
      ]
    },
    "biolink:GeneFamily": {
      "biolink:Gene": [
        "biolink:part_of"
      ]
    }
  }
}

The cron job appears to be querying Automat HGNC's endpoint correctly to get the specs. The relevant log is below:

  bte:biothings-explorer-trapi:cron Successfully got /meta_knowledge_graph for https://automat.renci.org/hmdb/1.1 +1ms

@colleenXu
Copy link
Collaborator Author

This likely helps to explain this situation: #262 (comment)

@colleenXu
Copy link
Collaborator Author

This task likely also involves resaving some cached files, since these were made incorrectly...

@marcodarko
Copy link
Contributor

marcodarko commented Sep 7, 2021

So far I've found the following:
The pattern that BTE uses to save the predicates to the local file seems to be OBJECT: SUBJECT : [PREDICATES]
{"biolink:Disease":{"biolink:ChemicalEntity":["biolink:associated_with_decreased_risk_for","biolink:associated_with_increased_risk_for","biolink:contraindicated_for","biolink:treats"]}
Like in this example from another API.

If that's correct then this seems to be correct as well as far as how BTE ingested that information (Looks like perhaps the way we interpreted that the info is different from what it is) . And perhaps the issue is not there...

@marcodarko
Copy link
Contributor

marcodarko commented Sep 9, 2021

Another example to support this from Connections Hypothesis Provider API:
"predicates":{
"biolink:Disease":{
"biolink:Gene":[
"biolink:gene_associated_with_condition",
"biolink:has_real_world_evidence_of_association_with"
],
"biolink:Drug":[
"biolink:treats",
"biolink:has_real_world_evidence_of_association_with"
]
},
"biolink:Drug":{
"biolink:Gene":[
"biolink:interacts_with",
"biolink:has_real_world_evidence_of_association_with"
],
"biolink:Disease":[
"biolink:treated_by",
"biolink:associated_with_real_world_evidence"
]
},
"biolink:Gene":{
"biolink:Drug":[
"biolink:interacts_with",
"biolink:associated_with_real_world_evidence"
],
"biolink:Gene":[
"biolink:genetically_interacts_with",
"biolink:has_real_world_evidence_of_association_with",
"biolink:associated_with_real_world_evidence"
],
"biolink:Disease":[
"biolink:condition_associated_with_gene",
"biolink:associated_with_real_world_evidence"
]
}

@marcodarko
Copy link
Contributor

Another small issue I found is that the metaKG cron job was outdated and only taking 1.1. apis skipping all 1.2. apis, fixing that...

@colleenXu
Copy link
Collaborator Author

@marcodarko This is an example of a query that's failing somewhere though...

It is going Gene ID -> GeneFamily, and you would expect it would use the "part_of" predicate from the Automat apis (HGNC, Robokop, covidkopkp)....

    "biolink:GeneFamily": {
      "biolink:Gene": [
        "biolink:part_of"
      ]

However, the BTE sub-queries use the wrong predicate (has_part) instead.

POST to http://localhost:3000/v1/query and look at the logs:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": ["NCBIGene:728882"]
                },
                "n1": {
                    "categories": ["biolink:GeneFamily"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

@marcodarko
Copy link
Contributor

marcodarko commented Sep 14, 2021

Solved: Looks like during the metaKG construction there's an error on the order in which the the object and subject are being read from the predicates.json file pattern is as follows: OBJ : {SUB :[ PREDs] } and they are being read as SUB : {OBJ :[ PREDs] }. That's why it was mixing the predicates from Gene-GeneFamily and GeneFamily-Gene. This fix correctly gets "NCBIGene:728882-biolink:part_of-HGNC.FAMILY:1993". Fix addressed in biothings/smartapi-kg.js#39

@marcodarko marcodarko added the bug Something isn't working label Sep 14, 2021
@colleenXu
Copy link
Collaborator Author

@newgene please close this once the PR/main data is pushed to the prod instance...

@colleenXu
Copy link
Collaborator Author

should be fixed now that this PR is merged: biothings/smartapi-kg.js#39

@colleenXu
Copy link
Collaborator Author

colleenXu commented Sep 21, 2021

copied from biothings/smartapi-kg.js#39 (comment)

Re-review of BTE's ability to query Automat APIs is described (the previous review was here).

The APIs being called by these queries have changed:

CHEBI:3558 -> Gene uses covidkop, ctd, pharos
HP:0004382 -> Disease uses Biolink, Uberongraph, (and covidkop)
MONDO:0009747 -> Gene uses Robokop (and covidkop, Pharos Uberongraph)
CHEBI:30830 -> Disease uses CORD19, hetio (and biolink, covidkop, ctd, robokop)
MONDO:0011565 -> Disease uses (biolink cord19, covidkop, robokop, uberongraph)
UBERON:0001905 -> AnatomicalEntity uses (covidkop, robokop, uberongraph)

These queries now get results from the APIs:

NCBITaxon:105667 -> MolecularEntity uses FoodDB
SequenceVariant CAID:CA16727036 -> Gene uses GTEx
NCBIGene:728882 -> GeneFamily uses HGNC
CHEBI:6896 -> Pathway uses HMDB
GO:0006629 -> BiologicalProcess uses Human-GOA
Protein UniProtKB:Q8JPQ9 -> Protein uses Intact
UBERON:0005453 -> AnatomicalEntity uses ontological-hierarchy
AnatomicalEntity UBERON:0015048 -> AnatomicalEntity uses text-mining
Protein UniProtKB:A3KCJ9 -> OrganismTaxon uses viral proteome

Not getting results from example queries for:

Gtopdb
Panther

@colleenXu
Copy link
Collaborator Author

Reviewing whether BTE correctly gets info from Automat DrugCentral, Gtopdb, and Panther (the APIs that I didn't get results from in the review above):

BTE is getting edges from the Automat APIs:

  • DrugCentral: SmallMolecule PUBCHEM.COMPOUND:10206 -> Disease
  • Gtopdb: SmallMolecule PUBCHEM.COMPOUND:70701426 -> Gene
  • Panther: GeneFamily PANTHER.FAMILY:PTHR11732 -> Gene

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Development

No branches or pull requests

2 participants