Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple x-bte operations not working (records dropped during edge-management) #756

Closed
colleenXu opened this issue Oct 31, 2023 · 10 comments
Assignees
Labels
bug Something isn't working On Test Related changes are deployed to Test server

Comments

@colleenXu
Copy link
Collaborator

colleenXu commented Oct 31, 2023

New info: this is affecting existing operations. A very similar "records dropped during edge-management" is happening with multiple BioThings TTD operations:

The way I tested this was using a local copy of the yaml, commenting out all the operations in the list except the ones you're testing, and then using the operation's testExample info to write a test query.

I now suspect that this occurs when specific namespaces are used as input (HP, MONDO). This reminds me of #591. I also wonder if this happened with the most recent code changes (JQ, typescript/pnpm/restructuring) or from something older...


Previous info from Slack:

I'm also having issues locally with two new operations I'm trying to make. I think there's an issue with the edge-management or api-response-transform...

To recreate the issue:

  • use an override to this MyVariant yaml (I suggest having a local copy and directing to that). only the two problematic operations are live (not commented out): clinvar-gene-phenoHP-rev and clinvar-variant-phenoHP-rev
  • test MyVariant through BTE (using the override).
queries

To test clinvar-gene-phenoHP-rev, run this query:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["HP:0001639"],
                    "categories":["biolink:PhenotypicFeature"]
                },
                "n1": {
                    "categories":["biolink:Gene"]
               }
            },
            "edges": {
                "e1": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

To test clinvar-variant-phenoHP-rev, replace Gene with SequenceVariant

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["HP:0001639"],
                    "categories":["biolink:PhenotypicFeature"]
                },
                "n1": {
                    "categories":["biolink:SequenceVariant"]
               }
            },
            "edges": {
                "e1": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

In both cases, it looks like something is going on with n0 during edge-management (no curies from records??) and all the records get dropped...

Screen Shot 2023-10-30 at 11 27 05 AM

@colleenXu colleenXu added the bug Something isn't working label Oct 31, 2023
@colleenXu
Copy link
Collaborator Author

colleenXu commented Nov 1, 2023

This issue seems to be isolated to ci/dev/local instances.

AKA behavior seems normal, unaffected on prod/test

Example query

based on BioThings TTD pubchem_treats_mondo-rev

Send this query to the instance you want to test. Should get 568 results if it works, and 0 if it doesn't work...

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": ["MONDO:0021094"],
                    "categories": ["biolink:Disease"]
                },
                "n1": {
                    "categories": ["biolink:SmallMolecule"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

@colleenXu
Copy link
Collaborator Author

colleenXu commented Nov 1, 2023

And another example using MyDisease: works fine on prod/test but has the problem on ci/dev/local
Starts with an HP ID...

click to expand

querying only MyDisease through BTE instances

This should retrieve 4 results.

Based on pheno -> disease operation

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["HP:0000224"],
                    "categories":["biolink:PhenotypicFeature"]
                },
                "n1": {
                    "categories":["biolink:Disease"]
               }
            },
            "edges": {
                "e1": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

@colleenXu
Copy link
Collaborator Author

And my last bits of investigation for now!

  • this only seems to affect BioThings APIs: the example queries for Biolink/Monarch API work...
  • this can affect multiple kinds of operations, not just reverse ones (all examples before this have been reverse ones).
Biolink/Monarch API queries that work and use MONDO/HP as input

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": ["biolink:Disease"],
                    "ids": ["MONDO:0011562"]
                },
                "n1": {
                    "categories": ["biolink:Gene"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}
{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": ["biolink:PhenotypicFeature"],
                    "ids": ["HP:0002944"]
                },
                "n1": {
                    "categories": ["biolink:SequenceVariant"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

Non-rev operations that still have this problem

"Forward" operation in MyDisease: should have 67 results

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["MONDO:0002494"],
                    "categories":["biolink:Disease"]
                },
                "n1": {
                    "categories":["biolink:Disease"]
               }
            },
            "edges": {
                "e1": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

association-based operation (so forward/reverse isn't really important): should have 64 results

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["MONDO:0004056"],
                    "categories":["biolink:Disease"]
                },
                "n1": {
                    "categories":["biolink:Gene"]
               }
            },
            "edges": {
                "e1": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:condition_associated_with_gene"]
                }
            }
        }
    }
}

@colleenXu
Copy link
Collaborator Author

This is also happening when DOID/MGI IDs are used as input. This is blocking me from testing the BioThings AGR work https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/AGR/agr.yaml for #260.

DOID Example

Run MyVariant through BTE: http://localhost:3000/v1/smartapi/09c8782d9f4027712e65b95424adba79/query

Based on MyVariant's operation


{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": ["biolink:Disease"],
                    "ids": ["DOID:9256"]
                },
                "n1": {
                    "categories": ["biolink:Gene"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:affected_by"]
                }
            }
        }
    }
}

MGI ID example

Run MGIGene2Pheno through BTE: http://localhost:3000/v1/smartapi/77ed27f111262d0289ed4f4071faa619/query

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": ["biolink:Gene"],
                    "ids": ["MGI:102539"]
                },
                "n1": {
                    "categories": ["biolink:PhenotypicFeature"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

@tokebe tokebe added the On Dev Related changes are deployed to Dev server label Nov 29, 2023
@colleenXu
Copy link
Collaborator Author

colleenXu commented Nov 29, 2023

@tokebe

I'm testing my BioThings AGR work and I think there's another bug here (I don't know how old it is). Queries that end up with multiple starting IDs will only keep records for the 1st ID. I have 3 diff scenarios where I noticed this behavior.

While I was mostly testing DOID/MGI input, I think this applies to any input namespace.


Test 0: starting with two IDs (shortest explanation!)

click to expand

First, run a smartapi_sync so your local instance has the BioThings AGR registration info (just added).
Then run your local instance (no overrides needed) and send a POST request to BioThings AGR through BTE (http://localhost:3000/v1/smartapi/68f12100e74342ae0dd5013d5f453194/query) w/ the following request body:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": ["biolink:Gene"],
                    "ids": ["MGI:109384", "MGI:109393"]
                },
                "n1": {
                    "categories": ["biolink:Disease"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:biomarker_for"]
                }
            }
        }
    }
}

Console logs (mgi_console_logs.txt) show that while there were two starting IDs, n0 only had 1 ID during edge management so records were dropped

  bte:biothings-explorer-trapi:edge-manager (5) Sending next edge 'e01' WITH entity count...(2) +0ms
  bte:biothings-explorer-trapi:edge-manager Checking entity max : (0)--(2) +0ms
  bte:biothings-explorer-trapi:edge-manager 'e01' : (2) --> (0) +0ms
  bte:biothings-explorer-trapi:edge-manager (5) Executing current edge >> "e01" +1ms

  bte:call-apis:query Successful POST https://biothings.ncats.io/agr (2 IDs): Gene > biomarker_for > Disease (obtained 7 records, took 191ms) +20ms

  bte:biothings-explorer-trapi:QNode Node "n0" intersecting (2)/(1) curies... +0ms
  bte:biothings-explorer-trapi:QNode Node "n0" kept (1) curies... +1ms
  bte:biothings-explorer-trapi:edge-manager 'e01' Reversed[false] (1)--(7) entities / (7) records. +979ms
  bte:biothings-explorer-trapi:edge-manager 'e01' dropped (5) records. +0ms

The response (mgi.txt) only has info from the first starting ID: MGI:109384 aka NCBIGene:11910 (Atf3).

(based off these two test examples: here and here)

EDIT: This is also happening with namespaces that aren't problematic. For example, with this query starting with 2 HGNC IDs...the results for the 2nd ID are missing.

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": ["biolink:Gene"],
                    "ids": ["HGNC:10615", "HGNC:1043"]
                },
                "n1": {
                    "categories": ["biolink:Disease"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:biomarker_for"]
                }
            }
        }
    }
}


Test 1: start with 1 DOID, then another ID is added by NodeNorm (focus on RGD output)

click to expand

Same start as test 0 (make sure your local instance has the BioThings AGR registration info).

Example query:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": ["biolink:Disease"],
                    "ids": ["DOID:3587"]
                },
                "n1": {
                    "categories": ["biolink:Gene"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:has_biomarker"]
                }
            }
        }
    }
}

I'm missing my expected answer, RGD:1303018 (Mcm7) in my response: odd.txt. (based on this operation)

In the console logs (odd_response.txt), I see signs that records are dropped.

  bte:biothings-explorer-trapi:QNode Node "n0" intersecting (1)/(1) curies... +1ms
  bte:biothings-explorer-trapi:QNode Node "n0" kept (1) curies... +0ms
  bte:biothings-explorer-trapi:edge-manager 'e01' Reversed[false] (1)--(437) entities / (448) records. +5s
  bte:biothings-explorer-trapi:edge-manager 'e01' dropped (372) records. +2ms

My investigation:

  • the query starts with 1 ID: DOID:3587
  • but I think NodeNorm finds an equivalent: DOID:3498, so BTE uses 2 DOID IDs in its execution. The proof is:
    • the response's pancreatic ductal adenocarcinoma node has both DOIDs in "attribute_type_id": "biolink:xref"
    • the console logs show that 2 DOIDs were used in sub-queries. Ex: Successful POST https://biothings.ncats.io/agr (2 IDs): Disease > has_biomarker > Gene (obtained 111 records, took 306ms)
  • While my expected RGD answer RGD:1303018 (Mcm7) isn't in the response...there are 12 other RGD IDs (ctrl-f for RGD:). But these RGD IDs seem to be from the NodeNorm-added DOID:3498


Test 2: start with 1 DOID, then IDs are added during node/ID-expansion (focus on HGNC output)

click to expand

Example query:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": ["biolink:Disease"],
                    "ids": ["DOID:0080322"]
                },
                "n1": {
                    "categories": ["biolink:Gene"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:has_biomarker"]
                }
            }
        }
    }
}

While I get my expected answer HGNC:1043 (BGLAP), I notice from the console logs (consolelogs2.txt) that records are being dropped:

  bte:biothings-explorer-trapi:QNode Node "n0" intersecting (8)/(1) curies... +1ms
  bte:biothings-explorer-trapi:QNode Node "n0" kept (1) curies... +0ms
  bte:biothings-explorer-trapi:edge-manager 'e01' Reversed[false] (1)--(308) entities / (331) records. +3s
  bte:biothings-explorer-trapi:edge-manager 'e01' dropped (219) records. +1ms

The console logs also show that the starting DOID:0080322 was expanded to include 7 descendants:

  bte:biothings-explorer-trapi:main Expanded ids for node n0: (1 ids -> 8 ids) +0ms

  bte:biothings-explorer-trapi:nodeUpdateHandler curies: {"Disease":["DOID:0080322","DOID:0110861","DOID:898","DOID:0080212","DOID:0080273","DOID:0110858","DOID:0110859","DOID:0110860"]} +0ms

But the response only seems to have HGNC IDs for the starting DOID, and not for the other 7 descendants. This is based on a quick comparison of:

  • the 18 HGNC IDs in the response (ctrl-f for HGNC: in hgnc-response.json)
  • VS the 18 HGNC IDs linked to DOID:0080322 in the expected sub-query response subq2.json

(based on this operation)

@colleenXu
Copy link
Collaborator Author

And here's a compilation of test queries for the original bug (MONDO, HP, DOID, MGI input). These don't require doing any overrides. They pass if there's >=1 result.

They don't test for the "multiple curies -> dropping records" behavior I described in the previous post

MONDO input

query BioThings TTD through BTE (e481efd21f8e8c1deac05662439c2294). Based on this operation testExample

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": ["MONDO:0019056"],
                    "categories": ["biolink:Disease"]
                },
                "n1": {
                    "categories": ["biolink:SmallMolecule"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

HP input

query MyDisease through BTE (671b45c0301c8624abbd26ae78449ca2). Based on this operation testExample

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["HP:0000224"],
                    "categories":["biolink:PhenotypicFeature"]
                },
                "n1": {
                    "categories":["biolink:Disease"]
               }
            },
            "edges": {
                "e1": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

DOID input

query MyVariant through BTE (09c8782d9f4027712e65b95424adba79). Based on this operation testExample

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": ["biolink:Disease"],
                    "ids": ["DOID:9256"]
                },
                "n1": {
                    "categories": ["biolink:Gene"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:affected_by"]
                }
            }
        }
    }
}

MGI input

Query MGIGene2Pheno through BTE(77ed27f111262d0289ed4f4071faa619). Based on this operation testExample

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": ["biolink:Gene"],
                    "ids": ["MGI:102539"]
                },
                "n1": {
                    "categories": ["biolink:PhenotypicFeature"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

@tokebe
Copy link
Member

tokebe commented Nov 30, 2023

Looks like an error with the Biothings JQ, I'll see about fixing it.

@tokebe
Copy link
Member

tokebe commented Dec 4, 2023

I've pushed a fix which should address this issue.

@tokebe tokebe added On CI Related changes are deployed to CI server and removed On Dev Related changes are deployed to Dev server labels Dec 5, 2023
@colleenXu
Copy link
Collaborator Author

colleenXu commented Dec 6, 2023

Looks like the fix works!

I ran my tests with main branches (CI) and everything worked as-intended:

I also successfully tested in dev:

@tokebe tokebe added On Test Related changes are deployed to Test server and removed On CI Related changes are deployed to CI server labels Dec 20, 2023
@colleenXu
Copy link
Collaborator Author

Closing this issue since the changes have been deployed to Prod with the Feb 2024 release.

I've confirmed that I can query BTE prod with the example queries in Test 0 of #756 (comment) and in #756 (comment), and everything works as-expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working On Test Related changes are deployed to Test server
Projects
None yet
Development

No branches or pull requests

2 participants