Skip to content

Remove v2 post bulk/info/variable-group endpoint definition construction#6258

Merged
nick-nlb merged 8 commits into
datacommonsorg:masterfrom
nick-nlb:v2-migration-definitions
May 5, 2026
Merged

Remove v2 post bulk/info/variable-group endpoint definition construction#6258
nick-nlb merged 8 commits into
datacommonsorg:masterfrom
nick-nlb:v2-migration-definitions

Conversation

@nick-nlb
Copy link
Copy Markdown
Contributor

@nick-nlb nick-nlb commented May 1, 2026

Issue

b/507068100

Related PRs

Mixer: 1877

If the above PR is not yet merged into master, you should have the above branch active in your local mixer.

Description

With Spanner soon to be supporting definitions directly from the v2/bulk/info/variable-group endpoint, the costly frontend definitions calculations are now removed. Both v1 and v2 operate on the same path, under the assumptions that the definitions will be provided.

This PR involves a significant reversion of Flask-side definition construction. However, major improvements introduced during the initial work on the v2 definitions have been kept, most notably query batching to remove an N+1 that had always existed in the original v1 path.

Notes

With the addition of the include_definitions parameter to the Spanner call (see 1877 linked above), we can now selectively determine which calls to the v2/bulk/info/variable-group endpoint calls return definitions. This is done for the sake of latency: calls that require definitions take a non-trivial amount of additional time and compute to resolve, particularly when the call also requests additional filtering.

We take the following approach:

  • Calls made in the NL search (which do not involve additional filtering, but do need definitions) request definitions.
  • Calls made directly to the flask variable-group endpoint (used only by the hierarchy) never need definitions, and so always exclude them.

Testing

Explore searches should provide largely similar results through either Spanner or BT backed mixers .

Results may not be exactly the same. For example, in the following search, the first graph provides "Language Spoken at Home" as an explore more option from BT, but does not from Spanner (you can see this below using production for the BT version, but the same result will happen locally).

This is due to the backfilled Spanner definitions being different in some cases (i.e., in the case below, the base and comparison SVs differ by more than one constraint in the Spanner definitions). The root of this discrepancy is unknown, but it is not due to the functionality of this PR.

Local search: Population of the United States

Production search: Population of the United States

Merging Notes

🚨 This PR should not be merged in until after the above linked 1877 PR in Mixer

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request streamlines the variable group information retrieval process by removing the client-side construction of variable definitions for the v2/bulk/info/variable-group endpoint. With the backend now providing these definitions directly, the frontend logic has been simplified to use a unified path for both v1 and v2, reducing complexity and removing redundant API calls.

Highlights

  • Removed V2 API dependency: Removed the conditional logic and helper functions that were previously used to handle V2 API-specific variable definition calculations, as these are now supported directly by the backend.
  • Simplified variable extension: Refactored the variable extension logic to unify V1 and V2 paths, removing the need for separate V2-specific traversal and definition fetching.
  • Cleaned up unused code: Removed obsolete constants, helper functions, and the associated unit test file that were specific to the now-deprecated V2 definition construction logic.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes V2-specific variable extension logic and associated feature flags, simplifying the extend_svs and open_svgs functions. The review feedback identifies a critical performance regression where batched network calls for indirect siblings were replaced by sequential queries inside a loop, creating an N+1 query problem. Furthermore, several instances of unsafe dictionary access were found that could lead to KeyError exceptions, and the reviewer recommends restoring and updating the deleted test suite to ensure the refactored logic remains correct.

I am having trouble creating individual review comments. Click here to see my feedback.

server/lib/nl/common/variable.py (180-188)

high

This refactoring introduces a significant performance regression by removing the batching logic for fetching indirect siblings. The previous implementation used a multi-pass approach with _fetch_indirect_siblings to perform network calls in batches for all candidate SVs. The current implementation performs three separate network calls (property_values twice and get_variable_group_info once) sequentially inside a loop for each SV requiring indirect expansion. This will lead to an N+1 query problem and significantly increase latency when multiple SVs are processed. Please consider restoring the batched fetching logic.

server/tests/lib/nl/variable_test.py (1-184)

high

Removing this test file is not recommended. Although the PR removes the frontend definition calculation logic, the core functionality of extend_svs (finding direct and indirect siblings) is still present and has been heavily refactored. These tests provide critical verification for the variable extension logic and should be updated to match the new implementation rather than being deleted.

server/lib/nl/common/variable.py (170)

medium

Using direct dictionary access svg2childsvs[svg] may raise a KeyError if the get_variable_group_info call (on line 154) did not return data for that specific SVG node. It is safer to use .get(svg, []), which was the pattern used in the previous implementation.

    for child_sv in svg2childsvs.get(svg, []):

server/lib/nl/common/variable.py (180)

medium

Accessing the result of property_values directly with [svg] may raise a KeyError if the node is not found in the response. It's safer to use .get(svg, []).

      svg_parents = fetch.property_values([svg], "specializationOf", True).get(svg, [])

server/lib/nl/common/variable.py (184-185)

medium

Similar to line 180, accessing the result of property_values directly with [svg_parent] may raise a KeyError.

      svg_siblings = fetch.property_values([svg_parent], "specializationOf",
                                           False).get(svg_parent, [])

server/lib/nl/common/variable.py (189)

medium

It is safer to check if data exists in svg_siblings_info before iterating, similar to the check on line 155.

server/lib/nl/common/variable.py (196)

medium

Similar to line 170, using svg2childsvs.get(svg, []) is safer here to avoid a potential KeyError.

      for new_sv_info in svg2childsvs.get(svg, []):

@nick-nlb nick-nlb marked this pull request as ready for review May 1, 2026 22:50
@nick-nlb nick-nlb requested review from juliawu May 1, 2026 22:59
Copy link
Copy Markdown
Contributor

@juliawu juliawu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates!

@nick-nlb nick-nlb merged commit 57daa02 into datacommonsorg:master May 5, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants