fix(compute-engine/local): Honor field_mapping on join keys in dedup + join nodes#6395
Merged
franciscojavierarceo merged 3 commits intoMay 13, 2026
Conversation
When a batch source defines a `field_mapping` that renames an entity join key (e.g. `USERID` -> `user_id`), the source-read node renames the columns on the pulled Arrow table to their mapped names. Downstream `LocalDedupNode` and `LocalJoinNode` then look up the *pre-mapping* names from `column_info.join_keys`, which raises `KeyError: Index(['USERID'])` during materialization (or returns an empty join). Add a `join_keys_columns` property on `ColumnInfo` that mirrors the existing `timestamp_column` / `created_timestamp_column` properties — returning join keys translated through `field_mapping` — and use it from the dedup and join nodes. Fixes feast-dev#5942. Signed-off-by: 1fanwang <1fannnw@gmail.com>
Collaborator
|
LGTM |
HaoXuAI
approved these changes
May 12, 2026
Signed-off-by: 1fanwang <1fannnw@gmail.com>
rpathade
pushed a commit
to rpathade/feast
that referenced
this pull request
May 21, 2026
…+ join nodes (feast-dev#6395) * fix: Apply field mapping to join keys in local compute engine nodes When a batch source defines a `field_mapping` that renames an entity join key (e.g. `USERID` -> `user_id`), the source-read node renames the columns on the pulled Arrow table to their mapped names. Downstream `LocalDedupNode` and `LocalJoinNode` then look up the *pre-mapping* names from `column_info.join_keys`, which raises `KeyError: Index(['USERID'])` during materialization (or returns an empty join). Add a `join_keys_columns` property on `ColumnInfo` that mirrors the existing `timestamp_column` / `created_timestamp_column` properties — returning join keys translated through `field_mapping` — and use it from the dedup and join nodes. Fixes feast-dev#5942. Signed-off-by: 1fanwang <1fannnw@gmail.com> * test: also cover LocalJoinNode field_mapping case Signed-off-by: 1fanwang <1fannnw@gmail.com> --------- Signed-off-by: 1fanwang <1fannnw@gmail.com> Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com>
ntkathole
added a commit
that referenced
this pull request
May 23, 2026
* feat: Add enabled/disabled toggle for feature views Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> * feat: Add demo noteboooks for users Signed-off-by: ntkathole <nikhilkathole2683@gmail.com> Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> * feat: Add CLI enable/disable commands and registry metadata support Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> * Added features Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> * fix(compute-engine/local): Honor field_mapping on join keys in dedup + join nodes (#6395) * fix: Apply field mapping to join keys in local compute engine nodes When a batch source defines a `field_mapping` that renames an entity join key (e.g. `USERID` -> `user_id`), the source-read node renames the columns on the pulled Arrow table to their mapped names. Downstream `LocalDedupNode` and `LocalJoinNode` then look up the *pre-mapping* names from `column_info.join_keys`, which raises `KeyError: Index(['USERID'])` during materialization (or returns an empty join). Add a `join_keys_columns` property on `ColumnInfo` that mirrors the existing `timestamp_column` / `created_timestamp_column` properties — returning join keys translated through `field_mapping` — and use it from the dedup and join nodes. Fixes #5942. Signed-off-by: 1fanwang <1fannnw@gmail.com> * test: also cover LocalJoinNode field_mapping case Signed-off-by: 1fanwang <1fannnw@gmail.com> --------- Signed-off-by: 1fanwang <1fannnw@gmail.com> Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> * feat: Add Prometheus gauges for FeatureStore installation telemetry (#6354) Signed-off-by: ntkathole <nikhilkathole2683@gmail.com> Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> * docs: Rename Atlas Vector Search to MongoDB Vector Search and fix code examples Signed-off-by: jvincent-mongodb <jeffrey.vincent@mongodb.com> Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> * feat(dynamodb): Use ProjectionExpression when requested_features is set The requested_features parameter was accepted by online_read and online_read_async but never used -- DynamoDB always fetched all features stored in the values map regardless. Add a ProjectionExpression to BatchGetItem requests when requested_features is provided, reducing data transfer, latency, and read costs. Fixes #6058 Signed-off-by: Jonathan Wrede <wrede.jonathan00@gmail.com> Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> * fix(dynamodb): Fix mypy type for _build_projection_expression return The return dict contains both str and Dict[str, str] values, so the return type must be Dict[str, Any] not Dict[str, str]. Signed-off-by: Jonathan Wrede <wrede.jonathan00@gmail.com> Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> * fix(bigquery): Enable list inference for parquet loads in offline_write_batch When pushing features with array/list types (e.g. STRING_LIST) to BigQuery via offline_write_batch, the data arrives as empty arrays because BigQuery's parquet loader does not infer list structure by default. Set parquet_options.enable_list_inference = True on the LoadJobConfig so array columns are written correctly. Fixes #5845 Signed-off-by: Jonathan Wrede <wrede.jonathan00@gmail.com> Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> * fix(trino): Clean up temporary entity tables after retrieval (#6381) * fix(trino): Clean up temporary entity tables after retrieval TrinoOfflineStore.get_historical_features() creates a temporary table for the entity DataFrame but never drops it, leaking tables indefinitely. Apply the same context manager pattern used by BigQuery, Redshift, and Athena offline stores: wrap the query in a generator that issues DROP TABLE IF EXISTS in a finally block. Fixes #6306 Signed-off-by: Jonathan Wrede <wrede.jonathan00@gmail.com> * fix: sort imports for ruff compliance Signed-off-by: Jonathan Wrede <wrede.jonathan00@gmail.com> * fix: decouple temp table cleanup from query access Avoid dropping the temporary entity table on to_sql() calls. Previously, every method used a context manager that dropped the table on exit, so calling to_sql() before to_df() would destroy the table and cause subsequent queries to fail. Now the query is stored as a plain string and cleanup is handled by a dedicated _drop_temp_table() method called only after query execution (to_df, to_trino). A __del__ fallback ensures cleanup if execution methods are never called. The _cleaned_up flag makes the drop idempotent. Signed-off-by: Jonathan Wrede <wrede.jonathan00@gmail.com> --------- Signed-off-by: Jonathan Wrede <wrede.jonathan00@gmail.com> Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> * feat(bigquery): Support DATE-type event timestamp columns (#6362) * feat(bigquery): Support DATE-type event timestamp columns When the event_timestamp column in BigQuery is a DATE type, the generated SQL wraps comparison values in TIMESTAMP(), causing a type mismatch error. This adds a timestamp_field_type parameter to BigQuerySource that, when set to "DATE", generates DATE() comparisons instead. Closes #2530 (part 2) Signed-off-by: Jonathan Wrede <wrede.jonathan00@gmail.com> * fix(bigquery): Use protobuf 4.25.x compatible generated code The proto files were regenerated with protobuf 6.31.1 / grpcio-tools 1.80.0, which imports runtime_version -- a module that does not exist in protobuf 4.25.x used by the project. Revert generated code to 4.25.1 format while keeping the new timestamp_field_type field. Signed-off-by: Jonathan Wrede <wrede.jonathan00@gmail.com> * fix(bigquery): Add Literal type annotation for cast_style Mypy infers str from the ternary expression; annotate with the exact Literal union so the call to get_timestamp_filter_sql passes type checking. Signed-off-by: Jonathan Wrede <wrede.jonathan00@gmail.com> * fix: Make timestamp_field_type default to None in FeatureViewQueryContext Callers that do not use DATE-typed timestamp fields (e.g. Spark offline store tests) should not be forced to pass timestamp_field_type. Adding a default keeps the new field backward-compatible. Signed-off-by: Jonathan Wrede <wrede.jonathan00@gmail.com> * fix: Keep timestamp_field_type required in FeatureViewQueryContext A default value on timestamp_field_type breaks the SparkFeatureViewQueryContext subclass because its non-default fields (min_date_partition, max_date_partition) would follow a field with a default. Instead, keep it required and update the Spark test to pass it. Signed-off-by: Jonathan Wrede <wrede.jonathan00@gmail.com> * fix: regenerate protos matching upstream mypy-protobuf style Reset all non-DataSource generated files to match master. Only DataSource_pb2.py and DataSource_pb2.pyi contain our timestamp_field_type additions (field 28). The .pyi stub is hand-edited to match the existing import style used on master. Signed-off-by: Jonathan Wrede <wrede.jonathan00@gmail.com> --------- Signed-off-by: Jonathan Wrede <wrede.jonathan00@gmail.com> Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> * fix: Fixes for ray source Signed-off-by: ntkathole <nikhilkathole2683@gmail.com> Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> * feat: Expose registry endpoints on feature server for MCP access Mount the existing REST registry routers under /registry on the feature server so that fastapi_mcp automatically exposes registry introspection (list/get for entities, feature views, data sources, feature services, permissions, projects, saved datasets, lineage, search) as MCP tools. The RegistryServer is created in-process from store.registry — no external registry server is required. Auth is enforced via inject_user_details on every mounted router. Made-with: Cursor Signed-off-by: Chaitany patel <patelchaitany93@gmail.com> Made-with: Cursor Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> * fix: Revert state propagation to always update in _update_metadata_fields Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> * fix: Recompile protos for protobuf 4.x compatibility and fix state machine to be opt-in Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> * feat: Add unit tests for state machine and clean up lazy imports in registry Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> * fix: Address review comments for feature view state management Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> * fix: Resolve integration test failures in apply loop Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> * fix: Resolve integration test failures in apply loop Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> * Apply suggestion from @ntkathole Co-authored-by: Nikhil Kathole <nikhilkathole2683@gmail.com> Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> * fix: Resolve review comments for feature_store Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> * fix: Resolve review comments for feature_views.py Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> * feat: Add FeatureStore methods and update describe for enabled/state Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> * fix: Add type: ignore comments for mypy on BaseFeatureView attr access Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> * fix: Remove REST API endpoints for enable/disable/set-state (deferred to follow-up PR) Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> --------- Signed-off-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> Signed-off-by: ntkathole <nikhilkathole2683@gmail.com> Signed-off-by: 1fanwang <1fannnw@gmail.com> Signed-off-by: jvincent-mongodb <jeffrey.vincent@mongodb.com> Signed-off-by: Jonathan Wrede <wrede.jonathan00@gmail.com> Signed-off-by: Chaitany patel <patelchaitany93@gmail.com> Co-authored-by: RutujaPathade <73137503+RutujaPathade@users.noreply.github.com> Co-authored-by: ntkathole <nikhilkathole2683@gmail.com> Co-authored-by: Stefan Wang <1fannnw@gmail.com> Co-authored-by: jvincent-mongodb <jeffrey.vincent@mongodb.com> Co-authored-by: Jonathan Wrede <wrede.jonathan00@gmail.com> Co-authored-by: Jwrede <62910358+Jwrede@users.noreply.github.com> Co-authored-by: Chaitany patel <patelchaitany93@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
Fixes #5942.
The bug isn't actually Snowflake-specific despite the issue title — it lives in the local compute engine's DAG nodes.
Path:
_get_column_names()returns the reverse-mapped join keys (e.g.["USERID"]).pull_latest_from_table_or_queryis called with those raw names — correct, the offline store needs them.LocalSourceReadNode.executerenames the result columns viafield_mapping.get(col, col)— the table now hasuser_id.LocalDedupNodeandLocalJoinNodethen look upcolumn_info.join_keys(still["USERID"]) on a df whose columns areuser_id. pandas raisesKeyError: Index(['USERID'], dtype='object')— the exact error in the issue.The existing
ColumnInfoclass already has mapped properties fortimestamp_columnandcreated_timestamp_column(precedent: PR #4886, which fixed the analogous bug forget_historical_features). This PR adds the missingjoin_keys_columnsmapped property and uses it inLocalDedupNodeandLocalJoinNode.How was this tested?
test_local_dedup_node_with_field_mapping_on_join_keyintests/unit/infra/compute_engines/local/test_nodes.py. Reproduces the exactKeyError(['USERID'])on master; passes with this PR.test_nodes.pystill pass.tests/unit/infra/compute_engines/still pass.ruff check,ruff format --check,mypyall clean.Scope
Local-engine only. The Spark, Ray, AWS Lambda, k8s, and Snowflake compute engines have the same
column_info.join_keyslookup pattern in their nodes and likely the same latent bug. Each is a separate <50 LoC follow-up — happy to file those if maintainers want them in this PR or as stacked.