Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reconstitute (Snowflake) DataSources without "duplicate" database declarations #3286

Closed
cburroughs opened this issue Oct 12, 2022 · 3 comments
Assignees
Labels
kind/feature New feature or request priority/p1

Comments

@cburroughs
Copy link
Contributor

Is your feature request related to a problem? Please describe.

I'm investigating improvements to the DataHub Feast Integration https://datahubproject.io/docs/generated/ingestion/sources/feast/. To complete a data lineage graph, the ingestion plugin needs to be able to associate a feature view with the "fully qualified" name of the table. That is, something like ${db}.${schema}.${table}.

Right now this is only works by manually duplicating the database name from feature_store.yaml into the source definition, which feels well... duplicative. Otherwise the properties of the class are not fully reconstituted from the registry.

(From the init demo)

driver_stats_source = SnowflakeSource(
    # The Snowflake table where features can be found
    database=yaml.safe_load(open("feature_store.yaml"))["offline_store"]["database"],
    table=f"{project_name}_feast_driver_hourly_stats",
    # The event timestamp is used for point-in-time joins and for ensuring only
    # features within the TTL are returned
    timestamp_field="event_timestamp",
    # The (optional) created timestamp is used to ensure there are no duplicate
    # feature rows in the offline store or when building training datasets
    created_timestamp_column="created",
)

Results in

$ feast data-sources describe notable_hyena_feast_driver_hourly_stats
type: BATCH_SNOWFLAKE
timestampField: event_timestamp
createdTimestampColumn: created
snowflakeOptions:
  table: notable_hyena_feast_driver_hourly_stats
  schema: PUBLIC
  database: EXPERIMENTS_CHRIS
name: notable_hyena_feast_driver_hourly_stats

While

driver_stats_source = SnowflakeSource(
    # The Snowflake table where features can be found
    #database=yaml.safe_load(open("feature_store.yaml"))["offline_store"]["database"],
    table=f"{project_name}_feast_driver_hourly_stats",
    # The event timestamp is used for point-in-time joins and for ensuring only
    # features within the TTL are returned
    timestamp_field="event_timestamp",
    # The (optional) created timestamp is used to ensure there are no duplicate
    # feature rows in the offline store or when building training datasets
    created_timestamp_column="created",
)

only gives

$ feast data-sources describe notable_hyena_feast_driver_hourly_stats
type: BATCH_SNOWFLAKE
timestampField: event_timestamp
createdTimestampColumn: created
snowflakeOptions:
  table: notable_hyena_feast_driver_hourly_stats
name: notable_hyena_feast_driver_hourly_stats

Note that even the init example just loads from the yaml file.

Describe the solution you'd like

Since get_historical_features works in either case, I presume there must already be a code path that handles this. I would like referencing the database and schema properties of a SnowflakeSource to always have the same values as when get_historical_features is called.

Describe alternatives you've considered

The DataHub ingestion plugin could read feature_store.yaml and smoosh strings together. That would work for this particular case, but I suspect it would be brittle in the long run.

@cburroughs cburroughs added the kind/feature New feature or request label Oct 12, 2022
@sfc-gh-madkins sfc-gh-madkins self-assigned this Oct 26, 2022
@sfc-gh-madkins
Copy link
Collaborator

@cburroughs I am looking into this

@sfc-gh-madkins
Copy link
Collaborator

@cburroughs i think I understand now ... how does this work if you only include a table name and not a fully qualified name.

@sfc-gh-madkins
Copy link
Collaborator

not likely that we will enforce fully qualified names here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request priority/p1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants