Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feast push (Redshift/DynamoDb) not work with PushMode.ONLINE_AND_OFFLINE when more than 500 columns #3282

Closed
beubeu13220 opened this issue Oct 10, 2022 · 2 comments · Fixed by #3377

Comments

@beubeu13220
Copy link
Contributor

Expected Behavior

Currently, we have a push source with Redshift Offline Store and DynamoDb Online Store.
We built our view with more than 500 columns. Around 750 columns.

We expected to ingest data in dynamo and redshift when we run
fs.push("push_source", df, to=PushMode.ONLINE_AND_OFFLINE)

Current Behavior

Push command raise an issue like [ERROR] ValueError: The input dataframe has columns ..
This issue come from get_table_column_names_and_types method in write_to_offline_store method.
In the method, we check if if set(input_columns) != set(source_columns) and raise the below issue if there are diff.

In case with more than 500 columns we get a diff because source_columns come from get_table_column_names_and_types method result where the result is define by MaxResults parameters.

Steps to reproduce

entity= Entity(
    name="entity",
    join_keys=["entity_id"],
    value_type=ValueType.INT64,
)

push_source = PushSource(
    name="push_source",
    batch_source=RedshiftSource(
        table="fs_push_view",
        timestamp_field="datecreation",
        created_timestamp_column="created_at"),
)

besoin_embedding_push_view = FeatureView(
    name="push_view",
    entities=[entity],
    schema=[Field(name=f"field_{dim}", dtype=types.Float64) for dim in range(768)],
    source=push_source 
)

fs.push("push_source", df, to=PushMode.ONLINE_AND_OFFLINE)

Specifications

  • Version: 0.25.0
  • Platform: AWS
  • Subsystem:

Possible Solution

In my mind, we have two solutions:

  • Set higher MaxResults in describe_table method
  • Use NextToken to iterate through results
@achals
Copy link
Member

achals commented Oct 12, 2022

Hi @beubeu13220 , I think either of the two solutions are good options. I think I'd prefer the NextToken approach simply because it's probably the most stable one.

Would you like to make a PR to add this functionality? We'd be happy to review!

@achals achals added the good first issue Good for newcomers label Oct 12, 2022
@beubeu13220
Copy link
Contributor Author

beubeu13220 commented Oct 26, 2022

Hi @achals,

Yes, I'll do that as soon as I have time.
For the moment, we use custom write_to_offline_redshift function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants