Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes large payload runtime exception in Datastore (issue 1633) #2181

Merged
merged 2 commits into from
Jan 4, 2022

Conversation

ptoman-pa
Copy link
Contributor

What this PR does / why we need it:
Fixes runtime exception when feature values are larger than 1500 bytes in Datastore.

Datastore indexes values as well as keys so by default it disallows large payloads (error: google.api_core.exceptions.InvalidArgument: 400 The value of property _ is longer than 1500 bytes.). This PR ensures that values will not be indexed, which seems to be the intent of the original code.

Which issue(s) this PR fixes:
Fixes #1633

Does this PR introduce a user-facing change?:

NONE

Testing
There is not yet a test harness for changes of this sort. The issue is demonstratable in master & fixed in this branch with the following steps (you'll want to change some of them):

echo '{"entity": "entity1", "event_timestamp": "2021-12-01T01:01:01.123456", "big_feature": ["A", "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB", "CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC"]}' > data.json gsutil cp data.json gs://$MY_PATH/issue1633/long-list-data.json

from google.cloud import bigquery
client = bigquery.Client(project=$MY_PROJECT)

gcs_uri = 'gs://$MY_PATH/issue1633/long-list-data.json'

dataset = client.create_dataset('feast_development', exists_ok=True)
table = dataset.table('issue1633_example')

job_config = bigquery.job.LoadJobConfig()
job_config.schema = [
    bigquery.SchemaField('entity', 'STRING'),
    bigquery.SchemaField('big_feature', 'STRING', 'REPEATED'),
    bigquery.SchemaField('event_timestamp', 'TIMESTAMP'),
]
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
job_config.write_disposition = "WRITE_TRUNCATE"

load_job = client.load_table_from_uri(gcs_uri, table, job_config=job_config)
from feast import FeatureStore, Entity, Feature, FeatureView, ValueType, RepoConfig, BigQuerySource
from feast.repo_config import RegistryConfig
from feast.infra.online_stores.datastore import DatastoreOnlineStoreConfig
from feast.infra.offline_stores.bigquery import BigQueryOfflineStoreConfig

import datetime
from datetime import timedelta

repo_config = RepoConfig(
    project="new_project",
    registry="gs://$MY_PATH/issue1633/registry.db",
    provider="gcp",
    online_store=DatastoreOnlineStoreConfig(project_id=$MY_PROJECT),
    offline_store=BigQueryOfflineStoreConfig(project_id=$MY_PROJECT)
)

entity = Entity(
    name="entity",
    value_type=ValueType.STRING
)
big_feature = FeatureView(
    name="big_feature",
    entities=["entity"],
    ttl=timedelta(days=180),
    features=[
        Feature(name="big_feature", dtype=ValueType.STRING_LIST)
    ],
    batch_source=BigQuerySource(
        table_ref="$MY_PROJECT.feast_development.issue1633_example",
        event_timestamp_column="event_timestamp"
    )
)

feature_store = FeatureStore(config=repo_config)
feature_store.registry._initialize_registry()
feature_store.apply([entity, big_feature])

start_time = datetime.datetime(2021, 11, 15, tzinfo=datetime.timezone.utc)
end_time = datetime.datetime(2021, 12, 31, tzinfo=datetime.timezone.utc)
feature_store.materialize(start_time, end_time) # WAS: google.api_core.exceptions.InvalidArgument: 400 The value of property "big_feature" is longer than 1500 bytes.

feature_store.get_online_features(
    features=["big_feature:big_feature"],
    entity_rows=[
        {"entity": "entity1"}
    ]
).to_df()

feature_store.teardown()

…s in Datastore.

Datastore indexes values as well as keys so large payloads are disallowed. This change clarifies that values should not be indexed. It avoids google.api_core.exceptions.InvalidArgument: 400 The value of property _ is longer than 1500 bytes.

Signed-off-by: Pamela Toman <ptoman@paloaltonetworks.com>
@ptoman-pa ptoman-pa requested a review from a team as a code owner January 4, 2022 01:19
@ptoman-pa ptoman-pa requested review from achals and removed request for a team January 4, 2022 01:19
@feast-ci-bot
Copy link
Collaborator

Hi @ptoman-pa. Thanks for your PR.

I'm waiting for a feast-dev member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@codecov-commenter
Copy link

codecov-commenter commented Jan 4, 2022

Codecov Report

Merging #2181 (58e1978) into master (068389d) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2181      +/-   ##
==========================================
+ Coverage   84.59%   84.61%   +0.01%     
==========================================
  Files         102      102              
  Lines        8186     8201      +15     
==========================================
+ Hits         6925     6939      +14     
- Misses       1261     1262       +1     
Flag Coverage Δ
integrationtests 74.30% <100.00%> (-0.25%) ⬇️
unittests 58.99% <0.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
sdk/python/feast/infra/online_stores/datastore.py 80.79% <100.00%> (+0.67%) ⬆️
sdk/python/feast/type_map.py 72.65% <0.00%> (-0.52%) ⬇️
sdk/python/feast/online_response.py 87.71% <0.00%> (ø)
sdk/python/feast/feature_store.py 91.39% <0.00%> (+0.03%) ⬆️
sdk/python/feast/infra/provider.py 90.09% <0.00%> (+0.18%) ⬆️
sdk/python/feast/infra/utils/aws_utils.py 86.23% <0.00%> (+0.72%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 068389d...58e1978. Read the comment docs.

Signed-off-by: Pamela Toman <ptoman@paloaltonetworks.com>
Copy link
Member

@achals achals left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@feast-ci-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: achals, ptoman-pa

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"The value of property is longer than 1500 bytes" error on BigQquery REPEATED STRING materialization
5 participants