Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Redundant feature materialization and premature incremental materialization timestamp updates #3789

Merged
merged 10 commits into from
Oct 25, 2023

Conversation

james-crabtree-sp
Copy link
Contributor

What this PR does / why we need it:

This PR fixes 2 issues encountered while attempting to run very large materializations (Several billion rows as part of an initial run of online materialization).

  1. Every pod created materializes every file in the dataset. This is redundant and costs a lot of time and money. To fix this, I updated the dataflow code to select only the single file corresponding to the running pod's JOB_COMPLETION_INDEX as input.
  2. The last materialized date used for incremental materialization was being updated before materialization had even run. This meant that if materialization had failed for some reason, many updates could be ignored, because the next run would continue from beyond the end date of the failed run. I solved this by implementing a synchronous mode for bytewax materialization that will wait for materialization of a feature to complete successfully before returning and updating the last materialized date
  3. Bytewax jobs can run infinitely if node crashes occur often enough that the job has to continuously reprocess the same pods after they are lost. I mitigated this by adding a configurable activeDeadlineSeconds and by adding an option to synchronous materialization to allow these jobs to be split into smaller batches so fewer pods have to be rerun in the event of a node failure

Which issue(s) this PR fixes:
Fixes #3786
Fixes #3787
Fixes #3788

@james-crabtree-sp james-crabtree-sp changed the title fix: redundant feature materialization and premature incremental materialization timestamp updates fix: Redundant feature materialization and premature incremental materialization timestamp updates Oct 9, 2023
@sudohainguyen
Copy link
Collaborator

great, I have been working on the similar PR, but you did it first and way better 💯

@whoahbot
Copy link
Collaborator

@james-crabtree-sp Thanks again for putting this PR together. I think in order to merge, you'll need to sign your commits: https://docs.feast.dev/project/development-guide#signing-off-commits

@james-crabtree-sp
Copy link
Contributor Author

@james-crabtree-sp Thanks again for putting this PR together. I think in order to merge, you'll need to sign your commits: https://docs.feast.dev/project/development-guide#signing-off-commits

I think they are all signed off already. I see that the DCO task that checks this completed successfully https://github.com/feast-dev/feast/pull/3789/checks?check_run_id=17541354092

@whoahbot
Copy link
Collaborator

@james-crabtree-sp Thanks again for putting this PR together. I think in order to merge, you'll need to sign your commits: https://docs.feast.dev/project/development-guide#signing-off-commits

I think they are all signed off already. I see that the DCO task that checks this completed successfully https://github.com/feast-dev/feast/pull/3789/checks?check_run_id=17541354092

Sorry for the confusion, I was referring to the requirements to merge, and linked to the wrong documentation.

This is the requirement that I was referring to:

Merging is blocked
The base branch requires all commits to be signed. Learn more about signing commits.

@sudohainguyen
Copy link
Collaborator

hey @james-crabtree-sp this is a great PR, you need any hands to make it merged? I see there's a lot of commits in the past unverified

if you don't mind I can create another PR to mimic this one

james-crabtree-sp and others added 8 commits October 23, 2023 11:39
Signed-off-by: James Crabtree <james.crabtree@sailpoint.com>
Signed-off-by: James Crabtree <james.crabtree@sailpoint.com>
Signed-off-by: James Crabtree <james.crabtree@sailpoint.com>
)

* SAASMLOPS-809 fix bytewax workers so they only process a single file

* SAASMLOPS-809 fix newlines

Signed-off-by: James Crabtree <james.crabtree@sailpoint.com>
* SAASMLOPS-833 add configurable job timeout

* SAASMLOPS-833 fix whitespace

Signed-off-by: James Crabtree <james.crabtree@sailpoint.com>
Signed-off-by: James Crabtree <james.crabtree@sailpoint.com>
Signed-off-by: James Crabtree <james.crabtree@sailpoint.com>
Signed-off-by: James Crabtree <james.crabtree@sailpoint.com>
@james-crabtree-sp
Copy link
Contributor Author

@whoahbot @sudohainguyen Ahh, I understand now. Commits are now verified

@sudohainguyen
Copy link
Collaborator

@whoahbot @sudohainguyen Ahh, I understand now. Commits are now verified

exactly 🤣 looks good now but you need to run code format though

@james-crabtree-sp
Copy link
Contributor Author

@whoahbot @sudohainguyen Ahh, I understand now. Commits are now verified

exactly 🤣 looks good now but you need to run code format though

Noticed that too. Weird rebase issue. Should be fixed now

…error

Signed-off-by: James Crabtree <james.crabtree@sailpoint.com>
Signed-off-by: James Crabtree <james.crabtree@sailpoint.com>
@sudohainguyen
Copy link
Collaborator

@james-crabtree-sp I've been adopting your changes on my side

so far do we need to wait the job to be done, like forever without timeout?

@james-crabtree-sp
Copy link
Contributor Author

@james-crabtree-sp I've been adopting your changes on my side

so far do we need to wait the job to be done, like forever without timeout?

Well, this PR adds the configurable active_deadline_seconds setting, which can be used to impose a timeout on the kubernetes jobs. Does that answer your question?

@sudohainguyen
Copy link
Collaborator

@james-crabtree-sp I've been adopting your changes on my side
so far do we need to wait the job to be done, like forever without timeout?

Well, this PR adds the configurable active_deadline_seconds setting, which can be used to impose a timeout on the kubernetes jobs. Does that answer your question?

cool clear

@achals achals merged commit 417b16b into feast-dev:master Oct 25, 2023
15 checks passed
woop pushed a commit that referenced this pull request Jan 13, 2024
# [0.35.0](v0.34.0...v0.35.0) (2024-01-13)

### Bug Fixes

* Add async refresh to prevent synchronous refresh in main thread ([#3812](#3812)) ([9583ed6](9583ed6))
* Adopt connection pooling for HBase ([#3793](#3793)) ([b3852bf](b3852bf))
* Bytewax engine create configmap from object ([#3821](#3821)) ([25e9775](25e9775))
* Fix warnings from deprecated paths and update default log level ([#3757](#3757)) ([68a8737](68a8737))
* improve parsing bytewax job status ([5983f40](5983f40))
* make bytewax settings unexposed ([ae1bb8b](ae1bb8b))
* Make generated temp table name escaped ([#3797](#3797)) ([175d796](175d796))
* Pin numpy version to avoid spammy deprecation messages ([774ed33](774ed33))
* Redundant feature materialization and premature incremental materialization timestamp updates ([#3789](#3789)) ([417b16b](417b16b)), closes [#6](#6) [#7](#7)
* Resolve hbase hotspot issue when materializing ([#3790](#3790)) ([7376db8](7376db8))
* Set keepalives_idle None by default ([#3756](#3756)) ([8717e9b](8717e9b))
* Set upper bound for bigquery client due to its breaking changes ([2151c39](2151c39))
* UI project cannot handle fallback routes ([#3766](#3766)) ([96ece0f](96ece0f))
* update dependencies versions due to conflicts ([5dc0b24](5dc0b24))
* Update jackson and remove unnecessary logging ([#3809](#3809)) ([018d0ea](018d0ea))
* upgrade the pyarrow to latest v14.0.1 for CVE-2023-47248. ([052182b](052182b))

### Features

* Add get online feature rpc to gprc server ([#3815](#3815)) ([01db8cc](01db8cc))
* Add materialize and materialize-incremental rest endpoints ([#3761](#3761)) ([fa600fe](fa600fe)), closes [#3760](#3760)
* add redis sentinel support ([3387a15](3387a15))
* add redis sentinel support ([4337c89](4337c89))
* add redis sentinel support format lint ([aad8718](aad8718))
* Add support for `table_create_disposition` in bigquery job for offline store ([#3762](#3762)) ([6a728fe](6a728fe))
* Add support for in_cluster config and additional labels for bytewax materialization ([#3754](#3754)) ([2192e65](2192e65))
* Apply cache to load proto registry for performance ([#3702](#3702)) ([709c709](709c709))
* Make bytewax job write as mini-batches ([#3777](#3777)) ([9b0e5ce](9b0e5ce))
* Optimize bytewax pod resource with zero-copy ([9cf9d96](9cf9d96))
* Support GCS filesystem for bytewax engine ([#3774](#3774)) ([fb6b807](fb6b807))
tokoko pushed a commit to tokoko/feast that referenced this pull request Feb 6, 2024
# [0.35.0](feast-dev/feast@v0.34.0...v0.35.0) (2024-01-13)

### Bug Fixes

* Add async refresh to prevent synchronous refresh in main thread ([feast-dev#3812](feast-dev#3812)) ([9583ed6](feast-dev@9583ed6))
* Adopt connection pooling for HBase ([feast-dev#3793](feast-dev#3793)) ([b3852bf](feast-dev@b3852bf))
* Bytewax engine create configmap from object ([feast-dev#3821](feast-dev#3821)) ([25e9775](feast-dev@25e9775))
* Fix warnings from deprecated paths and update default log level ([feast-dev#3757](feast-dev#3757)) ([68a8737](feast-dev@68a8737))
* improve parsing bytewax job status ([5983f40](feast-dev@5983f40))
* make bytewax settings unexposed ([ae1bb8b](feast-dev@ae1bb8b))
* Make generated temp table name escaped ([feast-dev#3797](feast-dev#3797)) ([175d796](feast-dev@175d796))
* Pin numpy version to avoid spammy deprecation messages ([774ed33](feast-dev@774ed33))
* Redundant feature materialization and premature incremental materialization timestamp updates ([feast-dev#3789](feast-dev#3789)) ([417b16b](feast-dev@417b16b)), closes [feast-dev#6](feast-dev#6) [feast-dev#7](feast-dev#7)
* Resolve hbase hotspot issue when materializing ([feast-dev#3790](feast-dev#3790)) ([7376db8](feast-dev@7376db8))
* Set keepalives_idle None by default ([feast-dev#3756](feast-dev#3756)) ([8717e9b](feast-dev@8717e9b))
* Set upper bound for bigquery client due to its breaking changes ([2151c39](feast-dev@2151c39))
* UI project cannot handle fallback routes ([feast-dev#3766](feast-dev#3766)) ([96ece0f](feast-dev@96ece0f))
* update dependencies versions due to conflicts ([5dc0b24](feast-dev@5dc0b24))
* Update jackson and remove unnecessary logging ([feast-dev#3809](feast-dev#3809)) ([018d0ea](feast-dev@018d0ea))
* upgrade the pyarrow to latest v14.0.1 for CVE-2023-47248. ([052182b](feast-dev@052182b))

### Features

* Add get online feature rpc to gprc server ([feast-dev#3815](feast-dev#3815)) ([01db8cc](feast-dev@01db8cc))
* Add materialize and materialize-incremental rest endpoints ([feast-dev#3761](feast-dev#3761)) ([fa600fe](feast-dev@fa600fe)), closes [feast-dev#3760](feast-dev#3760)
* add redis sentinel support ([3387a15](feast-dev@3387a15))
* add redis sentinel support ([4337c89](feast-dev@4337c89))
* add redis sentinel support format lint ([aad8718](feast-dev@aad8718))
* Add support for `table_create_disposition` in bigquery job for offline store ([feast-dev#3762](feast-dev#3762)) ([6a728fe](feast-dev@6a728fe))
* Add support for in_cluster config and additional labels for bytewax materialization ([feast-dev#3754](feast-dev#3754)) ([2192e65](feast-dev@2192e65))
* Apply cache to load proto registry for performance ([feast-dev#3702](feast-dev#3702)) ([709c709](feast-dev@709c709))
* Make bytewax job write as mini-batches ([feast-dev#3777](feast-dev#3777)) ([9b0e5ce](feast-dev@9b0e5ce))
* Optimize bytewax pod resource with zero-copy ([9cf9d96](feast-dev@9cf9d96))
* Support GCS filesystem for bytewax engine ([feast-dev#3774](feast-dev#3774)) ([fb6b807](feast-dev@fb6b807))

Signed-off-by: tokoko <togurg14@freeuni.edu.ge>
zseta pushed a commit to zseta/feast that referenced this pull request Feb 7, 2024
…rialization timestamp updates (feast-dev#3789)

* SAASMLOPS-767 wait for jobs to complete

Signed-off-by: James Crabtree <james.crabtree@sailpoint.com>

* SAASMLOPS-805 Stopgap change to fix duplicate materialization of data

Signed-off-by: James Crabtree <james.crabtree@sailpoint.com>

* SAASMLOPS-805 save BYTEWAX_REPLICAS=1

Signed-off-by: James Crabtree <james.crabtree@sailpoint.com>

* SAASMLOPS-809 fix bytewax workers so they only process a single file (feast-dev#6)

* SAASMLOPS-809 fix bytewax workers so they only process a single file

* SAASMLOPS-809 fix newlines

Signed-off-by: James Crabtree <james.crabtree@sailpoint.com>

* SAASMLOPS-833 add configurable job timeout (feast-dev#7)

* SAASMLOPS-833 add configurable job timeout

* SAASMLOPS-833 fix whitespace

Signed-off-by: James Crabtree <james.crabtree@sailpoint.com>

* develop Run large materializations in batches of pods

Signed-off-by: James Crabtree <james.crabtree@sailpoint.com>

* master Set job_batch_size at least equal to max_parallelism

Signed-off-by: James Crabtree <james.crabtree@sailpoint.com>

* master clarity max_parallelism description

Signed-off-by: James Crabtree <james.crabtree@sailpoint.com>

* master resolve bug that causes materialization to continue after job error

Signed-off-by: James Crabtree <james.crabtree@sailpoint.com>

* master resolve bug causing pod logs to not be printed

Signed-off-by: James Crabtree <james.crabtree@sailpoint.com>

---------

Signed-off-by: James Crabtree <james.crabtree@sailpoint.com>
Signed-off-by: Attila Toth <hello@attilatoth.dev>
zseta pushed a commit to zseta/feast that referenced this pull request Feb 7, 2024
# [0.35.0](feast-dev/feast@v0.34.0...v0.35.0) (2024-01-13)

### Bug Fixes

* Add async refresh to prevent synchronous refresh in main thread ([feast-dev#3812](feast-dev#3812)) ([9583ed6](feast-dev@9583ed6))
* Adopt connection pooling for HBase ([feast-dev#3793](feast-dev#3793)) ([b3852bf](feast-dev@b3852bf))
* Bytewax engine create configmap from object ([feast-dev#3821](feast-dev#3821)) ([25e9775](feast-dev@25e9775))
* Fix warnings from deprecated paths and update default log level ([feast-dev#3757](feast-dev#3757)) ([68a8737](feast-dev@68a8737))
* improve parsing bytewax job status ([5983f40](feast-dev@5983f40))
* make bytewax settings unexposed ([ae1bb8b](feast-dev@ae1bb8b))
* Make generated temp table name escaped ([feast-dev#3797](feast-dev#3797)) ([175d796](feast-dev@175d796))
* Pin numpy version to avoid spammy deprecation messages ([774ed33](feast-dev@774ed33))
* Redundant feature materialization and premature incremental materialization timestamp updates ([feast-dev#3789](feast-dev#3789)) ([417b16b](feast-dev@417b16b)), closes [feast-dev#6](feast-dev#6) [feast-dev#7](feast-dev#7)
* Resolve hbase hotspot issue when materializing ([feast-dev#3790](feast-dev#3790)) ([7376db8](feast-dev@7376db8))
* Set keepalives_idle None by default ([feast-dev#3756](feast-dev#3756)) ([8717e9b](feast-dev@8717e9b))
* Set upper bound for bigquery client due to its breaking changes ([2151c39](feast-dev@2151c39))
* UI project cannot handle fallback routes ([feast-dev#3766](feast-dev#3766)) ([96ece0f](feast-dev@96ece0f))
* update dependencies versions due to conflicts ([5dc0b24](feast-dev@5dc0b24))
* Update jackson and remove unnecessary logging ([feast-dev#3809](feast-dev#3809)) ([018d0ea](feast-dev@018d0ea))
* upgrade the pyarrow to latest v14.0.1 for CVE-2023-47248. ([052182b](feast-dev@052182b))

### Features

* Add get online feature rpc to gprc server ([feast-dev#3815](feast-dev#3815)) ([01db8cc](feast-dev@01db8cc))
* Add materialize and materialize-incremental rest endpoints ([feast-dev#3761](feast-dev#3761)) ([fa600fe](feast-dev@fa600fe)), closes [feast-dev#3760](feast-dev#3760)
* add redis sentinel support ([3387a15](feast-dev@3387a15))
* add redis sentinel support ([4337c89](feast-dev@4337c89))
* add redis sentinel support format lint ([aad8718](feast-dev@aad8718))
* Add support for `table_create_disposition` in bigquery job for offline store ([feast-dev#3762](feast-dev#3762)) ([6a728fe](feast-dev@6a728fe))
* Add support for in_cluster config and additional labels for bytewax materialization ([feast-dev#3754](feast-dev#3754)) ([2192e65](feast-dev@2192e65))
* Apply cache to load proto registry for performance ([feast-dev#3702](feast-dev#3702)) ([709c709](feast-dev@709c709))
* Make bytewax job write as mini-batches ([feast-dev#3777](feast-dev#3777)) ([9b0e5ce](feast-dev@9b0e5ce))
* Optimize bytewax pod resource with zero-copy ([9cf9d96](feast-dev@9cf9d96))
* Support GCS filesystem for bytewax engine ([feast-dev#3774](feast-dev#3774)) ([fb6b807](feast-dev@fb6b807))

Signed-off-by: Attila Toth <hello@attilatoth.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants