Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PIPE-512/513/514] Incremental Updates to Award Search #3986

Closed
wants to merge 134 commits into from

Conversation

collinwr
Copy link
Contributor

@collinwr collinwr commented Dec 28, 2023

PIPE-512

Description:
The main purpose of this PR is to allow changes to Award Search to be tracked using Databrick's Change Data Feed feature. To accomplish this, two main changes have been made:

  1. The create_delta_table.py command has been updated to add the ability to enable the Change Data Feed feature when creating tables with the --enable-cdf flag.
  2. The load_query_to_delta.py command has been updated to accept an --incremental flag that causes it to use a different set of SQL queries (if available for a given table_spec, in this case award_search) to incrementally update the table rather than blowing it away and recreating it. This allows per row tracking of changes in the table, rather than the CDF feature tracking every single record as a delete and re-insert.

Technical details:

  1. I've gone ahead and updated all the delta_table_create_sql strings to include the ability to create the tables with the CDF feature enabled when using the --enable-cdf flag for consistency.
  2. I tested the merge command using the update_date field and a record in the external_load_date table but realized this would not result in records merged when joined tables have been updated, so I went with an approach that compares all fields to determine whether a record should be updated or not. The performance of this has been okay so far. Takes about 30 minutes to run the job.
  3. I've updated the SQL query that populates award_search to sort all the array fields to ensure that the results are deterministic, so they can be compared between the source and target of a merge.
  4. The reason I added parameters to control a lot of the new behavior (enabling cdf and performing incremental loads) and didn't make them default is because these features are not supported with the open source version of Spark. They include some Databricks specific features. This way, any user without Databricks access can continue to run our code, and our unit tests are able to continue to run outside of Databricks.

PIPE-513

Description:
Large refactor of load_table_from_delta.py with the intention of adding the ability to copy changes to Postgres incrementally using Delta Lake's Change Data Feed feature.

A big aspect of this PR is separating some existing functionality into methods, so they can be called in different ways or in multiple ways:

  • Original path of copying full table to _temp table
  • New path of copying inserts and updates to _upsert table
  • New path of copying deletes to _deletes table

Technical Details

  • A big part of this was streamlining existing functionality by reducing edge case handling. If an edge case was not being exercised by any tables this code copying, the edge case was removed.
  • The partition_column elements of load_query_to_delta.py TABLE_SPEC definitions were not being used. Removed these and introduced a more clearly labeled unique_identifiers element.

Requirements for PR merge:

  1. Unit & integration tests updated - Code has been added to allow unit tests to run as they are. We are unable to test the new CDF and Incremental Update functionality because they rely on Databricks specific features
  2. N/A API documentation updated
  3. Necessary PR reviewers:
    • Backend
  4. N/A - Matview impact assessment completed
  5. N/A - Frontend impact assessment completed
  6. Data validation completed - Would like to do one more test on this, but so far, all data has matched up.
  7. Appropriate Operations ticket(s) created - None needed until this is ready for production
  8. Jira Ticket DEV-512:
    • Link to this Pull-Request
    • Performance evaluation of affected (API | Script | Download)
    • Before / After data comparison

Area for explaining above N/A when needed:

@collinwr collinwr added the do not merge [pr] This PR should not be merged label Dec 28, 2023
Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this work is good.

My biggest concern is that I don't see any tests for the new dependency, change data feed. I understand this feature isn't open source. I don't believe the lack of testability was something we discussed when we "green-lit" this implementation plan.

I'd suggest looking into patches/mocks to see if you can circumvent the non-open source features to allow us to better test the code that runs in production.

Desmond Porth and others added 25 commits March 26, 2024 14:28
…ad-sql

[PIPE-514]: Prototype Persisting Changes from Staging Tables into Live Table
…3-514-improve-migrations

PIPE-514; Migration Fix
@collinwr collinwr changed the title [PIPE-512/513] Incremental Updates to Award Search [PIPE-512/513/514] Incremental Updates to Award Search Apr 23, 2024
@ghost
Copy link

ghost commented Jul 21, 2024

@collinwr This PR has been open for a long time, what's the status of it?

@collinwr
Copy link
Contributor Author

@collinwr This PR has been open for a long time, what's the status of it?

This work is blocked until the Databricks Runtime version has been updated

@ghost ghost closed this Jul 22, 2024
@ghost
Copy link

ghost commented Jul 22, 2024

@collinwr This PR has been open for a long time, what's the status of it?

This work is blocked until the Databricks Runtime version has been updated

I'd like us to keep our PRs not stale. 7 months old is just too old. We can re-open this PR when we have an actual ability to merge it.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do not merge [pr] This PR should not be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants