-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PIPE-512/513/514] Incremental Updates to Award Search #3986
Conversation
…lta. Adds changes to support CDF on Award Search
… --incremental param to load_query_to_delta.py
…f postgres when merging
…y for creating upserts/deletes temp tables
… use of this library in load_table_from_delta
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this work is good.
My biggest concern is that I don't see any tests for the new dependency, change data feed. I understand this feature isn't open source. I don't believe the lack of testability was something we discussed when we "green-lit" this implementation plan.
I'd suggest looking into patches/mocks to see if you can circumvent the non-open source features to allow us to better test the code that runs in production.
…gtransparency/usaspending-api into ftr/PIPE-514-load-sql
…gtransparency/usaspending-api into ftr/PIPE-514-load-sql
…ad-sql [PIPE-514]: Prototype Persisting Changes from Staging Tables into Live Table
…rations that depend on it
…rations that depend on it
…rations that depend on it
…rations that depend on it
…rations that depend on it
…rations that depend on it
…rations that depend on it
…rations that depend on it
…rations that depend on it
…3-514-improve-migrations PIPE-514; Migration Fix
@collinwr This PR has been open for a long time, what's the status of it? |
This work is blocked until the Databricks Runtime version has been updated |
I'd like us to keep our PRs not stale. 7 months old is just too old. We can re-open this PR when we have an actual ability to merge it. |
PIPE-512
Description:
The main purpose of this PR is to allow changes to Award Search to be tracked using Databrick's Change Data Feed feature. To accomplish this, two main changes have been made:
create_delta_table.py
command has been updated to add the ability to enable the Change Data Feed feature when creating tables with the--enable-cdf
flag.load_query_to_delta.py
command has been updated to accept an--incremental
flag that causes it to use a different set of SQL queries (if available for a given table_spec, in this caseaward_search
) to incrementally update the table rather than blowing it away and recreating it. This allows per row tracking of changes in the table, rather than the CDF feature tracking every single record as a delete and re-insert.Technical details:
delta_table_create_sql
strings to include the ability to create the tables with the CDF feature enabled when using the--enable-cdf
flag for consistency.update_date
field and a record in theexternal_load_date
table but realized this would not result in records merged when joined tables have been updated, so I went with an approach that compares all fields to determine whether a record should be updated or not. The performance of this has been okay so far. Takes about 30 minutes to run the job.award_search
to sort all the array fields to ensure that the results are deterministic, so they can be compared between the source and target of a merge.PIPE-513
Description:
Large refactor of
load_table_from_delta.py
with the intention of adding the ability to copy changes to Postgres incrementally using Delta Lake's Change Data Feed feature.A big aspect of this PR is separating some existing functionality into methods, so they can be called in different ways or in multiple ways:
_temp
table_upsert
table_deletes
tableTechnical Details
partition_column
elements ofload_query_to_delta.py
TABLE_SPEC
definitions were not being used. Removed these and introduced a more clearly labeledunique_identifiers
element.Requirements for PR merge:
Area for explaining above N/A when needed: