-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use REPLACE INTO instead of INSERT INTO...UPDATE in covid_hosp acquisition #1356
Conversation
if self.aggregate_key_cols: | ||
# TODO: restrict this to just UNIQUE columns from aggregate keys table? | ||
# create (empty) `some_temp_table` like `{self.table_name}_key` | ||
b = f"CREATE TABLE some_temp_table AS SELECT {self.aggregate_key_cols} FROM `{self.table_name}_key` WHERE FALSE" | ||
# save aggregate keys from what we are about to delete | ||
c = f"SELECT {self.aggregate_key_cols} INTO some_temp_table FROM `{self.table_name}` WHERE `{self.publication_col_name}`={issue_date} GROUP BY {self.aggregate_key_cols}" | ||
# TODO: combine two SQL queries above into one? | ||
# delete from main data table where issue matches | ||
d = f"DELETE FROM `{self.table_name}` WHERE `{self.publication_col_name}`={issue_date}" | ||
if self.aggregate_key_cols: | ||
# delete from saved aggregate keys where the key still exists | ||
e = f"DELETE FROM some_temp_table JOIN `{self.table_name}` USING ({self.aggregate_key_cols})" | ||
# delete from aggregate key table anything left in saved keys (which should be aggregate keys that only existed in the issue we deleted) | ||
f = f"DELETE FROM `{self.table_name}_key` JOIN some_temp_table USING ({self.aggregate_key_cols})" | ||
g = "DROP TABLE some_temp_table" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self.aggregate_key_cols: | |
# TODO: restrict this to just UNIQUE columns from aggregate keys table? | |
# create (empty) `some_temp_table` like `{self.table_name}_key` | |
b = f"CREATE TABLE some_temp_table AS SELECT {self.aggregate_key_cols} FROM `{self.table_name}_key` WHERE FALSE" | |
# save aggregate keys from what we are about to delete | |
c = f"SELECT {self.aggregate_key_cols} INTO some_temp_table FROM `{self.table_name}` WHERE `{self.publication_col_name}`={issue_date} GROUP BY {self.aggregate_key_cols}" | |
# TODO: combine two SQL queries above into one? | |
# delete from main data table where issue matches | |
d = f"DELETE FROM `{self.table_name}` WHERE `{self.publication_col_name}`={issue_date}" | |
if self.aggregate_key_cols: | |
# delete from saved aggregate keys where the key still exists | |
e = f"DELETE FROM some_temp_table JOIN `{self.table_name}` USING ({self.aggregate_key_cols})" | |
# delete from aggregate key table anything left in saved keys (which should be aggregate keys that only existed in the issue we deleted) | |
f = f"DELETE FROM `{self.table_name}_key` JOIN some_temp_table USING ({self.aggregate_key_cols})" | |
g = "DROP TABLE some_temp_table" | |
if self.AGGREGATE_KEY_COLS: | |
# TODO: restrict this to just UNIQUE columns from aggregate keys table? | |
# create (empty) `some_temp_table` like `{self.table_name}_key` | |
b = f"CREATE TABLE some_temp_table AS SELECT {self.AGGREGATE_KEY_COLS} FROM `{self.table_name}_key` WHERE FALSE" | |
# save aggregate keys from what we are about to delete | |
c = f"SELECT {self.AGGREGATE_KEY_COLS} INTO some_temp_table FROM `{self.table_name}` WHERE `{self.publication_col_name}`={issue_date} GROUP BY {self.AGGREGATE_KEY_COLS}" | |
# TODO: combine two SQL queries above into one? | |
# delete from main data table where issue matches | |
d = f"DELETE FROM `{self.table_name}` WHERE `{self.publication_col_name}`={issue_date}" | |
if self.AGGREGATE_KEY_COLS: | |
# delete from saved aggregate keys where the key still exists | |
e = f"DELETE FROM some_temp_table JOIN `{self.table_name}` USING ({self.AGGREGATE_KEY_COLS})" | |
# delete from aggregate key table anything left in saved keys (which should be aggregate keys that only existed in the issue we deleted) | |
f = f"DELETE FROM `{self.table_name}_key` JOIN some_temp_table USING ({self.AGGREGATE_KEY_COLS})" | |
g = "DROP TABLE some_temp_table" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good suggestion, but this is just a work-in-progress for now... as you probably noticed, these are all just strings and the function doesnt actually accomplish anything. im gonna remove the method and save it for later.
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No notes on the main code, I read through it line by line and it looks correct to me.
This indeed is/was a driver issue... It was corrected in The prod and staging "automation" servers currently run The much older version We should try to keep our production and development environments more in sync so that we don't run into inconsistent and confusing situations like this again. We have talked about this a number of times before for various reasons, but i don't think we ever came to a consensus. Some possibilities for addressing it include:
I will probably turn much of the text of this comment into a GH issue of its own, too. cc: @korlaxxalrok |
The change from #1224 does not work on production, presumably because of the new superlong INSERT INTO...UPDATE statement that it introduced. With it, this
executemany()
call hangs indefinitely and pegs the CPU (this even happens with a batch size of 2). The covid_hosp tables have a large number of lengthy column names, and each gets repeated thrice in the INSERT INTO...UPDATE syntax, making for a ~20k character long SQL statement (and this doesn't even consider the literal values that get inserted). It stands to reason that we didn't see this problem in our tests because we don't use an obnoxiously long list of sample columns there (perhaps we should).I have a hunch that this could be a driver issue, so updates there might've remedied the situation, but i didn't want to crawl down that rabbit hole. We can potentially explore this with #1355.
The fix uses MySQL's REPLACE INTO statement instead, which doesn't require enumerating column names. When there is a UNIQUE key collision, INSERT INTO...UPDATE updates the columns as defined in the UPDATE clause, while REPLACE INTO deletes the colliding row and inserts the new one.
Also included in this PR is improved logging which lets us differentiate between new and inserted rows, and it adds a stub for removing covid_hosp datasets from the db (which may be helpful for doing patches in the future).