Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use REPLACE INTO instead of INSERT INTO...UPDATE in covid_hosp acquisition #1356

Merged
merged 6 commits into from
Dec 6, 2023

Conversation

melange396
Copy link
Collaborator

The change from #1224 does not work on production, presumably because of the new superlong INSERT INTO...UPDATE statement that it introduced. With it, this executemany() call hangs indefinitely and pegs the CPU (this even happens with a batch size of 2). The covid_hosp tables have a large number of lengthy column names, and each gets repeated thrice in the INSERT INTO...UPDATE syntax, making for a ~20k character long SQL statement (and this doesn't even consider the literal values that get inserted). It stands to reason that we didn't see this problem in our tests because we don't use an obnoxiously long list of sample columns there (perhaps we should).

I have a hunch that this could be a driver issue, so updates there might've remedied the situation, but i didn't want to crawl down that rabbit hole. We can potentially explore this with #1355.

The fix uses MySQL's REPLACE INTO statement instead, which doesn't require enumerating column names. When there is a UNIQUE key collision, INSERT INTO...UPDATE updates the columns as defined in the UPDATE clause, while REPLACE INTO deletes the colliding row and inserts the new one.

Also included in this PR is improved logging which lets us differentiate between new and inserted rows, and it adds a stub for removing covid_hosp datasets from the db (which may be helpful for doing patches in the future).

Comment on lines 166 to 180
if self.aggregate_key_cols:
# TODO: restrict this to just UNIQUE columns from aggregate keys table?
# create (empty) `some_temp_table` like `{self.table_name}_key`
b = f"CREATE TABLE some_temp_table AS SELECT {self.aggregate_key_cols} FROM `{self.table_name}_key` WHERE FALSE"
# save aggregate keys from what we are about to delete
c = f"SELECT {self.aggregate_key_cols} INTO some_temp_table FROM `{self.table_name}` WHERE `{self.publication_col_name}`={issue_date} GROUP BY {self.aggregate_key_cols}"
# TODO: combine two SQL queries above into one?
# delete from main data table where issue matches
d = f"DELETE FROM `{self.table_name}` WHERE `{self.publication_col_name}`={issue_date}"
if self.aggregate_key_cols:
# delete from saved aggregate keys where the key still exists
e = f"DELETE FROM some_temp_table JOIN `{self.table_name}` USING ({self.aggregate_key_cols})"
# delete from aggregate key table anything left in saved keys (which should be aggregate keys that only existed in the issue we deleted)
f = f"DELETE FROM `{self.table_name}_key` JOIN some_temp_table USING ({self.aggregate_key_cols})"
g = "DROP TABLE some_temp_table"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self.aggregate_key_cols:
# TODO: restrict this to just UNIQUE columns from aggregate keys table?
# create (empty) `some_temp_table` like `{self.table_name}_key`
b = f"CREATE TABLE some_temp_table AS SELECT {self.aggregate_key_cols} FROM `{self.table_name}_key` WHERE FALSE"
# save aggregate keys from what we are about to delete
c = f"SELECT {self.aggregate_key_cols} INTO some_temp_table FROM `{self.table_name}` WHERE `{self.publication_col_name}`={issue_date} GROUP BY {self.aggregate_key_cols}"
# TODO: combine two SQL queries above into one?
# delete from main data table where issue matches
d = f"DELETE FROM `{self.table_name}` WHERE `{self.publication_col_name}`={issue_date}"
if self.aggregate_key_cols:
# delete from saved aggregate keys where the key still exists
e = f"DELETE FROM some_temp_table JOIN `{self.table_name}` USING ({self.aggregate_key_cols})"
# delete from aggregate key table anything left in saved keys (which should be aggregate keys that only existed in the issue we deleted)
f = f"DELETE FROM `{self.table_name}_key` JOIN some_temp_table USING ({self.aggregate_key_cols})"
g = "DROP TABLE some_temp_table"
if self.AGGREGATE_KEY_COLS:
# TODO: restrict this to just UNIQUE columns from aggregate keys table?
# create (empty) `some_temp_table` like `{self.table_name}_key`
b = f"CREATE TABLE some_temp_table AS SELECT {self.AGGREGATE_KEY_COLS} FROM `{self.table_name}_key` WHERE FALSE"
# save aggregate keys from what we are about to delete
c = f"SELECT {self.AGGREGATE_KEY_COLS} INTO some_temp_table FROM `{self.table_name}` WHERE `{self.publication_col_name}`={issue_date} GROUP BY {self.AGGREGATE_KEY_COLS}"
# TODO: combine two SQL queries above into one?
# delete from main data table where issue matches
d = f"DELETE FROM `{self.table_name}` WHERE `{self.publication_col_name}`={issue_date}"
if self.AGGREGATE_KEY_COLS:
# delete from saved aggregate keys where the key still exists
e = f"DELETE FROM some_temp_table JOIN `{self.table_name}` USING ({self.AGGREGATE_KEY_COLS})"
# delete from aggregate key table anything left in saved keys (which should be aggregate keys that only existed in the issue we deleted)
f = f"DELETE FROM `{self.table_name}_key` JOIN some_temp_table USING ({self.AGGREGATE_KEY_COLS})"
g = "DROP TABLE some_temp_table"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good suggestion, but this is just a work-in-progress for now... as you probably noticed, these are all just strings and the function doesnt actually accomplish anything. im gonna remove the method and save it for later.

Copy link

sonarcloud bot commented Dec 6, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Copy link
Contributor

@dshemetov dshemetov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No notes on the main code, I read through it line by line and it looks correct to me.

src/acquisition/covid_hosp/common/database.py Outdated Show resolved Hide resolved
@melange396 melange396 merged commit 7a579c6 into dev Dec 6, 2023
8 checks passed
@melange396 melange396 deleted the replace_instead_of_upsert branch December 6, 2023 19:47
@melange396 melange396 mentioned this pull request Jan 18, 2024
4 tasks
@melange396
Copy link
Collaborator Author

This indeed is/was a driver issue... It was corrected in mysql-connector-python (AKA mysql.connector in the python code) version 8.0.33; see the relevant text in the release notes, the filed bug, and its fix.

The prod and staging "automation" servers currently run mysql-connector-python version 8.0.21 from 2020 (and they almost certainly also did when #1224 went live and broke). That version may have been committed to by ansible when it evaluated something like this or this for these machines, or it could have even been installed or updated manually by someone. New versions of our own code are kept up-to-date on these machines by our old friend https://github.com/cmu-delphi/github-deploy-repo (🧄), which does not appear to have any support for maintaining dependencies (much less their versions).

The much older version 2.2.9 of the mysql-connector package (still AKA mysql.connector in code) is pinned in our development requirements. That version does not seem to be affected by the executemany(INSERT...ON DUPLICATE) bug, which added to the difficulty of isolating the problematic behavior.

We should try to keep our production and development environments more in sync so that we don't run into inconsistent and confusing situations like this again. We have talked about this a number of times before for various reasons, but i don't think we ever came to a consensus. Some possibilities for addressing it include:

  • Run our production automation jobs out of our prebuilt docker containers instead of on "bare hardware". They will then have the same versions of requirements/dependencies as are in dev. This might be "the right way to do things" but it sounds complicated and is probably a longer-term project.
  • Add a pip step like this to the end of any github-deploy-repo runs. This sounds easy(ish?) to do, though there could be unforeseen consequences. It is also a temporary band-aid, since g-d-r is already in the crosshairs in Move acquisition deployment off of github-deploy-repo #1042.
  • Maybe get ansible to do it?

I will probably turn much of the text of this comment into a GH issue of its own, too. cc: @korlaxxalrok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants