use REPLACE INTO instead of INSERT INTO...UPDATE in covid_hosp acquisition #1356

melange396 · 2023-12-05T19:54:37Z

The change from #1224 does not work on production, presumably because of the new superlong INSERT INTO...UPDATE statement that it introduced. With it, this executemany() call hangs indefinitely and pegs the CPU (this even happens with a batch size of 2). The covid_hosp tables have a large number of lengthy column names, and each gets repeated thrice in the INSERT INTO...UPDATE syntax, making for a ~20k character long SQL statement (and this doesn't even consider the literal values that get inserted). It stands to reason that we didn't see this problem in our tests because we don't use an obnoxiously long list of sample columns there (perhaps we should).

I have a hunch that this could be a driver issue, so updates there might've remedied the situation, but i didn't want to crawl down that rabbit hole. We can potentially explore this with #1355.

The fix uses MySQL's REPLACE INTO statement instead, which doesn't require enumerating column names. When there is a UNIQUE key collision, INSERT INTO...UPDATE updates the columns as defined in the UPDATE clause, while REPLACE INTO deletes the colliding row and inserts the new one.

Also included in this PR is improved logging which lets us differentiate between new and inserted rows, and it adds a stub for removing covid_hosp datasets from the db (which may be helpful for doing patches in the future).

…ition

dshemetov · 2023-12-06T19:02:25Z

src/acquisition/covid_hosp/common/database.py

+    if self.aggregate_key_cols:
+      # TODO: restrict this to just UNIQUE columns from aggregate keys table?
+      # create (empty) `some_temp_table` like `{self.table_name}_key`
+      b = f"CREATE TABLE some_temp_table AS SELECT {self.aggregate_key_cols} FROM `{self.table_name}_key` WHERE FALSE"
+      # save aggregate keys from what we are about to delete
+      c = f"SELECT {self.aggregate_key_cols} INTO some_temp_table FROM `{self.table_name}` WHERE `{self.publication_col_name}`={issue_date} GROUP BY {self.aggregate_key_cols}"
+      # TODO: combine two SQL queries above into one?
+    # delete from main data table where issue matches
+    d = f"DELETE FROM `{self.table_name}` WHERE `{self.publication_col_name}`={issue_date}"
+    if self.aggregate_key_cols:
+      # delete from saved aggregate keys where the key still exists
+      e = f"DELETE FROM some_temp_table JOIN `{self.table_name}` USING ({self.aggregate_key_cols})"
+      # delete from aggregate key table anything left in saved keys (which should be aggregate keys that only existed in the issue we deleted)
+      f = f"DELETE FROM `{self.table_name}_key` JOIN some_temp_table USING ({self.aggregate_key_cols})"
+      g = "DROP TABLE some_temp_table"


Suggested change

if self.aggregate_key_cols:

# TODO: restrict this to just UNIQUE columns from aggregate keys table?

# create (empty) `some_temp_table` like `{self.table_name}_key`

b = f"CREATE TABLE some_temp_table AS SELECT {self.aggregate_key_cols} FROM `{self.table_name}_key` WHERE FALSE"

# save aggregate keys from what we are about to delete

c = f"SELECT {self.aggregate_key_cols} INTO some_temp_table FROM `{self.table_name}` WHERE `{self.publication_col_name}`={issue_date} GROUP BY {self.aggregate_key_cols}"

# TODO: combine two SQL queries above into one?

# delete from main data table where issue matches

d = f"DELETE FROM `{self.table_name}` WHERE `{self.publication_col_name}`={issue_date}"

if self.aggregate_key_cols:

# delete from saved aggregate keys where the key still exists

e = f"DELETE FROM some_temp_table JOIN `{self.table_name}` USING ({self.aggregate_key_cols})"

# delete from aggregate key table anything left in saved keys (which should be aggregate keys that only existed in the issue we deleted)

f = f"DELETE FROM `{self.table_name}_key` JOIN some_temp_table USING ({self.aggregate_key_cols})"

g = "DROP TABLE some_temp_table"

if self.AGGREGATE_KEY_COLS:

# TODO: restrict this to just UNIQUE columns from aggregate keys table?

# create (empty) `some_temp_table` like `{self.table_name}_key`

b = f"CREATE TABLE some_temp_table AS SELECT {self.AGGREGATE_KEY_COLS} FROM `{self.table_name}_key` WHERE FALSE"

# save aggregate keys from what we are about to delete

c = f"SELECT {self.AGGREGATE_KEY_COLS} INTO some_temp_table FROM `{self.table_name}` WHERE `{self.publication_col_name}`={issue_date} GROUP BY {self.AGGREGATE_KEY_COLS}"

# TODO: combine two SQL queries above into one?

# delete from main data table where issue matches

d = f"DELETE FROM `{self.table_name}` WHERE `{self.publication_col_name}`={issue_date}"

if self.AGGREGATE_KEY_COLS:

# delete from saved aggregate keys where the key still exists

e = f"DELETE FROM some_temp_table JOIN `{self.table_name}` USING ({self.AGGREGATE_KEY_COLS})"

# delete from aggregate key table anything left in saved keys (which should be aggregate keys that only existed in the issue we deleted)

f = f"DELETE FROM `{self.table_name}_key` JOIN some_temp_table USING ({self.AGGREGATE_KEY_COLS})"

g = "DROP TABLE some_temp_table"

good suggestion, but this is just a work-in-progress for now... as you probably noticed, these are all just strings and the function doesnt actually accomplish anything. im gonna remove the method and save it for later.

sonarcloud · 2023-12-06T19:33:47Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.0% Duplication

dshemetov

No notes on the main code, I read through it line by line and it looks correct to me.

src/acquisition/covid_hosp/common/database.py

melange396 · 2024-02-14T19:02:35Z

This indeed is/was a driver issue... It was corrected in mysql-connector-python (AKA mysql.connector in the python code) version 8.0.33; see the relevant text in the release notes, the filed bug, and its fix.

The prod and staging "automation" servers currently run mysql-connector-python version 8.0.21 from 2020 (and they almost certainly also did when #1224 went live and broke). That version may have been committed to by ansible when it evaluated something like this or this for these machines, or it could have even been installed or updated manually by someone. New versions of our own code are kept up-to-date on these machines by our old friend https://github.com/cmu-delphi/github-deploy-repo (🧄), which does not appear to have any support for maintaining dependencies (much less their versions).

The much older version 2.2.9 of the mysql-connector package (still AKA mysql.connector in code) is pinned in our development requirements. That version does not seem to be affected by the executemany(INSERT...ON DUPLICATE) bug, which added to the difficulty of isolating the problematic behavior.

We should try to keep our production and development environments more in sync so that we don't run into inconsistent and confusing situations like this again. We have talked about this a number of times before for various reasons, but i don't think we ever came to a consensus. Some possibilities for addressing it include:

Run our production automation jobs out of our prebuilt docker containers instead of on "bare hardware". They will then have the same versions of requirements/dependencies as are in dev. This might be "the right way to do things" but it sounds complicated and is probably a longer-term project.
Add a pip step like this to the end of any github-deploy-repo runs. This sounds easy(ish?) to do, though there could be unforeseen consequences. It is also a temporary band-aid, since g-d-r is already in the crosshairs in Move acquisition deployment off of github-deploy-repo #1042.
Maybe get ansible to do it?

I will probably turn much of the text of this comment into a GH issue of its own, too. cc: @korlaxxalrok

melange396 added 5 commits November 16, 2023 10:37

use REPLACE INTO instead of INSERT INTO...UPDATE in covid_hosp acquis…

8536431

…ition

oops, forgot logger was optional

e85a994

update tests for REPLACE INTO

3ccd57d

cleanup and better logging

a930bab

variable name typo

924b08e

melange396 requested review from dshemetov, korlaxxalrok and dmytrotsko December 5, 2023 19:54

dshemetov reviewed Dec 6, 2023

View reviewed changes

removed w.i.p. method stub

a232e7b

dshemetov approved these changes Dec 6, 2023

View reviewed changes

src/acquisition/covid_hosp/common/database.py Outdated Show resolved Hide resolved

melange396 merged commit 7a579c6 into dev Dec 6, 2023
8 checks passed

melange396 deleted the replace_instead_of_upsert branch December 6, 2023 19:47

melange396 mentioned this pull request Dec 6, 2023

Release Delphi Epidata 4.1.15 #1357

Merged

melange396 mentioned this pull request Jan 3, 2024

Update database drivers #1355

Open

melange396 mentioned this pull request Jan 18, 2024

Update old MySQL packages #1364

Open

4 tasks

melange396 mentioned this pull request Feb 14, 2024

Synchronize dependency versioning on automation servers #1389

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use REPLACE INTO instead of INSERT INTO...UPDATE in covid_hosp acquisition #1356

use REPLACE INTO instead of INSERT INTO...UPDATE in covid_hosp acquisition #1356

melange396 commented Dec 5, 2023

dshemetov Dec 6, 2023

melange396 Dec 6, 2023

sonarcloud bot commented Dec 6, 2023

dshemetov left a comment

melange396 commented Feb 14, 2024

use REPLACE INTO instead of INSERT INTO...UPDATE in covid_hosp acquisition #1356

use REPLACE INTO instead of INSERT INTO...UPDATE in covid_hosp acquisition #1356

Conversation

melange396 commented Dec 5, 2023

dshemetov Dec 6, 2023

Choose a reason for hiding this comment

melange396 Dec 6, 2023

Choose a reason for hiding this comment

sonarcloud bot commented Dec 6, 2023

dshemetov left a comment

Choose a reason for hiding this comment

melange396 commented Feb 14, 2024