Skip duplicate upserts. #137

mmontagna · 2021-09-14T00:05:30Z

When using the overwrite mode, pgsync does a blind upsert which ends up creating duplicate/new row versions for ALL rows involved in the sync regardless of whether or not any of the values in those rows have changed.

This is largely fine/functionally correct, but it's not ideal as it places additional write/transaction load on the receiving database which isn't strictly needed.

It might make sense to just make this behavior the default, but I haven't fully thought that through.

Any thoughts on the idea?

ankane · 2021-09-16T01:13:04Z

Hey @mmontagna, thanks for the PR! It's an interesting idea. Have you tried benchmarking it to see how much impact it has on performance?

mmontagna · 2021-09-16T02:54:02Z

@ankane I haven't run anything too scientific. I am happy to run more benchmarks, just let me know what you want to see. I'll outline roughly what I've done so far below.

When doing a sync from local postgres to local postgres it drops the time from ~8 seconds to overwrite a ~30MB table with 130k rows (~50 columns and 7 indices) from ~8 to ~2 seconds.

When doing remote to local syncs it doesn't matter that much since the time is largely dominated by network transfer speeds. All my tests here have come out a wash but I expect for users with faster internet connections there will likely be an improvement.

I've run some tests/monitored the target machines when syncing 2GB+ tables and while it doesn't save very much, if any immediate resource cost. It does prevent accumulating vacuum workload when running frequent syncs, which is really the motivating factor here.

eg

marcomontagna@Marco-C02D603CMD6R pgsync % time bundle exec pgsync something_locations --overwrite --from postgres://localhost/dev --to postgres://localhost/other_dev --overwrite-only-changed
From: dev on localhost:5432
To: other_dev on localhost:5432
⠇ something_locations
⠦ something_locations
✔ something_locations - 1.6s
Completed in 2.0s
bundle exec pgsync something_locations --overwrite --from  --to    0.67s user 0.25s system 38% cpu 2.395 total
marcomontagna@Marco-C02D603CMD6R pgsync % time bundle exec pgsync something_locations --overwrite --from postgres://localhost/dev --to postgres://localhost/other_dev
From: dev on localhost:5432
To: other_dev on localhost:5432
⠇ something_locations
⠹ something_locations
✔ something_locations - 8.4s
Completed in 8.9s
bundle exec pgsync something_locations --overwrite --from  --to   2.86s user 0.50s system 36% cpu 9.315 total

ankane · 2021-09-16T06:42:43Z

Great, just tried it with pgbench locally and also see a significant improvement.

pgbench -i -s 10 pgsync_bench1
pgbench -i -s 10 pgsync_bench2
pgsync pgbench_accounts --from pgsync_bench1 --to pgsync_bench2 --overwrite

Latest version: ~21s
This PR: ~8s

This shouldn't change the result of the sync, so let's add it without an option.

mmontagna · 2021-09-16T16:54:36Z

Alright. I've updated the PR to add it without an option.

mmontagna · 2021-09-16T20:49:42Z

Don't merge this yet. There might be a breakage here around json columns

mmontagna · 2021-09-16T22:18:45Z

Pushed a commit to fix that case.

ankane · 2021-09-20T19:57:43Z

Good catch. I think it'll be an issue with any column types that don't support the equality operator (including custom types). I also think there may be an issue with NULL values with the current query.

Due to this, I'm not sure it's worth the extra complexity to support.

mmontagna · 2021-09-20T23:11:34Z

Fair enough. I'm going to keep fiddling with this since it has the potential to vastly improve the performance/speed of some of our use-cases, but I won't expect you to merge anything upstream.

You're right about the null issue, that won't effect correctness, I think, but does mean this is writing more than it needs to.

ankane · 2021-09-21T20:14:18Z

Sounds good. fwiw, I think the null issue will affect correctness/cause it to sync less rows than desired, so you'll probably want to test different combinations of null in the source and destination.

Add option to skip duplicate upserts.

1279a5e

mmontagna added 2 commits September 16, 2021 09:49

Remove option.

72e88eb

remove debug stmt.

bf2e613

mmontagna changed the title ~~Add option to skip duplicate upserts.~~ Skip duplicate upserts. Sep 16, 2021

Fix json column comparison.

918d8e5

Fix null case

32b3773

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip duplicate upserts. #137

Skip duplicate upserts. #137

mmontagna commented Sep 14, 2021

ankane commented Sep 16, 2021

mmontagna commented Sep 16, 2021 •

edited

ankane commented Sep 16, 2021

mmontagna commented Sep 16, 2021

mmontagna commented Sep 16, 2021

mmontagna commented Sep 16, 2021

ankane commented Sep 20, 2021

mmontagna commented Sep 20, 2021

ankane commented Sep 21, 2021

Skip duplicate upserts. #137

Are you sure you want to change the base?

Skip duplicate upserts. #137

Conversation

mmontagna commented Sep 14, 2021

ankane commented Sep 16, 2021

mmontagna commented Sep 16, 2021 • edited

ankane commented Sep 16, 2021

mmontagna commented Sep 16, 2021

mmontagna commented Sep 16, 2021

mmontagna commented Sep 16, 2021

ankane commented Sep 20, 2021

mmontagna commented Sep 20, 2021

ankane commented Sep 21, 2021

mmontagna commented Sep 16, 2021 •

edited