Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip duplicate upserts. #137

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

mmontagna
Copy link

When using the overwrite mode, pgsync does a blind upsert which ends up creating duplicate/new row versions for ALL rows involved in the sync regardless of whether or not any of the values in those rows have changed.

This is largely fine/functionally correct, but it's not ideal as it places additional write/transaction load on the receiving database which isn't strictly needed.

It might make sense to just make this behavior the default, but I haven't fully thought that through.

Any thoughts on the idea?

@ankane
Copy link
Owner

ankane commented Sep 16, 2021

Hey @mmontagna, thanks for the PR! It's an interesting idea. Have you tried benchmarking it to see how much impact it has on performance?

@mmontagna
Copy link
Author

mmontagna commented Sep 16, 2021

@ankane I haven't run anything too scientific. I am happy to run more benchmarks, just let me know what you want to see. I'll outline roughly what I've done so far below.

When doing a sync from local postgres to local postgres it drops the time from ~8 seconds to overwrite a ~30MB table with 130k rows (~50 columns and 7 indices) from ~8 to ~2 seconds.

When doing remote to local syncs it doesn't matter that much since the time is largely dominated by network transfer speeds. All my tests here have come out a wash but I expect for users with faster internet connections there will likely be an improvement.

I've run some tests/monitored the target machines when syncing 2GB+ tables and while it doesn't save very much, if any immediate resource cost. It does prevent accumulating vacuum workload when running frequent syncs, which is really the motivating factor here.

eg

marcomontagna@Marco-C02D603CMD6R pgsync % time bundle exec pgsync something_locations --overwrite --from postgres://localhost/dev --to postgres://localhost/other_dev --overwrite-only-changed
From: dev on localhost:5432
To: other_dev on localhost:5432
⠇ something_locations
⠦ something_locations
✔ something_locations - 1.6s
Completed in 2.0s
bundle exec pgsync something_locations --overwrite --from  --to    0.67s user 0.25s system 38% cpu 2.395 total
marcomontagna@Marco-C02D603CMD6R pgsync % time bundle exec pgsync something_locations --overwrite --from postgres://localhost/dev --to postgres://localhost/other_dev
From: dev on localhost:5432
To: other_dev on localhost:5432
⠇ something_locations
⠹ something_locations
✔ something_locations - 8.4s
Completed in 8.9s
bundle exec pgsync something_locations --overwrite --from  --to   2.86s user 0.50s system 36% cpu 9.315 total

@ankane
Copy link
Owner

ankane commented Sep 16, 2021

Great, just tried it with pgbench locally and also see a significant improvement.

pgbench -i -s 10 pgsync_bench1
pgbench -i -s 10 pgsync_bench2
pgsync pgbench_accounts --from pgsync_bench1 --to pgsync_bench2 --overwrite

Latest version: ~21s
This PR: ~8s

This shouldn't change the result of the sync, so let's add it without an option.

@mmontagna mmontagna changed the title Add option to skip duplicate upserts. Skip duplicate upserts. Sep 16, 2021
@mmontagna
Copy link
Author

Alright. I've updated the PR to add it without an option.

@mmontagna
Copy link
Author

Don't merge this yet. There might be a breakage here around json columns

@mmontagna
Copy link
Author

Pushed a commit to fix that case.

@ankane
Copy link
Owner

ankane commented Sep 20, 2021

Good catch. I think it'll be an issue with any column types that don't support the equality operator (including custom types). I also think there may be an issue with NULL values with the current query.

Due to this, I'm not sure it's worth the extra complexity to support.

@mmontagna
Copy link
Author

Fair enough. I'm going to keep fiddling with this since it has the potential to vastly improve the performance/speed of some of our use-cases, but I won't expect you to merge anything upstream.

You're right about the null issue, that won't effect correctness, I think, but does mean this is writing more than it needs to.

@ankane
Copy link
Owner

ankane commented Sep 21, 2021

Sounds good. fwiw, I think the null issue will affect correctness/cause it to sync less rows than desired, so you'll probably want to test different combinations of null in the source and destination.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants