-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dispatch by shared memory #339
base: main
Are you sure you want to change the base?
Dispatch by shared memory #339
Conversation
b95dc51
to
9f7d1db
Compare
|
||
addr = dsm_segment_address(seg); | ||
memcpy(serializedPlantree, addr, serializedPlantreelen); | ||
memcpy(serializedQueryDispatchDesc, addr + serializedPlantreelen, serializedQueryDispatchDesclen); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can avoid this memcpy ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can, but then we need to take all logic related to entry->reference--
out of ReadSharedQueryPlan
and make it a separate function to be called after the serializedPlantree
and serializedQueryDispatchDesc
be deserialized in exec_mpp_query
. do you think this's ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It requires another round of hash_search
to find the entry and decrease referenced
after the deserialization of serializedPlantree
and serializedQueryDesc
. I think maybe two memcpy
is still more efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I worry about performance degradation.
can reuse |
Agree, we have a long way to go, but this PR is a good start! Thanks for your work and rich description. |
439a8c0
to
ef1970b
Compare
the original implementation was to use also I can never figure out why they use a array to store slots discriminated by |
I found they have the same logical except lifetime. |
4ab666b
to
c86a28f
Compare
This PR makes serializedPlantree and serializedQueryDispatchDesc dispatched by shared memory instead of interconnect, so that they can be sent only once to writer QE, and synced inbetween reader QEs and writer QE on a segment through DSM.
c86a28f
to
18cace7
Compare
This PR makes
serializedPlantree
andserializedQueryDispatchDesc
dispatched by shared memory instead of interconnect, so that they can be sent only once to writer QE, and synced inbetween reader QEs and writer QE on a segment through DSM. It has been discussed in https://github.com/orgs/cloudberrydb/discussions/243.Implementation Outline
barrier
, SharedLatch) cannot satisfy our requirement.reference
count is calculated on writer QE. and the last reader QE who reads it will reclaim the DSM. this is the same as parallel (GpInsertParallelDSMHash
).Prerequisites
This feature is only enabled when: 1) current query is not an extended query (cursor, or
Bind
messages, etc.), and 2) there exists a gang in which all QEs are writer QE (notably, writer QE and writer gang are two different concepts).Prerequisite 1
prerequisite 1 is a hard limit due to the way extended query (equery for short) works. during equery, there's always a live writer gang in which everyone's a writer QE (gang W). first a command
set gp_write_shared_snapshot=true
will be dispatched to gang W to force a shared snapshot sync, then the actual gang will be created in which everyone's reader (gang R) and the actual query gets dispatched to it. Immediately you can find out that when the actual query get's dispatched, no writer QE receives the plan (because all writer QEs are in gang W), so there's no one be responsible for shared query plan synchronization.Prerequisite 2
prerequisite 2 is a tradeoff. Consider the following query plan:
In this plan, seg0 in slice1 is a writer that should receives full query text. but in slice2, seg0 is a reader that should receive
slimQueryText
(aslimQueryText
is a query text w/o query plan and ddesc), and seg1 and seg2 are writer QEs. this means when dispatching to gang2, seg0 should receives slim query text (because the full plan can be synced to it from seg0 in slice1 which is a writer), but seg1 and seg2 should receives full query text. this poses challenges because on QD side, the current interface of cdb dispatcher limits all dispatches happens on a per-gang basis (cdbdisp_dispatchToGang
) and plan cannot be changed from seg to seg in a gang. for the same reason, the reference count of a DSM segment cannot be dispatched from QD directly because it may be different from QE to QE, even they're in the same gang (consider a plan that have singleton reader).We can surely workaround this by bringing more thorough refactor to cdb/dispatcher interfaces. but I don't think that's worth it though surely this's debatable.
Other Caveats
Updatable Views
It may seem that the following invariant holds for any given query:
This is indeed true for many common queries, but unfortunately not all of them. Below is a counterexample:
InitPlan
If there's a
InitPlan
at the root of a plan, there could be two set of writer gang created and two rounds of dispatching happens for the same query:this is why we limit "same root" during
reference
calculation (https://github.com/Ray-Eldath/cloudberrydb/blob/dispatch-by-shmem/organized-and-unlogged/src/backend/utils/time/sharedqueryplan.c#L119-L121). Note that aInitPlan
doesn't necessarily have to be at the root. It could be deep down the plan tree as well:Possible Outcome
All in all, though many requirements need to be meet for this feature to take effects, it is still very much turned on in most common queries (see tests for example). This is good news. On the bad side, I doubt whether this PR can make any noticeable performance improvements at all. On qd, we cannot completely get rid of libpq connections for now, and query dispatch is already pipelined to hide interconnect cost anyway. In the long run, if we are to "decentralize" QE by reassigning tasks (such as creating reader QEs, keepalive, etc.) to writer QE, this feature is a very good pathfinder and also a mandatory requisite. But if that's not the case, I doubt whether this feature alone worths the risk.