Dispatch by shared memory #339

Ray-Eldath · 2023-12-18T04:32:18Z

This PR makes serializedPlantree and serializedQueryDispatchDesc dispatched by shared memory instead of interconnect, so that they can be sent only once to writer QE, and synced inbetween reader QEs and writer QE on a segment through DSM. It has been discussed in https://github.com/orgs/cloudberrydb/discussions/243.

Implementation Outline

this PR mainly use polling on reader QE to wait shared plan get's dispatched from writer QE. this is the same mechanism as shared snapshot synchronization. to circumvent this we may need something special (e.g., signal) because seems current synchronization mechanisms (barrier, SharedLatch) cannot satisfy our requirement.
to properly reclaim DSM segments for a query, a reference count is calculated on writer QE. and the last reader QE who reads it will reclaim the DSM. this is the same as parallel (GpInsertParallelDSMHash).
to isolate states in each user connections, a shmem HTAB is used. also same as parallel.

Prerequisites

This feature is only enabled when: 1) current query is not an extended query (cursor, or Bind messages, etc.), and 2) there exists a gang in which all QEs are writer QE (notably, writer QE and writer gang are two different concepts).

Prerequisite 1

prerequisite 1 is a hard limit due to the way extended query (equery for short) works. during equery, there's always a live writer gang in which everyone's a writer QE (gang W). first a command set gp_write_shared_snapshot=true will be dispatched to gang W to force a shared snapshot sync, then the actual gang will be created in which everyone's reader (gang R) and the actual query gets dispatched to it. Immediately you can find out that when the actual query get's dispatched, no writer QE receives the plan (because all writer QEs are in gang W), so there's no one be responsible for shared query plan synchronization.

Prerequisite 2

prerequisite 2 is a tradeoff. Consider the following query plan:

In this plan, seg0 in slice1 is a writer that should receives full query text. but in slice2, seg0 is a reader that should receive slimQueryText (a slimQueryText is a query text w/o query plan and ddesc), and seg1 and seg2 are writer QEs. this means when dispatching to gang2, seg0 should receives slim query text (because the full plan can be synced to it from seg0 in slice1 which is a writer), but seg1 and seg2 should receives full query text. this poses challenges because on QD side, the current interface of cdb dispatcher limits all dispatches happens on a per-gang basis (cdbdisp_dispatchToGang) and plan cannot be changed from seg to seg in a gang. for the same reason, the reference count of a DSM segment cannot be dispatched from QD directly because it may be different from QE to QE, even they're in the same gang (consider a plan that have singleton reader).

We can surely workaround this by bringing more thorough refactor to cdb/dispatcher interfaces. but I don't think that's worth it though surely this's debatable.

Other Caveats

Updatable Views

It may seem that the following invariant holds for any given query:

For any segment that a plan touches, there's always a writer QE exists on that segment.

This is indeed true for many common queries, but unfortunately not all of them. Below is a counterexample:

`InitPlan`

If there's aInitPlan at the root of a plan, there could be two set of writer gang created and two rounds of dispatching happens for the same query:

this is why we limit "same root" during reference calculation (https://github.com/Ray-Eldath/cloudberrydb/blob/dispatch-by-shmem/organized-and-unlogged/src/backend/utils/time/sharedqueryplan.c#L119-L121). Note that a InitPlan doesn't necessarily have to be at the root. It could be deep down the plan tree as well:

Possible Outcome

All in all, though many requirements need to be meet for this feature to take effects, it is still very much turned on in most common queries (see tests for example). This is good news. On the bad side, I doubt whether this PR can make any noticeable performance improvements at all. On qd, we cannot completely get rid of libpq connections for now, and query dispatch is already pipelined to hide interconnect cost anyway. In the long run, if we are to "decentralize" QE by reassigning tasks (such as creating reader QEs, keepalive, etc.) to writer QE, this feature is a very good pathfinder and also a mandatory requisite. But if that's not the case, I doubt whether this feature alone worths the risk.

yjhjstz · 2023-12-18T08:43:14Z

src/backend/utils/time/sharedqueryplan.c

+
+			addr = dsm_segment_address(seg);
+			memcpy(serializedPlantree, addr, serializedPlantreelen);
+			memcpy(serializedQueryDispatchDesc, addr + serializedPlantreelen, serializedQueryDispatchDesclen);


can avoid this memcpy ?

we can, but then we need to take all logic related to entry->reference-- out of ReadSharedQueryPlan and make it a separate function to be called after the serializedPlantree and serializedQueryDispatchDesc be deserialized in exec_mpp_query. do you think this's ok?

It requires another round of hash_search to find the entry and decrease referenced after the deserialization of serializedPlantree and serializedQueryDesc. I think maybe two memcpy is still more efficient.

yes, I worry about performance degradation.

yjhjstz · 2023-12-18T08:55:00Z

else if (Gp_role == GP_ROLE_EXECUTE)
	{
		if (Gp_is_writer)
		{
			addSharedSnapshot("Writer qExec", gp_session_id);
		}
		else
		{
			/*
			 * NOTE: This assumes that the Slot has already been
			 *       allocated by the writer.  Need to make sure we
			 *       always allocate the writer qExec first.
			 */			 			
			lookupSharedSnapshot("Reader qExec", "Writer qExec", gp_session_id);
		}
	}

can reuse SharedSnapshot logic and code ?

src/backend/cdb/dispatcher/cdbdisp_query.c

src/backend/utils/time/sharedqueryplan.c

src/backend/cdb/dispatcher/cdbdisp_query.c

avamingli · 2023-12-20T03:53:37Z

On the bad side, I doubt whether this PR can make any noticeable performance improvements at all. On qd, we cannot completely get rid of libpq connections for now, and query dispatch is already pipelined to hide interconnect cost anyway. In the long run, if we are to "decentralize" QE by reassigning tasks (such as creating reader QEs, keepalive, etc.) to writer QE, this feature is a very good pathfinder and also a mandatory requisite. But if that's not the case, I doubt whether this feature alone worths the risk.

Agree, we have a long way to go, but this PR is a good start! Thanks for your work and rich description.

Ray-Eldath · 2023-12-27T10:07:27Z

else if (Gp_role == GP_ROLE_EXECUTE)
	{
		if (Gp_is_writer)
		{
			addSharedSnapshot("Writer qExec", gp_session_id);
		}
		else
		{
			/*
			 * NOTE: This assumes that the Slot has already been
			 *       allocated by the writer.  Need to make sure we
			 *       always allocate the writer qExec first.
			 */			 			
			lookupSharedSnapshot("Reader qExec", "Writer qExec", gp_session_id);
		}
	}

can reuse SharedSnapshot logic and code ?

the original implementation was to use SharedSnapShot, but I quickly gave it up. SharedSnapshot served for a totally different purpose and its lifetime is tied to a transaction not a query. when dtxcontext is AUTO_COMMIT_IMPLICIT, txn lifetime is query lifetime, but when you explicitly use txn with BEGIN stmt etc., they're different. we can of course make this work but I didn't go down that path. I can try again if you think that's better.

also I can never figure out why they use a array to store slots discriminated by gp_session_id. use ShmemSharedHTAB as this PR and parallel can lead to a far simpler implementation. Do you have a theory on this? @yjhjstz

yjhjstz · 2023-12-27T10:28:50Z

the original implementation was to use SharedSnapShot, but I quickly gave it up. SharedSnapshot served for a totally different purpose and its lifetime is tied to a transaction not a query. when dtxcontext is AUTO_COMMIT_IMPLICIT, txn lifetime is query lifetime, but when you explicitly use txn with BEGIN stmt etc., they're different. we can of course make this work but I didn't go down that path. I can try again if you think that's better.

I found they have the same logical except lifetime.

src/include/utils/sharedqueryplan.h

This PR makes serializedPlantree and serializedQueryDispatchDesc dispatched by shared memory instead of interconnect, so that they can be sent only once to writer QE, and synced inbetween reader QEs and writer QE on a segment through DSM.

Ray-Eldath marked this pull request as draft December 18, 2023 04:32

Ray-Eldath changed the title ~~Dispatch by shmem/organized and unlogged~~ Dispatch by shared memory Dec 18, 2023

Ray-Eldath force-pushed the dispatch-by-shmem/organized-and-unlogged branch from b95dc51 to 9f7d1db Compare December 18, 2023 07:37

Ray-Eldath marked this pull request as ready for review December 18, 2023 07:37

yjhjstz reviewed Dec 18, 2023

View reviewed changes

src/backend/cdb/dispatcher/cdbdisp_query.c Show resolved Hide resolved

yjhjstz reviewed Dec 18, 2023

View reviewed changes

src/backend/utils/time/sharedqueryplan.c Show resolved Hide resolved

avamingli reviewed Dec 20, 2023

View reviewed changes

src/backend/cdb/dispatcher/cdbdisp_query.c Outdated Show resolved Hide resolved

Ray-Eldath force-pushed the dispatch-by-shmem/organized-and-unlogged branch 2 times, most recently from 439a8c0 to ef1970b Compare December 27, 2023 09:57

yjhjstz reviewed Dec 27, 2023

View reviewed changes

src/include/utils/sharedqueryplan.h Show resolved Hide resolved

Ray-Eldath force-pushed the dispatch-by-shmem/organized-and-unlogged branch 3 times, most recently from 4ab666b to c86a28f Compare January 2, 2024 08:48

Dispatch plantree and ddesc by shared memory

18cace7

This PR makes serializedPlantree and serializedQueryDispatchDesc dispatched by shared memory instead of interconnect, so that they can be sent only once to writer QE, and synced inbetween reader QEs and writer QE on a segment through DSM.

Ray-Eldath force-pushed the dispatch-by-shmem/organized-and-unlogged branch from c86a28f to 18cace7 Compare January 3, 2024 03:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dispatch by shared memory #339

Dispatch by shared memory #339

Ray-Eldath commented Dec 18, 2023 •

edited

yjhjstz Dec 18, 2023

Ray-Eldath Dec 28, 2023 •

edited

Ray-Eldath Dec 29, 2023

yjhjstz Jan 2, 2024

yjhjstz commented Dec 18, 2023

avamingli commented Dec 20, 2023

Ray-Eldath commented Dec 27, 2023 •

edited

yjhjstz commented Dec 27, 2023

Dispatch by shared memory #339

Are you sure you want to change the base?

Dispatch by shared memory #339

Conversation

Ray-Eldath commented Dec 18, 2023 • edited

Implementation Outline

Prerequisites

Prerequisite 1

Prerequisite 2

Other Caveats

Updatable Views

InitPlan

Possible Outcome

yjhjstz Dec 18, 2023

Choose a reason for hiding this comment

Ray-Eldath Dec 28, 2023 • edited

Choose a reason for hiding this comment

Ray-Eldath Dec 29, 2023

Choose a reason for hiding this comment

yjhjstz Jan 2, 2024

Choose a reason for hiding this comment

yjhjstz commented Dec 18, 2023

avamingli commented Dec 20, 2023

Ray-Eldath commented Dec 27, 2023 • edited

yjhjstz commented Dec 27, 2023

Ray-Eldath commented Dec 18, 2023 •

edited

`InitPlan`

Ray-Eldath Dec 28, 2023 •

edited

Ray-Eldath commented Dec 27, 2023 •

edited