Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dispatch by shared memory #339

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Ray-Eldath
Copy link
Contributor

@Ray-Eldath Ray-Eldath commented Dec 18, 2023

This PR makes serializedPlantree and serializedQueryDispatchDesc dispatched by shared memory instead of interconnect, so that they can be sent only once to writer QE, and synced inbetween reader QEs and writer QE on a segment through DSM. It has been discussed in https://github.com/orgs/cloudberrydb/discussions/243.

Implementation Outline

  • this PR mainly use polling on reader QE to wait shared plan get's dispatched from writer QE. this is the same mechanism as shared snapshot synchronization. to circumvent this we may need something special (e.g., signal) because seems current synchronization mechanisms (barrier, SharedLatch) cannot satisfy our requirement.
  • to properly reclaim DSM segments for a query, a reference count is calculated on writer QE. and the last reader QE who reads it will reclaim the DSM. this is the same as parallel (GpInsertParallelDSMHash).
  • to isolate states in each user connections, a shmem HTAB is used. also same as parallel.

Prerequisites

This feature is only enabled when: 1) current query is not an extended query (cursor, or Bind messages, etc.), and 2) there exists a gang in which all QEs are writer QE (notably, writer QE and writer gang are two different concepts).

Prerequisite 1

prerequisite 1 is a hard limit due to the way extended query (equery for short) works. during equery, there's always a live writer gang in which everyone's a writer QE (gang W). first a command set gp_write_shared_snapshot=true will be dispatched to gang W to force a shared snapshot sync, then the actual gang will be created in which everyone's reader (gang R) and the actual query gets dispatched to it. Immediately you can find out that when the actual query get's dispatched, no writer QE receives the plan (because all writer QEs are in gang W), so there's no one be responsible for shared query plan synchronization.

Prerequisite 2

prerequisite 2 is a tradeoff. Consider the following query plan:

image

In this plan, seg0 in slice1 is a writer that should receives full query text. but in slice2, seg0 is a reader that should receive slimQueryText (a slimQueryText is a query text w/o query plan and ddesc), and seg1 and seg2 are writer QEs. this means when dispatching to gang2, seg0 should receives slim query text (because the full plan can be synced to it from seg0 in slice1 which is a writer), but seg1 and seg2 should receives full query text. this poses challenges because on QD side, the current interface of cdb dispatcher limits all dispatches happens on a per-gang basis (cdbdisp_dispatchToGang) and plan cannot be changed from seg to seg in a gang. for the same reason, the reference count of a DSM segment cannot be dispatched from QD directly because it may be different from QE to QE, even they're in the same gang (consider a plan that have singleton reader).

We can surely workaround this by bringing more thorough refactor to cdb/dispatcher interfaces. but I don't think that's worth it though surely this's debatable.

Other Caveats

Updatable Views

It may seem that the following invariant holds for any given query:

For any segment that a plan touches, there's always a writer QE exists on that segment.

This is indeed true for many common queries, but unfortunately not all of them. Below is a counterexample:

image

InitPlan

If there's aInitPlan at the root of a plan, there could be two set of writer gang created and two rounds of dispatching happens for the same query:

image

this is why we limit "same root" during reference calculation (https://github.com/Ray-Eldath/cloudberrydb/blob/dispatch-by-shmem/organized-and-unlogged/src/backend/utils/time/sharedqueryplan.c#L119-L121). Note that a InitPlan doesn't necessarily have to be at the root. It could be deep down the plan tree as well:

image

Possible Outcome

All in all, though many requirements need to be meet for this feature to take effects, it is still very much turned on in most common queries (see tests for example). This is good news. On the bad side, I doubt whether this PR can make any noticeable performance improvements at all. On qd, we cannot completely get rid of libpq connections for now, and query dispatch is already pipelined to hide interconnect cost anyway. In the long run, if we are to "decentralize" QE by reassigning tasks (such as creating reader QEs, keepalive, etc.) to writer QE, this feature is a very good pathfinder and also a mandatory requisite. But if that's not the case, I doubt whether this feature alone worths the risk.

@Ray-Eldath Ray-Eldath marked this pull request as draft December 18, 2023 04:32
@Ray-Eldath Ray-Eldath changed the title Dispatch by shmem/organized and unlogged Dispatch by shared memory Dec 18, 2023
@Ray-Eldath Ray-Eldath force-pushed the dispatch-by-shmem/organized-and-unlogged branch from b95dc51 to 9f7d1db Compare December 18, 2023 07:37
@Ray-Eldath Ray-Eldath marked this pull request as ready for review December 18, 2023 07:37

addr = dsm_segment_address(seg);
memcpy(serializedPlantree, addr, serializedPlantreelen);
memcpy(serializedQueryDispatchDesc, addr + serializedPlantreelen, serializedQueryDispatchDesclen);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can avoid this memcpy ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can, but then we need to take all logic related to entry->reference-- out of ReadSharedQueryPlan and make it a separate function to be called after the serializedPlantree and serializedQueryDispatchDesc be deserialized in exec_mpp_query. do you think this's ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It requires another round of hash_search to find the entry and decrease referenced after the deserialization of serializedPlantree and serializedQueryDesc. I think maybe two memcpy is still more efficient.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I worry about performance degradation.

@yjhjstz
Copy link
Collaborator

yjhjstz commented Dec 18, 2023

else if (Gp_role == GP_ROLE_EXECUTE)
	{
		if (Gp_is_writer)
		{
			addSharedSnapshot("Writer qExec", gp_session_id);
		}
		else
		{
			/*
			 * NOTE: This assumes that the Slot has already been
			 *       allocated by the writer.  Need to make sure we
			 *       always allocate the writer qExec first.
			 */			 			
			lookupSharedSnapshot("Reader qExec", "Writer qExec", gp_session_id);
		}
	}

can reuse SharedSnapshot logic and code ?

@avamingli
Copy link
Collaborator

On the bad side, I doubt whether this PR can make any noticeable performance improvements at all. On qd, we cannot completely get rid of libpq connections for now, and query dispatch is already pipelined to hide interconnect cost anyway. In the long run, if we are to "decentralize" QE by reassigning tasks (such as creating reader QEs, keepalive, etc.) to writer QE, this feature is a very good pathfinder and also a mandatory requisite. But if that's not the case, I doubt whether this feature alone worths the risk.

Agree, we have a long way to go, but this PR is a good start! Thanks for your work and rich description.

@Ray-Eldath Ray-Eldath force-pushed the dispatch-by-shmem/organized-and-unlogged branch 2 times, most recently from 439a8c0 to ef1970b Compare December 27, 2023 09:57
@Ray-Eldath
Copy link
Contributor Author

Ray-Eldath commented Dec 27, 2023

else if (Gp_role == GP_ROLE_EXECUTE)
	{
		if (Gp_is_writer)
		{
			addSharedSnapshot("Writer qExec", gp_session_id);
		}
		else
		{
			/*
			 * NOTE: This assumes that the Slot has already been
			 *       allocated by the writer.  Need to make sure we
			 *       always allocate the writer qExec first.
			 */			 			
			lookupSharedSnapshot("Reader qExec", "Writer qExec", gp_session_id);
		}
	}

can reuse SharedSnapshot logic and code ?

the original implementation was to use SharedSnapShot, but I quickly gave it up. SharedSnapshot served for a totally different purpose and its lifetime is tied to a transaction not a query. when dtxcontext is AUTO_COMMIT_IMPLICIT, txn lifetime is query lifetime, but when you explicitly use txn with BEGIN stmt etc., they're different. we can of course make this work but I didn't go down that path. I can try again if you think that's better.

also I can never figure out why they use a array to store slots discriminated by gp_session_id. use ShmemSharedHTAB as this PR and parallel can lead to a far simpler implementation. Do you have a theory on this? @yjhjstz

@yjhjstz
Copy link
Collaborator

yjhjstz commented Dec 27, 2023

the original implementation was to use SharedSnapShot, but I quickly gave it up. SharedSnapshot served for a totally different purpose and its lifetime is tied to a transaction not a query. when dtxcontext is AUTO_COMMIT_IMPLICIT, txn lifetime is query lifetime, but when you explicitly use txn with BEGIN stmt etc., they're different. we can of course make this work but I didn't go down that path. I can try again if you think that's better.

I found they have the same logical except lifetime.

@Ray-Eldath Ray-Eldath force-pushed the dispatch-by-shmem/organized-and-unlogged branch 3 times, most recently from 4ab666b to c86a28f Compare January 2, 2024 08:48
This PR makes serializedPlantree and serializedQueryDispatchDesc
dispatched by shared memory instead of interconnect, so that
they can be sent only once to writer QE, and synced inbetween
reader QEs and writer QE on a segment through DSM.
@Ray-Eldath Ray-Eldath force-pushed the dispatch-by-shmem/organized-and-unlogged branch from c86a28f to 18cace7 Compare January 3, 2024 03:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants