Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg: Add and integrate BulkRowWriter processor into CTAS execution. #38374

Merged
merged 2 commits into from Jul 9, 2019

Conversation

@adityamaru27
Copy link
Collaborator

adityamaru27 commented Jun 24, 2019

This change introduces a BulkRowWriter processor which uses
AddSSTable to write rows from a RowSource to the target
table. This is in an attempt to make the CREATE TABLE ... AS
statement more scalable.

@adityamaru27 adityamaru27 requested a review from cockroachdb/cli-prs as a code owner Jun 24, 2019
@adityamaru27 adityamaru27 requested review from Jun 24, 2019
@adityamaru27 adityamaru27 requested a review from cockroachdb/sql-opt-prs as a code owner Jun 24, 2019
@adityamaru27 adityamaru27 requested review from Jun 24, 2019
@adityamaru27 adityamaru27 assigned dt and lucy-zhang and unassigned dt and lucy-zhang Jun 24, 2019
@cockroach-teamcity

This comment has been minimized.

Copy link
Member

cockroach-teamcity commented Jun 24, 2019

This change is Reviewable

@adityamaru27 adityamaru27 requested review from dt and lucy-zhang Jun 24, 2019
@adityamaru27 adityamaru27 force-pushed the adityamaru27:add-ctas-processor branch 5 times, most recently from 420c63f to 4467d36 Jun 24, 2019
@adityamaru27 adityamaru27 requested a review from Jun 27, 2019
@adityamaru27 adityamaru27 force-pushed the adityamaru27:add-ctas-processor branch 3 times, most recently from 832663f to ccd06b3 Jun 27, 2019
@@ -134,7 +134,7 @@ statement ok
SET tracing = on,kv,results; UPDATE t.kv2 SET v = v + 2; SET tracing = off

query TT
SELECT operation, regexp_replace(message, '(\d\d\d\d-\d\d-\d\dT\d\d:\d\d:\d\d\.)?\d\d\d\d\d+', '...PK...') as message

This comment has been minimized.

Copy link
@adityamaru27

adityamaru27 Jun 28, 2019

Author Collaborator

Using the row converter changes the way row_id's are generated for hidden PK's. They are generated using GenerateUniqueID which required this change in the logic test.

@adityamaru27 adityamaru27 force-pushed the adityamaru27:add-ctas-processor branch from ccd06b3 to 8c52887 Jul 1, 2019
@adityamaru27 adityamaru27 changed the title [WIP] pkg: Add and integrate BulkRowWriter processor into CTAS execution. pkg: Add and integrate BulkRowWriter processor into CTAS execution. Jul 1, 2019
Copy link
Member

dt left a comment

Reviewed 10 of 10 files at r1, 7 of 7 files at r2, 7 of 11 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru27, @dt, and @lucy-zhang)


pkg/sql/create_table.go, line 1652 at r3 (raw file):

}

func newBulkRowWriterProcessor(

for better or worse, most of the time we define processors in distsqlrun and only use the function-pointer injection route for the ones that have to be defined elsewhere due to hard rules about dependency edges, like OSS vs CCL. Is there anything motivating putting this in sql instead of distsqlrun? Anything it depends on in sql?

in either case, I'd pull it into own file.


pkg/sql/create_table.go, line 1744 at r3 (raw file):

		g = ctxgroup.WithContext(ctx)
		g.GoCtx(func(ctx context.Context) error {

is sp.out threadsafe?

2¢: I might find this easier to follow if we
a) pull the first half of this inline func into a private method on the processor which also inlines ingestKVs.
b) I'd probably have that method just set a BulkOpSummary field in the processor when it is done, instead of emitting directly.
c) I'd then serialize and emit the row part done in Run, after calling g.Wait (so you don't have to worry about a race).


pkg/sql/create_table.go, line 1778 at r3 (raw file):

		})

		for {

right now if the ingesting goroutine gets an error, I think this will keep pushing rows at it and might deadlock on the channel. I think you might want to also run this loop in a task on the error group so that its context is hooked up to cancel on error return (and the other way too -- if this errors, we should teardown the ingesting goroutine too).


pkg/sql/distsqlrun/processors.go, line 1244 at r2 (raw file):

var NewCSVWriterProcessor func(*FlowCtx, int32, distsqlpb.CSVWriterSpec, RowSource, RowReceiver) (Processor, error)

// NewBulkRowWriterProcessor is externally implemented.

This is usually done to get around CCL vs OSS, where we inject the proc via this hook and an init func. This is a pure OSS proc, so I don't think we need that song and dance.

@adityamaru27 adityamaru27 force-pushed the adityamaru27:add-ctas-processor branch from 8c52887 to 0c50c53 Jul 3, 2019
Copy link
Collaborator Author

adityamaru27 left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru27, @dt, and @lucy-zhang)


pkg/sql/create_table.go, line 1744 at r3 (raw file):

Previously, dt (David Taylor) wrote…

is sp.out threadsafe?

2¢: I might find this easier to follow if we
a) pull the first half of this inline func into a private method on the processor which also inlines ingestKVs.
b) I'd probably have that method just set a BulkOpSummary field in the processor when it is done, instead of emitting directly.
c) I'd then serialize and emit the row part done in Run, after calling g.Wait (so you don't have to worry about a race).

Done.


pkg/sql/create_table.go, line 1778 at r3 (raw file):

Previously, dt (David Taylor) wrote…

right now if the ingesting goroutine gets an error, I think this will keep pushing rows at it and might deadlock on the channel. I think you might want to also run this loop in a task on the error group so that its context is hooked up to cancel on error return (and the other way too -- if this errors, we should teardown the ingesting goroutine too).

thats a cool catch, good to know. Done.

Copy link
Contributor

lucy-zhang left a comment

I'm not very familiar with this part of the code, but it LGTM

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru27, @dt, and @lucy-zhang)


pkg/sql/bulk_row_writer.go, line 82 at r4 (raw file):

	// ingestKvs drains kvs from the channel until it closes, ingesting them using
	// the BulkAdder. It handles the required buffering/sorting/etc.
	ingestKvs := func() error {

Is there a difference between defining this function and then calling it immediately afterward, and just replacing it by the function definition?

@adityamaru27

This comment has been minimized.

Copy link
Collaborator Author

adityamaru27 commented Jul 8, 2019

Is there a difference between defining this function and then calling it immediately afterward, and just replacing it by the function definition?

I don't think so, I just did it to modularize the kv ingestion logic. However, I just realized that this pattern is used for places where we reuse a method several times. Should I pull out the definition?

@dt
dt approved these changes Jul 8, 2019
Copy link
Member

dt left a comment

Reviewed 1 of 11 files at r3, 1 of 2 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru27)


pkg/sql/bulk_row_writer.go, line 70 at r4 (raw file):

}

func (sp *bulkRowWriter) doRun(ctx context.Context, kvCh chan []roachpb.KeyValue) error {

nit: I'd call this method ingestLoop or something descriptive since it doesn't really run the whole process.

That said, I think you're also on the right track going with the pattern of a doRun method that has an error return and can thus have early returns on error, since you can then just call it in Run and cleanup some of the named errors and conditionals deciding what to return.


pkg/sql/bulk_row_writer.go, line 118 at r4 (raw file):

	var kvCh chan []roachpb.KeyValue
	var g ctxgroup.Group
	err := func() error {

This is a lot of inline anon function along with the added layer below for the GoCtx call.

In my experience, big inline funcs, particularly closing over each other, can make it tricky to see what vars closed over vs local, what is shadowed, etc. and can sometimes lead to subtle bugs, particularly if anyone else tries to make changes later.

My 2¢: I'd move most of this, i.e. what you want run on another goroutine, to a second private method, and then yourRun (or doRun helper) looks like

<setup code>
g := ...
g.GoCtx(func(ctx) error { return ingestLoop(ctx, kvCh, ...) } )
g.GoCtx(func(ctx) error { return convertLoop(ctx, input, kvCh, ...)} )
g.Wait()

@adityamaru27 adityamaru27 force-pushed the adityamaru27:add-ctas-processor branch 8 times, most recently from 6125663 to 9774f70 Jul 8, 2019
This change introduces a BulkRowWriter processor which uses the
AddSSTable method to write rows from a RowSource to the target
table. The processor is required to make the CTAS statement
scalable.

Release note: None
This change integrates the BulkRowWriter into the distsql
PlanAndRun phase for CTAS statements.

Release Note: None
@adityamaru27 adityamaru27 force-pushed the adityamaru27:add-ctas-processor branch from 9774f70 to 3e1f7f7 Jul 8, 2019
Copy link
Collaborator Author

adityamaru27 left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru27 and @dt)


pkg/sql/create_table.go, line 1652 at r3 (raw file):

Previously, dt (David Taylor) wrote…

for better or worse, most of the time we define processors in distsqlrun and only use the function-pointer injection route for the ones that have to be defined elsewhere due to hard rules about dependency edges, like OSS vs CCL. Is there anything motivating putting this in sql instead of distsqlrun? Anything it depends on in sql?

in either case, I'd pull it into own file.

Done.


pkg/sql/distsqlrun/processors.go, line 1244 at r2 (raw file):

Previously, dt (David Taylor) wrote…

This is usually done to get around CCL vs OSS, where we inject the proc via this hook and an init func. This is a pure OSS proc, so I don't think we need that song and dance.

Done.


pkg/sql/bulk_row_writer.go, line 70 at r4 (raw file):

Previously, dt (David Taylor) wrote…

nit: I'd call this method ingestLoop or something descriptive since it doesn't really run the whole process.

That said, I think you're also on the right track going with the pattern of a doRun method that has an error return and can thus have early returns on error, since you can then just call it in Run and cleanup some of the named errors and conditionals deciding what to return.

Done.


pkg/sql/bulk_row_writer.go, line 118 at r4 (raw file):

Previously, dt (David Taylor) wrote…

This is a lot of inline anon function along with the added layer below for the GoCtx call.

In my experience, big inline funcs, particularly closing over each other, can make it tricky to see what vars closed over vs local, what is shadowed, etc. and can sometimes lead to subtle bugs, particularly if anyone else tries to make changes later.

My 2¢: I'd move most of this, i.e. what you want run on another goroutine, to a second private method, and then yourRun (or doRun helper) looks like

<setup code>
g := ...
g.GoCtx(func(ctx) error { return ingestLoop(ctx, kvCh, ...) } )
g.GoCtx(func(ctx) error { return convertLoop(ctx, input, kvCh, ...)} )
g.Wait()

Done.

@adityamaru27

This comment has been minimized.

Copy link
Collaborator Author

adityamaru27 commented Jul 8, 2019

TFTR
bors r+

@adityamaru27

This comment has been minimized.

Copy link
Collaborator Author

adityamaru27 commented Jul 9, 2019

bors r+

craig bot pushed a commit that referenced this pull request Jul 9, 2019
38374: pkg: Add and integrate BulkRowWriter processor into CTAS execution. r=adityamaru27 a=adityamaru27

This change introduces a BulkRowWriter processor which uses
AddSSTable to write rows from a RowSource to the target
table. This is in an attempt to make the CREATE TABLE ... AS
statement more scalable.

Co-authored-by: Aditya Maru <adityamaru@cockroachlabs.com>
@craig

This comment has been minimized.

Copy link

craig bot commented Jul 9, 2019

Build succeeded

@craig craig bot merged commit 3e1f7f7 into cockroachdb:master Jul 9, 2019
3 checks passed
3 checks passed
GitHub CI (Cockroach) TeamCity build finished
Details
bors Build succeeded
Details
license/cla Contributor License Agreement is signed.
Details
@adityamaru27 adityamaru27 deleted the adityamaru27:add-ctas-processor branch Jul 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.