New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exec: add physical types; columnarizer and materializer #31353

Merged
merged 2 commits into from Oct 17, 2018

Conversation

Projects
None yet
4 participants
@jordanlewis
Member

jordanlewis commented Oct 15, 2018

  • exec: add physical types enum
  • exec: add columnarizer and materializer

An exec physical type corresponds to a particular type's bytes
representation, from the perspective of exec. For example, OID
types and Integers are represented the same way - as int64s.
And Integers with bounded precisions may be represented as
int32, int16, or int8, depending on the precision bounds.

The columnarizer and materializer are the the components that map
data between the EncDatumRow representation and the exec.ColBatch
representation.

Release note: None

exec: add physical types enum
An exec physical type corresponds to a particular type's bytes
representation, from the perspective of exec.

Release note: None

@jordanlewis jordanlewis requested review from cockroachdb/distsql-prs as code owners Oct 15, 2018

@cockroach-teamcity

This comment has been minimized.

Show comment
Hide comment
@cockroach-teamcity

cockroach-teamcity Oct 15, 2018

Member

This change is Reviewable

Member

cockroach-teamcity commented Oct 15, 2018

This change is Reviewable

@solongordon

:lgtm: in general. I left a few thoughts.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/sql/distsqlrun/columnarizer.go, line 192 at r2 (raw file):

// unsafeConvertStringToBytes converts a string to a byte array to be used with string encoding functions.
func unsafeConvertStringToBytes(s string) []byte {

Any reason not to share with util.encoding rather than duplicating?


pkg/sql/distsqlrun/columnarizer_test.go, line 51 at r2 (raw file):

			if bat.Length() == 0 {
				break
			}

Would be nice to add a simple check on expected number of batches here. (Lesson learned from the tablereader benchmark issue.)


pkg/sql/distsqlrun/materializer.go, line 117 at r2 (raw file):

		sel := m.batch.Selection()

		rowIdx := m.curIdx

Did you consider processing the entire batch here into an EncDatumRows buffer, rather than one row at a time? Seems like we could see a speed-up there since you could convert each column vector to EncDatums in a tight loop.

@jordanlewis

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/sql/distsqlrun/columnarizer.go, line 192 at r2 (raw file):

Previously, solongordon (Solon) wrote…

Any reason not to share with util.encoding rather than duplicating?

Nope, I'll export from there, not sure why I didn't do that to start with.


pkg/sql/distsqlrun/columnarizer_test.go, line 51 at r2 (raw file):

Previously, solongordon (Solon) wrote…

Would be nice to add a simple check on expected number of batches here. (Lesson learned from the tablereader benchmark issue.)

Good idea.


pkg/sql/distsqlrun/materializer.go, line 117 at r2 (raw file):

Previously, solongordon (Solon) wrote…

Did you consider processing the entire batch here into an EncDatumRows buffer, rather than one row at a time? Seems like we could see a speed-up there since you could convert each column vector to EncDatums in a tight loop.

Fantastic idea. Do you want to take that on? I'll plan to merge as-is and we can upgrade it later.

@solongordon

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/sql/distsqlrun/materializer.go, line 117 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Fantastic idea. Do you want to take that on? I'll plan to merge as-is and we can upgrade it later.

👍

@asubiotto

Reviewed 1 of 1 files at r1, 6 of 6 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/sql/distsqlrun/columnarizer.go, line 47 at r2 (raw file):

		input: input,
	}
	c.InitWithEvalCtx(

If you're just passing in flowCtx.NewEvalCtx might as well use Init (maybe explicitly call ProcessorBase.Init to avoid confusion)


pkg/sql/distsqlrun/columnarizer.go, line 62 at r2 (raw file):

}

func (c *columnarizer) Init() {

I feel like this might take the col batch size as an argument at some point since it should probably be a function of the L1 cache size (and maybe the number of columns?). I'm sure there's some literature on this.


pkg/sql/distsqlrun/columnarizer.go, line 76 at r2 (raw file):

	nRows := uint16(0)
	columnTypes := c.OutputTypes()
	for ; nRows < exec.ColBatchSize; nRows++ {

I think that ColBatchSize should be number of bytes, not number of rows.


pkg/sql/distsqlrun/columnarizer.go, line 80 at r2 (raw file):

		if meta != nil {
			panic("metadata")

At some point we'll need to c.ProcessorBase.AppendTrailingMeta(meta) here, right? The materializer should probably have a reference to the upstream columnarizer.


pkg/sql/distsqlrun/columnarizer.go, line 85 at r2 (raw file):

			break
		}
		c.buffered[nRows] = row

Are we going to have to copy here?


pkg/sql/distsqlrun/columnarizer_test.go, line 48 at r2 (raw file):

	for i := 0; i < b.N; i++ {
		for {
			bat := c.Next()

nit: do we want to be using the term bat?


pkg/sql/distsqlrun/materializer_test.go, line 28 at r2 (raw file):

)

func TestColumnarizeMaterialize(t *testing.T) {

We should think about adding randomization.


pkg/sql/exec/types/types.go, line 25 at r1 (raw file):

// T represents an exec physical type - a bytes representation of a particular
// column type.
type T int

Have you considered making this a byte?


pkg/sql/exec/types/types.go, line 61 at r1 (raw file):

			return Int64
		}
		panic(fmt.Sprintf("integer with unknown width %d", ct.Width))

Do you think it might be better to just have Int64 be a catch-all as a default case?

@solongordon

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/sql/distsqlrun/columnarizer.go, line 139 at r2 (raw file):

				col := vec.Int64()
				for i := uint16(0); i < nRows; i++ {
					if c.buffered[i][idx].Datum == nil {

Any reason you didn't use the Datum == nil optimization for every column type? Same question for avoiding the ed := assignment.

@jordanlewis

This comment has been minimized.

Show comment
Hide comment
@jordanlewis

jordanlewis Oct 16, 2018

Member

Any reason you didn't use the Datum == nil optimization for every column type? Same question for avoiding the ed := assignment.

Lack of templating. I only was playing with the int64 one - the implementation that's there for the int64 type was the fastest according to the benchmark.

Member

jordanlewis commented Oct 16, 2018

Any reason you didn't use the Datum == nil optimization for every column type? Same question for avoiding the ed := assignment.

Lack of templating. I only was playing with the int64 one - the implementation that's there for the int64 type was the fastest according to the benchmark.

@solongordon

Thanks, that's what I figured. I'll include those in the template.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@jordanlewis

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/sql/distsqlrun/columnarizer.go, line 47 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

If you're just passing in flowCtx.NewEvalCtx might as well use Init (maybe explicitly call ProcessorBase.Init to avoid confusion)

Done.


pkg/sql/distsqlrun/columnarizer.go, line 62 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

I feel like this might take the col batch size as an argument at some point since it should probably be a function of the L1 cache size (and maybe the number of columns?). I'm sure there's some literature on this.

Agreed, though for now it'll be easier to just keep things parameterized on a global. We can tune this later as needed.


pkg/sql/distsqlrun/columnarizer.go, line 76 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

I think that ColBatchSize should be number of bytes, not number of rows.

That's going to be very tricky, as it will lead to a different number of rows per column vector. I think for now we can stick with number of rows. If you're motivated to do so, you should benchmark whether this makes a difference.


pkg/sql/distsqlrun/columnarizer.go, line 80 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

At some point we'll need to c.ProcessorBase.AppendTrailingMeta(meta) here, right? The materializer should probably have a reference to the upstream columnarizer.

Yes, we'll need that at some point.


pkg/sql/distsqlrun/columnarizer.go, line 85 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Are we going to have to copy here?

Great point, fixed. Now we preallocate rows of the right size in startup, and copy the EncDatums into their rows.


pkg/sql/distsqlrun/columnarizer.go, line 192 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Nope, I'll export from there, not sure why I didn't do that to start with.

Done.


pkg/sql/distsqlrun/columnarizer_test.go, line 48 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

nit: do we want to be using the term bat?

Just short for batch, I didn't mean to make it sound like BAT. Fixed.


pkg/sql/distsqlrun/materializer_test.go, line 28 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

We should think about adding randomization.

Agreed.


pkg/sql/exec/types/types.go, line 25 at r1 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Have you considered making this a byte?

Interesting idea - I think that unlike the other datatypes we were discussing this one is less critical to make small, since it's not touched in the hot path - should be either just during planning or once per patch.


pkg/sql/exec/types/types.go, line 61 at r1 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Do you think it might be better to just have Int64 be a catch-all as a default case?

I think for now I like the panic because it shows us that something is going very wrong. We shouldn't have a width that aren't those above values.

@asubiotto

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/sql/distsqlrun/columnarizer.go, line 80 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Yes, we'll need that at some point.

Maybe add a TODO


pkg/sql/distsqlrun/columnarizer.go, line 85 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Great point, fixed. Now we preallocate rows of the right size in startup, and copy the EncDatums into their rows.

Now that we do need to copy, I wonder if it might be better to just immediately write out the column values and avoid the copy?


pkg/sql/distsqlrun/materializer_test.go, line 28 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Agreed.

Could you add a TODO?

@jordanlewis

TFTRs!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/sql/distsqlrun/columnarizer.go, line 80 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Maybe add a TODO

Done.


pkg/sql/distsqlrun/columnarizer.go, line 85 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Now that we do need to copy, I wonder if it might be better to just immediately write out the column values and avoid the copy?

Added a todo.


pkg/sql/distsqlrun/materializer_test.go, line 28 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Could you add a TODO?

Done.

craig bot pushed a commit that referenced this pull request Oct 17, 2018

Merge #31353
31353: exec: add physical types; columnarizer and materializer r=jordanlewis a=jordanlewis

- exec: add physical types enum
- exec: add columnarizer and materializer

An exec physical type corresponds to a particular type's bytes
representation, from the perspective of exec. For example, OID
types and Integers are represented the same way - as int64s.
And Integers with bounded precisions may be represented as
int32, int16, or int8, depending on the precision bounds.

The columnarizer and materializer are the the components that map
data between the EncDatumRow representation and the exec.ColBatch
representation.

Release note: None

Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com>
@jordanlewis

This comment has been minimized.

Show comment
Hide comment
@jordanlewis

jordanlewis Oct 17, 2018

Member

bors r-

Pushed the wrong rev.

Member

jordanlewis commented Oct 17, 2018

bors r-

Pushed the wrong rev.

@craig

This comment has been minimized.

Show comment
Hide comment
@craig

craig bot Oct 17, 2018

Canceled

craig bot commented Oct 17, 2018

Canceled

exec: add columnarizer and materializer
These are the components that map data between the EncDatumRow
representation and the exec.ColBatch representation.

Release note: None
@jordanlewis

This comment has been minimized.

Show comment
Hide comment
@jordanlewis

jordanlewis Oct 17, 2018

Member

bors r+

Member

jordanlewis commented Oct 17, 2018

bors r+

craig bot pushed a commit that referenced this pull request Oct 17, 2018

Merge #31353
31353: exec: add physical types; columnarizer and materializer r=jordanlewis a=jordanlewis

- exec: add physical types enum
- exec: add columnarizer and materializer

An exec physical type corresponds to a particular type's bytes
representation, from the perspective of exec. For example, OID
types and Integers are represented the same way - as int64s.
And Integers with bounded precisions may be represented as
int32, int16, or int8, depending on the precision bounds.

The columnarizer and materializer are the the components that map
data between the EncDatumRow representation and the exec.ColBatch
representation.

Release note: None

Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com>
@craig

This comment has been minimized.

Show comment
Hide comment
@craig

craig bot commented Oct 17, 2018

Build succeeded

@craig craig bot merged commit 6746fbd into cockroachdb:master Oct 17, 2018

3 checks passed

GitHub CI (Cockroach) TeamCity build finished
Details
bors Build succeeded
Details
license/cla Contributor License Agreement is signed.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment