Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upexec: add physical types; columnarizer and materializer #31353
Conversation
jordanlewis
requested review from
solongordon,
asubiotto and
changangela
Oct 15, 2018
jordanlewis
requested review from
cockroachdb/distsql-prs
as
code owners
Oct 15, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
solongordon
approved these changes
Oct 15, 2018
in general. I left a few thoughts.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/sql/distsqlrun/columnarizer.go, line 192 at r2 (raw file):
// unsafeConvertStringToBytes converts a string to a byte array to be used with string encoding functions. func unsafeConvertStringToBytes(s string) []byte {
Any reason not to share with util.encoding rather than duplicating?
pkg/sql/distsqlrun/columnarizer_test.go, line 51 at r2 (raw file):
if bat.Length() == 0 { break }
Would be nice to add a simple check on expected number of batches here. (Lesson learned from the tablereader benchmark issue.)
pkg/sql/distsqlrun/materializer.go, line 117 at r2 (raw file):
sel := m.batch.Selection() rowIdx := m.curIdx
Did you consider processing the entire batch here into an EncDatumRows buffer, rather than one row at a time? Seems like we could see a speed-up there since you could convert each column vector to EncDatums in a tight loop.
jordanlewis
reviewed
Oct 15, 2018
Reviewable status:
complete! 1 of 0 LGTMs obtained
pkg/sql/distsqlrun/columnarizer.go, line 192 at r2 (raw file):
Previously, solongordon (Solon) wrote…
Any reason not to share with
util.encodingrather than duplicating?
Nope, I'll export from there, not sure why I didn't do that to start with.
pkg/sql/distsqlrun/columnarizer_test.go, line 51 at r2 (raw file):
Previously, solongordon (Solon) wrote…
Would be nice to add a simple check on expected number of batches here. (Lesson learned from the tablereader benchmark issue.)
Good idea.
pkg/sql/distsqlrun/materializer.go, line 117 at r2 (raw file):
Previously, solongordon (Solon) wrote…
Did you consider processing the entire batch here into an
EncDatumRowsbuffer, rather than one row at a time? Seems like we could see a speed-up there since you could convert each column vector to EncDatums in a tight loop.
Fantastic idea. Do you want to take that on? I'll plan to merge as-is and we can upgrade it later.
solongordon
requested changes
Oct 15, 2018
Reviewable status:
complete! 1 of 0 LGTMs obtained
pkg/sql/distsqlrun/materializer.go, line 117 at r2 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
Fantastic idea. Do you want to take that on? I'll plan to merge as-is and we can upgrade it later.
asubiotto
requested changes
Oct 15, 2018
Reviewed 1 of 1 files at r1, 6 of 6 files at r2.
Reviewable status:complete! 1 of 0 LGTMs obtained
pkg/sql/distsqlrun/columnarizer.go, line 47 at r2 (raw file):
input: input, } c.InitWithEvalCtx(
If you're just passing in flowCtx.NewEvalCtx might as well use Init (maybe explicitly call ProcessorBase.Init to avoid confusion)
pkg/sql/distsqlrun/columnarizer.go, line 62 at r2 (raw file):
} func (c *columnarizer) Init() {
I feel like this might take the col batch size as an argument at some point since it should probably be a function of the L1 cache size (and maybe the number of columns?). I'm sure there's some literature on this.
pkg/sql/distsqlrun/columnarizer.go, line 76 at r2 (raw file):
nRows := uint16(0) columnTypes := c.OutputTypes() for ; nRows < exec.ColBatchSize; nRows++ {
I think that ColBatchSize should be number of bytes, not number of rows.
pkg/sql/distsqlrun/columnarizer.go, line 80 at r2 (raw file):
if meta != nil { panic("metadata")
At some point we'll need to c.ProcessorBase.AppendTrailingMeta(meta) here, right? The materializer should probably have a reference to the upstream columnarizer.
pkg/sql/distsqlrun/columnarizer.go, line 85 at r2 (raw file):
break } c.buffered[nRows] = row
Are we going to have to copy here?
pkg/sql/distsqlrun/columnarizer_test.go, line 48 at r2 (raw file):
for i := 0; i < b.N; i++ { for { bat := c.Next()
nit: do we want to be using the term bat?
pkg/sql/distsqlrun/materializer_test.go, line 28 at r2 (raw file):
) func TestColumnarizeMaterialize(t *testing.T) {
We should think about adding randomization.
pkg/sql/exec/types/types.go, line 25 at r1 (raw file):
// T represents an exec physical type - a bytes representation of a particular // column type. type T int
Have you considered making this a byte?
pkg/sql/exec/types/types.go, line 61 at r1 (raw file):
return Int64 } panic(fmt.Sprintf("integer with unknown width %d", ct.Width))
Do you think it might be better to just have Int64 be a catch-all as a default case?
solongordon
reviewed
Oct 16, 2018
Reviewable status:
complete! 1 of 0 LGTMs obtained
pkg/sql/distsqlrun/columnarizer.go, line 139 at r2 (raw file):
col := vec.Int64() for i := uint16(0); i < nRows; i++ { if c.buffered[i][idx].Datum == nil {
Any reason you didn't use the Datum == nil optimization for every column type? Same question for avoiding the ed := assignment.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jordanlewis
Oct 16, 2018
Member
Any reason you didn't use the Datum == nil optimization for every column type? Same question for avoiding the ed := assignment.
Lack of templating. I only was playing with the int64 one - the implementation that's there for the int64 type was the fastest according to the benchmark.
Lack of templating. I only was playing with the int64 one - the implementation that's there for the int64 type was the fastest according to the benchmark. |
solongordon
reviewed
Oct 16, 2018
Thanks, that's what I figured. I'll include those in the template.
Reviewable status:
complete! 1 of 0 LGTMs obtained
jordanlewis
reviewed
Oct 16, 2018
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/sql/distsqlrun/columnarizer.go, line 47 at r2 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
If you're just passing in
flowCtx.NewEvalCtxmight as well useInit(maybe explicitly callProcessorBase.Initto avoid confusion)
Done.
pkg/sql/distsqlrun/columnarizer.go, line 62 at r2 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
I feel like this might take the col batch size as an argument at some point since it should probably be a function of the L1 cache size (and maybe the number of columns?). I'm sure there's some literature on this.
Agreed, though for now it'll be easier to just keep things parameterized on a global. We can tune this later as needed.
pkg/sql/distsqlrun/columnarizer.go, line 76 at r2 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
I think that
ColBatchSizeshould be number of bytes, not number of rows.
That's going to be very tricky, as it will lead to a different number of rows per column vector. I think for now we can stick with number of rows. If you're motivated to do so, you should benchmark whether this makes a difference.
pkg/sql/distsqlrun/columnarizer.go, line 80 at r2 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
At some point we'll need to
c.ProcessorBase.AppendTrailingMeta(meta)here, right? The materializer should probably have a reference to the upstream columnarizer.
Yes, we'll need that at some point.
pkg/sql/distsqlrun/columnarizer.go, line 85 at r2 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
Are we going to have to copy here?
Great point, fixed. Now we preallocate rows of the right size in startup, and copy the EncDatums into their rows.
pkg/sql/distsqlrun/columnarizer.go, line 192 at r2 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
Nope, I'll export from there, not sure why I didn't do that to start with.
Done.
pkg/sql/distsqlrun/columnarizer_test.go, line 48 at r2 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
nit: do we want to be using the term
bat?
Just short for batch, I didn't mean to make it sound like BAT. Fixed.
pkg/sql/distsqlrun/materializer_test.go, line 28 at r2 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
We should think about adding randomization.
Agreed.
pkg/sql/exec/types/types.go, line 25 at r1 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
Have you considered making this a
byte?
Interesting idea - I think that unlike the other datatypes we were discussing this one is less critical to make small, since it's not touched in the hot path - should be either just during planning or once per patch.
pkg/sql/exec/types/types.go, line 61 at r1 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
Do you think it might be better to just have
Int64be a catch-all as adefaultcase?
I think for now I like the panic because it shows us that something is going very wrong. We shouldn't have a width that aren't those above values.
asubiotto
approved these changes
Oct 17, 2018
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/sql/distsqlrun/columnarizer.go, line 80 at r2 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
Yes, we'll need that at some point.
Maybe add a TODO
pkg/sql/distsqlrun/columnarizer.go, line 85 at r2 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
Great point, fixed. Now we preallocate rows of the right size in startup, and copy the EncDatums into their rows.
Now that we do need to copy, I wonder if it might be better to just immediately write out the column values and avoid the copy?
pkg/sql/distsqlrun/materializer_test.go, line 28 at r2 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
Agreed.
Could you add a TODO?
jordanlewis
reviewed
Oct 17, 2018
TFTRs!
bors r+
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/sql/distsqlrun/columnarizer.go, line 80 at r2 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
Maybe add a TODO
Done.
pkg/sql/distsqlrun/columnarizer.go, line 85 at r2 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
Now that we do need to copy, I wonder if it might be better to just immediately write out the column values and avoid the copy?
Added a todo.
pkg/sql/distsqlrun/materializer_test.go, line 28 at r2 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
Could you add a TODO?
Done.
bot
pushed a commit
that referenced
this pull request
Oct 17, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
bors r- Pushed the wrong rev. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
craig
bot
commented
Oct 17, 2018
Canceled |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
bors r+ |
bot
pushed a commit
that referenced
this pull request
Oct 17, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
craig
bot
commented
Oct 17, 2018
Build succeeded |
jordanlewis commentedOct 15, 2018
An exec physical type corresponds to a particular type's bytes
representation, from the perspective of exec. For example, OID
types and Integers are represented the same way - as int64s.
And Integers with bounded precisions may be represented as
int32, int16, or int8, depending on the precision bounds.
The columnarizer and materializer are the the components that map
data between the EncDatumRow representation and the exec.ColBatch
representation.
Release note: None