Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/sem/builtins: add crdb_internal.datums_to_bytes, use in hash sharding #67865

Merged
merged 2 commits into from
Aug 21, 2021

Conversation

ajwerner
Copy link
Contributor

@ajwerner ajwerner commented Jul 21, 2021

This new builtin function encodes datums using the key encoding. It is useful
in particular because it is a function with immutable volatility from a large
number of data types to bytes.

The function is then adopted as the default expression for hash sharded indexes if the version is new enough.

Fixes #67170.

Release note (sql change): Added a builtin function,
crdb_internal.datums_to_bytes, which can encode any data type which can be
used in an forward index key into bytes in an immutable way. This function
is now used in the expression for hash sharded indexes.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajwerner
Copy link
Contributor Author

@RaduBerinde this PR has the changes. Ideally that big select in execbuilder/testdata/hash_sharded_index would be a point lookup.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/sql/create_table.go, line 2716 at r2 (raw file):

}

func makeHashShardComputeExpr(colNames []string, buckets int) *string {

[nit] probably on your TODO list but this should have a comment with sample expressions


pkg/sql/create_table.go, line 2752 at r2 (raw file):

		}
	}
	// Construct an expression which is the sum of all of the casted and hashed

[nit] comment needs updating ("casted")


pkg/sql/create_table.go, line 2761 at r2 (raw file):

			expr = &tree.BinaryExpr{
				Left:     modBuckets(hashedColumnExpr(i)),
				Operator: tree.MakeBinaryOperator(tree.Plus),

Would it be easier to just do fnv32(encode_key(col1, col2, col3)) % buckets?


pkg/sql/opt/memo/logical_props_builder.go, line 2644 at r2 (raw file):

		// TODO(ajwerner,RaduBerinde): Deal with the cases like
		// `crdb_internal.key_encode(f4+1)`.
		if funcExpr, ok := e.(*FunctionExpr); ok && funcExpr.Properties.CompositeInsensitive {

This can be simplified to just make sure that checkTypes is false below when funcExpr.Properties.CompositeInsensitive = true. The code below already checks that the inputs are not sensitive.


pkg/sql/sem/builtins/builtins.go, line 3548 at r1 (raw file):

			},
			Volatility: tree.VolatilityImmutable,
			Info: "Converts datums into key-encoded bytes. " +

Not sure if it should be here or in a comment above the built-in but we should mention the property key_encode(a) == key_encode(b) iff (a IS NOT DISTINCT FROM b)


pkg/sql/sem/builtins/key_encode_builtin_test.go, line 1 at r1 (raw file):

package builtins_test

[nit] missing header


pkg/sql/sem/builtins/key_encode_builtin_test.go, line 24 at r1 (raw file):

// can cross product them all and we get different values which align to the uniformity
// expectations.
func TestCrdbInternalKeyEncode(t *testing.T) {

I don't quite get what this test is trying to achieve, it seems overly complicated. Is it essentially just cross-checking the distinct count against the distinct key count? Wouldn't we get the same or better by generating a random values clause and doing something like

with v as (values (1),(2),(3))
  select (select count(distinct v.*) from v) - (select count(distinct key_encode(v.*)) from v);

pkg/sql/sem/builtins/key_encode_builtin_test.go, line 73 at r1 (raw file):

	// Generate a query to cross join all of the values of the two rows
	// with also the column ordinal between them.
	crossJoinQuery := func() string {

[nit] would help to show a sample query


pkg/sql/sem/tree/function_definition.go, line 108 at r2 (raw file):

	HasSequenceArguments bool

	// CompositeInsensitive indicates that this function does not care about

CompositeInsensitive indicates that this function returns equal results when evaluated on equal inputs. This is a non-trivial property for composite types which can be equal but not identical (e.g. decimals 1.0 and 1.00). For example, converting a decimal to string is not CompositeInsensitive.

Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)


pkg/sql/create_table.go, line 2761 at r2 (raw file):

Previously, RaduBerinde wrote…

Would it be easier to just do fnv32(encode_key(col1, col2, col3)) % buckets?

I was looking at the distribution of values and startled by it which lead me down a small rabbit hole. Consider the test I wrote with the big cross join of two rows. All of the rows end up being unique and constructing unique keys using encode_key. What startled me is that it seemed that no matter how I ended up randomizing the inputs, the bucket counts were perfectly equal. That alarmed me. It got me thinking that fnv32 must have strong correlations that are not being taken into account.

To make this concrete, let's imagine the following for the one non-nil row

	tdb.Exec(t, `INSERT
	INTO t
	SELECT mod(i, 1<<15)::INT2,
		i::INT4,
		i::INT8,
		i::FLOAT4,
		i::FLOAT8,
		i::STRING,
		i::CHAR,
		i::STRING::BYTES,
		i::DECIMAL,
		i::INTERVAL,
		i::OID,
		i::TIMESTAMPTZ,
		i::TIMESTAMP,
		i::DATE,
		('127.0.0.' || mod(i, 256)::STRING)::INET,
		i::VARBIT,
		ARRAY[i::STRING],
		ARRAY[i]
	FROM (
		SELECT $1::INT AS i
	);`, rand.Intn(math.MaxInt32))

If I do this query with %s as the cross join (including or not! the column ordinals), I always find the following distribution

SELECT k, count(k) AS c
                FROM (
                        SELECTmod(fnv32(crdb_internal.key_encode(input.*)), 8) AS k
                          FROM (%s) AS input
                     )
            GROUP BY k
              ORDER BY k
        0, 32768
        1, 32768
        2, 32768
        3, 32768
        4, 32768
        5, 32768
        6, 32768
        7, 32768

That scared me, I'd expect it to be independent. In fact, most formulations that don't involve the previous step in the hash of the next step don't end up being independent (the current formula is just as broken). If you use fnv64, it's a bit better but still not perfect. If you use sha256 and then fnv32 then you get something that feels truly independent. Also, if you incorporate the result of the previous step in the current step, then it feels independent. I don't know the right tradeoff or the math here.

So, here's what I'm thinking for the two options:

SELECT mod(fnv32(sha256(crdb_internal.key_encode(a.i2, 0, b.i4, 1, c.i8, 2, d.f4, 3, e.f8, 4, f.s, 5, g.c, 6, h.b, 7, i.dc, 8, j.ival, 9, k.oid, 10, l.tstz, 11, m.ts, 12, n.da, 13, o.inet, 14, p.vb, 15, q.sa, 16, r.ia, 17))), 8) 
FROM t a, t b, t c, t d, t e, t f, t g, t h, t i, t j, t k, t l, t m, t n, t o, t p, t q, t r;
SELECT
  *
FROM
  (
    SELECT
      mod(
        fnv32(
          crdb_internal.key_encode(
            a.i2,
            0,
            fnv32(
              crdb_internal.key_encode(
                b.i4,
                1,
                fnv32(
                  crdb_internal.key_encode(
                    c.i8,
                    2,
                    fnv32(
                      crdb_internal.key_encode(
                        d.f4,
                        3,
                        fnv32(
                          crdb_internal.key_encode(
                            e.f8,
                            4,
                            fnv32(
                              crdb_internal.key_encode(
                                f.s,
                                5,
                                fnv32(
                                  crdb_internal.key_encode(
                                    g.c,
                                    6,
                                    fnv32(
                                      crdb_internal.key_encode(
                                        h.b,
                                        7,
                                        fnv32(
                                          crdb_internal.key_encode(
                                            i.dc,
                                            8,
                                            fnv32(
                                              crdb_internal.key_encode(
                                                j.ival,
                                                9,
                                                fnv32(
                                                  crdb_internal.key_encode(
                                                    k.oid,
                                                    10,
                                                    fnv32(
                                                      crdb_internal.key_encode(
                                                        l.tstz,
                                                        11,
                                                        fnv32(
                                                          crdb_internal.key_encode(
                                                            m.ts,
                                                            12,
                                                            fnv32(
                                                              crdb_internal.key_encode(
                                                                n.da,
                                                                13,
                                                                fnv32(
                                                                  crdb_internal.key_encode(
                                                                    o.inet,
                                                                    14,
                                                                    fnv32(crdb_internal.key_encode(p.vb, 15, fnv32(crdb_internal.key_encode(q.sa, 16, fnv32(crdb_internal.key_encode(r.ia, 17))))))
                                                                  )
                                                                )
                                                              )
                                                            )
                                                          )
                                                        )
                                                      )
                                                    )
                                                  )
                                                )
                                              )
                                            )
                                          )
                                        )
                                      )
                                    )
                                  )
                                )
                              )
                            )
                          )
                        )
                      )
                    )
                  )
                )
              )
            )
          )
        ),
        8
      )
    FROM
      t AS a, t AS b, t AS c, t AS d, t AS e, t AS f, t AS g, t AS h, t AS i, t AS j, t AS k, t AS l, t AS m, t AS n, t AS o, t AS p, t AS q, t AS r
  );

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/sql/create_table.go, line 2761 at r2 (raw file):

Previously, ajwerner wrote…

I was looking at the distribution of values and startled by it which lead me down a small rabbit hole. Consider the test I wrote with the big cross join of two rows. All of the rows end up being unique and constructing unique keys using encode_key. What startled me is that it seemed that no matter how I ended up randomizing the inputs, the bucket counts were perfectly equal. That alarmed me. It got me thinking that fnv32 must have strong correlations that are not being taken into account.

To make this concrete, let's imagine the following for the one non-nil row

	tdb.Exec(t, `INSERT
	INTO t
	SELECT mod(i, 1<<15)::INT2,
		i::INT4,
		i::INT8,
		i::FLOAT4,
		i::FLOAT8,
		i::STRING,
		i::CHAR,
		i::STRING::BYTES,
		i::DECIMAL,
		i::INTERVAL,
		i::OID,
		i::TIMESTAMPTZ,
		i::TIMESTAMP,
		i::DATE,
		('127.0.0.' || mod(i, 256)::STRING)::INET,
		i::VARBIT,
		ARRAY[i::STRING],
		ARRAY[i]
	FROM (
		SELECT $1::INT AS i
	);`, rand.Intn(math.MaxInt32))

If I do this query with %s as the cross join (including or not! the column ordinals), I always find the following distribution

SELECT k, count(k) AS c
                FROM (
                        SELECTmod(fnv32(crdb_internal.key_encode(input.*)), 8) AS k
                          FROM (%s) AS input
                     )
            GROUP BY k
              ORDER BY k
        0, 32768
        1, 32768
        2, 32768
        3, 32768
        4, 32768
        5, 32768
        6, 32768
        7, 32768

That scared me, I'd expect it to be independent. In fact, most formulations that don't involve the previous step in the hash of the next step don't end up being independent (the current formula is just as broken). If you use fnv64, it's a bit better but still not perfect. If you use sha256 and then fnv32 then you get something that feels truly independent. Also, if you incorporate the result of the previous step in the current step, then it feels independent. I don't know the right tradeoff or the math here.

So, here's what I'm thinking for the two options:

SELECT mod(fnv32(sha256(crdb_internal.key_encode(a.i2, 0, b.i4, 1, c.i8, 2, d.f4, 3, e.f8, 4, f.s, 5, g.c, 6, h.b, 7, i.dc, 8, j.ival, 9, k.oid, 10, l.tstz, 11, m.ts, 12, n.da, 13, o.inet, 14, p.vb, 15, q.sa, 16, r.ia, 17))), 8) 
FROM t a, t b, t c, t d, t e, t f, t g, t h, t i, t j, t k, t l, t m, t n, t o, t p, t q, t r;
SELECT
  *
FROM
  (
    SELECT
      mod(
        fnv32(
          crdb_internal.key_encode(
            a.i2,
            0,
            fnv32(
              crdb_internal.key_encode(
                b.i4,
                1,
                fnv32(
                  crdb_internal.key_encode(
                    c.i8,
                    2,
                    fnv32(
                      crdb_internal.key_encode(
                        d.f4,
                        3,
                        fnv32(
                          crdb_internal.key_encode(
                            e.f8,
                            4,
                            fnv32(
                              crdb_internal.key_encode(
                                f.s,
                                5,
                                fnv32(
                                  crdb_internal.key_encode(
                                    g.c,
                                    6,
                                    fnv32(
                                      crdb_internal.key_encode(
                                        h.b,
                                        7,
                                        fnv32(
                                          crdb_internal.key_encode(
                                            i.dc,
                                            8,
                                            fnv32(
                                              crdb_internal.key_encode(
                                                j.ival,
                                                9,
                                                fnv32(
                                                  crdb_internal.key_encode(
                                                    k.oid,
                                                    10,
                                                    fnv32(
                                                      crdb_internal.key_encode(
                                                        l.tstz,
                                                        11,
                                                        fnv32(
                                                          crdb_internal.key_encode(
                                                            m.ts,
                                                            12,
                                                            fnv32(
                                                              crdb_internal.key_encode(
                                                                n.da,
                                                                13,
                                                                fnv32(
                                                                  crdb_internal.key_encode(
                                                                    o.inet,
                                                                    14,
                                                                    fnv32(crdb_internal.key_encode(p.vb, 15, fnv32(crdb_internal.key_encode(q.sa, 16, fnv32(crdb_internal.key_encode(r.ia, 17))))))
                                                                  )
                                                                )
                                                              )
                                                            )
                                                          )
                                                        )
                                                      )
                                                    )
                                                  )
                                                )
                                              )
                                            )
                                          )
                                        )
                                      )
                                    )
                                  )
                                )
                              )
                            )
                          )
                        )
                      )
                    )
                  )
                )
              )
            )
          )
        ),
        8
      )
    FROM
      t AS a, t AS b, t AS c, t AS d, t AS e, t AS f, t AS g, t AS h, t AS i, t AS j, t AS k, t AS l, t AS m, t AS n, t AS o, t AS p, t AS q, t AS r
  );

In that example, we have two random rows in t (generated with that Exec statement) and we're mixing and matching all combinations?

When you say "That scared me, I'd expect it to be independent", what do you mean? Just that the counts shouldn't look so perfect?

What feels wrong is that we do mod 8, so we only look at the last few bits of the fnv32, which is not great for a non-cryptographic hash. Using a prime number for the buckets would fix that. I think we could also mod against a bigger prime number before modding down to 8, or shuffle the bits somehow.

I think this can be a problem even for a single column, though I can't find an example.

My concern with your first suggestion is that sha is fairly expensive to calculate. And the second suggestion feels like its obfuscating the underlying problem but not solving it entirely.

@RaduBerinde
Copy link
Member


pkg/sql/create_table.go, line 2761 at r2 (raw file):

Previously, RaduBerinde wrote…

In that example, we have two random rows in t (generated with that Exec statement) and we're mixing and matching all combinations?

When you say "That scared me, I'd expect it to be independent", what do you mean? Just that the counts shouldn't look so perfect?

What feels wrong is that we do mod 8, so we only look at the last few bits of the fnv32, which is not great for a non-cryptographic hash. Using a prime number for the buckets would fix that. I think we could also mod against a bigger prime number before modding down to 8, or shuffle the bits somehow.

I think this can be a problem even for a single column, though I can't find an example.

My concern with your first suggestion is that sha is fairly expensive to calculate. And the second suggestion feels like its obfuscating the underlying problem but not solving it entirely.

I guess this example kind-of illustrates your finding:

root@127.185.179.142:26257/defaultdb> select *, crdb_internal.key_encode(input.*), fnv32(crdb_internal.key_encode(input.*)), fnv32(crdb_internal.key_encode(input.*)) % 8 from (select t1.a, t2.b, t3.c, t4.d from t t1, t
 t2, t t3, t t4) as input;
  a | b | c | d |                  crdb_internal.key_encode                   |   fnv32    | ?column?
----+---+---+---+-------------------------------------------------------------+------------+-----------
  1 | 1 | 1 | 1 | \211\005?\360\000\000\000\000\000\000\0221\000\001*\002\000 | 2651793876 |        4
  1 | 1 | 1 | 5 | \211\005?\360\000\000\000\000\000\000\0221\000\001*\012\000 | 2517572924 |        4
  1 | 1 | 5 | 1 | \211\005?\360\000\000\000\000\000\000\0225\000\001*\002\000 | 3622176784 |        0
  1 | 1 | 5 | 5 | \211\005?\360\000\000\000\000\000\000\0225\000\001*\012\000 | 3756397736 |        0
  1 | 5 | 1 | 1 | \211\005@\024\000\000\000\000\000\000\0221\000\001*\002\000 |   13520939 |        3
  1 | 5 | 1 | 5 | \211\005@\024\000\000\000\000\000\000\0221\000\001*\012\000 | 4174267283 |        3
  1 | 5 | 5 | 1 | \211\005@\024\000\000\000\000\000\000\0225\000\001*\002\000 |  983903847 |        7
  1 | 5 | 5 | 5 | \211\005@\024\000\000\000\000\000\000\0225\000\001*\012\000 |  849682895 |        7
  5 | 1 | 1 | 1 | \215\005?\360\000\000\000\000\000\000\0221\000\001*\002\000 | 3370022656 |        0
  5 | 1 | 1 | 5 | \215\005?\360\000\000\000\000\000\000\0221\000\001*\012\000 | 3504243608 |        0
  5 | 1 | 5 | 1 | \215\005?\360\000\000\000\000\000\000\0225\000\001*\002\000 | 2378851396 |        4
  5 | 1 | 5 | 5 | \215\005?\360\000\000\000\000\000\000\0225\000\001*\012\000 | 2244630444 |        4
  5 | 5 | 1 | 1 | \215\005@\024\000\000\000\000\000\000\0221\000\001*\002\000 | 3290441751 |        7
  5 | 5 | 1 | 5 | \215\005@\024\000\000\000\000\000\000\0221\000\001*\012\000 | 3156220799 |        7
  5 | 5 | 5 | 1 | \215\005@\024\000\000\000\000\000\000\0225\000\001*\002\000 | 2299270491 |        3
  5 | 5 | 5 | 5 | \215\005@\024\000\000\000\000\000\000\0225\000\001*\012\000 | 2165049539 |        3

Note how changing the last value doesn't change the bucket, even though the overall fnv value changes. I think it's probably possible that, given the very specific way in which these rows are constructed, for any row you can find 32768 other rows where the values that are different don't affect the lower 8 bits of the fnv. Or something along those lines.

Not sure how much of an alarm this whole thing is.. It would be more alarming if we could find data where the buckets are super skewed.

@RaduBerinde
Copy link
Member


pkg/sql/create_table.go, line 2761 at r2 (raw file):

Previously, RaduBerinde wrote…

I guess this example kind-of illustrates your finding:

root@127.185.179.142:26257/defaultdb> select *, crdb_internal.key_encode(input.*), fnv32(crdb_internal.key_encode(input.*)), fnv32(crdb_internal.key_encode(input.*)) % 8 from (select t1.a, t2.b, t3.c, t4.d from t t1, t
 t2, t t3, t t4) as input;
  a | b | c | d |                  crdb_internal.key_encode                   |   fnv32    | ?column?
----+---+---+---+-------------------------------------------------------------+------------+-----------
  1 | 1 | 1 | 1 | \211\005?\360\000\000\000\000\000\000\0221\000\001*\002\000 | 2651793876 |        4
  1 | 1 | 1 | 5 | \211\005?\360\000\000\000\000\000\000\0221\000\001*\012\000 | 2517572924 |        4
  1 | 1 | 5 | 1 | \211\005?\360\000\000\000\000\000\000\0225\000\001*\002\000 | 3622176784 |        0
  1 | 1 | 5 | 5 | \211\005?\360\000\000\000\000\000\000\0225\000\001*\012\000 | 3756397736 |        0
  1 | 5 | 1 | 1 | \211\005@\024\000\000\000\000\000\000\0221\000\001*\002\000 |   13520939 |        3
  1 | 5 | 1 | 5 | \211\005@\024\000\000\000\000\000\000\0221\000\001*\012\000 | 4174267283 |        3
  1 | 5 | 5 | 1 | \211\005@\024\000\000\000\000\000\000\0225\000\001*\002\000 |  983903847 |        7
  1 | 5 | 5 | 5 | \211\005@\024\000\000\000\000\000\000\0225\000\001*\012\000 |  849682895 |        7
  5 | 1 | 1 | 1 | \215\005?\360\000\000\000\000\000\000\0221\000\001*\002\000 | 3370022656 |        0
  5 | 1 | 1 | 5 | \215\005?\360\000\000\000\000\000\000\0221\000\001*\012\000 | 3504243608 |        0
  5 | 1 | 5 | 1 | \215\005?\360\000\000\000\000\000\000\0225\000\001*\002\000 | 2378851396 |        4
  5 | 1 | 5 | 5 | \215\005?\360\000\000\000\000\000\000\0225\000\001*\012\000 | 2244630444 |        4
  5 | 5 | 1 | 1 | \215\005@\024\000\000\000\000\000\000\0221\000\001*\002\000 | 3290441751 |        7
  5 | 5 | 1 | 5 | \215\005@\024\000\000\000\000\000\000\0221\000\001*\012\000 | 3156220799 |        7
  5 | 5 | 5 | 1 | \215\005@\024\000\000\000\000\000\000\0225\000\001*\002\000 | 2299270491 |        3
  5 | 5 | 5 | 5 | \215\005@\024\000\000\000\000\000\000\0225\000\001*\012\000 | 2165049539 |        3

Note how changing the last value doesn't change the bucket, even though the overall fnv value changes. I think it's probably possible that, given the very specific way in which these rows are constructed, for any row you can find 32768 other rows where the values that are different don't affect the lower 8 bits of the fnv. Or something along those lines.

Not sure how much of an alarm this whole thing is.. It would be more alarming if we could find data where the buckets are super skewed.

*lower 3 bits of the fnv

Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)


pkg/sql/create_table.go, line 2761 at r2 (raw file):

Previously, RaduBerinde wrote…

*lower 3 bits of the fnv

Given all of this and efficiency considerations, I'm thinking about just adding a different builtin which takes the datums and hashes them but chains the hash values together to make things . Would such a function upset you? I can also be convinced to stop caring.

	var h fnv.New32()
        var buf []byte
	for i, datum := range args {
                buf = buf[:0]
                var err error
                buf, err = rowenc.EncodeTableKey(out, datum, encoding.Ascending)
                if err != nil { return err }
		_, err := h.Write(buf)
                if err != nil { ... }
                var scratch [4]byte
                binary.PutUvarint(scratch, i) // hash the index hash value to make sequences of the same value independent
                _, err := h.Write(scratch[:])
                if err != nil { ... }
                binary.PutUvarint(scratch, h.Sum32()) // hash the current hash value to increase the data independence
                _, err := h.Write(scratch[:])
                if err != nil { ... }
	}

pkg/sql/opt/memo/logical_props_builder.go, line 2644 at r2 (raw file):

Previously, RaduBerinde wrote…

This can be simplified to just make sure that checkTypes is false below when funcExpr.Properties.CompositeInsensitive = true. The code below already checks that the inputs are not sensitive.

I don't think that's quite true. The Child(0) of FunctionExpr is its arg list. I think I could do the following but something about it felt icky at the time.

checkTypes := !opt.IsCompositeInsensitiveOp(e)
if funcExpr, ok := e.(*FunctionExpr); ok && funcExpr.Properties.CompositeInsensitive {
    checkTypes = false
    e = funcExpr.Args
}

pkg/sql/sem/builtins/key_encode_builtin_test.go, line 24 at r1 (raw file):

Previously, RaduBerinde wrote…

I don't quite get what this test is trying to achieve, it seems overly complicated. Is it essentially just cross-checking the distinct count against the distinct key count? Wouldn't we get the same or better by generating a random values clause and doing something like

with v as (values (1),(2),(3))
  select (select count(distinct v.*) from v) - (select count(distinct key_encode(v.*)) from v);

Sure, I was mostly just testing that all the data types worked and that they NULLs worked and then got too clever. I'll take another pass.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)


pkg/sql/create_table.go, line 2761 at r2 (raw file):

Previously, ajwerner wrote…

Given all of this and efficiency considerations, I'm thinking about just adding a different builtin which takes the datums and hashes them but chains the hash values together to make things . Would such a function upset you? I can also be convinced to stop caring.

	var h fnv.New32()
        var buf []byte
	for i, datum := range args {
                buf = buf[:0]
                var err error
                buf, err = rowenc.EncodeTableKey(out, datum, encoding.Ascending)
                if err != nil { return err }
		_, err := h.Write(buf)
                if err != nil { ... }
                var scratch [4]byte
                binary.PutUvarint(scratch, i) // hash the index hash value to make sequences of the same value independent
                _, err := h.Write(scratch[:])
                if err != nil { ... }
                binary.PutUvarint(scratch, h.Sum32()) // hash the current hash value to increase the data independence
                _, err := h.Write(scratch[:])
                if err != nil { ... }
	}

In general I'm opposed to coming up with this kind of one-off supposed improvements, unless it is following some standard practice or has clear math/reasoning behind it. It's a bit like trying to come up with your own security scheme.

Passing the current hash value through the hasher makes no sense to me. What "data independence" does it increase? It's mathematically equivalent to applying this obscure, static uint32->uint32 function to the current hash. At best, it's a bijective function and you effectively did nothing. At worst, it's collapsing different hashes into the same hash, making things worse.

Also don't see a benefit to hashing the index value (given that the concatenated key is such that you can split it into pieces without ambiguity). I don't know what "make sequences of the same value independent" means.

To me the legit concept is that we take what would be the key for a non-sharded index and hash that. It doesn't get cleaner than that conceptually - it's equivalent to having a hash-based KV store, which is what this is trying to simulate in the first place. I have no objection to a built-in that does that (to save on holding the entire key in memory).

@RaduBerinde
Copy link
Member


pkg/sql/opt/memo/logical_props_builder.go, line 2644 at r2 (raw file):

Previously, ajwerner wrote…

I don't think that's quite true. The Child(0) of FunctionExpr is its arg list. I think I could do the following but something about it felt icky at the time.

checkTypes := !opt.IsCompositeInsensitiveOp(e)
if funcExpr, ok := e.(*FunctionExpr); ok && funcExpr.Properties.CompositeInsensitive {
    checkTypes = false
    e = funcExpr.Args
}

So? It's a recursive function..

@RaduBerinde
Copy link
Member


pkg/sql/opt/memo/logical_props_builder.go, line 2644 at r2 (raw file):

Previously, RaduBerinde wrote…

So? It's a recursive function..

Ah, maybe ScalarListOp is not marked as insensitive, we should fix that

Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)


pkg/sql/create_table.go, line 2761 at r2 (raw file):

Previously, RaduBerinde wrote…

In general I'm opposed to coming up with this kind of one-off supposed improvements, unless it is following some standard practice or has clear math/reasoning behind it. It's a bit like trying to come up with your own security scheme.

Passing the current hash value through the hasher makes no sense to me. What "data independence" does it increase? It's mathematically equivalent to applying this obscure, static uint32->uint32 function to the current hash. At best, it's a bijective function and you effectively did nothing. At worst, it's collapsing different hashes into the same hash, making things worse.

Also don't see a benefit to hashing the index value (given that the concatenated key is such that you can split it into pieces without ambiguity). I don't know what "make sequences of the same value independent" means.

To me the legit concept is that we take what would be the key for a non-sharded index and hash that. It doesn't get cleaner than that conceptually - it's equivalent to having a hash-based KV store, which is what this is trying to simulate in the first place. I have no objection to a built-in that does that (to save on holding the entire key in memory).

I don't have a formalism about the "independence" so I'll stop worrying about it. Anecdotally I observe that if you feed the hash value back in that two rows seem to be no longer related. I've been convinced that I don't care about that. I also don't care much about the buffering. Thanks for all the engagement, I've been convinced to stop caring.


pkg/sql/opt/memo/logical_props_builder.go, line 2644 at r2 (raw file):

Previously, RaduBerinde wrote…

Ah, maybe ScalarListOp is not marked as insensitive, we should fix that

The value of checkTypes is not passed in recursively (for good reason). The ScalarListExpr that is Args will have its types checked, right?

@RaduBerinde
Copy link
Member


pkg/sql/opt/memo/logical_props_builder.go, line 2644 at r2 (raw file):

Previously, ajwerner wrote…

The value of checkTypes is not passed in recursively (for good reason). The ScalarListExpr that is Args will have its types checked, right?

Not if it (the ScalarListOp) is marked as an insensitive op. We are checking that all paths to variable leaves only pass through composite-insensitive ops.

Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)


pkg/sql/opt/memo/logical_props_builder.go, line 2644 at r2 (raw file):

Previously, RaduBerinde wrote…

Not if it (the ScalarListOp) is marked as an insensitive op. We are checking that all paths to variable leaves only pass through composite-insensitive ops.

If ScalarListOp were marked as an insensitive op then wouldn't that mean that function invocations be treated as insensitive? At that point anything related to the function properties would be irrelevant.

Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)


pkg/sql/opt/memo/logical_props_builder.go, line 2644 at r2 (raw file):

Previously, ajwerner wrote…

If ScalarListOp were marked as an insensitive op then wouldn't that mean that function invocations be treated as insensitive? At that point anything related to the function properties would be irrelevant.

An example of a function that, I think, should be composite sensitive is json_build_object. I think that that thing already has some problems:

demo@127.0.0.1:52608/movr> SELECT json_build_object('a', -0.0);
  json_build_object
---------------------
  {"a": 0.0}
(1 row)


Time: 0ms total (execution 0ms / network 0ms)

demo@127.0.0.1:52608/movr> SELECT json_build_object('a', -0);
  json_build_object
---------------------
  {"a": 0}
(1 row)


demo@127.0.0.1:52608/movr> SELECT (json_build_object('a', '-0.0'::JSON)->'a')::FLOAT;
  float8
----------
      -0
(1 row)

@RaduBerinde
Copy link
Member


pkg/sql/opt/memo/logical_props_builder.go, line 2644 at r2 (raw file):

Previously, ajwerner wrote…

An example of a function that, I think, should be composite sensitive is json_build_object. I think that that thing already has some problems:

demo@127.0.0.1:52608/movr> SELECT json_build_object('a', -0.0);
  json_build_object
---------------------
  {"a": 0.0}
(1 row)


Time: 0ms total (execution 0ms / network 0ms)

demo@127.0.0.1:52608/movr> SELECT json_build_object('a', -0);
  json_build_object
---------------------
  {"a": 0}
(1 row)


demo@127.0.0.1:52608/movr> SELECT (json_build_object('a', '-0.0'::JSON)->'a')::FLOAT;
  float8
----------
      -0
(1 row)

ScalarListOp is "under" the function. The function would be checked separately.

I believe that what you're discovering above is that you can't simply use -0 in a statement to create a -0 float (and I believe that's consistent with pg).

Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)


pkg/sql/opt/memo/logical_props_builder.go, line 2644 at r2 (raw file):

Previously, RaduBerinde wrote…

ScalarListOp is "under" the function. The function would be checked separately.

I believe that what you're discovering above is that you can't simply use -0 in a statement to create a -0 float (and I believe that's consistent with pg).

Sorry for the confusion. What I was missing was that getOuterCols was also recursive. I think that that makes this check N^2 in the depth of the expression but yes, if we also mark ScalarListOp as IsCompositeInsensitiveOp then it all works out. Thanks!

@RaduBerinde
Copy link
Member


pkg/sql/opt/memo/logical_props_builder.go, line 2644 at r2 (raw file):

Previously, ajwerner wrote…

Sorry for the confusion. What I was missing was that getOuterCols was also recursive. I think that that makes this check N^2 in the depth of the expression but yes, if we also mark ScalarListOp as IsCompositeInsensitiveOp then it all works out. Thanks!

Good point, maybe leave a TODO in there for the quadratic behavior.

@RaduBerinde
Copy link
Member

I noticed there already is a crdb_internal.encode_key. I don't think we can use that because it's meant to encode a full key, but we should rename the new one so there's no confusion between the two.

@ajwerner ajwerner force-pushed the ajwerner/key-encoding-builtin branch 3 times, most recently from 4279e15 to ca026e9 Compare August 19, 2021 15:29
Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)


pkg/sql/create_table.go, line 2569 at r3 (raw file):

// makeShardColumnDesc returns a new column descriptor for a hidden computed shard column
// based on all the `colNames`.
func makeShardColumnDesc(

[nit] this deserves a comment explaining the argument, maybe pointing to the two make*ComputeExpr for details.


pkg/sql/create_table.go, line 2625 at r3 (raw file):

			Exprs: tree.Exprs{
				expr,
				&tree.CastExpr{

Is this cast necessary? Maybe we can just use tree.AdjustValueToType instead? Though I think we should error out much earlier if the value is too large.

@ajwerner ajwerner force-pushed the ajwerner/key-encoding-builtin branch 2 times, most recently from b57a6bc to 94587d9 Compare August 19, 2021 20:04
@ajwerner ajwerner changed the title [DNM] sql,sem/builtins: add crdb_internal.key_encode, use for hash sharded indexes sql/sem/builtins: add crdb_internal.datums_to_bytes, use in hash sharding Aug 19, 2021
@ajwerner ajwerner marked this pull request as ready for review August 19, 2021 20:05
@ajwerner ajwerner requested a review from a team as a code owner August 19, 2021 20:05
@ajwerner ajwerner requested a review from a team August 19, 2021 20:05
@ajwerner ajwerner requested a review from a team as a code owner August 19, 2021 20:05
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @RaduBerinde)


pkg/sql/create_table.go, line 2716 at r2 (raw file):

Previously, RaduBerinde wrote…

[nit] probably on your TODO list but this should have a comment with sample expressions

Done.


pkg/sql/create_table.go, line 2569 at r3 (raw file):

Previously, RaduBerinde wrote…

[nit] this deserves a comment explaining the argument, maybe pointing to the two make*ComputeExpr for details.

Done.


pkg/sql/create_table.go, line 2625 at r3 (raw file):

Previously, RaduBerinde wrote…

Is this cast necessary? Maybe we can just use tree.AdjustValueToType instead? Though I think we should error out much earlier if the value is too large.

Removed. Also added a check and test.


pkg/sql/opt/memo/logical_props_builder.go, line 2644 at r2 (raw file):

Previously, RaduBerinde wrote…

Good point, maybe leave a TODO in there for the quadratic behavior.

This is now fixed 🎉


pkg/sql/sem/tree/function_definition.go, line 108 at r2 (raw file):

Previously, RaduBerinde wrote…

CompositeInsensitive indicates that this function returns equal results when evaluated on equal inputs. This is a non-trivial property for composite types which can be equal but not identical (e.g. decimals 1.0 and 1.00). For example, converting a decimal to string is not CompositeInsensitive.

Removed as it's not needed anymore.


pkg/sql/sem/builtins/key_encode_builtin_test.go, line 1 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] missing header

Done.


pkg/sql/sem/builtins/key_encode_builtin_test.go, line 24 at r1 (raw file):

Previously, ajwerner wrote…

Sure, I was mostly just testing that all the data types worked and that they NULLs worked and then got too clever. I'll take another pass.

Cleaned it up roughly as you suggested.


pkg/sql/sem/builtins/key_encode_builtin_test.go, line 73 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] would help to show a sample query

Done.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @RaduBerinde)


pkg/sql/sem/tree/function_definition.go, line 108 at r2 (raw file):

Previously, ajwerner wrote…

Removed as it's not needed anymore.

Hm, how come? I don't think we can infer a hash value for composite types without it..

@RaduBerinde
Copy link
Member


pkg/sql/sem/tree/function_definition.go, line 108 at r2 (raw file):

Previously, RaduBerinde wrote…

Hm, how come? I don't think we can infer a hash value for composite types without it..

I think you might have encountered a bug. You made the scalar list composite insensitive and we're probably hitting this code:

		if opt.IsCompositeInsensitiveOp(e) {
			// The operator is known to be composite-insensitive. If its result is a
			// non-composite type, it is also composite-identical.
			return true, !colinfo.HasCompositeKeyEncoding(e.(opt.ScalarExpr).DataType())
		}

The data type of scalar list is Any and it's not handled. There are others like Tuple that aren't handled. We should change HasCompositeKeyEncoding to CanHaveCompositeKeyEncoding and have the default return value for unhandled types be true.

@ajwerner ajwerner force-pushed the ajwerner/key-encoding-builtin branch 2 times, most recently from 4b9db56 to a6ca290 Compare August 19, 2021 22:19
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @RaduBerinde)


pkg/sql/sem/tree/function_definition.go, line 108 at r2 (raw file):

Previously, RaduBerinde wrote…

I think you might have encountered a bug. You made the scalar list composite insensitive and we're probably hitting this code:

		if opt.IsCompositeInsensitiveOp(e) {
			// The operator is known to be composite-insensitive. If its result is a
			// non-composite type, it is also composite-identical.
			return true, !colinfo.HasCompositeKeyEncoding(e.(opt.ScalarExpr).DataType())
		}

The data type of scalar list is Any and it's not handled. There are others like Tuple that aren't handled. We should change HasCompositeKeyEncoding to CanHaveCompositeKeyEncoding and have the default return value for unhandled types be true.

Alrighty, it's back and it's now confirmed needed. See how you feel about the new logic for CanHaveCompositeKeyEncoding.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @RaduBerinde)


pkg/sql/sem/tree/function_definition.go, line 108 at r2 (raw file):

Previously, ajwerner wrote…

Alrighty, it's back and it's now confirmed needed. See how you feel about the new logic for CanHaveCompositeKeyEncoding.

👍 👍 looks great!

@ajwerner ajwerner force-pushed the ajwerner/key-encoding-builtin branch 3 times, most recently from 5c08eab to e21e00e Compare August 20, 2021 03:39
@ajwerner
Copy link
Contributor Author

@RaduBerinde anything else you want on this?

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong: Just wrap it with a red bow!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)

@ajwerner ajwerner force-pushed the ajwerner/key-encoding-builtin branch from e21e00e to 7bdae14 Compare August 20, 2021 17:51
This patch also fixes the behavior to properly catagorize types like any
and unknown. It does so by adding all of the families and setting the default
to be true instead of the prior false.

Release note: None
This new builtin function encodes datums using the key encoding. It is useful
in particular because it is a function with immutable volatility from a large
number of data types to bytes.

Release note (sql change): Added a builtin function,
crdb_internal.datums_to_bytes, which can encode any data type which can be
used in an forward index key into bytes in an immutable way. This function
is now used in the expression for hash sharded indexes.
@ajwerner ajwerner force-pushed the ajwerner/key-encoding-builtin branch from 7bdae14 to f172756 Compare August 21, 2021 12:32
@ajwerner
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 21, 2021

Build succeeded:

@craig craig bot merged commit d18da6c into cockroachdb:master Aug 21, 2021
michae2 added a commit to michae2/cockroach that referenced this pull request Nov 2, 2022
We've discovered a case where a combination of non-random data and a
power-of-two number of buckets causes an uneven distribution of rows
in a hash-sharded index. While this specific case is very contrived, it
illustrates a small weakness in the current hash-shard calculation:
modulo by a power-of-two number of buckets only uses the last few bits
of the hash value.

Radu suggested this would be a problem in cockroachdb#67865 and also suggested a
fix: add an intermediate modulo by a larger prime before modulo by num
buckets, so we'll try that.

Fixes: cockroachdb#91109

Epic: None

Release note (performance improvement): This change updates the shard
calculation of newly-created hash-sharded indexes so that uneven
distributions of rows are less likely.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SELECT from a table with hash sharded PK has unexpected index join in query plan
3 participants