sql/sem/builtins: add crdb_internal.datums_to_bytes, use in hash sharding #67865

ajwerner · 2021-07-21T16:56:50Z

This new builtin function encodes datums using the key encoding. It is useful
in particular because it is a function with immutable volatility from a large
number of data types to bytes.

The function is then adopted as the default expression for hash sharded indexes if the version is new enough.

Fixes #67170.

Release note (sql change): Added a builtin function,
crdb_internal.datums_to_bytes, which can encode any data type which can be
used in an forward index key into bytes in an immutable way. This function
is now used in the expression for hash sharded indexes.

cockroach-teamcity · 2021-07-21T16:56:57Z

This change is

ajwerner · 2021-07-21T22:54:25Z

@RaduBerinde this PR has the changes. Ideally that big select in execbuilder/testdata/hash_sharded_index would be a point lookup.

RaduBerinde

Nice!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)

pkg/sql/create_table.go, line 2716 at r2 (raw file):

}

func makeHashShardComputeExpr(colNames []string, buckets int) *string {

[nit] probably on your TODO list but this should have a comment with sample expressions

pkg/sql/create_table.go, line 2752 at r2 (raw file):

		}
	}
	// Construct an expression which is the sum of all of the casted and hashed

[nit] comment needs updating ("casted")

pkg/sql/create_table.go, line 2761 at r2 (raw file):

			expr = &tree.BinaryExpr{
				Left:     modBuckets(hashedColumnExpr(i)),
				Operator: tree.MakeBinaryOperator(tree.Plus),

Would it be easier to just do fnv32(encode_key(col1, col2, col3)) % buckets?

pkg/sql/opt/memo/logical_props_builder.go, line 2644 at r2 (raw file):

		// TODO(ajwerner,RaduBerinde): Deal with the cases like
		// `crdb_internal.key_encode(f4+1)`.
		if funcExpr, ok := e.(*FunctionExpr); ok && funcExpr.Properties.CompositeInsensitive {

This can be simplified to just make sure that checkTypes is false below when funcExpr.Properties.CompositeInsensitive = true. The code below already checks that the inputs are not sensitive.

pkg/sql/sem/builtins/builtins.go, line 3548 at r1 (raw file):

			},
			Volatility: tree.VolatilityImmutable,
			Info: "Converts datums into key-encoded bytes. " +

Not sure if it should be here or in a comment above the built-in but we should mention the property key_encode(a) == key_encode(b) iff (a IS NOT DISTINCT FROM b)

pkg/sql/sem/builtins/key_encode_builtin_test.go, line 1 at r1 (raw file):

package builtins_test

[nit] missing header

pkg/sql/sem/builtins/key_encode_builtin_test.go, line 24 at r1 (raw file):

// can cross product them all and we get different values which align to the uniformity
// expectations.
func TestCrdbInternalKeyEncode(t *testing.T) {

I don't quite get what this test is trying to achieve, it seems overly complicated. Is it essentially just cross-checking the distinct count against the distinct key count? Wouldn't we get the same or better by generating a random values clause and doing something like

with v as (values (1),(2),(3))
  select (select count(distinct v.*) from v) - (select count(distinct key_encode(v.*)) from v);

pkg/sql/sem/builtins/key_encode_builtin_test.go, line 73 at r1 (raw file):

	// Generate a query to cross join all of the values of the two rows
	// with also the column ordinal between them.
	crossJoinQuery := func() string {

[nit] would help to show a sample query

pkg/sql/sem/tree/function_definition.go, line 108 at r2 (raw file):

	HasSequenceArguments bool

	// CompositeInsensitive indicates that this function does not care about

CompositeInsensitive indicates that this function returns equal results when evaluated on equal inputs. This is a non-trivial property for composite types which can be equal but not identical (e.g. decimals 1.0 and 1.00). For example, converting a decimal to string is not CompositeInsensitive.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)

pkg/sql/create_table.go, line 2761 at r2 (raw file):

Previously, RaduBerinde wrote…

Would it be easier to just do fnv32(encode_key(col1, col2, col3)) % buckets?

I was looking at the distribution of values and startled by it which lead me down a small rabbit hole. Consider the test I wrote with the big cross join of two rows. All of the rows end up being unique and constructing unique keys using encode_key. What startled me is that it seemed that no matter how I ended up randomizing the inputs, the bucket counts were perfectly equal. That alarmed me. It got me thinking that fnv32 must have strong correlations that are not being taken into account.

To make this concrete, let's imagine the following for the one non-nil row

	tdb.Exec(t, `INSERT
	INTO t
	SELECT mod(i, 1<<15)::INT2,
		i::INT4,
		i::INT8,
		i::FLOAT4,
		i::FLOAT8,
		i::STRING,
		i::CHAR,
		i::STRING::BYTES,
		i::DECIMAL,
		i::INTERVAL,
		i::OID,
		i::TIMESTAMPTZ,
		i::TIMESTAMP,
		i::DATE,
		('127.0.0.' || mod(i, 256)::STRING)::INET,
		i::VARBIT,
		ARRAY[i::STRING],
		ARRAY[i]
	FROM (
		SELECT $1::INT AS i
	);`, rand.Intn(math.MaxInt32))

If I do this query with %s as the cross join (including or not! the column ordinals), I always find the following distribution

SELECT k, count(k) AS c
                FROM (
                        SELECTmod(fnv32(crdb_internal.key_encode(input.*)), 8) AS k
                          FROM (%s) AS input
                     )
            GROUP BY k
              ORDER BY k

That scared me, I'd expect it to be independent. In fact, most formulations that don't involve the previous step in the hash of the next step don't end up being independent (the current formula is just as broken). If you use fnv64, it's a bit better but still not perfect. If you use sha256 and then fnv32 then you get something that feels truly independent. Also, if you incorporate the result of the previous step in the current step, then it feels independent. I don't know the right tradeoff or the math here.

So, here's what I'm thinking for the two options:

SELECT mod(fnv32(sha256(crdb_internal.key_encode(a.i2, 0, b.i4, 1, c.i8, 2, d.f4, 3, e.f8, 4, f.s, 5, g.c, 6, h.b, 7, i.dc, 8, j.ival, 9, k.oid, 10, l.tstz, 11, m.ts, 12, n.da, 13, o.inet, 14, p.vb, 15, q.sa, 16, r.ia, 17))), 8) 
FROM t a, t b, t c, t d, t e, t f, t g, t h, t i, t j, t k, t l, t m, t n, t o, t p, t q, t r;

SELECT
  *
FROM
  (
    SELECT
      mod(
        fnv32(
          crdb_internal.key_encode(
            a.i2,
            0,
            fnv32(
              crdb_internal.key_encode(
                b.i4,
                1,
                fnv32(
                  crdb_internal.key_encode(
                    c.i8,
                    2,
                    fnv32(
                      crdb_internal.key_encode(
                        d.f4,
                        3,
                        fnv32(
                          crdb_internal.key_encode(
                            e.f8,
                            4,
                            fnv32(
                              crdb_internal.key_encode(
                                f.s,
                                5,
                                fnv32(
                                  crdb_internal.key_encode(
                                    g.c,
                                    6,
                                    fnv32(
                                      crdb_internal.key_encode(
                                        h.b,
                                        7,
                                        fnv32(
                                          crdb_internal.key_encode(
                                            i.dc,
                                            8,
                                            fnv32(
                                              crdb_internal.key_encode(
                                                j.ival,
                                                9,
                                                fnv32(
                                                  crdb_internal.key_encode(
                                                    k.oid,
                                                    10,
                                                    fnv32(
                                                      crdb_internal.key_encode(
                                                        l.tstz,
                                                        11,
                                                        fnv32(
                                                          crdb_internal.key_encode(
                                                            m.ts,
                                                            12,
                                                            fnv32(
                                                              crdb_internal.key_encode(
                                                                n.da,
                                                                13,
                                                                fnv32(
                                                                  crdb_internal.key_encode(
                                                                    o.inet,
                                                                    14,
                                                                    fnv32(crdb_internal.key_encode(p.vb, 15, fnv32(crdb_internal.key_encode(q.sa, 16, fnv32(crdb_internal.key_encode(r.ia, 17))))))
                                                                  )
                                                                )
                                                              )
                                                            )
                                                          )
                                                        )
                                                      )
                                                    )
                                                  )
                                                )
                                              )
                                            )
                                          )
                                        )
                                      )
                                    )
                                  )
                                )
                              )
                            )
                          )
                        )
                      )
                    )
                  )
                )
              )
            )
          )
        ),
        8
      )
    FROM
      t AS a, t AS b, t AS c, t AS d, t AS e, t AS f, t AS g, t AS h, t AS i, t AS j, t AS k, t AS l, t AS m, t AS n, t AS o, t AS p, t AS q, t AS r
  );

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)

pkg/sql/create_table.go, line 2761 at r2 (raw file):

Previously, ajwerner wrote…

I was looking at the distribution of values and startled by it which lead me down a small rabbit hole. Consider the test I wrote with the big cross join of two rows. All of the rows end up being unique and constructing unique keys using encode_key. What startled me is that it seemed that no matter how I ended up randomizing the inputs, the bucket counts were perfectly equal. That alarmed me. It got me thinking that fnv32 must have strong correlations that are not being taken into account.

To make this concrete, let's imagine the following for the one non-nil row

	tdb.Exec(t, `INSERT
	INTO t
	SELECT mod(i, 1<<15)::INT2,
		i::INT4,
		i::INT8,
		i::FLOAT4,
		i::FLOAT8,
		i::STRING,
		i::CHAR,
		i::STRING::BYTES,
		i::DECIMAL,
		i::INTERVAL,
		i::OID,
		i::TIMESTAMPTZ,
		i::TIMESTAMP,
		i::DATE,
		('127.0.0.' || mod(i, 256)::STRING)::INET,
		i::VARBIT,
		ARRAY[i::STRING],
		ARRAY[i]
	FROM (
		SELECT $1::INT AS i
	);`, rand.Intn(math.MaxInt32))

If I do this query with %s as the cross join (including or not! the column ordinals), I always find the following distribution

SELECT k, count(k) AS c
                FROM (
                        SELECTmod(fnv32(crdb_internal.key_encode(input.*)), 8) AS k
                          FROM (%s) AS input
                     )
            GROUP BY k
              ORDER BY k

That scared me, I'd expect it to be independent. In fact, most formulations that don't involve the previous step in the hash of the next step don't end up being independent (the current formula is just as broken). If you use fnv64, it's a bit better but still not perfect. If you use sha256 and then fnv32 then you get something that feels truly independent. Also, if you incorporate the result of the previous step in the current step, then it feels independent. I don't know the right tradeoff or the math here.

So, here's what I'm thinking for the two options:

SELECT mod(fnv32(sha256(crdb_internal.key_encode(a.i2, 0, b.i4, 1, c.i8, 2, d.f4, 3, e.f8, 4, f.s, 5, g.c, 6, h.b, 7, i.dc, 8, j.ival, 9, k.oid, 10, l.tstz, 11, m.ts, 12, n.da, 13, o.inet, 14, p.vb, 15, q.sa, 16, r.ia, 17))), 8) 
FROM t a, t b, t c, t d, t e, t f, t g, t h, t i, t j, t k, t l, t m, t n, t o, t p, t q, t r;

SELECT
  *
FROM
  (
    SELECT
      mod(
        fnv32(
          crdb_internal.key_encode(
            a.i2,
            0,
            fnv32(
              crdb_internal.key_encode(
                b.i4,
                1,
                fnv32(
                  crdb_internal.key_encode(
                    c.i8,
                    2,
                    fnv32(
                      crdb_internal.key_encode(
                        d.f4,
                        3,
                        fnv32(
                          crdb_internal.key_encode(
                            e.f8,
                            4,
                            fnv32(
                              crdb_internal.key_encode(
                                f.s,
                                5,
                                fnv32(
                                  crdb_internal.key_encode(
                                    g.c,
                                    6,
                                    fnv32(
                                      crdb_internal.key_encode(
                                        h.b,
                                        7,
                                        fnv32(
                                          crdb_internal.key_encode(
                                            i.dc,
                                            8,
                                            fnv32(
                                              crdb_internal.key_encode(
                                                j.ival,
                                                9,
                                                fnv32(
                                                  crdb_internal.key_encode(
                                                    k.oid,
                                                    10,
                                                    fnv32(
                                                      crdb_internal.key_encode(
                                                        l.tstz,
                                                        11,
                                                        fnv32(
                                                          crdb_internal.key_encode(
                                                            m.ts,
                                                            12,
                                                            fnv32(
                                                              crdb_internal.key_encode(
                                                                n.da,
                                                                13,
                                                                fnv32(
                                                                  crdb_internal.key_encode(
                                                                    o.inet,
                                                                    14,
                                                                    fnv32(crdb_internal.key_encode(p.vb, 15, fnv32(crdb_internal.key_encode(q.sa, 16, fnv32(crdb_internal.key_encode(r.ia, 17))))))
                                                                  )
                                                                )
                                                              )
                                                            )
                                                          )
                                                        )
                                                      )
                                                    )
                                                  )
                                                )
                                              )
                                            )
                                          )
                                        )
                                      )
                                    )
                                  )
                                )
                              )
                            )
                          )
                        )
                      )
                    )
                  )
                )
              )
            )
          )
        ),
        8
      )
    FROM
      t AS a, t AS b, t AS c, t AS d, t AS e, t AS f, t AS g, t AS h, t AS i, t AS j, t AS k, t AS l, t AS m, t AS n, t AS o, t AS p, t AS q, t AS r
  );

In that example, we have two random rows in t (generated with that Exec statement) and we're mixing and matching all combinations?

When you say "That scared me, I'd expect it to be independent", what do you mean? Just that the counts shouldn't look so perfect?

What feels wrong is that we do mod 8, so we only look at the last few bits of the fnv32, which is not great for a non-cryptographic hash. Using a prime number for the buckets would fix that. I think we could also mod against a bigger prime number before modding down to 8, or shuffle the bits somehow.

I think this can be a problem even for a single column, though I can't find an example.

My concern with your first suggestion is that sha is fairly expensive to calculate. And the second suggestion feels like its obfuscating the underlying problem but not solving it entirely.

RaduBerinde · 2021-07-22T21:20:58Z

pkg/sql/create_table.go, line 2761 at r2 (raw file):

Previously, RaduBerinde wrote…

In that example, we have two random rows in t (generated with that Exec statement) and we're mixing and matching all combinations?

When you say "That scared me, I'd expect it to be independent", what do you mean? Just that the counts shouldn't look so perfect?

What feels wrong is that we do mod 8, so we only look at the last few bits of the fnv32, which is not great for a non-cryptographic hash. Using a prime number for the buckets would fix that. I think we could also mod against a bigger prime number before modding down to 8, or shuffle the bits somehow.

I think this can be a problem even for a single column, though I can't find an example.

My concern with your first suggestion is that sha is fairly expensive to calculate. And the second suggestion feels like its obfuscating the underlying problem but not solving it entirely.

I guess this example kind-of illustrates your finding:

root@127.185.179.142:26257/defaultdb> select *, crdb_internal.key_encode(input.*), fnv32(crdb_internal.key_encode(input.*)), fnv32(crdb_internal.key_encode(input.*)) % 8 from (select t1.a, t2.b, t3.c, t4.d from t t1, t
 t2, t t3, t t4) as input;
  a | b | c | d |                  crdb_internal.key_encode                   |   fnv32    | ?column?
----+---+---+---+-------------------------------------------------------------+------------+-----------
  1 | 1 | 1 | 1 | \211\005?\360\000\000\000\000\000\000\0221\000\001*\002\000 | 2651793876 |        4
  1 | 1 | 1 | 5 | \211\005?\360\000\000\000\000\000\000\0221\000\001*\012\000 | 2517572924 |        4
  1 | 1 | 5 | 1 | \211\005?\360\000\000\000\000\000\000\0225\000\001*\002\000 | 3622176784 |        0
  1 | 1 | 5 | 5 | \211\005?\360\000\000\000\000\000\000\0225\000\001*\012\000 | 3756397736 |        0
  1 | 5 | 1 | 1 | \211\005@\024\000\000\000\000\000\000\0221\000\001*\002\000 |   13520939 |        3
  1 | 5 | 1 | 5 | \211\005@\024\000\000\000\000\000\000\0221\000\001*\012\000 | 4174267283 |        3
  1 | 5 | 5 | 1 | \211\005@\024\000\000\000\000\000\000\0225\000\001*\002\000 |  983903847 |        7
  1 | 5 | 5 | 5 | \211\005@\024\000\000\000\000\000\000\0225\000\001*\012\000 |  849682895 |        7
  5 | 1 | 1 | 1 | \215\005?\360\000\000\000\000\000\000\0221\000\001*\002\000 | 3370022656 |        0
  5 | 1 | 1 | 5 | \215\005?\360\000\000\000\000\000\000\0221\000\001*\012\000 | 3504243608 |        0
  5 | 1 | 5 | 1 | \215\005?\360\000\000\000\000\000\000\0225\000\001*\002\000 | 2378851396 |        4
  5 | 1 | 5 | 5 | \215\005?\360\000\000\000\000\000\000\0225\000\001*\012\000 | 2244630444 |        4
  5 | 5 | 1 | 1 | \215\005@\024\000\000\000\000\000\000\0221\000\001*\002\000 | 3290441751 |        7
  5 | 5 | 1 | 5 | \215\005@\024\000\000\000\000\000\000\0221\000\001*\012\000 | 3156220799 |        7
  5 | 5 | 5 | 1 | \215\005@\024\000\000\000\000\000\000\0225\000\001*\002\000 | 2299270491 |        3
  5 | 5 | 5 | 5 | \215\005@\024\000\000\000\000\000\000\0225\000\001*\012\000 | 2165049539 |        3

Note how changing the last value doesn't change the bucket, even though the overall fnv value changes. I think it's probably possible that, given the very specific way in which these rows are constructed, for any row you can find 32768 other rows where the values that are different don't affect the lower 8 bits of the fnv. Or something along those lines.

Not sure how much of an alarm this whole thing is.. It would be more alarming if we could find data where the buckets are super skewed.

RaduBerinde · 2021-07-22T21:21:43Z

pkg/sql/create_table.go, line 2761 at r2 (raw file):

Previously, RaduBerinde wrote…

I guess this example kind-of illustrates your finding:

root@127.185.179.142:26257/defaultdb> select *, crdb_internal.key_encode(input.*), fnv32(crdb_internal.key_encode(input.*)), fnv32(crdb_internal.key_encode(input.*)) % 8 from (select t1.a, t2.b, t3.c, t4.d from t t1, t
 t2, t t3, t t4) as input;
  a | b | c | d |                  crdb_internal.key_encode                   |   fnv32    | ?column?
----+---+---+---+-------------------------------------------------------------+------------+-----------
  1 | 1 | 1 | 1 | \211\005?\360\000\000\000\000\000\000\0221\000\001*\002\000 | 2651793876 |        4
  1 | 1 | 1 | 5 | \211\005?\360\000\000\000\000\000\000\0221\000\001*\012\000 | 2517572924 |        4
  1 | 1 | 5 | 1 | \211\005?\360\000\000\000\000\000\000\0225\000\001*\002\000 | 3622176784 |        0
  1 | 1 | 5 | 5 | \211\005?\360\000\000\000\000\000\000\0225\000\001*\012\000 | 3756397736 |        0
  1 | 5 | 1 | 1 | \211\005@\024\000\000\000\000\000\000\0221\000\001*\002\000 |   13520939 |        3
  1 | 5 | 1 | 5 | \211\005@\024\000\000\000\000\000\000\0221\000\001*\012\000 | 4174267283 |        3
  1 | 5 | 5 | 1 | \211\005@\024\000\000\000\000\000\000\0225\000\001*\002\000 |  983903847 |        7
  1 | 5 | 5 | 5 | \211\005@\024\000\000\000\000\000\000\0225\000\001*\012\000 |  849682895 |        7
  5 | 1 | 1 | 1 | \215\005?\360\000\000\000\000\000\000\0221\000\001*\002\000 | 3370022656 |        0
  5 | 1 | 1 | 5 | \215\005?\360\000\000\000\000\000\000\0221\000\001*\012\000 | 3504243608 |        0
  5 | 1 | 5 | 1 | \215\005?\360\000\000\000\000\000\000\0225\000\001*\002\000 | 2378851396 |        4
  5 | 1 | 5 | 5 | \215\005?\360\000\000\000\000\000\000\0225\000\001*\012\000 | 2244630444 |        4
  5 | 5 | 1 | 1 | \215\005@\024\000\000\000\000\000\000\0221\000\001*\002\000 | 3290441751 |        7
  5 | 5 | 1 | 5 | \215\005@\024\000\000\000\000\000\000\0221\000\001*\012\000 | 3156220799 |        7
  5 | 5 | 5 | 1 | \215\005@\024\000\000\000\000\000\000\0225\000\001*\002\000 | 2299270491 |        3
  5 | 5 | 5 | 5 | \215\005@\024\000\000\000\000\000\000\0225\000\001*\012\000 | 2165049539 |        3

Note how changing the last value doesn't change the bucket, even though the overall fnv value changes. I think it's probably possible that, given the very specific way in which these rows are constructed, for any row you can find 32768 other rows where the values that are different don't affect the lower 8 bits of the fnv. Or something along those lines.

Not sure how much of an alarm this whole thing is.. It would be more alarming if we could find data where the buckets are super skewed.

*lower 3 bits of the fnv

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)

pkg/sql/create_table.go, line 2761 at r2 (raw file):

Previously, RaduBerinde wrote…

*lower 3 bits of the fnv

Given all of this and efficiency considerations, I'm thinking about just adding a different builtin which takes the datums and hashes them but chains the hash values together to make things . Would such a function upset you? I can also be convinced to stop caring.

	var h fnv.New32()
        var buf []byte
	for i, datum := range args {
                buf = buf[:0]
                var err error
                buf, err = rowenc.EncodeTableKey(out, datum, encoding.Ascending)
                if err != nil { return err }
		_, err := h.Write(buf)
                if err != nil { ... }
                var scratch [4]byte
                binary.PutUvarint(scratch, i) // hash the index hash value to make sequences of the same value independent
                _, err := h.Write(scratch[:])
                if err != nil { ... }
                binary.PutUvarint(scratch, h.Sum32()) // hash the current hash value to increase the data independence
                _, err := h.Write(scratch[:])
                if err != nil { ... }
	}

pkg/sql/opt/memo/logical_props_builder.go, line 2644 at r2 (raw file):

Previously, RaduBerinde wrote…

This can be simplified to just make sure that checkTypes is false below when funcExpr.Properties.CompositeInsensitive = true. The code below already checks that the inputs are not sensitive.

I don't think that's quite true. The Child(0) of FunctionExpr is its arg list. I think I could do the following but something about it felt icky at the time.

checkTypes := !opt.IsCompositeInsensitiveOp(e)
if funcExpr, ok := e.(*FunctionExpr); ok && funcExpr.Properties.CompositeInsensitive {
    checkTypes = false
    e = funcExpr.Args
}

pkg/sql/sem/builtins/key_encode_builtin_test.go, line 24 at r1 (raw file):

Previously, RaduBerinde wrote…

I don't quite get what this test is trying to achieve, it seems overly complicated. Is it essentially just cross-checking the distinct count against the distinct key count? Wouldn't we get the same or better by generating a random values clause and doing something like
with v as (values (1),(2),(3))
  select (select count(distinct v.*) from v) - (select count(distinct key_encode(v.*)) from v);

Sure, I was mostly just testing that all the data types worked and that they NULLs worked and then got too clever. I'll take another pass.

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)

pkg/sql/create_table.go, line 2761 at r2 (raw file):

Previously, ajwerner wrote…

Given all of this and efficiency considerations, I'm thinking about just adding a different builtin which takes the datums and hashes them but chains the hash values together to make things . Would such a function upset you? I can also be convinced to stop caring.

	var h fnv.New32()
        var buf []byte
	for i, datum := range args {
                buf = buf[:0]
                var err error
                buf, err = rowenc.EncodeTableKey(out, datum, encoding.Ascending)
                if err != nil { return err }
		_, err := h.Write(buf)
                if err != nil { ... }
                var scratch [4]byte
                binary.PutUvarint(scratch, i) // hash the index hash value to make sequences of the same value independent
                _, err := h.Write(scratch[:])
                if err != nil { ... }
                binary.PutUvarint(scratch, h.Sum32()) // hash the current hash value to increase the data independence
                _, err := h.Write(scratch[:])
                if err != nil { ... }
	}

In general I'm opposed to coming up with this kind of one-off supposed improvements, unless it is following some standard practice or has clear math/reasoning behind it. It's a bit like trying to come up with your own security scheme.

Passing the current hash value through the hasher makes no sense to me. What "data independence" does it increase? It's mathematically equivalent to applying this obscure, static uint32->uint32 function to the current hash. At best, it's a bijective function and you effectively did nothing. At worst, it's collapsing different hashes into the same hash, making things worse.

Also don't see a benefit to hashing the index value (given that the concatenated key is such that you can split it into pieces without ambiguity). I don't know what "make sequences of the same value independent" means.

To me the legit concept is that we take what would be the key for a non-sharded index and hash that. It doesn't get cleaner than that conceptually - it's equivalent to having a hash-based KV store, which is what this is trying to simulate in the first place. I have no objection to a built-in that does that (to save on holding the entire key in memory).

RaduBerinde · 2021-07-22T22:42:53Z

pkg/sql/opt/memo/logical_props_builder.go, line 2644 at r2 (raw file):

Previously, ajwerner wrote…

I don't think that's quite true. The Child(0) of FunctionExpr is its arg list. I think I could do the following but something about it felt icky at the time.
checkTypes := !opt.IsCompositeInsensitiveOp(e)
if funcExpr, ok := e.(*FunctionExpr); ok && funcExpr.Properties.CompositeInsensitive {
    checkTypes = false
    e = funcExpr.Args
}

So? It's a recursive function..

RaduBerinde · 2021-07-22T23:07:51Z

pkg/sql/opt/memo/logical_props_builder.go, line 2644 at r2 (raw file):

Previously, RaduBerinde wrote…

So? It's a recursive function..

Ah, maybe ScalarListOp is not marked as insensitive, we should fix that

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)

pkg/sql/create_table.go, line 2761 at r2 (raw file):

Previously, RaduBerinde wrote…

In general I'm opposed to coming up with this kind of one-off supposed improvements, unless it is following some standard practice or has clear math/reasoning behind it. It's a bit like trying to come up with your own security scheme.

Passing the current hash value through the hasher makes no sense to me. What "data independence" does it increase? It's mathematically equivalent to applying this obscure, static uint32->uint32 function to the current hash. At best, it's a bijective function and you effectively did nothing. At worst, it's collapsing different hashes into the same hash, making things worse.

Also don't see a benefit to hashing the index value (given that the concatenated key is such that you can split it into pieces without ambiguity). I don't know what "make sequences of the same value independent" means.

To me the legit concept is that we take what would be the key for a non-sharded index and hash that. It doesn't get cleaner than that conceptually - it's equivalent to having a hash-based KV store, which is what this is trying to simulate in the first place. I have no objection to a built-in that does that (to save on holding the entire key in memory).

I don't have a formalism about the "independence" so I'll stop worrying about it. Anecdotally I observe that if you feed the hash value back in that two rows seem to be no longer related. I've been convinced that I don't care about that. I also don't care much about the buffering. Thanks for all the engagement, I've been convinced to stop caring.

pkg/sql/opt/memo/logical_props_builder.go, line 2644 at r2 (raw file):

Previously, RaduBerinde wrote…

Ah, maybe ScalarListOp is not marked as insensitive, we should fix that

The value of checkTypes is not passed in recursively (for good reason). The ScalarListExpr that is Args will have its types checked, right?

RaduBerinde · 2021-07-22T23:40:13Z

pkg/sql/opt/memo/logical_props_builder.go, line 2644 at r2 (raw file):

Previously, ajwerner wrote…

The value of checkTypes is not passed in recursively (for good reason). The ScalarListExpr that is Args will have its types checked, right?

Not if it (the ScalarListOp) is marked as an insensitive op. We are checking that all paths to variable leaves only pass through composite-insensitive ops.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)

pkg/sql/opt/memo/logical_props_builder.go, line 2644 at r2 (raw file):

Previously, RaduBerinde wrote…

Not if it (the ScalarListOp) is marked as an insensitive op. We are checking that all paths to variable leaves only pass through composite-insensitive ops.

If ScalarListOp were marked as an insensitive op then wouldn't that mean that function invocations be treated as insensitive? At that point anything related to the function properties would be irrelevant.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)

pkg/sql/opt/memo/logical_props_builder.go, line 2644 at r2 (raw file):

Previously, ajwerner wrote…

If ScalarListOp were marked as an insensitive op then wouldn't that mean that function invocations be treated as insensitive? At that point anything related to the function properties would be irrelevant.

An example of a function that, I think, should be composite sensitive is json_build_object. I think that that thing already has some problems:

demo@127.0.0.1:52608/movr> SELECT json_build_object('a', -0.0);
  json_build_object
---------------------
  {"a": 0.0}
(1 row)


Time: 0ms total (execution 0ms / network 0ms)

demo@127.0.0.1:52608/movr> SELECT json_build_object('a', -0);
  json_build_object
---------------------
  {"a": 0}
(1 row)


demo@127.0.0.1:52608/movr> SELECT (json_build_object('a', '-0.0'::JSON)->'a')::FLOAT;
  float8
----------
      -0
(1 row)

RaduBerinde · 2021-07-23T00:00:11Z

pkg/sql/opt/memo/logical_props_builder.go, line 2644 at r2 (raw file):

Previously, ajwerner wrote…

An example of a function that, I think, should be composite sensitive is json_build_object. I think that that thing already has some problems:

demo@127.0.0.1:52608/movr> SELECT json_build_object('a', -0.0);
  json_build_object
---------------------
  {"a": 0.0}
(1 row)


Time: 0ms total (execution 0ms / network 0ms)

demo@127.0.0.1:52608/movr> SELECT json_build_object('a', -0);
  json_build_object
---------------------
  {"a": 0}
(1 row)


demo@127.0.0.1:52608/movr> SELECT (json_build_object('a', '-0.0'::JSON)->'a')::FLOAT;
  float8
----------
      -0
(1 row)

ScalarListOp is "under" the function. The function would be checked separately.

I believe that what you're discovering above is that you can't simply use -0 in a statement to create a -0 float (and I believe that's consistent with pg).

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)

pkg/sql/opt/memo/logical_props_builder.go, line 2644 at r2 (raw file):

Previously, RaduBerinde wrote…

ScalarListOp is "under" the function. The function would be checked separately.

I believe that what you're discovering above is that you can't simply use -0 in a statement to create a -0 float (and I believe that's consistent with pg).

Sorry for the confusion. What I was missing was that getOuterCols was also recursive. I think that that makes this check N^2 in the depth of the expression but yes, if we also mark ScalarListOp as IsCompositeInsensitiveOp then it all works out. Thanks!

RaduBerinde · 2021-07-23T00:27:00Z

pkg/sql/opt/memo/logical_props_builder.go, line 2644 at r2 (raw file):

Previously, ajwerner wrote…

Sorry for the confusion. What I was missing was that getOuterCols was also recursive. I think that that makes this check N^2 in the depth of the expression but yes, if we also mark ScalarListOp as IsCompositeInsensitiveOp then it all works out. Thanks!

Good point, maybe leave a TODO in there for the quadratic behavior.

RaduBerinde · 2021-07-30T21:21:14Z

I noticed there already is a crdb_internal.encode_key. I don't think we can use that because it's meant to encode a full key, but we should rename the new one so there's no confusion between the two.

RaduBerinde

Looking good!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)

pkg/sql/create_table.go, line 2569 at r3 (raw file):

// makeShardColumnDesc returns a new column descriptor for a hidden computed shard column
// based on all the `colNames`.
func makeShardColumnDesc(

[nit] this deserves a comment explaining the argument, maybe pointing to the two make*ComputeExpr for details.

pkg/sql/create_table.go, line 2625 at r3 (raw file):

			Exprs: tree.Exprs{
				expr,
				&tree.CastExpr{

Is this cast necessary? Maybe we can just use tree.AdjustValueToType instead? Though I think we should error out much earlier if the value is too large.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @RaduBerinde)

pkg/sql/create_table.go, line 2716 at r2 (raw file):

Previously, RaduBerinde wrote…

[nit] probably on your TODO list but this should have a comment with sample expressions

Done.

pkg/sql/create_table.go, line 2569 at r3 (raw file):

Previously, RaduBerinde wrote…

[nit] this deserves a comment explaining the argument, maybe pointing to the two make*ComputeExpr for details.

Done.

pkg/sql/create_table.go, line 2625 at r3 (raw file):

Previously, RaduBerinde wrote…

Is this cast necessary? Maybe we can just use tree.AdjustValueToType instead? Though I think we should error out much earlier if the value is too large.

Removed. Also added a check and test.

pkg/sql/opt/memo/logical_props_builder.go, line 2644 at r2 (raw file):

Previously, RaduBerinde wrote…

Good point, maybe leave a TODO in there for the quadratic behavior.

This is now fixed 🎉

pkg/sql/sem/tree/function_definition.go, line 108 at r2 (raw file):

Previously, RaduBerinde wrote…

CompositeInsensitive indicates that this function returns equal results when evaluated on equal inputs. This is a non-trivial property for composite types which can be equal but not identical (e.g. decimals 1.0 and 1.00). For example, converting a decimal to string is not CompositeInsensitive.

Removed as it's not needed anymore.

pkg/sql/sem/builtins/key_encode_builtin_test.go, line 1 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] missing header

Done.

pkg/sql/sem/builtins/key_encode_builtin_test.go, line 24 at r1 (raw file):

Previously, ajwerner wrote…

Sure, I was mostly just testing that all the data types worked and that they NULLs worked and then got too clever. I'll take another pass.

Cleaned it up roughly as you suggested.

pkg/sql/sem/builtins/key_encode_builtin_test.go, line 73 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] would help to show a sample query

Done.

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @RaduBerinde)

pkg/sql/sem/tree/function_definition.go, line 108 at r2 (raw file):

Previously, ajwerner wrote…

Removed as it's not needed anymore.

Hm, how come? I don't think we can infer a hash value for composite types without it..

RaduBerinde · 2021-08-19T20:54:14Z

pkg/sql/sem/tree/function_definition.go, line 108 at r2 (raw file):

Previously, RaduBerinde wrote…

Hm, how come? I don't think we can infer a hash value for composite types without it..

I think you might have encountered a bug. You made the scalar list composite insensitive and we're probably hitting this code:

		if opt.IsCompositeInsensitiveOp(e) {
			// The operator is known to be composite-insensitive. If its result is a
			// non-composite type, it is also composite-identical.
			return true, !colinfo.HasCompositeKeyEncoding(e.(opt.ScalarExpr).DataType())
		}

The data type of scalar list is Any and it's not handled. There are others like Tuple that aren't handled. We should change HasCompositeKeyEncoding to CanHaveCompositeKeyEncoding and have the default return value for unhandled types be true.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @RaduBerinde)

pkg/sql/sem/tree/function_definition.go, line 108 at r2 (raw file):

Previously, RaduBerinde wrote…

I think you might have encountered a bug. You made the scalar list composite insensitive and we're probably hitting this code:
		if opt.IsCompositeInsensitiveOp(e) {
			// The operator is known to be composite-insensitive. If its result is a
			// non-composite type, it is also composite-identical.
			return true, !colinfo.HasCompositeKeyEncoding(e.(opt.ScalarExpr).DataType())
		}
The data type of scalar list is Any and it's not handled. There are others like Tuple that aren't handled. We should change HasCompositeKeyEncoding to CanHaveCompositeKeyEncoding and have the default return value for unhandled types be true.

Alrighty, it's back and it's now confirmed needed. See how you feel about the new logic for CanHaveCompositeKeyEncoding.

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @RaduBerinde)

pkg/sql/sem/tree/function_definition.go, line 108 at r2 (raw file):

Previously, ajwerner wrote…

Alrighty, it's back and it's now confirmed needed. See how you feel about the new logic for CanHaveCompositeKeyEncoding.

👍 👍 looks great!

ajwerner · 2021-08-20T15:31:16Z

@RaduBerinde anything else you want on this?

RaduBerinde

Just wrap it with a red bow!

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)

This patch also fixes the behavior to properly catagorize types like any and unknown. It does so by adding all of the families and setting the default to be true instead of the prior false. Release note: None

This new builtin function encodes datums using the key encoding. It is useful in particular because it is a function with immutable volatility from a large number of data types to bytes. Release note (sql change): Added a builtin function, crdb_internal.datums_to_bytes, which can encode any data type which can be used in an forward index key into bytes in an immutable way. This function is now used in the expression for hash sharded indexes.

ajwerner · 2021-08-21T21:02:06Z

bors r+

craig · 2021-08-21T21:53:51Z

Build succeeded:

GitHub CI (Cockroach)

We've discovered a case where a combination of non-random data and a power-of-two number of buckets causes an uneven distribution of rows in a hash-sharded index. While this specific case is very contrived, it illustrates a small weakness in the current hash-shard calculation: modulo by a power-of-two number of buckets only uses the last few bits of the hash value. Radu suggested this would be a problem in cockroachdb#67865 and also suggested a fix: add an intermediate modulo by a larger prime before modulo by num buckets, so we'll try that. Fixes: cockroachdb#91109 Epic: None Release note (performance improvement): This change updates the shard calculation of newly-created hash-sharded indexes so that uneven distributions of rows are less likely.

ajwerner mentioned this pull request Jul 21, 2021

SELECT from a table with hash sharded PK has unexpected index join in query plan #67170

Closed

ajwerner force-pushed the ajwerner/key-encoding-builtin branch 2 times, most recently from 649b2e4 to 667a412 Compare July 21, 2021 22:54

RaduBerinde reviewed Jul 22, 2021

View reviewed changes

ajwerner commented Jul 22, 2021

View reviewed changes

RaduBerinde reviewed Jul 22, 2021

View reviewed changes

ajwerner commented Jul 22, 2021

View reviewed changes

RaduBerinde reviewed Jul 22, 2021

View reviewed changes

ajwerner commented Jul 22, 2021

View reviewed changes

ajwerner commented Jul 23, 2021

View reviewed changes

ajwerner mentioned this pull request Aug 6, 2021

sql: monotonic id’s + recency bias leads to hot node #68560

Closed

ajwerner force-pushed the ajwerner/key-encoding-builtin branch 3 times, most recently from 4279e15 to ca026e9 Compare August 19, 2021 15:29

RaduBerinde reviewed Aug 19, 2021

View reviewed changes

ajwerner force-pushed the ajwerner/key-encoding-builtin branch 2 times, most recently from b57a6bc to 94587d9 Compare August 19, 2021 20:04

ajwerner changed the title ~~[DNM] sql,sem/builtins: add crdb_internal.key_encode, use for hash sharded indexes~~ sql/sem/builtins: add crdb_internal.datums_to_bytes, use in hash sharding Aug 19, 2021

ajwerner marked this pull request as ready for review August 19, 2021 20:05

ajwerner requested a review from a team as a code owner August 19, 2021 20:05

ajwerner requested a review from a team August 19, 2021 20:05

ajwerner requested a review from a team as a code owner August 19, 2021 20:05

ajwerner commented Aug 19, 2021

View reviewed changes

RaduBerinde reviewed Aug 19, 2021

View reviewed changes

ajwerner force-pushed the ajwerner/key-encoding-builtin branch 2 times, most recently from 4b9db56 to a6ca290 Compare August 19, 2021 22:19

ajwerner commented Aug 19, 2021

View reviewed changes

RaduBerinde reviewed Aug 19, 2021

View reviewed changes

ajwerner force-pushed the ajwerner/key-encoding-builtin branch 3 times, most recently from 5c08eab to e21e00e Compare August 20, 2021 03:39

RaduBerinde reviewed Aug 20, 2021

View reviewed changes

ajwerner force-pushed the ajwerner/key-encoding-builtin branch from e21e00e to 7bdae14 Compare August 20, 2021 17:51

ajwerner added 2 commits August 21, 2021 08:23

colinfo: rename HasCompositeKeyEncoding to CanHaveCompositeKeyEncoding

c1754fb

This patch also fixes the behavior to properly catagorize types like any and unknown. It does so by adding all of the families and setting the default to be true instead of the prior false. Release note: None

ajwerner force-pushed the ajwerner/key-encoding-builtin branch from 7bdae14 to f172756 Compare August 21, 2021 12:32

craig bot merged commit d18da6c into cockroachdb:master Aug 21, 2021

cockroach-teamcity mentioned this pull request Aug 21, 2021

sql/sem/builtins: add crdb_internal.datums_to_bytes, use in hash sharding cockroachdb/docs#11099

Closed

jseldess mentioned this pull request Sep 8, 2021

sql/sem/builtins: add crdb_internal.datums_to_bytes, use in hash sharding cockroachdb/docs#11445

Closed

ajwerner mentioned this pull request Oct 15, 2021

Using hash-sharding index on a floating point number leaves us with a CREATE TABLE statement that does not work #71611

Closed

ajwerner mentioned this pull request Nov 2, 2022

sql: non-uniform hash-sharded index with power-of-2 number of buckets #91109

Closed

michae2 mentioned this pull request Nov 2, 2022

sql: modulo by prime in shard calculation for hash-sharded indexes #91155

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql/sem/builtins: add crdb_internal.datums_to_bytes, use in hash sharding #67865

sql/sem/builtins: add crdb_internal.datums_to_bytes, use in hash sharding #67865

ajwerner commented Jul 21, 2021 •

edited

Loading

cockroach-teamcity commented Jul 21, 2021

ajwerner commented Jul 21, 2021

RaduBerinde left a comment

ajwerner left a comment

RaduBerinde left a comment

RaduBerinde commented Jul 22, 2021

RaduBerinde commented Jul 22, 2021

ajwerner left a comment

RaduBerinde left a comment

RaduBerinde commented Jul 22, 2021

RaduBerinde commented Jul 22, 2021

ajwerner left a comment

RaduBerinde commented Jul 22, 2021

ajwerner left a comment

ajwerner left a comment

RaduBerinde commented Jul 23, 2021

ajwerner left a comment

RaduBerinde commented Jul 23, 2021

RaduBerinde commented Jul 30, 2021

RaduBerinde left a comment

ajwerner left a comment

RaduBerinde left a comment

RaduBerinde commented Aug 19, 2021

ajwerner left a comment

RaduBerinde left a comment

ajwerner commented Aug 20, 2021

RaduBerinde left a comment

ajwerner commented Aug 21, 2021

craig bot commented Aug 21, 2021

sql/sem/builtins: add crdb_internal.datums_to_bytes, use in hash sharding #67865

sql/sem/builtins: add crdb_internal.datums_to_bytes, use in hash sharding #67865

Conversation

ajwerner commented Jul 21, 2021 • edited Loading

cockroach-teamcity commented Jul 21, 2021

ajwerner commented Jul 21, 2021

RaduBerinde left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

RaduBerinde commented Jul 22, 2021

RaduBerinde commented Jul 22, 2021

ajwerner left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

RaduBerinde commented Jul 22, 2021

RaduBerinde commented Jul 22, 2021

ajwerner left a comment

Choose a reason for hiding this comment

RaduBerinde commented Jul 22, 2021

ajwerner left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

RaduBerinde commented Jul 23, 2021

ajwerner left a comment

Choose a reason for hiding this comment

RaduBerinde commented Jul 23, 2021

RaduBerinde commented Jul 30, 2021

RaduBerinde left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

RaduBerinde commented Aug 19, 2021

ajwerner left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

ajwerner commented Aug 20, 2021

RaduBerinde left a comment

Choose a reason for hiding this comment

ajwerner commented Aug 21, 2021

craig bot commented Aug 21, 2021

ajwerner commented Jul 21, 2021 •

edited

Loading