-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle null values in Range Partition dimension distribution #11973
Handle null values in Range Partition dimension distribution #11973
Conversation
private static final boolean SKIP_NULL = true; | ||
// Do not skip nulls as StringDistribution can handle null values. | ||
// This behavior is different from hadoop indexing. | ||
private static final boolean SKIP_NULL = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be selectively turned on only when more than one dimension is being used? I don't know for certain what the impact of not skipping null will be but then that impact will be limited to new range partitioning only. or it can be based on a flag that you can pass via the context. thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be fine without the flag.
The following effects would be observed on single dim:
- Partitioning would now also work on a dimension column that is always null, although it will actually create just one partition.
- Estimation of partition boundaries will also take into account null values. So the algorithm would do a better job of estimating the size of the first partition (it would be closer to the target rows). The sizes of later partitions will not be affected (although the same data being ingested before and after this change could have different partition boundaries as the first partition boundary might shift and the others would shift with it)
With the addition of (multi dimension) range partitioning, single dim is inevitably being affected as it now goes through the multi dim flow itself. So this would only be another part of that overall effect.
...ache/druid/indexing/common/task/batch/parallel/distribution/ArrayOfStringsNullSafeSerde.java
Outdated
Show resolved
Hide resolved
...ache/druid/indexing/common/task/batch/parallel/distribution/ArrayOfStringsNullSafeSerde.java
Outdated
Show resolved
Hide resolved
.../druid/indexing/common/task/batch/parallel/distribution/ArrayOfStringsNullSafeSerdeTest.java
Show resolved
Hide resolved
Thank you @kfaraz. I have merged this change. I just realized that this new behaviour should be called out in the docs as well. will that be done in a follow-up PR? |
Thanks, @abhishekagarwal87 . The doc changes for multi-dim partitioning are being made in #11983 . I will create a follow up PR that documents the change in this PR. |
Description
This PR adds support for handling null dimension values while creating partition boundaries
in range partitioning.
This means that we can now have partition boundaries like
[null, "abc"]
or["abc", null, "def"]
.This fixes the following problems seen in production environments.
for all partition dimensions, partitioning just doesn't work. This is because all the rows
are ignored in the
PartialDimensionDistributionTask
andintervalToPartitions
turnsout to be empty.
end up in the first partition thus bloating up that partition. This can now be mitigated in
cases where the later dimensions have non-null values which vary over these rows.
The tuple of nulls
Please note that even though
[null, null]
is also a valid boundary now, it will never actuallybe encountered in practice. This is because lexicographically, the null tuple is
immediately followed the tuple of nulls. i.e.
null < [null, null]
Thus, if the start and end of a partition were
null
and[null, null]
respectively,that partition would contain zero rows (end is not inclusive). There would be no
point in having such a partition and so it never occurs in practice.
Changes
PartialDimensionDistributionTask
ArrayOfStringsSerde
Note:
The null-safe impl of
ArrayOfStringsSerde
can be removed once it is contributedback to apache-datasketches itself.
This PR has: