Field writers for numerical arrays#14900
Conversation
processing/src/main/java/org/apache/druid/frame/field/NumericArrayFieldReader.java
Fixed
Show fixed
Hide fixed
gianm
left a comment
There was a problem hiding this comment.
For tests, add the following:
- Test datasets in
FrameWriterTestDataforTEST_ARRAYS_LONG,TEST_ARRAYS_FLOAT, andTEST_ARRAYS_DOUBLE, and add those toDATASETS. We have one of these for each supported type. These test datasets should be designed to exercise cases like arrays sharing the same prefix, null arrays, arrays containing null, empty arrays, and single-element arrays. - Test cases in
FrameWriterTestfortest_arrayLong,test_arrayFloat, andtest_arrayDouble.
Because this patch doesn't include columnar readers and writers, you will need to add some code to skip these cases for columnar frames.
processing/src/main/java/org/apache/druid/frame/field/NumericArrayFieldReader.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/frame/field/NumericFieldReader.java
Outdated
Show resolved
Hide resolved
gianm
left a comment
There was a problem hiding this comment.
Generally looks good, had some comments on testing and javadocs. ty!
processing/src/main/java/org/apache/druid/frame/write/FrameWriterUtils.java
Outdated
Show resolved
Hide resolved
| import java.util.List; | ||
| import java.util.stream.Collectors; | ||
|
|
||
| public class DoubleArrayFieldReaderTest extends InitializedNullHandlingTest |
There was a problem hiding this comment.
Please include a test case with Double.NaN, and a test case with with a null mixed in with nonnull doubles. (I see one for null by itself, but I don't see a mixture of null and nonnull.) Only include the null if NullHandling.sqlCompatible() though.
There was a problem hiding this comment.
The test case should work in both compatible and incompatible mode, since arrays don't coerce elements to null. #14900 (comment)
| import java.util.List; | ||
| import java.util.stream.Collectors; | ||
|
|
||
| public class FloatArrayFieldReaderTest extends InitializedNullHandlingTest |
There was a problem hiding this comment.
Please include a test case with Float.NaN, and a test case with with a null mixed in with nonnull doubles. (I see one for null by itself, but I don't see a mixture of null and nonnull.) Only include the null if NullHandling.sqlCompatible() though.
| // read only the ROW_BASED frames. Rather, set the optional, and throw the appropriate error message when the reader | ||
| // tries to read COLUMNAR frame. This should go away once the COLUMNAR frames also start supporting the numeric | ||
| // array | ||
| if (columnType.getType() == ValueType.ARRAY |
There was a problem hiding this comment.
nit: the conditional would be more readable if the element type was extracted from the condition.
There was a problem hiding this comment.
Will remove the conditional altogether, and add a dummy column reader.
| * In short, this is a temporary measure till columnar frames support the numerical array types to punt the unsupported | ||
| * type check for the numerical arrays (for COLUMNAR frames only) at the usage time, rather than the creation time | ||
| */ | ||
| private final Optional<Pair<String, ColumnType>> unsupportedColumnAndType; |
There was a problem hiding this comment.
Instead of this structure, IMO it'd keep FrameReader cleaner if we create a dummy column reader for the numeric array columns, which throws exceptions on readColumn and readRACColumn. Then the errors would happen only when we attempt to access those columns.
There was a problem hiding this comment.
+1 to @gianm comment. I think frameReader should not track the incompatible column types. That logic should be pushed to the fieldReadersand they should just throw impl not found.
processing/src/main/java/org/apache/druid/frame/field/NumericFieldWriter.java
Show resolved
Hide resolved
| return true; | ||
| } | ||
|
|
||
| public abstract static class Selector<T extends Number> implements ColumnValueSelector |
There was a problem hiding this comment.
Please add some javadoc explaining the purpose of this class, including javadoc for the methods getIndividualValueAtMemory and getIndividualFieldSize. It's pretty twisty-turny to read through the code due to the inheritance stack, so additional javadoc helps the reader figure out what is going on.
There was a problem hiding this comment.
+1, also maybe a more descriptive name for the class, and maybe name for the generic parameter, TElement or something perhaps?
There was a problem hiding this comment.
Updated comments and docs
| * of the array instead. (0x00 is used for array end because it helps in preserving the byte comparison property of the | ||
| * numeric array field writers). | ||
| * | ||
| * Therefore, to preserve backward and forward compatibility, the individual element's writers were left unchanged, |
There was a problem hiding this comment.
Please provide and example of the above. This will help user reason about the choice of bytes.
There was a problem hiding this comment.
- There is format in the JavaDoc above
- There is an explanation about the choice of the bytes chosen here
- Added @see to the
NumericArrayFieldWriterwhich contains the info on how this field writer is used for array elements.
Is there anything specific that I am missing here, that would help with clarification?
| return true; | ||
| } | ||
|
|
||
| public abstract ColumnValueSelector<?> getColumnValueSelector( |
There was a problem hiding this comment.
Could you please java doc the abstract class methods.
|
|
||
| public abstract ValueType getValueType(); | ||
|
|
||
| public abstract static class Selector |
There was a problem hiding this comment.
Is there a reason why this is not a class of its own?
Where is this selector being used in the abstract class NumericFieldReader ?
There was a problem hiding this comment.
All of the previous implementations of the field readers kept the selector as a private class inside their code, therefore I kept the selectors inside the parent class. Since this one is getting inherited, I'll move this out.
There was a problem hiding this comment.
Keeping it as is, since this is used for nullity check only. I'll add the relevant comments though.
| * In short, this is a temporary measure till columnar frames support the numerical array types to punt the unsupported | ||
| * type check for the numerical arrays (for COLUMNAR frames only) at the usage time, rather than the creation time | ||
| */ | ||
| private final Optional<Pair<String, ColumnType>> unsupportedColumnAndType; |
There was a problem hiding this comment.
+1 to @gianm comment. I think frameReader should not track the incompatible column types. That logic should be pushed to the fieldReadersand they should just throw impl not found.
processing/src/main/java/org/apache/druid/frame/field/NumericArrayFieldWriter.java
Outdated
Show resolved
Hide resolved
| // even when NullHandling.replaceWithDefault() is true we need to write null as is, and not convert it to their | ||
| // default value when writing the array. Therefore, the check is `getObject() == null` ignoring the value of | ||
| // `NullHandling.replaceWithDefaul()`. | ||
| return getObject() == null; |
There was a problem hiding this comment.
NullHandling.replaceWithDefault is dead to me, but this is true that array columns and expressions do not coerce elements to zeros.
We could also consider only supporting arrays in sql compatible mode :p
processing/src/main/java/org/apache/druid/frame/field/NumericArrayFieldWriter.java
Show resolved
Hide resolved
|
|
||
| // Create a columnValueSelector to write the individual elements re-using the NumericFieldWriter | ||
| AtomicInteger index = new AtomicInteger(0); | ||
| ColumnValueSelector<Number> columnValueSelector = new ColumnValueSelector<Number>() |
There was a problem hiding this comment.
Can we push column value selector to another class or a subclass. That would make the code more readable.
| memory.putByte(position + offset, NON_NULL_ROW); | ||
| offset += Byte.BYTES; | ||
|
|
||
| for (; index.get() < list.size(); index.incrementAndGet()) { |
There was a problem hiding this comment.
Can we iterate the list here ?
There was a problem hiding this comment.
Yes, but we'll ignore the element of the list we will be iterating over, so I have kept it as is
for ( Number ignored : list) {
// processing
}
|
Please also enable the |
| private FieldWriter fieldWriter; | ||
|
|
||
| //CHECKSTYLE.OFF: Regexp | ||
| private static final Object[] FLOATS_ARRAY_1 = new Object[]{ |
There was a problem hiding this comment.
Can we add a nested array test case as well to all numeric types ?
There was a problem hiding this comment.
Do you mean something like [[1,2],[3,4]]? It'd make little sense to do that since the numeric fields cannot be arbitrarily nested.
Therefore the fact that we have created a FloatArrayFieldReader means that we are reading something written by FloatArrayFieldWriterwhich ensures that we get an array of type [1,2,3,4,....].
If we'd had nested array like [[1,2],[3,4]], we wouldn't have created the writer and the reader in the first place, since the signature would have been `Array<Array>, and we wouldn't be in the methods tested by this suite.
processing/src/main/java/org/apache/druid/frame/field/NumericArrayFieldReader.java
Outdated
Show resolved
Hide resolved
| return true; | ||
| } | ||
|
|
||
| public abstract static class Selector<T extends Number> implements ColumnValueSelector |
There was a problem hiding this comment.
+1, also maybe a more descriptive name for the class, and maybe name for the generic parameter, TElement or something perhaps?
| return new StringFieldReader(true); | ||
| switch (Preconditions.checkNotNull(columnType.getElementType().getType(), "array elementType")) { | ||
| case STRING: | ||
| return new StringFieldReader(true); |
There was a problem hiding this comment.
unrelated to this PR side note, i wish we split string array field reader out of string field reader, even if its only a vanity name for the same class
There was a problem hiding this comment.
I have added a new namesake reader for this.
So, to create new methods would be:
new StringFieldReader()-> for stringsnew StringArrayFieldReader()-> for string arrays
processing/src/main/java/org/apache/druid/frame/field/FieldWriters.java
Outdated
Show resolved
Hide resolved
| ColumnValueSelector<?> columnValueSelector = | ||
| DoubleFieldReader.forArray() | ||
| .makeColumnValueSelector(memory, new ConstantFieldPointer(position)); |
There was a problem hiding this comment.
this seems kind of expensive, like we make a column value selector and a constant pointer for every element of every row? surely there is a less garbage intense way to read the array elements, though doesn't necessarily need to be resolved in this PR...
There was a problem hiding this comment.
Lemme check if I can reduce these object creations post addressing the other review comments.. If it requires a minor refactor, I'll update it here.
There was a problem hiding this comment.
Refactored, now we are just changing the position, and creating each object (reader, column value selector, field pointer) exactly once per selector.
(which should probably be once per Frame I think, or slightly higher, if the frame was comprised of multiple memory locations)
| // even when NullHandling.replaceWithDefault() is true we need to write null as is, and not convert it to their | ||
| // default value when writing the array. Therefore, the check is `getObject() == null` ignoring the value of | ||
| // `NullHandling.replaceWithDefaul()`. | ||
| return getObject() == null; |
There was a problem hiding this comment.
NullHandling.replaceWithDefault is dead to me, but this is true that array columns and expressions do not coerce elements to zeros.
We could also consider only supporting arrays in sql compatible mode :p
| * Indicator byte denoting that the numeric value succeeding it is not null. This is used while writing the individual | ||
| * elements writers of an array | ||
| */ | ||
| public static final byte ARRAY_ELEMENT_NOT_NULL_BYTE = 0x02; |
There was a problem hiding this comment.
any reason not to make this the array terminator then can just use the same null/not null bytes for elements as for regular numbers?
There was a problem hiding this comment.
Byte-by-byte comparison won't work if we change the order.
Comparing 2 arrays [1] and [1, 2].
Semantically, [1] < [1, 2]
When we convert using the scheme you mentioned, we'd encode them as
[1]
0x01 (non-null array)
0x01 (non-null byte)
transform(1)
0x02 (array end) (*)
[1, 2]
0x01 (non-null array)
0x01 (non-null byte)
transform(1)
0x01 (non-null byte) (*)
transform(2)
0x02 (array end)
However, on comparing byte-by-byte, the second array < first array, and the first point of difference would be the (*) marked location.
Therefore the array terminator is the smallest marker of them all.
There was a problem hiding this comment.
| @Override | ||
| public Object getObject() | ||
| { | ||
| final List<ElementType> currentArray = computeCurrentArray(); |
There was a problem hiding this comment.
i suppose this is the only real disadvantage of not storing the array length after the null byte and wanting to spit out Object[]. Basically that since we don't know the length up front we have to make a list and then convert it to Object[] after (though realistically right now it seems like almost everything that handles arrays handles Object[], List, and some group by stuff, so its not strictly required, but is better to be consistent).
I'm not sure that the extra bytes per value (could use VByte.writeInt/VByte.readInt, well add methods that work on Memory instead of ByteBuffer to fit in a little as a byte) are worth the savings in reads, might be worth measuring someday, though this doesn't really seem like a blocker to me
There was a problem hiding this comment.
the array length after the null byte and wanting to spit out Object[].
So you are saying the format should be something like:
Begin array indicator -> Array Elements -> Array terminator -> Size of array
The size of the array will be helpful in preventing extra converts of List <-> Object while reading since we can directly allocate that much memory upfront. I never considered it, and it also doesn't mess with the comparison aspect of the fields, since we are going to end the comparison before that itself.
However, since the array end is not known, we'd need to have a two-pass, where we find the rowTerminator first, and then figure out the size, allocate an array, and then re-read the elements from the bytes.
So it's b/w - Two passes through the byte array, however without any List <-> Object[] conversion
Or Single pass, requiring List<-> Object[] conversion
| verifySteps.add(new VerifyResults(finalExecStep)); | ||
| // Don't verify the row signature when MSQ is running, since the broker receives the task id, and the signature | ||
| // would be {TASK:STRING} instead of the expected results signature | ||
| verifySteps.add(new VerifyResults(finalExecStep, !config.isRunningMSQ())); |
There was a problem hiding this comment.
it might be nice to have something that functions equivalently for msq based tests (if there isn't already) but this doesn't seem like a blocker.
There was a problem hiding this comment.
Yep, I wanted to separate out the row signature check and the results check, though chose not to keep it in this PR.
cryptoe
left a comment
There was a problem hiding this comment.
Changes LGTM!!.
Could you update the release notes for the end user.
On the lines of
earlier queries like "select array[1,2]` would not work in msq but now they would.
| 1672531200000L, | ||
| Arrays.asList("d", "e"), | ||
| Arrays.asList("b", "b"), | ||
| new Object[]{1L, 4L}, |
There was a problem hiding this comment.
Since this method will run multiple times due to parametrized tests, we can consider writing this file only once. Maybe in a static method but that can be done in a follow up PR.
Thank you for adding this test though.
|
Verified that it works on a Druid cluster. The IT failures look unrelated Thanks for the reviews @clintropolis @cryptoe @gianm |
|
There are some static check failures, that I was about to miss. Fixing them |
HA Integration tests are failing, which this PR doesn't touch upon and seems unrelated. Also, it passed in the previous run. |
Row-based frames, and by extension, MSQ now supports numeric array types. This means that all queries consuming or producing arrays would also work with MSQ. Numeric arrays can also be ingested via MSQ. Post this patch, queries like, SELECT [1, 2] would work with MSQ since they consume a numeric array, instead of failing with an unsupported column type exception.
Row-based frames, and by extension, MSQ now supports numeric array types. This means that all queries consuming or producing arrays would also work with MSQ. Numeric arrays can also be ingested via MSQ. Post this patch, queries like, SELECT [1, 2] would work with MSQ since they consume a numeric array, instead of failing with an unsupported column type exception.

Description
This PR introduces the
FieldWritersto write numerical arrays into Frames. This will allow the arrays to be written into row-based frames, which are currently being used by the MSQ engine.The end goal is to extend this to
FrameColumnWritersas well in follow-up patches so that the broker that can use Frame (column-based) to materialize the subquery results can also write and read numerical array columns.The PR refactors the existing
long,double, andfloat-FieldWritersandFieldReaders- to subclass fromNumericalFieldWriterandNumericalFieldReader.The new
NumericalArrayFieldWriterandNumericalArrayFieldReaderreuse these classes to write the numerical array.Format
The format of the array value is written into the field as:
For example:
nullNo end of array is needed in case of null array
[][5L, null, 6L]Testing
Test cases added
<Type>ArrayFieldWriterTesthave been added.CalciteArraysQueryTestsas well. I have not committed to those changes yet since the UNNEST on MSQ Unnest now works on MSQ #14886 also creates the same file, and I wanted to avoid conflict since there's a high likelihood that PR might go in first. In case the reviews on this PR are completed, I'll update the PR before the merge with the changes.Compatibility with older versions
The changes don't modify the original behavior of the frames, but add support for the newer types. Therefore the queries that were working before would continue to work as usual in case of a rolling upgrade with the newer version.
The queries that weren't working before might or might not fail in case of a rolling upgrade (when there are multiple versions of Druid in the same cluster) depending on the nodes that the workers are running on, whereas they would have definitely failed in the older cluster.
Release note
Row-based frames, and by extension, MSQ now supports numeric array types. This means that all queries consuming or producing arrays would also work with MSQ. Numeric arrays can also be ingested via MSQ. Post this patch, queries like,
SELECT [1, 2]would work with MSQ since they consume a numeric array, instead of failing with an unsupported column type exception.Key changed/added classes in this PR
MyFooOurBarTheirBazThis PR has: