Skip to content

Comments

Field writers for numerical arrays#14900

Merged
LakshSingla merged 28 commits intoapache:masterfrom
LakshSingla:frames-numerical-arrays
Oct 4, 2023
Merged

Field writers for numerical arrays#14900
LakshSingla merged 28 commits intoapache:masterfrom
LakshSingla:frames-numerical-arrays

Conversation

@LakshSingla
Copy link
Contributor

@LakshSingla LakshSingla commented Aug 23, 2023

Description

This PR introduces the FieldWriters to write numerical arrays into Frames. This will allow the arrays to be written into row-based frames, which are currently being used by the MSQ engine.

The end goal is to extend this to FrameColumnWriters as well in follow-up patches so that the broker that can use Frame (column-based) to materialize the subquery results can also write and read numerical array columns.

The PR refactors the existing long, double, and float - FieldWriters and FieldReaders - to subclass from NumericalFieldWriter and NumericalFieldReader.
The new NumericalArrayFieldWriter and NumericalArrayFieldReader reuse these classes to write the numerical array.

Format

The format of the array value is written into the field as:

  1. 1st byte is reserved for whether the array is null or not.
  2. If the array is not null, we write the individual values into the memory consecutively using the preexisting field writers. There's one caveat, the null value is denoted with a 0x01 flag instead of 0x00 (which is used in normal numeric field writers), and the non-null value is denoted with a 0x02 flag instead of 0x01 (which is used in normal numeric field writers). This is done because it allows the field comparison per byte.
  3. We terminate the array with 0x00.

For example:

  1. The getObject() returns null
Bytes Value Interpretation
1 0x00 Denotes that the array null

No end of array is needed in case of null array

  1. The getObject returns []
Bytes Value Interpretation
1 0x01 Denotes that the array is not null
1 0x00 End of the array
  1. The getObject returns [5L, null, 6L]
Bytes Value Interpretation
1 0x01 Denotes that the array is not null
2 0x02 Denotes that the next element is not null
3-10 transform(5) Representation of 5
11 0x01 Denotes that the next element is null
12-19 transform(0) Representation of 0 (default value, the reader will ignore it if SqlCompatible mode is on
20 0x02 Denotes that the next element is not null
21-28 transform(6) Representation of 6
29 0x00 End of array

Testing

Test cases added

  1. Individual test cases for the field writers and readers by the name of <Type>ArrayFieldWriterTest have been added.
  2. Added the Datasets to work with the unit tests that test out the frame creation and sorting E2E.
  3. (Not committed) Verified that the tests work with the CalciteArraysQueryTests as well. I have not committed to those changes yet since the UNNEST on MSQ Unnest now works on MSQ #14886 also creates the same file, and I wanted to avoid conflict since there's a high likelihood that PR might go in first. In case the reviews on this PR are completed, I'll update the PR before the merge with the changes.

Compatibility with older versions

The changes don't modify the original behavior of the frames, but add support for the newer types. Therefore the queries that were working before would continue to work as usual in case of a rolling upgrade with the newer version.
The queries that weren't working before might or might not fail in case of a rolling upgrade (when there are multiple versions of Druid in the same cluster) depending on the nodes that the workers are running on, whereas they would have definitely failed in the older cluster.

Release note

Row-based frames, and by extension, MSQ now supports numeric array types. This means that all queries consuming or producing arrays would also work with MSQ. Numeric arrays can also be ingested via MSQ. Post this patch, queries like, SELECT [1, 2] would work with MSQ since they consume a numeric array, instead of failing with an unsupported column type exception.


Key changed/added classes in this PR
  • MyFoo
  • OurBar
  • TheirBaz

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@LakshSingla LakshSingla added Release Notes Improvement Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Aug 24, 2023
Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For tests, add the following:

  • Test datasets in FrameWriterTestData for TEST_ARRAYS_LONG, TEST_ARRAYS_FLOAT, and TEST_ARRAYS_DOUBLE, and add those to DATASETS. We have one of these for each supported type. These test datasets should be designed to exercise cases like arrays sharing the same prefix, null arrays, arrays containing null, empty arrays, and single-element arrays.
  • Test cases in FrameWriterTest for test_arrayLong, test_arrayFloat, and test_arrayDouble.

Because this patch doesn't include columnar readers and writers, you will need to add some code to skip these cases for columnar frames.

@LakshSingla LakshSingla requested review from cryptoe and gianm September 7, 2023 22:09
Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good, had some comments on testing and javadocs. ty!

import java.util.List;
import java.util.stream.Collectors;

public class DoubleArrayFieldReaderTest extends InitializedNullHandlingTest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include a test case with Double.NaN, and a test case with with a null mixed in with nonnull doubles. (I see one for null by itself, but I don't see a mixture of null and nonnull.) Only include the null if NullHandling.sqlCompatible() though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test case should work in both compatible and incompatible mode, since arrays don't coerce elements to null. #14900 (comment)

import java.util.List;
import java.util.stream.Collectors;

public class FloatArrayFieldReaderTest extends InitializedNullHandlingTest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include a test case with Float.NaN, and a test case with with a null mixed in with nonnull doubles. (I see one for null by itself, but I don't see a mixture of null and nonnull.) Only include the null if NullHandling.sqlCompatible() though.

// read only the ROW_BASED frames. Rather, set the optional, and throw the appropriate error message when the reader
// tries to read COLUMNAR frame. This should go away once the COLUMNAR frames also start supporting the numeric
// array
if (columnType.getType() == ValueType.ARRAY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the conditional would be more readable if the element type was extracted from the condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove the conditional altogether, and add a dummy column reader.

* In short, this is a temporary measure till columnar frames support the numerical array types to punt the unsupported
* type check for the numerical arrays (for COLUMNAR frames only) at the usage time, rather than the creation time
*/
private final Optional<Pair<String, ColumnType>> unsupportedColumnAndType;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this structure, IMO it'd keep FrameReader cleaner if we create a dummy column reader for the numeric array columns, which throws exceptions on readColumn and readRACColumn. Then the errors would happen only when we attempt to access those columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @gianm comment. I think frameReader should not track the incompatible column types. That logic should be pushed to the fieldReadersand they should just throw impl not found.

return true;
}

public abstract static class Selector<T extends Number> implements ColumnValueSelector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some javadoc explaining the purpose of this class, including javadoc for the methods getIndividualValueAtMemory and getIndividualFieldSize. It's pretty twisty-turny to read through the code due to the inheritance stack, so additional javadoc helps the reader figure out what is going on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, also maybe a more descriptive name for the class, and maybe name for the generic parameter, TElement or something perhaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated comments and docs

* of the array instead. (0x00 is used for array end because it helps in preserving the byte comparison property of the
* numeric array field writers).
*
* Therefore, to preserve backward and forward compatibility, the individual element's writers were left unchanged,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide and example of the above. This will help user reason about the choice of bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. There is format in the JavaDoc above
  2. There is an explanation about the choice of the bytes chosen here
  3. Added @see to the NumericArrayFieldWriter which contains the info on how this field writer is used for array elements.

Is there anything specific that I am missing here, that would help with clarification?

return true;
}

public abstract ColumnValueSelector<?> getColumnValueSelector(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please java doc the abstract class methods.


public abstract ValueType getValueType();

public abstract static class Selector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why this is not a class of its own?
Where is this selector being used in the abstract class NumericFieldReader ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the previous implementations of the field readers kept the selector as a private class inside their code, therefore I kept the selectors inside the parent class. Since this one is getting inherited, I'll move this out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping it as is, since this is used for nullity check only. I'll add the relevant comments though.

* In short, this is a temporary measure till columnar frames support the numerical array types to punt the unsupported
* type check for the numerical arrays (for COLUMNAR frames only) at the usage time, rather than the creation time
*/
private final Optional<Pair<String, ColumnType>> unsupportedColumnAndType;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @gianm comment. I think frameReader should not track the incompatible column types. That logic should be pushed to the fieldReadersand they should just throw impl not found.

// even when NullHandling.replaceWithDefault() is true we need to write null as is, and not convert it to their
// default value when writing the array. Therefore, the check is `getObject() == null` ignoring the value of
// `NullHandling.replaceWithDefaul()`.
return getObject() == null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clintropolis Could you please review this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NullHandling.replaceWithDefault is dead to me, but this is true that array columns and expressions do not coerce elements to zeros.

We could also consider only supporting arrays in sql compatible mode :p


// Create a columnValueSelector to write the individual elements re-using the NumericFieldWriter
AtomicInteger index = new AtomicInteger(0);
ColumnValueSelector<Number> columnValueSelector = new ColumnValueSelector<Number>()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we push column value selector to another class or a subclass. That would make the code more readable.

memory.putByte(position + offset, NON_NULL_ROW);
offset += Byte.BYTES;

for (; index.get() < list.size(); index.incrementAndGet()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we iterate the list here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but we'll ignore the element of the list we will be iterating over, so I have kept it as is

for ( Number ignored : list) {
 // processing
}

@cryptoe
Copy link
Contributor

cryptoe commented Oct 3, 2023

Please also enable the CalciteArraysQueryTests with MSQ with this PR.

private FieldWriter fieldWriter;

//CHECKSTYLE.OFF: Regexp
private static final Object[] FLOATS_ARRAY_1 = new Object[]{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a nested array test case as well to all numeric types ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean something like [[1,2],[3,4]]? It'd make little sense to do that since the numeric fields cannot be arbitrarily nested.
Therefore the fact that we have created a FloatArrayFieldReader means that we are reading something written by FloatArrayFieldWriterwhich ensures that we get an array of type [1,2,3,4,....].

If we'd had nested array like [[1,2],[3,4]], we wouldn't have created the writer and the reader in the first place, since the signature would have been `Array<Array>, and we wouldn't be in the methods tested by this suite.

return true;
}

public abstract static class Selector<T extends Number> implements ColumnValueSelector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, also maybe a more descriptive name for the class, and maybe name for the generic parameter, TElement or something perhaps?

return new StringFieldReader(true);
switch (Preconditions.checkNotNull(columnType.getElementType().getType(), "array elementType")) {
case STRING:
return new StringFieldReader(true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated to this PR side note, i wish we split string array field reader out of string field reader, even if its only a vanity name for the same class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a new namesake reader for this.
So, to create new methods would be:

  1. new StringFieldReader() -> for strings
  2. new StringArrayFieldReader() -> for string arrays

Comment on lines 45 to 47
ColumnValueSelector<?> columnValueSelector =
DoubleFieldReader.forArray()
.makeColumnValueSelector(memory, new ConstantFieldPointer(position));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems kind of expensive, like we make a column value selector and a constant pointer for every element of every row? surely there is a less garbage intense way to read the array elements, though doesn't necessarily need to be resolved in this PR...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lemme check if I can reduce these object creations post addressing the other review comments.. If it requires a minor refactor, I'll update it here.

Copy link
Contributor Author

@LakshSingla LakshSingla Oct 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored, now we are just changing the position, and creating each object (reader, column value selector, field pointer) exactly once per selector.
(which should probably be once per Frame I think, or slightly higher, if the frame was comprised of multiple memory locations)

// even when NullHandling.replaceWithDefault() is true we need to write null as is, and not convert it to their
// default value when writing the array. Therefore, the check is `getObject() == null` ignoring the value of
// `NullHandling.replaceWithDefaul()`.
return getObject() == null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NullHandling.replaceWithDefault is dead to me, but this is true that array columns and expressions do not coerce elements to zeros.

We could also consider only supporting arrays in sql compatible mode :p

* Indicator byte denoting that the numeric value succeeding it is not null. This is used while writing the individual
* elements writers of an array
*/
public static final byte ARRAY_ELEMENT_NOT_NULL_BYTE = 0x02;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to make this the array terminator then can just use the same null/not null bytes for elements as for regular numbers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Byte-by-byte comparison won't work if we change the order.
Comparing 2 arrays [1] and [1, 2].
Semantically, [1] < [1, 2]
When we convert using the scheme you mentioned, we'd encode them as

[1] 
0x01 (non-null array) 
0x01 (non-null byte) 
transform(1) 
0x02 (array end) (*)
[1, 2] 
0x01 (non-null array) 
0x01 (non-null byte) 
transform(1) 
0x01 (non-null byte) (*)
transform(2)
0x02 (array end)

However, on comparing byte-by-byte, the second array < first array, and the first point of difference would be the (*) marked location.

Therefore the array terminator is the smallest marker of them all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions github-actions bot added Area - Querying Area - Segment Format and Ser/De and removed Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Oct 3, 2023
@github-actions github-actions bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Oct 3, 2023
@LakshSingla LakshSingla added this to the 28.0 milestone Oct 3, 2023
Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤘

@Override
public Object getObject()
{
final List<ElementType> currentArray = computeCurrentArray();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suppose this is the only real disadvantage of not storing the array length after the null byte and wanting to spit out Object[]. Basically that since we don't know the length up front we have to make a list and then convert it to Object[] after (though realistically right now it seems like almost everything that handles arrays handles Object[], List, and some group by stuff, so its not strictly required, but is better to be consistent).

I'm not sure that the extra bytes per value (could use VByte.writeInt/VByte.readInt, well add methods that work on Memory instead of ByteBuffer to fit in a little as a byte) are worth the savings in reads, might be worth measuring someday, though this doesn't really seem like a blocker to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the array length after the null byte and wanting to spit out Object[].

So you are saying the format should be something like:

Begin array indicator -> Array Elements -> Array terminator -> Size of array

The size of the array will be helpful in preventing extra converts of List <-> Object while reading since we can directly allocate that much memory upfront. I never considered it, and it also doesn't mess with the comparison aspect of the fields, since we are going to end the comparison before that itself.

However, since the array end is not known, we'd need to have a two-pass, where we find the rowTerminator first, and then figure out the size, allocate an array, and then re-read the elements from the bytes.
So it's b/w - Two passes through the byte array, however without any List <-> Object[] conversion
Or Single pass, requiring List<-> Object[] conversion

verifySteps.add(new VerifyResults(finalExecStep));
// Don't verify the row signature when MSQ is running, since the broker receives the task id, and the signature
// would be {TASK:STRING} instead of the expected results signature
verifySteps.add(new VerifyResults(finalExecStep, !config.isRunningMSQ()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be nice to have something that functions equivalently for msq based tests (if there isn't already) but this doesn't seem like a blocker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I wanted to separate out the row signature check and the results check, though chose not to keep it in this PR.

Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM!!.

Could you update the release notes for the end user.

On the lines of
earlier queries like "select array[1,2]` would not work in msq but now they would.

1672531200000L,
Arrays.asList("d", "e"),
Arrays.asList("b", "b"),
new Object[]{1L, 4L},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this method will run multiple times due to parametrized tests, we can consider writing this file only once. Maybe in a static method but that can be done in a follow up PR.

Thank you for adding this test though.

@LakshSingla
Copy link
Contributor Author

Verified that it works on a Druid cluster. The IT failures look unrelated

Thanks for the reviews @clintropolis @cryptoe @gianm
Screenshot 2023-10-04 at 7 52 31 PM

@LakshSingla
Copy link
Contributor Author

There are some static check failures, that I was about to miss. Fixing them

@LakshSingla
Copy link
Contributor Author

Unit & Integration tests CI / revised-its / it (8, HighAvailability, middleManager) / HighAvailability integration test (Compile=jdk8, Run=jdk8, Indexer=middleManager) (pull_request) Failing after 65m

HA Integration tests are failing, which this PR doesn't touch upon and seems unrelated. Also, it passed in the previous run.
Going ahead with the merge.

@LakshSingla LakshSingla merged commit 30cf76d into apache:master Oct 4, 2023
@somu-imply somu-imply mentioned this pull request Oct 5, 2023
10 tasks
ektravel pushed a commit to ektravel/druid that referenced this pull request Oct 16, 2023
Row-based frames, and by extension, MSQ now supports numeric array types. This means that all queries consuming or producing arrays would also work with MSQ. Numeric arrays can also be ingested via MSQ. Post this patch, queries like, SELECT [1, 2] would work with MSQ since they consume a numeric array, instead of failing with an unsupported column type exception.
CaseyPan pushed a commit to CaseyPan/druid that referenced this pull request Nov 17, 2023
Row-based frames, and by extension, MSQ now supports numeric array types. This means that all queries consuming or producing arrays would also work with MSQ. Numeric arrays can also be ingested via MSQ. Post this patch, queries like, SELECT [1, 2] would work with MSQ since they consume a numeric array, instead of failing with an unsupported column type exception.
@LakshSingla LakshSingla deleted the frames-numerical-arrays branch May 8, 2024 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants