Write null byte when indexing numeric dimensions with Hadoop #7020

ferristseng · 2019-02-06T21:52:45Z

I noticed a couple of comments that hadn't been addressed in the Hadoop Indexing project regarding serializing and deserializing null numeric values, so I figured I would try to tackle it. I'm not super familiar with the internals of Druid, so let me know if I need to change code elsewhere.

Also, I ran the existing tests in the Hadoop Indexing project with -Ddruid.generic.useDefaultValueForNull=false, and they still passed. Let me know if I need to add additional ones!

clintropolis

Thanks for the contribution! (and apologies it has taken so long for a review)

I think this is reasonable, it's the same approach being used for the metrics columns.

I think it would be nice to add a test in InputRowSerdeTest to cover this. All of the tests in travis are run with and without sql null compatibility, so you can probably just write one test that can assert that null valued input columns are either the null byte or zero depending on the NullHandling.replaceWithDefault().

ferristseng · 2019-02-28T18:10:11Z

Thanks for taking a look at this!

I added a test to InputRowSerdeTest that just makes sure the default value or null byte are written in the expected positions in the serialized array.

clintropolis

LGTM, thanks for adding a test 👍

asdf2014 · 2019-03-06T04:11:43Z

indexing-hadoop/src/main/java/org/apache/druid/indexer/InputRowSerde.java

+    // Write the null byte only if the default numeric value is still null.
+    if (ret == null) {
+      out.writeByte(NullHandling.IS_NULL_BYTE);
+


Please remove the extra blank line.

asdf2014 · 2019-03-06T04:24:48Z

indexing-hadoop/src/main/java/org/apache/druid/indexer/InputRowSerde.java

@@ -190,7 +215,7 @@ public void serialize(ByteArrayDataOutput out, Object value)
    @Override
    public Long deserialize(ByteArrayDataInput in)
    {
-      return in.readLong();
+      return isNullByteSet(in) ? null : in.readLong();


Perhaps it would be better to use a functional programming style here.

return Optional.ofNullable(in) .filter(InputRowSerde::isNotNullByteSet) .map(ByteArrayDataInput::readLong) .get();

Eh, I sort of prefer it the way it currently is, seems clearer to me, is there any reason it would be better other than preference?

Alright, I prefer the functional style because it makes the code more readable. If we don't use Optional, then we need add the @Nullable annotation for this method. It's up to you. 😅

I think I prefer the non-functional style. Also, maybe I'm misunderstanding, but wouldn't the get() cause the code to throw if the null byte is set?

I'll add @Nullable annotations to these deserialize methods

but wouldn't the get() cause the code to throw if the null byte is set?

@ferristseng If the null byte is set, then get will return a null value. What you describe should be the orElseThrow function. Thanks for your contribution.

asdf2014 · 2019-03-06T04:25:08Z

indexing-hadoop/src/main/java/org/apache/druid/indexer/InputRowSerde.java

@@ -229,7 +249,7 @@ public void serialize(ByteArrayDataOutput out, Object value)
    @Override
    public Float deserialize(ByteArrayDataInput in)
    {
-      return in.readFloat();
+      return isNullByteSet(in) ? null : in.readFloat();


Same.

return Optional.ofNullable(in) .filter(InputRowSerde::isNotNullByteSet) .map(ByteArrayDataInput::readFloat) .get();

asdf2014 · 2019-03-06T04:25:28Z

indexing-hadoop/src/main/java/org/apache/druid/indexer/InputRowSerde.java

@@ -268,7 +283,7 @@ public void serialize(ByteArrayDataOutput out, Object value)
    @Override
    public Double deserialize(ByteArrayDataInput in)
    {
-      return in.readDouble();
+      return isNullByteSet(in) ? null : in.readDouble();


Same.

return Optional.ofNullable(in) .filter(InputRowSerde::isNotNullByteSet) .map(ByteArrayDataInput::readDouble) .get();

asdf2014

Overall LGTM 👍 Also I left a few suggestions.

gianm · 2019-03-12T01:02:00Z

This has two approvals -- merging it.

* write null byte in hadoop indexing for numeric dimensions * Add test case to check output serializing null numeric dimensions * Remove extra line * Add @nullable annotations

…7020) * write null byte in hadoop indexing for numeric dimensions * Add test case to check output serializing null numeric dimensions * Remove extra line * Add @nullable annotations

ferristseng mentioned this pull request Feb 11, 2019

[ERROR] Issue with Hadoop Indexer and null numeric dimensions #7050

Closed

clintropolis added Area - Batch Ingestion Area - Null Handling labels Feb 28, 2019

clintropolis requested changes Feb 28, 2019

View reviewed changes

write null byte in hadoop indexing for numeric dimensions

7ad1d91

ferristseng force-pushed the feature-hadoop-index-numeric-dim branch from 392f3ae to 7ad1d91 Compare February 28, 2019 15:08

Add test case to check output serializing null numeric dimensions

2d15e7d

clintropolis approved these changes Feb 28, 2019

View reviewed changes

jon-wei added the Bug label Feb 28, 2019

jon-wei modified the milestone: 0.14.0 Feb 28, 2019

asdf2014 reviewed Mar 6, 2019

View reviewed changes

ferristseng added 2 commits March 7, 2019 18:49

Remove extra line

65673cb

Add @nullable annotations

68eaded

asdf2014 approved these changes Mar 8, 2019

View reviewed changes

gianm merged commit c503ba9 into apache:master Mar 12, 2019

clintropolis added this to the 0.14.1 milestone Apr 24, 2019

clintropolis mentioned this pull request Apr 25, 2019

0.14.1-incubating release notes #7553

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write null byte when indexing numeric dimensions with Hadoop #7020

Write null byte when indexing numeric dimensions with Hadoop #7020

ferristseng commented Feb 6, 2019

clintropolis left a comment

ferristseng commented Feb 28, 2019

clintropolis left a comment

asdf2014 Mar 6, 2019

ferristseng Mar 7, 2019

asdf2014 Mar 6, 2019

clintropolis Mar 6, 2019

asdf2014 Mar 6, 2019

ferristseng Mar 7, 2019 •

edited

Loading

asdf2014 Mar 8, 2019

asdf2014 Mar 6, 2019

asdf2014 Mar 6, 2019

asdf2014 left a comment

gianm commented Mar 12, 2019

Write null byte when indexing numeric dimensions with Hadoop #7020

Write null byte when indexing numeric dimensions with Hadoop #7020

Conversation

ferristseng commented Feb 6, 2019

clintropolis left a comment

Choose a reason for hiding this comment

ferristseng commented Feb 28, 2019

clintropolis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ferristseng Mar 7, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asdf2014 left a comment

Choose a reason for hiding this comment

gianm commented Mar 12, 2019

ferristseng Mar 7, 2019 •

edited

Loading