Fix code to correctly extract value of multi-value column from avro file #5746

jackjlli · 2020-07-24T00:03:31Z

Description

This PR fixed the issue introduced from this PR: #5238

Basically the value of the multi-value column should be at the 1st level instead of digging into the deepest level. The issue is causing the segment creation fail to extract the correct value from the avro file.

Tests Done

Before the change:

java.lang.ClassCastException: java.util.HashMap cannot be cast to java.lang.Number
	at org.apache.pinot.core.data.recordtransformer.PinotDataType$11.toInteger(PinotDataType.java:392) ~[classes/:?]
	at org.apache.pinot.core.data.recordtransformer.PinotDataType.toIntegerArray(PinotDataType.java:512) ~[classes/:?]
	at org.apache.pinot.core.data.recordtransformer.PinotDataType$13.convert(PinotDataType.java:441) ~[classes/:?]
	at org.apache.pinot.core.data.recordtransformer.PinotDataType$13.convert(PinotDataType.java:438) ~[classes/:?]
	at org.apache.pinot.core.data.recordtransformer.DataTypeTransformer.transform(DataTypeTransformer.java:100) ~[classes/:?]
	at org.apache.pinot.core.data.recordtransformer.CompositeTransformer.transform(CompositeTransformer.java:74) ~[classes/:?]
	at org.apache.pinot.core.segment.creator.RecordReaderSegmentCreationDataSource.gatherStats(RecordReaderSegmentCreationDataSource.java:70) ~[classes/:?]
	at org.apache.pinot.core.segment.creator.RecordReaderSegmentCreationDataSource.gatherStats(RecordReaderSegmentCreationDataSource.java:38) ~[classes/:?]
	at org.apache.pinot.core.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:146) ~[classes/:?]
	at org.apache.pinot.core.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:131) ~[classes/:?]
	at org.apache.pinot.core.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:93) ~[classes/:?]
	at org.apache.pinot.tools.admin.command.CreateSegmentCommand.lambda$execute$0(CreateSegmentCommand.java:238) ~[classes/:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_172]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_172]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_172]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]

After the change:

2020/07/23 17:02:15.195 INFO [CreateSegmentCommand] [pool-2-thread-1] Successfully created segment: segmentName_20200722_20200722_0 at directory: ~/sampleData/output/segmentName_20200722_20200722_0
2020/07/23 17:02:15.196 INFO [CreateSegmentCommand] [main] Successfully created 1 segments from data files: [~/sampleData/input/test.avro]

mayankshriv

Could you please add unit test?

mayankshriv · 2020-07-24T00:07:47Z

...format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroUtils.java

-      convertedMap.put(fieldName, convert(record.get(fieldName)));
-    }
-    return convertedMap;
+    return record.get(0);


Is record guaranteed to be non-empty?

Yes, otherwise the if statement (value instanceof GenericData.Record) would have failed.

This is going to break the existing functionality of extracting a struct within a record

snleee · 2020-07-24T00:13:20Z

...format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroUtils.java

-      convertedMap.put(fieldName, convert(record.get(fieldName)));
-    }
-    return convertedMap;
+    return record.get(0);


I'm not sure if this is a valid fix. According to the comment instruction,

Converts the value to either a single value (string, number, bytebuffer), multi value (Object[]) or a Map Natively Pinot only understands single values and multi values. Map is useful only if some ingestion transform functions operates on it in the transformation layer.

Don't we need to follow the above conversion in case of GenericData.Record?

This is following the conversion here. GenericData.Record is for fetching the value from a single value column.

If GenericData.Record is supposed imply SV, then what is the distinction between the last two branches of this code

public static Object convert(Object value) { if (value == null) { return null; } Object convertedValue; if (value instanceof Collection) { convertedValue = handleMultiValue((Collection) value); } else if (value instanceof Map) { convertedValue = handleMap((Map) value); } else if(value instanceof GenericData.Record) { convertedValue = handleGenericRecord((GenericData.Record) value); } else { convertedValue = RecordReaderUtils.convertSingleValue(value); } return convertedValue; }

Also, if the bug is the extraction of multi-value column, I don't quite follow how this fix is handling that. Shouldn't we be looking at handleMultiValue()?

This is the original code to handle SV:

public static Object handleSingleValue(@Nullable Object value) { if (value == null) { return null; } if (value instanceof GenericData.Record) { return handleSingleValue(((GenericData.Record) value).get(0)); } return value; }

To determine whether it's SV or MV it's from the Pinot schema.

And the method handleMultiValue() would also call convert() method for every sub-value for MV column.

npawar · 2020-07-24T00:53:57Z

@jackjlli this doesn't look like the right fix for the issue. For GenericRecord, we should be looking at each field recursively.

Also the AvroRecordExtractorComplexTypesTest will fail with this change.

The code you're introducing makes an assumption that for a field of type GenericRecord in the avro raw data, we should always only look at the first fiel. That's not a good assumption right?

sajjad-moradi

Please add a unit test which fails without your fix. This way you show the validity of the fix.

siddharthteotia · 2020-07-24T00:55:01Z

...format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroUtils.java

@@ -353,16 +353,6 @@ public static Object handleMap(Map map) {
   * Handles the conversion of every field of the GenericRecord
   */
  private static Object handleGenericRecord(GenericData.Record record) {
-    List<Field> fields = record.getSchema().getFields();


It seems that the current implementation of this function is doing what it is supposed to.

The caller chain is something like convert(value) -> handleGenericRecord(record). The latter is invoked by the former only if it determines there is nestedness (that the value passed to convert isn't a SV, MV or MAP. It is essentially a struct. So for that we need to recurse and call convert for each field inside GenericData.Record. The fix seems to be removing all of that code.

@npawar 's comment #5746 is also aligned with my observation.

Jackie-Jiang

The fix should be in the DataTypeTransformer instead of here.
In DataTypeTransformer we should be able to handle map properly

jackjlli · 2020-07-24T01:24:33Z

The fix should be in the DataTypeTransformer instead of here.
In DataTypeTransformer we should be able to handle map properly

I've thought about that. But DataTypeTransformer uses a generic class for all kinds of data types though.

siddharthteotia · 2020-07-24T22:39:52Z

pinot-core/src/main/java/org/apache/pinot/core/data/recordtransformer/DataTypeTransformer.java

@@ -87,6 +90,8 @@ public GenericRow transform(GenericRow record) {
        source = MULTI_VALUE_TYPE_MAP.get(values[0].getClass());
        if (source == null) {
          source = PinotDataType.OBJECT_ARRAY;
+        } else if (source == PinotDataType.HASHMAP) {


This should be done conditionally. In other words, consider two columns:

col1 - primitive column defined as MV int/float/double/string (simple array of primitive types that Pinot supports). Something like "CompaniesWorkedAt".

[ "val1", "val2", "val3" ]

col2 - complex column defined as MV. Something like addresses which is an array (struct) or array (map).

[ { "k1" : "v1", "k2": "v2", "k3": "v3" } ]

The second is a complex column whereas is first is standard primitive. Now the AvroRecordExtractor and AvroUtils.convert() would have returned the second as an array of Map/HashMap for our sample data

"dimension_***" : [ { "item_id" : { "string" : "some data" } }, { "item_id" : { "string" : "some data" } }

Now since schema for "dimension_***" column indicates it is an array of primitives, we need to extract the actual value from each HashMap. In such cases, it is also true that map size would be 1 but that alone should not be the condition since map size could be 1 even if the object is actually a complex one. So we should convert from Map to primitive only if the schema says so.

See this comment on what should be the algorithm here

#5746

siddharthteotia · 2020-07-24T22:56:15Z

pinot-core/src/main/java/org/apache/pinot/core/data/recordtransformer/PinotDataType.java

@@ -470,6 +474,43 @@ public String toString(Object value) {
    }
  },

+  HASHMAP,


We actually don't have to introduce this type (in future we should when we have Map) but now there is no need.

In the transform() method of DataTypeTransformer, we should do:

if (values[0] instance of Map) { // and if schema says the column is primitive // then get the primitive type from Map using the function you have already implemented ` `getPinotDataTypeFromHashMap`` call dest.convert() }

siddharthteotia · 2020-07-24T22:57:35Z

pinot-core/src/main/java/org/apache/pinot/core/data/recordtransformer/PinotDataType.java

@@ -632,4 +713,22 @@ public static PinotDataType getPinotDataType(FieldSpec fieldSpec) {
            "Unsupported data type: " + dataType + " in field: " + fieldSpec.getName());
    }
  }
+
+  public static PinotDataType getPinotDataTypeFromHashMap(Map<Object, Object> map) {


Suggest calling it getPrimitivePinotDataTypeFromMap

HashMap is an implementation detail so let's not use that. Map is better since it is also an interface and suitable to represent an abstract complex type

Jackie-Jiang

Seems we don't need to handle the nested map for now (the old behavior does not handle that as well). So a simple fix could be before checking value instanceof List, check value instanceof Map and then check if it is single-entry, replace value with the entry value.

Jackie-Jiang · 2020-07-25T19:09:38Z

pinot-core/src/main/java/org/apache/pinot/core/data/recordtransformer/DataTypeTransformer.java

@@ -56,6 +56,7 @@
    MULTI_VALUE_TYPE_MAP.put(Float.class, PinotDataType.FLOAT_ARRAY);
    MULTI_VALUE_TYPE_MAP.put(Double.class, PinotDataType.DOUBLE_ARRAY);
    MULTI_VALUE_TYPE_MAP.put(String.class, PinotDataType.STRING_ARRAY);
+    MULTI_VALUE_TYPE_MAP.put(HashMap.class, PinotDataType.MAP);


HashMap is not good enough, should use instanceof Map instead

Jackie-Jiang · 2020-07-25T19:30:58Z

pinot-core/src/main/java/org/apache/pinot/core/data/recordtransformer/PinotDataType.java

+    } else if (obj instanceof String) {
+      return PinotDataType.STRING_MAP;
+    } else {
+      throw new IllegalStateException(String.format("'%s' isn't supported in the hash map.", obj.getClass()));


How about nested Map? How about single-entry map with multi-valued value (e.g. Object[])?

Since we don't completely support complex map cases like nested map and single-entry map with multi-valued value, we can add the logic to it once it's supported. We can add a TODO here.

siddharthteotia · 2020-07-27T15:50:07Z

pinot-core/src/main/java/org/apache/pinot/core/data/recordtransformer/PinotDataType.java

@@ -470,6 +472,43 @@ public String toString(Object value) {
    }
  },

+  MAP,
+
+  INTEGER_MAP {


Let's consider naming this to INTEGER_VALUE_MAP and same for the remaining.

siddharthteotia · 2020-07-27T15:55:07Z

pinot-core/src/main/java/org/apache/pinot/core/data/recordtransformer/PinotDataType.java

-      PinotDataType singleValueType = getSingleValueType();
-      for (int i = 0; i < length; i++) {
-        integerArray[i] = singleValueType.toInteger(valueArray[i]);
+      if (valueArray[0] instanceof Map) {


The call will be something like this:

DataTypeTransformer will determine that source = PinotDataType.Map and dest = PinotDataType.STRING_ARRAY

We then call dest.convert(source), which enters the following:

STRING_ARRAY { @Override public String[] convert(Object value, PinotDataType sourceType) { return sourceType.toStringArray(value); } },

So now all we need is toStringArray, toIntegerArray etc on source. So instead of changing these functions for everyone, let's just override them for the new type you have introduced - INT_VALUE_MAP, LONG_VALUE_MAP etc.

This can be another way, but it'll increase the number of overridden methods for EACH of the new PinotDataTypes.
E.g. if the source type is STRING_VALUE_MAP and the destination type is INTEGER_ARRAY, the source data type has to have an overridden method of toIntegerArray(). I have a unit test on that, which is DataTypeTransformerTest.

siddharthteotia · 2020-07-27T15:58:42Z

Discussed with @jackjlli and I feel the alternate solution implemented in #5756 is better.

For the current solution:

Firstly I would like to avoid introducing a type now and defer it to later when Map type support is actually implemented correctly. The alternative PR follows this and the code change is much cleaner.
Secondly, if we want to go ahead with this solution, then we should change the HashMap type to Map since latter is more abstract and a couple of other changes as suggested in the comments.

jackjlli · 2020-07-27T17:15:57Z

Seems we don't need to handle the nested map for now (the old behavior does not handle that as well). So a simple fix could be before checking value instanceof List, check value instanceof Map and then check if it is single-entry, replace value with the entry value.

The value itself basically is a Object[], where each object is a map. That's why I added the logic under if (value instanceof Object[]) statement.

jackjlli · 2020-07-27T17:16:10Z

HashMap is not good enough, should use instanceof Map instead

I've tried to put MULTI_VALUE_TYPE_MAP.put(Map.class, PinotDataType.MAP); but it could't return the expected value. Maybe using ( value instance of Map) would be better than adding an entry to MULTI_VALUE_TYPE_MAP.

Jackie-Jiang · 2020-07-27T23:07:21Z

Please take a look at #5760 which is a general fix for this issue.

Fix code to correctly extract value of multi-value column from avro file

9469cec

mayankshriv reviewed Jul 24, 2020

View reviewed changes

snleee reviewed Jul 24, 2020

View reviewed changes

sajjad-moradi reviewed Jul 24, 2020

View reviewed changes

siddharthteotia reviewed Jul 24, 2020

View reviewed changes

Jackie-Jiang reviewed Jul 24, 2020

View reviewed changes

Handle Map in DataTypeTransformer class

48a85f2

siddharthteotia reviewed Jul 24, 2020

View reviewed changes

Address PR comments

297d5c5

jackjlli force-pushed the fix-multi-value-column-from-avro branch from be4be9a to 297d5c5 Compare July 25, 2020 06:28

Jackie-Jiang reviewed Jul 25, 2020

View reviewed changes

siddharthteotia reviewed Jul 27, 2020

View reviewed changes

Address PR comments

7f5a47e

jackjlli closed this Jul 29, 2020

jackjlli deleted the fix-multi-value-column-from-avro branch July 29, 2020 17:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix code to correctly extract value of multi-value column from avro file #5746

Fix code to correctly extract value of multi-value column from avro file #5746

jackjlli commented Jul 24, 2020 •

edited

mayankshriv left a comment

mayankshriv Jul 24, 2020

jackjlli Jul 24, 2020

siddharthteotia Jul 24, 2020

snleee Jul 24, 2020

jackjlli Jul 24, 2020

siddharthteotia Jul 24, 2020 •

edited

siddharthteotia Jul 24, 2020

jackjlli Jul 24, 2020

npawar commented Jul 24, 2020

sajjad-moradi left a comment

siddharthteotia Jul 24, 2020

siddharthteotia Jul 24, 2020

Jackie-Jiang left a comment

jackjlli commented Jul 24, 2020

siddharthteotia Jul 24, 2020 •

edited

siddharthteotia Jul 24, 2020

siddharthteotia Jul 24, 2020 •

edited

siddharthteotia Jul 24, 2020

Jackie-Jiang left a comment

Jackie-Jiang Jul 25, 2020

Jackie-Jiang Jul 25, 2020

jackjlli Jul 27, 2020

siddharthteotia Jul 27, 2020

siddharthteotia Jul 27, 2020

jackjlli Jul 27, 2020

siddharthteotia commented Jul 27, 2020

jackjlli commented Jul 27, 2020

jackjlli commented Jul 27, 2020

Jackie-Jiang commented Jul 27, 2020

Fix code to correctly extract value of multi-value column from avro file #5746

Fix code to correctly extract value of multi-value column from avro file #5746

Conversation

jackjlli commented Jul 24, 2020 • edited

Description

Tests Done

mayankshriv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siddharthteotia Jul 24, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

npawar commented Jul 24, 2020

sajjad-moradi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jackie-Jiang left a comment

Choose a reason for hiding this comment

jackjlli commented Jul 24, 2020

siddharthteotia Jul 24, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siddharthteotia Jul 24, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siddharthteotia commented Jul 27, 2020

jackjlli commented Jul 27, 2020

jackjlli commented Jul 27, 2020

Jackie-Jiang commented Jul 27, 2020

jackjlli commented Jul 24, 2020 •

edited

siddharthteotia Jul 24, 2020 •

edited

siddharthteotia Jul 24, 2020 •

edited

siddharthteotia Jul 24, 2020 •

edited