Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix code to correctly extract value of multi-value column from avro file #5746

Closed
wants to merge 4 commits into from

Conversation

jackjlli
Copy link
Contributor

@jackjlli jackjlli commented Jul 24, 2020

Description

This PR fixed the issue introduced from this PR: #5238

Basically the value of the multi-value column should be at the 1st level instead of digging into the deepest level. The issue is causing the segment creation fail to extract the correct value from the avro file.

Tests Done

Before the change:

java.lang.ClassCastException: java.util.HashMap cannot be cast to java.lang.Number
	at org.apache.pinot.core.data.recordtransformer.PinotDataType$11.toInteger(PinotDataType.java:392) ~[classes/:?]
	at org.apache.pinot.core.data.recordtransformer.PinotDataType.toIntegerArray(PinotDataType.java:512) ~[classes/:?]
	at org.apache.pinot.core.data.recordtransformer.PinotDataType$13.convert(PinotDataType.java:441) ~[classes/:?]
	at org.apache.pinot.core.data.recordtransformer.PinotDataType$13.convert(PinotDataType.java:438) ~[classes/:?]
	at org.apache.pinot.core.data.recordtransformer.DataTypeTransformer.transform(DataTypeTransformer.java:100) ~[classes/:?]
	at org.apache.pinot.core.data.recordtransformer.CompositeTransformer.transform(CompositeTransformer.java:74) ~[classes/:?]
	at org.apache.pinot.core.segment.creator.RecordReaderSegmentCreationDataSource.gatherStats(RecordReaderSegmentCreationDataSource.java:70) ~[classes/:?]
	at org.apache.pinot.core.segment.creator.RecordReaderSegmentCreationDataSource.gatherStats(RecordReaderSegmentCreationDataSource.java:38) ~[classes/:?]
	at org.apache.pinot.core.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:146) ~[classes/:?]
	at org.apache.pinot.core.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:131) ~[classes/:?]
	at org.apache.pinot.core.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:93) ~[classes/:?]
	at org.apache.pinot.tools.admin.command.CreateSegmentCommand.lambda$execute$0(CreateSegmentCommand.java:238) ~[classes/:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_172]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_172]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_172]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]

After the change:

2020/07/23 17:02:15.195 INFO [CreateSegmentCommand] [pool-2-thread-1] Successfully created segment: segmentName_20200722_20200722_0 at directory: ~/sampleData/output/segmentName_20200722_20200722_0
2020/07/23 17:02:15.196 INFO [CreateSegmentCommand] [main] Successfully created 1 segments from data files: [~/sampleData/input/test.avro]

Copy link
Contributor

@mayankshriv mayankshriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add unit test?

convertedMap.put(fieldName, convert(record.get(fieldName)));
}
return convertedMap;
return record.get(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is record guaranteed to be non-empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, otherwise the if statement (value instanceof GenericData.Record) would have failed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to break the existing functionality of extracting a struct within a record

convertedMap.put(fieldName, convert(record.get(fieldName)));
}
return convertedMap;
return record.get(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is a valid fix. According to the comment instruction,

Converts the value to either a single value (string, number, bytebuffer), multi value (Object[]) or a Map
Natively Pinot only understands single values and multi values.
Map is useful only if some ingestion transform functions operates on it in the transformation layer.

Don't we need to follow the above conversion in case of GenericData.Record?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is following the conversion here. GenericData.Record is for fetching the value from a single value column.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If GenericData.Record is supposed imply SV, then what is the distinction between the last two branches of this code

public static Object convert(Object value) {
    if (value == null) {
      return null;
    }
    Object convertedValue;
    if (value instanceof Collection) {
      convertedValue = handleMultiValue((Collection) value);
    } else if (value instanceof Map) {
      convertedValue = handleMap((Map) value);
    } else if(value instanceof GenericData.Record) {
      convertedValue = handleGenericRecord((GenericData.Record) value);
    } else {
      convertedValue = RecordReaderUtils.convertSingleValue(value);
    }
    return convertedValue;
  }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if the bug is the extraction of multi-value column, I don't quite follow how this fix is handling that. Shouldn't we be looking at handleMultiValue()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the original code to handle SV:

  public static Object handleSingleValue(@Nullable Object value) {
    if (value == null) {
      return null;
    }
    if (value instanceof GenericData.Record) {
      return handleSingleValue(((GenericData.Record) value).get(0));
    }
    return value;
  }

To determine whether it's SV or MV it's from the Pinot schema.

And the method handleMultiValue() would also call convert() method for every sub-value for MV column.

@npawar
Copy link
Contributor

npawar commented Jul 24, 2020

@jackjlli this doesn't look like the right fix for the issue. For GenericRecord, we should be looking at each field recursively.

Also the AvroRecordExtractorComplexTypesTest will fail with this change.

The code you're introducing makes an assumption that for a field of type GenericRecord in the avro raw data, we should always only look at the first fiel. That's not a good assumption right?

Copy link
Contributor

@sajjad-moradi sajjad-moradi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a unit test which fails without your fix. This way you show the validity of the fix.

@@ -353,16 +353,6 @@ public static Object handleMap(Map map) {
* Handles the conversion of every field of the GenericRecord
*/
private static Object handleGenericRecord(GenericData.Record record) {
List<Field> fields = record.getSchema().getFields();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the current implementation of this function is doing what it is supposed to.

The caller chain is something like convert(value) -> handleGenericRecord(record). The latter is invoked by the former only if it determines there is nestedness (that the value passed to convert isn't a SV, MV or MAP. It is essentially a struct. So for that we need to recurse and call convert for each field inside GenericData.Record. The fix seems to be removing all of that code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@npawar 's comment #5746 is also aligned with my observation.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix should be in the DataTypeTransformer instead of here.
In DataTypeTransformer we should be able to handle map properly

@jackjlli
Copy link
Contributor Author

The fix should be in the DataTypeTransformer instead of here.
In DataTypeTransformer we should be able to handle map properly

I've thought about that. But DataTypeTransformer uses a generic class for all kinds of data types though.

@@ -87,6 +90,8 @@ public GenericRow transform(GenericRow record) {
source = MULTI_VALUE_TYPE_MAP.get(values[0].getClass());
if (source == null) {
source = PinotDataType.OBJECT_ARRAY;
} else if (source == PinotDataType.HASHMAP) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done conditionally. In other words, consider two columns:

col1 - primitive column defined as MV int/float/double/string (simple array of primitive types that Pinot supports). Something like "CompaniesWorkedAt".

[
     "val1",
     "val2",
     "val3"
]

col2 - complex column defined as MV. Something like addresses which is an array (struct) or array (map).

[
    {
         "k1" : "v1",
         "k2":  "v2",
         "k3": "v3"
   }
]

The second is a complex column whereas is first is standard primitive. Now the AvroRecordExtractor and AvroUtils.convert() would have returned the second as an array of Map/HashMap for our sample data

"dimension_***" : [ {
    "item_id" : {
      "string" : "some data"
    }
  }, {
    "item_id" : {
      "string" : "some data"
    }
  }

Now since schema for "dimension_***" column indicates it is an array of primitives, we need to extract the actual value from each HashMap. In such cases, it is also true that map size would be 1 but that alone should not be the condition since map size could be 1 even if the object is actually a complex one. So we should convert from Map to primitive only if the schema says so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this comment on what should be the algorithm here

#5746

@@ -470,6 +474,43 @@ public String toString(Object value) {
}
},

HASHMAP,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually don't have to introduce this type (in future we should when we have Map) but now there is no need.

In the transform() method of DataTypeTransformer, we should do:

if (values[0] instance of Map) {
    // and if schema says the column is primitive
    // then get the primitive type from Map using the function you have already implemented 
   ` `getPinotDataTypeFromHashMap``
     call dest.convert()
}

@@ -632,4 +713,22 @@ public static PinotDataType getPinotDataType(FieldSpec fieldSpec) {
"Unsupported data type: " + dataType + " in field: " + fieldSpec.getName());
}
}

public static PinotDataType getPinotDataTypeFromHashMap(Map<Object, Object> map) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest calling it getPrimitivePinotDataTypeFromMap

HashMap is an implementation detail so let's not use that. Map is better since it is also an interface and suitable to represent an abstract complex type

@jackjlli jackjlli force-pushed the fix-multi-value-column-from-avro branch from be4be9a to 297d5c5 Compare July 25, 2020 06:28
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we don't need to handle the nested map for now (the old behavior does not handle that as well). So a simple fix could be before checking value instanceof List, check value instanceof Map and then check if it is single-entry, replace value with the entry value.

@@ -56,6 +56,7 @@
MULTI_VALUE_TYPE_MAP.put(Float.class, PinotDataType.FLOAT_ARRAY);
MULTI_VALUE_TYPE_MAP.put(Double.class, PinotDataType.DOUBLE_ARRAY);
MULTI_VALUE_TYPE_MAP.put(String.class, PinotDataType.STRING_ARRAY);
MULTI_VALUE_TYPE_MAP.put(HashMap.class, PinotDataType.MAP);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HashMap is not good enough, should use instanceof Map instead

} else if (obj instanceof String) {
return PinotDataType.STRING_MAP;
} else {
throw new IllegalStateException(String.format("'%s' isn't supported in the hash map.", obj.getClass()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about nested Map? How about single-entry map with multi-valued value (e.g. Object[])?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't completely support complex map cases like nested map and single-entry map with multi-valued value, we can add the logic to it once it's supported. We can add a TODO here.

@@ -470,6 +472,43 @@ public String toString(Object value) {
}
},

MAP,

INTEGER_MAP {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's consider naming this to INTEGER_VALUE_MAP and same for the remaining.

PinotDataType singleValueType = getSingleValueType();
for (int i = 0; i < length; i++) {
integerArray[i] = singleValueType.toInteger(valueArray[i]);
if (valueArray[0] instanceof Map) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call will be something like this:

DataTypeTransformer will determine that source = PinotDataType.Map and dest = PinotDataType.STRING_ARRAY

We then call dest.convert(source), which enters the following:

STRING_ARRAY {
    @Override
    public String[] convert(Object value, PinotDataType sourceType) {
      return sourceType.toStringArray(value);
    }
  },

So now all we need is toStringArray, toIntegerArray etc on source. So instead of changing these functions for everyone, let's just override them for the new type you have introduced - INT_VALUE_MAP, LONG_VALUE_MAP etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be another way, but it'll increase the number of overridden methods for EACH of the new PinotDataTypes.
E.g. if the source type is STRING_VALUE_MAP and the destination type is INTEGER_ARRAY, the source data type has to have an overridden method of toIntegerArray(). I have a unit test on that, which is DataTypeTransformerTest.

@siddharthteotia
Copy link
Contributor

Discussed with @jackjlli and I feel the alternate solution implemented in #5756 is better.

For the current solution:

  • Firstly I would like to avoid introducing a type now and defer it to later when Map type support is actually implemented correctly. The alternative PR follows this and the code change is much cleaner.
  • Secondly, if we want to go ahead with this solution, then we should change the HashMap type to Map since latter is more abstract and a couple of other changes as suggested in the comments.

@jackjlli
Copy link
Contributor Author

Seems we don't need to handle the nested map for now (the old behavior does not handle that as well). So a simple fix could be before checking value instanceof List, check value instanceof Map and then check if it is single-entry, replace value with the entry value.

The value itself basically is a Object[], where each object is a map. That's why I added the logic under if (value instanceof Object[]) statement.

@jackjlli
Copy link
Contributor Author

HashMap is not good enough, should use instanceof Map instead

I've tried to put MULTI_VALUE_TYPE_MAP.put(Map.class, PinotDataType.MAP); but it could't return the expected value. Maybe using ( value instance of Map) would be better than adding an entry to MULTI_VALUE_TYPE_MAP.

@Jackie-Jiang
Copy link
Contributor

Please take a look at #5760 which is a general fix for this issue.

@jackjlli jackjlli closed this Jul 29, 2020
@jackjlli jackjlli deleted the fix-multi-value-column-from-avro branch July 29, 2020 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants