-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix code to correctly extract value of multi-value column from avro file #5746
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add unit test?
convertedMap.put(fieldName, convert(record.get(fieldName))); | ||
} | ||
return convertedMap; | ||
return record.get(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is record
guaranteed to be non-empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, otherwise the if statement (value instanceof GenericData.Record)
would have failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to break the existing functionality of extracting a struct within a record
convertedMap.put(fieldName, convert(record.get(fieldName))); | ||
} | ||
return convertedMap; | ||
return record.get(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is a valid fix. According to the comment instruction,
Converts the value to either a single value (string, number, bytebuffer), multi value (Object[]) or a Map
Natively Pinot only understands single values and multi values.
Map is useful only if some ingestion transform functions operates on it in the transformation layer.
Don't we need to follow the above conversion in case of GenericData.Record
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is following the conversion here. GenericData.Record
is for fetching the value from a single value column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If GenericData.Record is supposed imply SV, then what is the distinction between the last two branches of this code
public static Object convert(Object value) {
if (value == null) {
return null;
}
Object convertedValue;
if (value instanceof Collection) {
convertedValue = handleMultiValue((Collection) value);
} else if (value instanceof Map) {
convertedValue = handleMap((Map) value);
} else if(value instanceof GenericData.Record) {
convertedValue = handleGenericRecord((GenericData.Record) value);
} else {
convertedValue = RecordReaderUtils.convertSingleValue(value);
}
return convertedValue;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if the bug is the extraction of multi-value column, I don't quite follow how this fix is handling that. Shouldn't we be looking at handleMultiValue()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the original code to handle SV:
public static Object handleSingleValue(@Nullable Object value) {
if (value == null) {
return null;
}
if (value instanceof GenericData.Record) {
return handleSingleValue(((GenericData.Record) value).get(0));
}
return value;
}
To determine whether it's SV or MV it's from the Pinot schema.
And the method handleMultiValue()
would also call convert()
method for every sub-value for MV column.
@jackjlli this doesn't look like the right fix for the issue. For GenericRecord, we should be looking at each field recursively. Also the AvroRecordExtractorComplexTypesTest will fail with this change. The code you're introducing makes an assumption that for a field of type GenericRecord in the avro raw data, we should always only look at the first fiel. That's not a good assumption right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a unit test which fails without your fix. This way you show the validity of the fix.
@@ -353,16 +353,6 @@ public static Object handleMap(Map map) { | |||
* Handles the conversion of every field of the GenericRecord | |||
*/ | |||
private static Object handleGenericRecord(GenericData.Record record) { | |||
List<Field> fields = record.getSchema().getFields(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the current implementation of this function is doing what it is supposed to.
The caller chain is something like convert(value) -> handleGenericRecord(record). The latter is invoked by the former only if it determines there is nestedness (that the value passed to convert isn't a SV, MV or MAP. It is essentially a struct. So for that we need to recurse and call convert for each field inside GenericData.Record. The fix seems to be removing all of that code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix should be in the DataTypeTransformer
instead of here.
In DataTypeTransformer
we should be able to handle map properly
I've thought about that. But |
@@ -87,6 +90,8 @@ public GenericRow transform(GenericRow record) { | |||
source = MULTI_VALUE_TYPE_MAP.get(values[0].getClass()); | |||
if (source == null) { | |||
source = PinotDataType.OBJECT_ARRAY; | |||
} else if (source == PinotDataType.HASHMAP) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be done conditionally. In other words, consider two columns:
col1 - primitive column defined as MV int/float/double/string (simple array of primitive types that Pinot supports). Something like "CompaniesWorkedAt".
[
"val1",
"val2",
"val3"
]
col2 - complex column defined as MV. Something like addresses which is an array (struct) or array (map).
[
{
"k1" : "v1",
"k2": "v2",
"k3": "v3"
}
]
The second is a complex column whereas is first is standard primitive. Now the AvroRecordExtractor and AvroUtils.convert() would have returned the second as an array of Map/HashMap for our sample data
"dimension_***" : [ {
"item_id" : {
"string" : "some data"
}
}, {
"item_id" : {
"string" : "some data"
}
}
Now since schema for "dimension_***" column indicates it is an array of primitives, we need to extract the actual value from each HashMap. In such cases, it is also true that map size would be 1 but that alone should not be the condition since map size could be 1 even if the object is actually a complex one. So we should convert from Map to primitive only if the schema says so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See this comment on what should be the algorithm here
@@ -470,6 +474,43 @@ public String toString(Object value) { | |||
} | |||
}, | |||
|
|||
HASHMAP, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually don't have to introduce this type (in future we should when we have Map) but now there is no need.
In the transform() method of DataTypeTransformer, we should do:
if (values[0] instance of Map) {
// and if schema says the column is primitive
// then get the primitive type from Map using the function you have already implemented
` `getPinotDataTypeFromHashMap``
call dest.convert()
}
@@ -632,4 +713,22 @@ public static PinotDataType getPinotDataType(FieldSpec fieldSpec) { | |||
"Unsupported data type: " + dataType + " in field: " + fieldSpec.getName()); | |||
} | |||
} | |||
|
|||
public static PinotDataType getPinotDataTypeFromHashMap(Map<Object, Object> map) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest calling it getPrimitivePinotDataTypeFromMap
HashMap is an implementation detail so let's not use that. Map is better since it is also an interface and suitable to represent an abstract complex type
be4be9a
to
297d5c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we don't need to handle the nested map for now (the old behavior does not handle that as well). So a simple fix could be before checking value instanceof List
, check value instanceof Map
and then check if it is single-entry, replace value with the entry value.
@@ -56,6 +56,7 @@ | |||
MULTI_VALUE_TYPE_MAP.put(Float.class, PinotDataType.FLOAT_ARRAY); | |||
MULTI_VALUE_TYPE_MAP.put(Double.class, PinotDataType.DOUBLE_ARRAY); | |||
MULTI_VALUE_TYPE_MAP.put(String.class, PinotDataType.STRING_ARRAY); | |||
MULTI_VALUE_TYPE_MAP.put(HashMap.class, PinotDataType.MAP); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HashMap
is not good enough, should use instanceof Map
instead
} else if (obj instanceof String) { | ||
return PinotDataType.STRING_MAP; | ||
} else { | ||
throw new IllegalStateException(String.format("'%s' isn't supported in the hash map.", obj.getClass())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about nested Map? How about single-entry map with multi-valued value (e.g. Object[]
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't completely support complex map cases like nested map and single-entry map with multi-valued value, we can add the logic to it once it's supported. We can add a TODO here.
@@ -470,6 +472,43 @@ public String toString(Object value) { | |||
} | |||
}, | |||
|
|||
MAP, | |||
|
|||
INTEGER_MAP { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's consider naming this to INTEGER_VALUE_MAP and same for the remaining.
PinotDataType singleValueType = getSingleValueType(); | ||
for (int i = 0; i < length; i++) { | ||
integerArray[i] = singleValueType.toInteger(valueArray[i]); | ||
if (valueArray[0] instanceof Map) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call will be something like this:
DataTypeTransformer will determine that source = PinotDataType.Map and dest = PinotDataType.STRING_ARRAY
We then call dest.convert(source), which enters the following:
STRING_ARRAY {
@Override
public String[] convert(Object value, PinotDataType sourceType) {
return sourceType.toStringArray(value);
}
},
So now all we need is toStringArray, toIntegerArray etc on source. So instead of changing these functions for everyone, let's just override them for the new type you have introduced - INT_VALUE_MAP, LONG_VALUE_MAP etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be another way, but it'll increase the number of overridden methods for EACH of the new PinotDataTypes.
E.g. if the source type is STRING_VALUE_MAP and the destination type is INTEGER_ARRAY, the source data type has to have an overridden method of toIntegerArray()
. I have a unit test on that, which is DataTypeTransformerTest
.
Discussed with @jackjlli and I feel the alternate solution implemented in #5756 is better. For the current solution:
|
The value itself basically is a Object[], where each object is a map. That's why I added the logic under |
I've tried to put |
Please take a look at #5760 which is a general fix for this issue. |
Description
This PR fixed the issue introduced from this PR: #5238
Basically the value of the multi-value column should be at the 1st level instead of digging into the deepest level. The issue is causing the segment creation fail to extract the correct value from the avro file.
Tests Done
Before the change:
After the change: