Skip to content

NIFI-6295 fix deserialization issues with NiFiRecordSerDe for hive3streaming#3509

Closed
gideonkorir wants to merge 1 commit intoapache:masterfrom
gideonkorir:NIFI-6295
Closed

NIFI-6295 fix deserialization issues with NiFiRecordSerDe for hive3streaming#3509
gideonkorir wants to merge 1 commit intoapache:masterfrom
gideonkorir:NIFI-6295

Conversation

@gideonkorir
Copy link
Contributor

@gideonkorir gideonkorir commented May 31, 2019

Thank you for submitting a contribution to Apache NiFi.

Please provide a short description of the PR here:

Description of PR

Fixes deserialization of records in NifiRecordSerDe add the capability to:

  1. Deserialize nested records
  2. Deserialize array elements including records
  3. Deserialize map elements

@gideonkorir gideonkorir changed the title fix deserialization issues with NiFiRecordSerDe for hive3streaming NIFI-6295 fix deserialization issues with NiFiRecordSerDe for hive3streaming May 31, 2019
case BYTE:
Integer bIntValue = record.getAsInt(fieldName);
val = bIntValue == null ? null : bIntValue.byteValue();
Integer bIntValue = DataTypeUtils.toInteger(fieldValue, field.getDataType().getFormat());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the direct call to DataTypeUtils is pretty much equivalent to the record.getAs calls (because of the implementation of MapRecord for example), but I personally prefer the use of the Record interface since we're working with Record objects and it's possible (however unlikely) that the underlying implementation will change. Unfortunately that may also mean that getAs returns null (although it apparently doesn't for primitive types) which is why the extra checks are in there. I can appreciate the slight performance increase but I'd prefer we keep the current use of Record methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AvroTypeUtil does seem to use DataTypeUtil rather than use the interface directly so was following that and not assuming the MapRecord implementation.

}
Map<String, Integer> fieldPositionMap = null;
try {
fieldPositionMap = populateFieldPositionMap(record.getSchema(), structTypeInfo, log);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was previously done in initialize() because it only needed to be done once per flowfile, rather than doing the same work every time a record is deserialized. Is there a reason it needs to be moved into deserialize()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem was that deserialize could call itself if the field value was a nested record, array/list of records or a map of key->Record. This bothered me quite a bit but I didn't know how exactly to handle it. My initial thoughts was to have a cache (a map basically) that I populate as I encounter new schema/struct type info; something to consider, we can have 2 exact schemas but different positions (imaging an account containing a list of accounts where the column names were re-ordered in the child collection) so the cache will need to handle that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I over-simplified the NiFiRecordSerDe code while porting from JsonSerDe. The latter had broken up the tasks of getting field names, extracting field values, and possibly recursively calling one or both of those methods for nested structures. I think we should look at JsonSerDe and do NiFiRecordSerDe likewise. It can result in getting that fieldPositionMap called multiple times per flow file, but like you said I think we need to do it. However they have a pretty good modular approach to which methods do which things, all we should have to do is map our Record methods to JsonSerDe's JSON parsing methods (nextToken() and such). Thoughts?

Copy link
Contributor Author

@gideonkorir gideonkorir Aug 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked out their code, will refactor to match what they've done. Looks cleaner than what I've got

if (array == null) {
return null;
}
Object[] array = DataTypeUtils.toArray(fieldValue, field.getFieldName(), field.getDataType());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to get the array element type rather than passing in field.getDataType()? The other structured data types (List, e.g.) below do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks will fix that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually based on Hive Binary Design and Hive Types does it make sense to only support ByteArrayRef, byte[] and String via getBytes()?

assertEquals("test", fields.get(7));
assertEquals("test2", fields.get(8));
assertEquals("c", fields.get(9));
//assertEquals(AvroTypeUtil.convertByteArray(new Object[]{ (byte)1 }).array(), (byte[])fields.get(10));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be included or removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking it probably should, would have caught the bug you saw

break;
case MAP:
val = record.getValue(fieldName);
//in nifi all maps are <String,?> so use that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually when running with a live NiFi instance, the maps are being represented as MapRecords not Map<String,Object> but I think we should handle both just in case. We had to fix this in #3424 for the other utilities, so we may need to check if it's an instance of Record or Map and handle them separately.

I took the liberty of starting a branch using this PR as a base and updating NiFiRecordSerDe to use the JsonSerDe approach to recursion as we discussed. I'll post the branch/PR when I'm finished, but wasn't sure if you are also actively working it or not. If so, perhaps we could collaborate or bring in each other's commits or something?

@github-actions
Copy link

We're marking this PR as stale due to lack of updates in the past few months. If after another couple of weeks the stale label has not been removed this PR will be closed. This stale marker and eventual auto close does not indicate a judgement of the PR just lack of reviewer bandwidth and helps us keep the PR queue more manageable. If you would like this PR re-opened you can do so and a committer can remove the stale tag. Or you can open a new PR. Try to help review other PRs to increase PR review bandwidth which in turn helps yours.

@github-actions github-actions bot added the Stale label Apr 25, 2021
@github-actions github-actions bot closed this May 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants