NIFI-6295 fix deserialization issues with NiFiRecordSerDe for hive3streaming by gideonkorir · Pull Request #3509 · apache/nifi

gideonkorir · 2019-05-31T16:53:09Z

Thank you for submitting a contribution to Apache NiFi.

Please provide a short description of the PR here:

Description of PR

Fixes deserialization of records in NifiRecordSerDe add the capability to:

Deserialize nested records
Deserialize array elements including records
Deserialize map elements

mattyb149 · 2019-08-08T15:12:14Z

...ve-bundle/nifi-hive3-processors/src/main/java/org/apache/hive/streaming/NiFiRecordSerDe.java

                    case BYTE:
-                        Integer bIntValue = record.getAsInt(fieldName);
-                        val = bIntValue == null ? null : bIntValue.byteValue();
+                        Integer bIntValue = DataTypeUtils.toInteger(fieldValue, field.getDataType().getFormat());


I see that the direct call to DataTypeUtils is pretty much equivalent to the record.getAs calls (because of the implementation of MapRecord for example), but I personally prefer the use of the Record interface since we're working with Record objects and it's possible (however unlikely) that the underlying implementation will change. Unfortunately that may also mean that getAs returns null (although it apparently doesn't for primitive types) which is why the extra checks are in there. I can appreciate the slight performance increase but I'd prefer we keep the current use of Record methods.

AvroTypeUtil does seem to use DataTypeUtil rather than use the interface directly so was following that and not assuming the MapRecord implementation.

mattyb149 · 2019-08-08T15:12:22Z

...ve-bundle/nifi-hive3-processors/src/main/java/org/apache/hive/streaming/NiFiRecordSerDe.java

+        }
+        Map<String, Integer> fieldPositionMap = null;
+        try {
+            fieldPositionMap = populateFieldPositionMap(record.getSchema(), structTypeInfo, log);


This was previously done in initialize() because it only needed to be done once per flowfile, rather than doing the same work every time a record is deserialized. Is there a reason it needs to be moved into deserialize()?

The problem was that deserialize could call itself if the field value was a nested record, array/list of records or a map of key->Record. This bothered me quite a bit but I didn't know how exactly to handle it. My initial thoughts was to have a cache (a map basically) that I populate as I encounter new schema/struct type info; something to consider, we can have 2 exact schemas but different positions (imaging an account containing a list of accounts where the column names were re-ordered in the child collection) so the cache will need to handle that.

Looks like I over-simplified the NiFiRecordSerDe code while porting from JsonSerDe. The latter had broken up the tasks of getting field names, extracting field values, and possibly recursively calling one or both of those methods for nested structures. I think we should look at JsonSerDe and do NiFiRecordSerDe likewise. It can result in getting that fieldPositionMap called multiple times per flow file, but like you said I think we need to do it. However they have a pretty good modular approach to which methods do which things, all we should have to do is map our Record methods to JsonSerDe's JSON parsing methods (nextToken() and such). Thoughts?

I've checked out their code, will refactor to match what they've done. Looks cleaner than what I've got

mattyb149 · 2019-08-08T15:13:16Z

...ve-bundle/nifi-hive3-processors/src/main/java/org/apache/hive/streaming/NiFiRecordSerDe.java

-                        if (array == null) {
-                            return null;
-                        }
+                        Object[] array = DataTypeUtils.toArray(fieldValue, field.getFieldName(), field.getDataType());


Do we need to get the array element type rather than passing in field.getDataType()? The other structured data types (List, e.g.) below do that.

Thanks will fix that

Actually based on Hive Binary Design and Hive Types does it make sense to only support ByteArrayRef, byte[] and String via getBytes()?

mattyb149 · 2019-08-08T15:13:54Z

...undle/nifi-hive3-processors/src/test/java/org/apache/hive/streaming/TestNiFiRecordSerDe.java

+        assertEquals("test", fields.get(7));
+        assertEquals("test2", fields.get(8));
+        assertEquals("c", fields.get(9));
+        //assertEquals(AvroTypeUtil.convertByteArray(new Object[]{ (byte)1 }).array(), (byte[])fields.get(10));


Should this be included or removed?

I'm thinking it probably should, would have caught the bug you saw

mattyb149 · 2019-08-28T14:08:52Z

...ve-bundle/nifi-hive3-processors/src/main/java/org/apache/hive/streaming/NiFiRecordSerDe.java

                break;
            case MAP:
-                val = record.getValue(fieldName);
+                //in nifi all maps are <String,?> so use that


Actually when running with a live NiFi instance, the maps are being represented as MapRecords not Map<String,Object> but I think we should handle both just in case. We had to fix this in #3424 for the other utilities, so we may need to check if it's an instance of Record or Map and handle them separately.

I took the liberty of starting a branch using this PR as a base and updating NiFiRecordSerDe to use the JsonSerDe approach to recursion as we discussed. I'll post the branch/PR when I'm finished, but wasn't sure if you are also actively working it or not. If so, perhaps we could collaborate or bring in each other's commits or something?

github-actions · 2021-04-25T00:16:57Z

We're marking this PR as stale due to lack of updates in the past few months. If after another couple of weeks the stale label has not been removed this PR will be closed. This stale marker and eventual auto close does not indicate a judgement of the PR just lack of reviewer bandwidth and helps us keep the PR queue more manageable. If you would like this PR re-opened you can do so and a committer can remove the stale tag. Or you can open a new PR. Try to help review other PRs to increase PR review bandwidth which in turn helps yours.

fix deserialization issues with NiFiRecordSerDe for hive3streaming

bc79671

gideonkorir changed the title ~~fix deserialization issues with NiFiRecordSerDe for hive3streaming~~ NIFI-6295 fix deserialization issues with NiFiRecordSerDe for hive3streaming May 31, 2019

mattyb149 reviewed Aug 8, 2019

View reviewed changes

mattyb149 reviewed Aug 28, 2019

View reviewed changes

mattyb149 mentioned this pull request Aug 30, 2019

NIFI-6295: Refactored NiFiRecordSerDe to handle nested complex types #3684

Closed

12 tasks

github-actions bot added the Stale label Apr 25, 2021

github-actions bot closed this May 11, 2021

Conversation

gideonkorir commented May 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of PR

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gideonkorir Aug 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gideonkorir commented May 31, 2019 •

edited

Loading

gideonkorir Aug 20, 2019 •

edited

Loading