NIFI-10888: When inferring a schema using a Record Reader, buffer up … by markap14 · Pull Request #6725 · apache/nifi

markap14 · 2022-11-28T18:40:06Z

…to 1 MB of FlowFile content for the schema inference so that when we read the contents to obtain records we can use the buffered data. This helps in cases of small FlowFiles by not having to seek back to the beginning of the FlowFile every time.

Summary

NIFI-00000

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Apache NiFi Jira issue created

Pull Request Tracking

Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

Pull Request based on current revision of the main branch
Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

Build completed using mvn clean install -P contrib-check
- JDK 8
- JDK 11
- JDK 17

Licensing

New dependencies are compatible with the Apache License 2.0 according to the License Policy
New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

Documentation formatting appears as expected in rendered files

…to 1 MB of FlowFile content for the schema inference so that when we read the contents to obtain records we can use the buffered data. This helps in cases of small FlowFiles by not having to seek back to the beginning of the FlowFile every time.

NissimShiman · 2022-12-09T19:16:46Z

This is a really nice enhancement.

Tested with csv files being converted to json files before and after fix
using set up jira (i.e. GenerateFlowFile - > UpdateRecord [with 4 concurrent threads])
for csv files of 21 bytes, using 10000 files in queue, there was 25 -30% speed up in processing (19/20 seconds to 15 seconds)
for csv files of 63 bytes, using 10000 files, there was 25 -30% speed up in processing (19/20 seconds to 15 seconds)
for csv files of 549 bytes each, using 10000 files I saw a 33%+ speed up (37 to 23 seconds)

for larger files, set up GetFile -> UpdateRecord
for csv files of 2.3MB each, using 100 files, it took 33 seconds (both before and after)
for csv files of 23MB each, using 20 files, it took 64 seconds (both before and after)

This was compiled/run with java openjdk 1.8.0_332
on linux kernel 3.10.0-1160

csv files in format of:
name, age
Aa, 1
Bb, 2
etc.

All referenced processors/controller services used defaults
Update Record had the following property/values:
Record Reader/CSVReader
Record Writer/JsonRecordSetWriter
Replacement Value Strategy/Literal Value
/name/NewName

mattyb149 · 2022-12-13T15:32:24Z

Reviewing...

mattyb149 · 2022-12-14T17:10:01Z

+1 LGTM, thanks for the improvement! Merging to main

…to 1 MB of FlowFile content for the schema inference so that when we read the contents to obtain records we can use the buffered data. This helps in cases of small FlowFiles by not having to seek back to the beginning of the FlowFile every time. Signed-off-by: Matthew Burgess <mattyb149@apache.org> This closes apache#6725

mattyb149 closed this in 78be613 Dec 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NIFI-10888: When inferring a schema using a Record Reader, buffer up …#6725

NIFI-10888: When inferring a schema using a Record Reader, buffer up …#6725
markap14 wants to merge 1 commit intoapache:mainfrom
markap14:NIFI-10888

markap14 commented Nov 28, 2022

Uh oh!

NissimShiman commented Dec 9, 2022 •

edited

Loading

Uh oh!

mattyb149 commented Dec 13, 2022

Uh oh!

mattyb149 commented Dec 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

markap14 commented Nov 28, 2022

Summary

Tracking

Issue Tracking

Pull Request Tracking

Pull Request Formatting

Verification

Build

Licensing

Documentation

Uh oh!

NissimShiman commented Dec 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattyb149 commented Dec 13, 2022

Uh oh!

mattyb149 commented Dec 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NissimShiman commented Dec 9, 2022 •

edited

Loading