Skip to content

NIFI-10888: When inferring a schema using a Record Reader, buffer up …#6725

Closed
markap14 wants to merge 1 commit intoapache:mainfrom
markap14:NIFI-10888
Closed

NIFI-10888: When inferring a schema using a Record Reader, buffer up …#6725
markap14 wants to merge 1 commit intoapache:mainfrom
markap14:NIFI-10888

Conversation

@markap14
Copy link
Contributor

…to 1 MB of FlowFile content for the schema inference so that when we read the contents to obtain records we can use the buffered data. This helps in cases of small FlowFiles by not having to seek back to the beginning of the FlowFile every time.

Summary

NIFI-00000

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using mvn clean install -P contrib-check
    • JDK 8
    • JDK 11
    • JDK 17

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

…to 1 MB of FlowFile content for the schema inference so that when we read the contents to obtain records we can use the buffered data. This helps in cases of small FlowFiles by not having to seek back to the beginning of the FlowFile every time.
@NissimShiman
Copy link
Contributor

NissimShiman commented Dec 9, 2022

This is a really nice enhancement.

Tested with csv files being converted to json files before and after fix
using set up jira (i.e. GenerateFlowFile - > UpdateRecord [with 4 concurrent threads])
for csv files of 21 bytes, using 10000 files in queue, there was 25 -30% speed up in processing (19/20 seconds to 15 seconds)
for csv files of 63 bytes, using 10000 files, there was 25 -30% speed up in processing (19/20 seconds to 15 seconds)
for csv files of 549 bytes each, using 10000 files I saw a 33%+ speed up (37 to 23 seconds)

for larger files, set up GetFile -> UpdateRecord
for csv files of 2.3MB each, using 100 files, it took 33 seconds (both before and after)
for csv files of 23MB each, using 20 files, it took 64 seconds (both before and after)

This was compiled/run with java openjdk 1.8.0_332
on linux kernel 3.10.0-1160

csv files in format of:
name, age
Aa, 1
Bb, 2
etc.

All referenced processors/controller services used defaults
Update Record had the following property/values:
Record Reader/CSVReader
Record Writer/JsonRecordSetWriter
Replacement Value Strategy/Literal Value
/name/NewName

@mattyb149
Copy link
Contributor

Reviewing...

@mattyb149
Copy link
Contributor

+1 LGTM, thanks for the improvement! Merging to main

@mattyb149 mattyb149 closed this in 78be613 Dec 14, 2022
lizhizhou pushed a commit to lizhizhou/nifi that referenced this pull request Jan 2, 2023
…to 1 MB of FlowFile content for the schema inference so that when we read the contents to obtain records we can use the buffered data. This helps in cases of small FlowFiles by not having to seek back to the beginning of the FlowFile every time.

Signed-off-by: Matthew Burgess <mattyb149@apache.org>

This closes apache#6725
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants