NIFI-10442: Create PutIceberg processor #6368

mark-bathori · 2022-09-06T08:10:40Z

Summary

NIFI-10442

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Apache NiFi Jira issue created

Pull Request Tracking

Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

Pull Request based on current revision of the main branch
Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

Build completed using mvn clean install -P contrib-check
- JDK 8
- JDK 11
- JDK 17

Licensing

New dependencies are compatible with the Apache License 2.0 according to the License Policy
New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

Documentation formatting appears as expected in rendered files

turcsanyip

@mark-bathori Thanks for implementing the Iceberg integration processor for NiFi! It looks interesting!

I managed to set up the processor quite easily to load data into a simple table. So the configuration seems to be straightforward. However, I found some improvement points like using Title Case for properties, fixing typos in descriptions, etc.

I also looked into the poms and added a couple of comments about provided vs compile dependencies, scope inherited from dependency management and also found some unused or duplicated dependencies. Pls. see my comments below.

Will continue with testing and reviewing advanced data conversions, data types, partitions, etc

...dle/nifi-iceberg-processors/src/main/java/org/apache/nifi/processors/iceberg/PutIceberg.java

...eberg-services-api/src/main/java/org/apache/nifi/services/iceberg/IcebergCatalogService.java

...fi-iceberg-services/src/main/java/org/apache/nifi/services/iceberg/HadoopCatalogService.java

...dle/nifi-iceberg-processors/src/main/java/org/apache/nifi/processors/iceberg/PutIceberg.java

nifi-nar-bundles/nifi-iceberg-bundle/nifi-iceberg-processors/pom.xml

...dle/nifi-iceberg-processors/src/main/java/org/apache/nifi/processors/iceberg/PutIceberg.java

nandorsoma

Hi @mark-bathori!
Thank you for working this feature out! I'm sure you put a lot of work into it!

To be honest, I had issues with the testing, but probably it is due to my lack of knowledge in the hive world. When I tried to put data into a simple table, as @turcsanyip mentions, I got an error that the table is not an iceberg table:

Component is invalid: 'Component' is invalid because Failed to perform validation due to org.apache.iceberg.exceptions.NoSuchIcebergTableException: Not an iceberg table: hive-catalog.nifitest.icebergtable (type=null)

When I tried to create an Iceberg table with CREATE TABLE nifitest.ib1 (id int, name string) STORED BY ICEBERG; I got the following exception from Hive:

Error: Error while compiling statement: FAILED: ParseException line 1:65 mismatched input 'ICEBERG' expecting StringLiteral near 'BY' in table file format specification (state=42000,code=40000)

As I said, I think this is a configuration issue on my side, and I don't think it is related to the code. I will continue to figure that out. Nevertheless, I've found a few things in the code that I'd like to share with you below.

Update:
Using STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' instead of just STORED BY ICEBERG solves my issue.

...dle/nifi-iceberg-processors/src/main/java/org/apache/nifi/processors/iceberg/PutIceberg.java

...essors/src/test/java/org/apache/nifi/processors/iceberg/TestPutIcebergWithHadoopCatalog.java

...rc/main/java/org/apache/nifi/processors/iceberg/appender/avro/AvroWithNifiSchemaVisitor.java

...-processors/src/main/java/org/apache/nifi/processors/iceberg/KerberosAwareBaseProcessor.java

...i-iceberg-processors/src/main/java/org/apache/nifi/processors/iceberg/NifiRecordWrapper.java

turcsanyip

@mark-bathori I added some more comments. Please also rebase your branch to main because there are merge conflicts between the branches.

...dle/nifi-iceberg-processors/src/main/java/org/apache/nifi/processors/iceberg/PutIceberg.java

...rg-processors/src/main/java/org/apache/nifi/processors/iceberg/IcebergTaskWriterFactory.java

...-processors/src/main/java/org/apache/nifi/processors/iceberg/KerberosAwareBaseProcessor.java

mark-bathori · 2022-09-13T09:36:12Z

Thanks @turcsanyip and @nandorsoma for the review comments. I tried to fix all of the mentioned issues in my current commit. I’ve also added a couple of new elements to my PR:

added file format and target file size properties to the Put processor
added the option to provide Hadoop configuration file in the Catalog Services
removed the V2 table format restriction since it felt unnecessary after looking into it more

turcsanyip · 2022-09-21T13:35:52Z

@mark-bathori Thanks for the latest changes! I set my review comments resolved and also tested the file format property and V1 tables.

I looked into the data conversion code and now there are 3 separate implementations for Avro, Parquet and ORC conversions using the "low level" Iceberg API to write these data files.
However, in the Iceberg API there also exists GenericRecord implementation that could be used for the conversion. So we would convert NiFi's Record object to Iceberg's GenericRecord object once, and then Iceberg would convert GenericRecord to Avro, Parquet and ORC because it already knows how to do that.
It would mean a more robust and maintainable code on our side.
Do you think it makes sense?

nandorsoma

Hey @mark-bathori! Thank you for the additional changes. I've found a few things in the new code and also commented on the outdated ones where I thought it would need your attention.

...ssors/src/main/java/org/apache/nifi/processors/iceberg/converter/IcebergRecordConverter.java

...dle/nifi-iceberg-processors/src/main/java/org/apache/nifi/processors/iceberg/PutIceberg.java

...essors/src/test/java/org/apache/nifi/processors/iceberg/TestPutIcebergWithHadoopCatalog.java

...rc/main/java/org/apache/nifi/processors/iceberg/appender/avro/AvroWithNifiSchemaVisitor.java

...-processors/src/main/java/org/apache/nifi/processors/iceberg/KerberosAwareBaseProcessor.java

...rg-processors/src/main/java/org/apache/nifi/processors/iceberg/AbstractIcebergProcessor.java

nandorsoma

Somehow I forgot to add this one to my previous set of review.

...dle/nifi-iceberg-processors/src/main/java/org/apache/nifi/processors/iceberg/PutIceberg.java

nandorsoma

Hey @mark-bathori!
Thanks for the additional changes. I've found a few minor issues, but I'm still uncertain about handling the primitive types in the IcebergRecordConverter. Please see my inline comments.

nifi-assembly/pom.xml

...essors/src/main/java/org/apache/nifi/processors/iceberg/converter/GenericDataConverters.java

...rg-processors/src/main/java/org/apache/nifi/processors/iceberg/AbstractIcebergProcessor.java

...ssors/src/main/java/org/apache/nifi/processors/iceberg/converter/IcebergRecordConverter.java

nifi-nar-bundles/nifi-iceberg-bundle/pom.xml

nandorsoma

Thank you, @mark-bathori! Pr looks good to me! Once CI jobs pass, I'm fine with merging.

turcsanyip

@mark-bathori Thanks for adding this new processor!
@nandorsoma Thanks for your review!

+1 LGTM
Meging to main.

This closes apache#6368. Signed-off-by: Peter Turcsanyi <turcsanyi@apache.org>

abdelrahim-ahmad · 2023-03-22T18:18:42Z

Hi Guys, thanks for this great feature. I've installed nifi 1.20.0 but not able to find this new process PutIceberg ?
why is that? is there any thing I should consider!?

pvillard31 · 2023-03-22T18:25:49Z

Those NARs are not part of the convenience binary due to ASF rules around the binary size. You can download NARs from Maven central repos and drop those in NiFi:
https://mvnrepository.com/artifact/org.apache.nifi/nifi-iceberg-processors-nar
https://mvnrepository.com/artifact/org.apache.nifi/nifi-iceberg-services-nar

abdelrahim-ahmad · 2023-03-22T23:03:40Z

Thank you very much, I was not aware of the NARs project. Is there any chance for this to be included in the upcoming versions by default? or It will stays in NARs.

pvillard31 · 2023-03-23T08:52:43Z

It'll very likely remain like this. Many components are just provided as NARs via Maven Central. We have to keep the binary size as small as possible.

AbdelrahimKA · 2023-04-10T01:27:59Z

Hi @pvillard31, Thanks for your help!.
Thanks @mark-bathori for this processor.
I've added this processor and all is good but when I configure the process I get this error:


PutIceberg[id=018710cc-a9b6-1854-2954-e2178e328ca7] Failed to load table from catalog: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
- Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

Files added to /srv/nifi/extensions:/opt/nifi/nifi-current/extensions:
nifi-iceberg-processors-nar-1.20.0.nar
nifi-iceberg-services-api-nar-1.20.0.nar
nifi-iceberg-services-nar-1.20.0.nar

Nifi version:1.20
s3 file system: Minio.
Hive Cataloge service configured and enabled.
Core-site.xml:

<configuration>
<property>
	<name>hive.metastore.warehouse.dir</name>
	<value>s3a://icebergdb/</value>
</property>    
<property>
	<name>fs.s3a.connection.ssl.enabled</name>
	<value>false</value>
	<description>Enables or disables SSL connections to S3.</description>
</property>

<property>
	<name>fs.s3a.aws.credentials.provider</name>
	<value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</value>
</property>

<property>
	<name>fs.s3a.endpoint</name>
	<value>http://x.x.x.x:9900</value>
</property>

<property>
	<name>fs.s3a.access.key</name>
	<value>xxxxxxxxxxxxxxxx</value>
</property>

<property>
	<name>fs.s3a.secret.key</name>
	<value>xxxxxxxxxxxxxxxxxxxx</value>
</property>

<property>
	<name>fs.s3a.path.style.access</name>
	<value>true</value>
</property>

</configuration>

Do I need to add additional NAR file to work with s3 files system?

Thanks and Best regards
Abdelrahim Ahmad

mark-bathori · 2023-04-10T18:51:19Z

@AbdelrahimKA The base nar files don't contain any cloud specific dependency due their sizes.
You need to make a custom build from nifi-iceberg-bundle using either include-hadoop-aws or include-hadoop-cloud-storage profiles to be able to use the processor with S3. The include-hadoop-aws profile contains only S3 specific dependencies while include-hadoop-cloud-storage profile will additionally include azure and gcp related dependencies.

MohamedAdelHsn · 2023-05-11T07:22:13Z

Hi, @AbdelrahimKA
I hope you are doing well, Did you find any solution for that issue?

AbdelrahimKA · 2023-05-17T15:03:19Z

Hi, @MohamedAdelHsn
Unfortunately, there is no way right now to send data directly to the open tables (Deltalake, Hudi or Iceberg) from Nifi.
I've tried this with other processes like putdatabaserecord. The putdatabaserecord have disabled autocommit feature so you cannot even use it with Trino or Dremio.
The only way, which is not recommended is to use putsql process that needs some pre-processing steps before sending the data to Trinio or Dremio.
I have opened a ticket on Jira (Ticket-11449)hopefully someone will consider it.

Hope this helps you.
Regards

quydx · 2023-09-07T02:50:29Z

@AbdelrahimKA The base nar files don't contain any cloud specific dependency due their sizes. You need to make a custom build from nifi-iceberg-bundle using either include-hadoop-aws or include-hadoop-cloud-storage profiles to be able to use the processor with S3. The include-hadoop-aws profile contains only S3 specific dependencies while include-hadoop-cloud-storage profile will additionally include azure and gcp related dependencies.

@mark-bathori Could you please give more details about this? How can we build the NAR file from nifi-iceberg-bundle using include-hadoop-aws profiles. Thanks

Andy1510 · 2024-06-04T08:25:17Z

@quydx I managed to do so by cloning the nifi repo, then run the following maven build command

mvn clean install -Pinclude-hadoop-aws -pl nifi-nar-bundles/nifi-iceberg-bundle/nifi-iceberg-processors-nar -am -T2C

turcsanyip requested changes Sep 7, 2022

View reviewed changes

nandorsoma requested changes Sep 11, 2022

View reviewed changes

turcsanyip requested changes Sep 12, 2022

View reviewed changes

mark-bathori force-pushed the NIFI-10442-iceberg branch 2 times, most recently from 968cb29 to 34d974f Compare September 27, 2022 14:30

nandorsoma requested changes Sep 28, 2022

View reviewed changes

...dle/nifi-iceberg-processors/src/main/java/org/apache/nifi/processors/iceberg/PutIceberg.java Outdated Show resolved Hide resolved

mark-bathori force-pushed the NIFI-10442-iceberg branch from 34d974f to e1553b1 Compare October 5, 2022 12:26

nandorsoma requested changes Oct 10, 2022

View reviewed changes

turcsanyip reviewed Oct 11, 2022

View reviewed changes

nifi-nar-bundles/nifi-iceberg-bundle/pom.xml Outdated Show resolved Hide resolved

mark-bathori added 7 commits October 11, 2022 13:18

NIFI-10442: Create PutIceberg processor

19d0115

Fix pre-commit issues and review comments

207d94c

Use Generic Record with record converter

8f9e591

Fix Kerberos communication issues

de459b4

Create profile for iceberg nar and fix review comments

5d7dd67

Remove cloud profiles

d347e95

Update module versions to 1.19

a6039f2

mark-bathori force-pushed the NIFI-10442-iceberg branch from 90de976 to a6039f2 Compare October 11, 2022 12:41

Fix derby log location and missing version upgrades

e7c709c

nandorsoma approved these changes Oct 11, 2022

View reviewed changes

turcsanyip approved these changes Oct 11, 2022

View reviewed changes

asfgit closed this in e87bced Oct 11, 2022

emiliosetiadarma pushed a commit to emiliosetiadarma/nifi that referenced this pull request Oct 11, 2022

NIFI-10442: Create PutIceberg processor

c1dac8f

This closes apache#6368. Signed-off-by: Peter Turcsanyi <turcsanyi@apache.org>

emiliosetiadarma pushed a commit to emiliosetiadarma/nifi that referenced this pull request Oct 13, 2022

NIFI-10442: Create PutIceberg processor

7176ba5

This closes apache#6368. Signed-off-by: Peter Turcsanyi <turcsanyi@apache.org>

p-kimberley pushed a commit to p-kimberley/nifi that referenced this pull request Oct 15, 2022

NIFI-10442: Create PutIceberg processor

9029f9a

This closes apache#6368. Signed-off-by: Peter Turcsanyi <turcsanyi@apache.org>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NIFI-10442: Create PutIceberg processor #6368

NIFI-10442: Create PutIceberg processor #6368

mark-bathori commented Sep 6, 2022 •

edited

turcsanyip left a comment

nandorsoma left a comment •

edited

turcsanyip left a comment

mark-bathori commented Sep 13, 2022

turcsanyip commented Sep 21, 2022

nandorsoma left a comment

nandorsoma left a comment

nandorsoma left a comment

nandorsoma left a comment

turcsanyip left a comment

abdelrahim-ahmad commented Mar 22, 2023

pvillard31 commented Mar 22, 2023

abdelrahim-ahmad commented Mar 22, 2023

pvillard31 commented Mar 23, 2023

AbdelrahimKA commented Apr 10, 2023 •

edited

mark-bathori commented Apr 10, 2023

MohamedAdelHsn commented May 11, 2023 •

edited

AbdelrahimKA commented May 17, 2023 •

edited

quydx commented Sep 7, 2023 •

edited

Andy1510 commented Jun 4, 2024

NIFI-10442: Create PutIceberg processor #6368

NIFI-10442: Create PutIceberg processor #6368

Conversation

mark-bathori commented Sep 6, 2022 • edited

Summary

Tracking

Issue Tracking

Pull Request Tracking

Pull Request Formatting

Verification

Build

Licensing

Documentation

turcsanyip left a comment

Choose a reason for hiding this comment

nandorsoma left a comment • edited

Choose a reason for hiding this comment

turcsanyip left a comment

Choose a reason for hiding this comment

mark-bathori commented Sep 13, 2022

turcsanyip commented Sep 21, 2022

nandorsoma left a comment

Choose a reason for hiding this comment

nandorsoma left a comment

Choose a reason for hiding this comment

nandorsoma left a comment

Choose a reason for hiding this comment

nandorsoma left a comment

Choose a reason for hiding this comment

turcsanyip left a comment

Choose a reason for hiding this comment

abdelrahim-ahmad commented Mar 22, 2023

pvillard31 commented Mar 22, 2023

abdelrahim-ahmad commented Mar 22, 2023

pvillard31 commented Mar 23, 2023

AbdelrahimKA commented Apr 10, 2023 • edited

mark-bathori commented Apr 10, 2023

MohamedAdelHsn commented May 11, 2023 • edited

AbdelrahimKA commented May 17, 2023 • edited

quydx commented Sep 7, 2023 • edited

Andy1510 commented Jun 4, 2024

mark-bathori commented Sep 6, 2022 •

edited

nandorsoma left a comment •

edited

AbdelrahimKA commented Apr 10, 2023 •

edited

MohamedAdelHsn commented May 11, 2023 •

edited

AbdelrahimKA commented May 17, 2023 •

edited

quydx commented Sep 7, 2023 •

edited