Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-10442: Create PutIceberg processor #6368

Closed
wants to merge 8 commits into from

Conversation

mark-bathori
Copy link
Contributor

@mark-bathori mark-bathori commented Sep 6, 2022

Summary

NIFI-10442

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using mvn clean install -P contrib-check
    • JDK 8
    • JDK 11
    • JDK 17

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

Copy link
Contributor

@turcsanyip turcsanyip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mark-bathori Thanks for implementing the Iceberg integration processor for NiFi! It looks interesting!

I managed to set up the processor quite easily to load data into a simple table. So the configuration seems to be straightforward. However, I found some improvement points like using Title Case for properties, fixing typos in descriptions, etc.

I also looked into the poms and added a couple of comments about provided vs compile dependencies, scope inherited from dependency management and also found some unused or duplicated dependencies. Pls. see my comments below.

Will continue with testing and reviewing advanced data conversions, data types, partitions, etc

Copy link
Contributor

@nandorsoma nandorsoma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mark-bathori!
Thank you for working this feature out! I'm sure you put a lot of work into it!

To be honest, I had issues with the testing, but probably it is due to my lack of knowledge in the hive world. When I tried to put data into a simple table, as @turcsanyip mentions, I got an error that the table is not an iceberg table:

Component is invalid: 'Component' is invalid because Failed to perform validation due to org.apache.iceberg.exceptions.NoSuchIcebergTableException: Not an iceberg table: hive-catalog.nifitest.icebergtable (type=null)

When I tried to create an Iceberg table with CREATE TABLE nifitest.ib1 (id int, name string) STORED BY ICEBERG; I got the following exception from Hive:

Error: Error while compiling statement: FAILED: ParseException line 1:65 mismatched input 'ICEBERG' expecting StringLiteral near 'BY' in table file format specification (state=42000,code=40000)

As I said, I think this is a configuration issue on my side, and I don't think it is related to the code. I will continue to figure that out. Nevertheless, I've found a few things in the code that I'd like to share with you below.

Update:
Using STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' instead of just STORED BY ICEBERG solves my issue.

Copy link
Contributor

@turcsanyip turcsanyip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mark-bathori I added some more comments. Please also rebase your branch to main because there are merge conflicts between the branches.

@mark-bathori
Copy link
Contributor Author

Thanks @turcsanyip and @nandorsoma for the review comments. I tried to fix all of the mentioned issues in my current commit. I’ve also added a couple of new elements to my PR:

  • added file format and target file size properties to the Put processor
  • added the option to provide Hadoop configuration file in the Catalog Services
  • removed the V2 table format restriction since it felt unnecessary after looking into it more

@turcsanyip
Copy link
Contributor

@mark-bathori Thanks for the latest changes! I set my review comments resolved and also tested the file format property and V1 tables.

I looked into the data conversion code and now there are 3 separate implementations for Avro, Parquet and ORC conversions using the "low level" Iceberg API to write these data files.
However, in the Iceberg API there also exists GenericRecord implementation that could be used for the conversion. So we would convert NiFi's Record object to Iceberg's GenericRecord object once, and then Iceberg would convert GenericRecord to Avro, Parquet and ORC because it already knows how to do that.
It would mean a more robust and maintainable code on our side.
Do you think it makes sense?

@mark-bathori mark-bathori force-pushed the NIFI-10442-iceberg branch 2 times, most recently from 968cb29 to 34d974f Compare September 27, 2022 14:30
Copy link
Contributor

@nandorsoma nandorsoma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mark-bathori! Thank you for the additional changes. I've found a few things in the new code and also commented on the outdated ones where I thought it would need your attention.

Copy link
Contributor

@nandorsoma nandorsoma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow I forgot to add this one to my previous set of review.

Copy link
Contributor

@nandorsoma nandorsoma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mark-bathori!
Thanks for the additional changes. I've found a few minor issues, but I'm still uncertain about handling the primitive types in the IcebergRecordConverter. Please see my inline comments.

Copy link
Contributor

@nandorsoma nandorsoma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @mark-bathori! Pr looks good to me! Once CI jobs pass, I'm fine with merging.

Copy link
Contributor

@turcsanyip turcsanyip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mark-bathori Thanks for adding this new processor!
@nandorsoma Thanks for your review!

+1 LGTM
Meging to main.

@asfgit asfgit closed this in e87bced Oct 11, 2022
emiliosetiadarma pushed a commit to emiliosetiadarma/nifi that referenced this pull request Oct 11, 2022
This closes apache#6368.

Signed-off-by: Peter Turcsanyi <turcsanyi@apache.org>
emiliosetiadarma pushed a commit to emiliosetiadarma/nifi that referenced this pull request Oct 13, 2022
This closes apache#6368.

Signed-off-by: Peter Turcsanyi <turcsanyi@apache.org>
p-kimberley pushed a commit to p-kimberley/nifi that referenced this pull request Oct 15, 2022
This closes apache#6368.

Signed-off-by: Peter Turcsanyi <turcsanyi@apache.org>
@abdelrahim-ahmad
Copy link

Hi Guys, thanks for this great feature. I've installed nifi 1.20.0 but not able to find this new process PutIceberg ?
why is that? is there any thing I should consider!?

@pvillard31
Copy link
Contributor

Those NARs are not part of the convenience binary due to ASF rules around the binary size. You can download NARs from Maven central repos and drop those in NiFi:
https://mvnrepository.com/artifact/org.apache.nifi/nifi-iceberg-processors-nar
https://mvnrepository.com/artifact/org.apache.nifi/nifi-iceberg-services-nar

@abdelrahim-ahmad
Copy link

Thank you very much, I was not aware of the NARs project. Is there any chance for this to be included in the upcoming versions by default? or It will stays in NARs.

@pvillard31
Copy link
Contributor

It'll very likely remain like this. Many components are just provided as NARs via Maven Central. We have to keep the binary size as small as possible.

@AbdelrahimKA
Copy link

AbdelrahimKA commented Apr 10, 2023

Hi @pvillard31, Thanks for your help!.
Thanks @mark-bathori for this processor.
I've added this processor and all is good but when I configure the process I get this error:


PutIceberg[id=018710cc-a9b6-1854-2954-e2178e328ca7] Failed to load table from catalog: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
- Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found


Files added to /srv/nifi/extensions:/opt/nifi/nifi-current/extensions:
nifi-iceberg-processors-nar-1.20.0.nar
nifi-iceberg-services-api-nar-1.20.0.nar
nifi-iceberg-services-nar-1.20.0.nar

Nifi version:1.20
s3 file system: Minio.
Hive Cataloge service configured and enabled.
Core-site.xml:

<configuration>
<property>
	<name>hive.metastore.warehouse.dir</name>
	<value>s3a://icebergdb/</value>
</property>    
<property>
	<name>fs.s3a.connection.ssl.enabled</name>
	<value>false</value>
	<description>Enables or disables SSL connections to S3.</description>
</property>

<property>
	<name>fs.s3a.aws.credentials.provider</name>
	<value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</value>
</property>

<property>
	<name>fs.s3a.endpoint</name>
	<value>http://x.x.x.x:9900</value>
</property>

<property>
	<name>fs.s3a.access.key</name>
	<value>xxxxxxxxxxxxxxxx</value>
</property>

<property>
	<name>fs.s3a.secret.key</name>
	<value>xxxxxxxxxxxxxxxxxxxx</value>
</property>

<property>
	<name>fs.s3a.path.style.access</name>
	<value>true</value>
</property>

</configuration>

Do I need to add additional NAR file to work with s3 files system?

Thanks and Best regards
Abdelrahim Ahmad

@mark-bathori
Copy link
Contributor Author

@AbdelrahimKA The base nar files don't contain any cloud specific dependency due their sizes.
You need to make a custom build from nifi-iceberg-bundle using either include-hadoop-aws or include-hadoop-cloud-storage profiles to be able to use the processor with S3. The include-hadoop-aws profile contains only S3 specific dependencies while include-hadoop-cloud-storage profile will additionally include azure and gcp related dependencies.

@MohamedAdelHsn
Copy link

MohamedAdelHsn commented May 11, 2023

Hi, @AbdelrahimKA
I hope you are doing well, Did you find any solution for that issue?

@AbdelrahimKA
Copy link

AbdelrahimKA commented May 17, 2023

Hi, @MohamedAdelHsn
Unfortunately, there is no way right now to send data directly to the open tables (Deltalake, Hudi or Iceberg) from Nifi.
I've tried this with other processes like putdatabaserecord. The putdatabaserecord have disabled autocommit feature so you cannot even use it with Trino or Dremio.
The only way, which is not recommended is to use putsql process that needs some pre-processing steps before sending the data to Trinio or Dremio.
I have opened a ticket on Jira (Ticket-11449)hopefully someone will consider it.

Hope this helps you.
Regards

@quydx
Copy link

quydx commented Sep 7, 2023

@AbdelrahimKA The base nar files don't contain any cloud specific dependency due their sizes. You need to make a custom build from nifi-iceberg-bundle using either include-hadoop-aws or include-hadoop-cloud-storage profiles to be able to use the processor with S3. The include-hadoop-aws profile contains only S3 specific dependencies while include-hadoop-cloud-storage profile will additionally include azure and gcp related dependencies.

@mark-bathori Could you please give more details about this? How can we build the NAR file from nifi-iceberg-bundle using include-hadoop-aws profiles. Thanks

@Andy1510
Copy link

Andy1510 commented Jun 4, 2024

@quydx I managed to do so by cloning the nifi repo, then run the following maven build command

mvn clean install -Pinclude-hadoop-aws -pl nifi-nar-bundles/nifi-iceberg-bundle/nifi-iceberg-processors-nar -am -T2C

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants