Extension to read and ingest iceberg data files #14329

a2l007 · 2023-05-22T22:10:06Z

Description

This adds a new contrib extension: druid-iceberg-extensions which can be used to ingest data stored in Apache Iceberg format. It adds a new input source of type iceberg that connects to a catalog and retrieves the data files associated with an iceberg table and provides these data file paths to either an S3 or HDFS input source depending on the warehouse location.

Two important dependencies associated with Apache Iceberg tables are:

Catalog : This extension supports reading from either a Hive Metastore catalog or a Local file-based catalog. Support for AWS Glue is not available yet.
Warehouse : This extension supports reading data files from either HDFS or S3. Adapters for other cloud object locations should be easy to add by extending the AbstractInputSourceAdapter.

Sample ingestion spec:

    "inputSource": {
       "type": "iceberg",
       "tableName": "logs",
       "namespace": "webapp",
       "icebergFilter": {
                   "type": "interval",
                   "filterColumn": "createTime",
                   "intervals": [
                       "2023-05-10T00:00:00.000Z/2023-05-15T00:00:00.000Z"
                   ]
               },
       "icebergCatalog": {
         "type": "hive",
         "warehousePath": "hdfs://localwarehouse/",
         "catalogUri": "thrift://hdfscatalog:9083",
         "catalogProperties": {
           "hive.metastore.connect.retries": "1",
           "hive.metastore.execute.setugi": "false",
           "hive.metastore.kerberos.principal": "principal@krb.com",
           "hive.metastore.sasl.enabled": "true",
           "hadoop.security.authentication": "kerberos",
           "hadoop.security.authorization": "true",
           "java.security.auth.login.config": "jaas.config",
         }
       },
       "warehouseSource": {
         "type": "hdfs"
     },
     "inputFormat": {
       "type": "parquet"
     }

Release note

Enhanced ingestion capabilities to support ingestion of Apache Iceberg data.

Key changed/added classes in this PR

IcebergCatalog.java
IcebergInputSource.java

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
been tested in a test Druid cluster.

...ore/s3-extensions/src/test/java/org/apache/druid/data/input/s3/S3InputSourceAdapterTest.java

...ruid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/HiveIcebergCatalog.java

...d-iceberg-extensions/src/test/java/org/apache/druid/iceberg/filter/IcebergAndFilterTest.java

...id-iceberg-extensions/src/test/java/org/apache/druid/iceberg/filter/IcebergOrFilterTest.java

processing/src/test/java/org/apache/druid/data/input/impl/LocalInputSourceAdapterTest.java

cryptoe

Left a partial review. Will finish the detailed review by this week .
Super cool stuff 🚀 .

cryptoe · 2023-06-05T11:40:21Z

docs/development/extensions-contrib/iceberg.md

+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata on metadata files, it is still dependent on a metastore for managing a certain amount of metadata.


This might need rephrasing. Did you mean a metadata store here ?

No, I'm referring to a metastore, also known as Iceberg metadata catalog or just Iceberg catalog. I've slightly reworded this, let me know if it helps.

cryptoe · 2023-06-05T11:42:24Z

docs/development/extensions-contrib/iceberg.md

+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath.  


Where are they needed. I am assuming they are only needed on the peon's ?

Yes only on the peons, fixed it in the docs.

cryptoe · 2023-06-05T11:47:47Z

extensions-contrib/druid-iceberg-extensions/pom.xml

+  </parent>
+  <modelVersion>4.0.0</modelVersion>
+
+  <properties>


Nit: do we require an empty block here ?

cryptoe · 2023-06-05T11:48:52Z

extensions-contrib/druid-iceberg-extensions/pom.xml

+  <properties>
+  </properties>
+  <dependencies>
+    <dependency>


I donot see hadoop 2/ hadoop 3 profiles. For reference you can have a look here : https://github.com/apache/druid/blob/master/extensions-core/hdfs-storage/pom.xml#L142

In the limitations of the extensions, it is specified that hadoop 2.x support is not tested. Do we still need a hadoop2 profile?

not required. though can we not even add this module if hadoop2 profile is activated? Assuming such a thing is possible.

we can probably remove it from the distribution pom.xml under the hadoop2 profile, if needed.

I think its fine since we will remove the hadoop 2 support very soon anyway.

extensions-contrib/druid-iceberg-extensions/pom.xml

cryptoe · 2023-06-05T12:03:48Z

...ib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/IcebergCatalog.java

+/*
+ * Druid wrapper for an iceberg catalog.
+ * The configured catalog is used to load the specified iceberg table and retrieve the underlying live data files upto the latest snapshot.
+ * This does not perform any projections on the table yet, therefore all the underlying columns will be retrieved from the data files.


Where is the icebergFilterexpression filtering happening.
Does the filtering happen while pruning the list of the data files that need to be fetched?

Yes, we create an iceberg table scan and feed it the set of filters before the plan files are identified. Therefore while the files are being planned, it can prune out the list based on the filters provided.

cryptoe · 2023-06-05T12:07:59Z

...ib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/IcebergCatalog.java

+                       .forEach(dataFile -> dataFilePaths.add(dataFile.path().toString()));
+
+      long duration = System.currentTimeMillis() - start;
+      log.info("Data file scan and fetch took %d ms", duration);


You could also log the number of dataFilePaths here

cryptoe · 2023-06-05T12:10:02Z

...ruid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/HiveIcebergCatalog.java

+
+  private void authenticate()
+  {
+    String principal = catalogProperties.getOrDefault("principal", null);


Are there other types of authentication methods or only we have support for krb5 in the initial version.
In any case we should document this explicitly.

Added a line in the doc.

cryptoe · 2023-06-05T12:11:34Z

...ruid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/HiveIcebergCatalog.java

+  private HiveCatalog setupCatalog()
+  {
+    HiveCatalog catalog = new HiveCatalog();
+    authenticate();


Do we need to handle remote http/rpc related exceptions here ?

cryptoe · 2023-06-05T12:13:32Z

...ib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/IcebergCatalog.java

+    String tableIdentifier = tableNamespace + "." + tableName;
+
+    Thread.currentThread().setContextClassLoader(getClass().getClassLoader());
+    TableIdentifier icebergTableIdentifier = catalog.listTables(namespace).stream()


I think this call needs a special error handling to let the user know that there is some connectivity issue or bad configuration is passed.

abhishekagarwal87 · 2023-06-27T09:49:35Z

docs/development/extensions-contrib/iceberg.md

+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, The IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.


Suggested change

For a given catalog, iceberg table name and filters, The IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.

For a given catalog, iceberg table name, and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters, and extracting all the underlying live data files up to the latest snapshot.

abhishekagarwal87 · 2023-06-27T09:51:31Z

docs/development/extensions-contrib/iceberg.md

+  "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+}
+```
+Since the AWS connector uses the `s3a` filesystem based client, the warehouse path should be specified with the `s3a://` protocol instead of `s3://`.


Suggested change

Since the AWS connector uses the `s3a` filesystem based client, the warehouse path should be specified with the `s3a://` protocol instead of `s3://`.

Since the Hadoop AWS connector uses the `s3a` filesystem based client, the warehouse path should be specified with the `s3a://` protocol instead of `s3://`.

docs/ingestion/input-sources.md

abhishekagarwal87 · 2023-06-30T10:49:15Z

docs/development/extensions-contrib/iceberg.md

+
+```json
+"catalogProperties": {
+  "fs.s3a.access.key" : "S3_ACCESS_KEY",


can you confirm that these get masked when we log these properties or when someone looks at the ingestion spec?

It wasn't masked earlier, i've added support for the dynamicconfigprovider now, so it should be good.

abhishekagarwal87 · 2023-06-30T10:55:15Z

extensions-contrib/druid-iceberg-extensions/pom.xml

+  <properties>
+  </properties>
+  <dependencies>
+    <dependency>


not required. though can we not even add this module if hadoop2 profile is activated? Assuming such a thing is possible.

...ib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/IcebergCatalog.java

abhishekagarwal87 · 2023-06-30T11:43:57Z

processing/src/main/java/org/apache/druid/data/input/AbstractInputSourceAdapter.java

+    return inputSource;
+  }
+
+  private static class EmptyInputSource implements SplittableInputSource


you should add a note here that this class exists because some underlying input sources might not accept an empty list of input sources. While an empty list is possible when working with iceberg.

Added docs, thanks.

abhishekagarwal87 · 2023-06-30T11:46:19Z

processing/src/main/java/org/apache/druid/data/input/AbstractInputSourceAdapter.java

+import java.util.stream.Stream;
+
+/**
+ * A wrapper on top of {@link SplittableInputSource} that handles input source creation.


I think this class could be called LazyInputSourceBuilder or something to that effect. since it doesn't seem like an adapter. its primary responsibility is lazy on-demand instantiation of input sources.

abhishekagarwal87 · 2023-06-30T11:48:14Z

processing/src/main/java/org/apache/druid/data/input/AbstractInputSourceAdapter.java

+import java.util.stream.Stream;
+
+/**
+ * A wrapper on top of {@link SplittableInputSource} that handles input source creation.


In fact, this class could be split into one concrete class that does memoization and one interface that has a build method. The extensions can just implement the interface.

abhishekagarwal87 · 2023-06-30T11:57:06Z

processing/src/main/java/org/apache/druid/data/input/AbstractInputSourceAdapter.java

+import java.util.stream.Stream;
+
+/**
+ * A wrapper on top of {@link SplittableInputSource} that handles input source creation.


thinking a bit more about it, memoization doesn't require a class of its own at all. That's something IcebergInputSource can do itself. So all we require is the ability to generate an input source dynamically. and a single-method interface is good enough to achieve that. We can call it FileInputSourceBuilder or FileInputSourceGenerator.

abhishekagarwal87

Just a few minor comments @a2l007 - Looks good to me otherwise.

abhishekagarwal87 · 2023-07-10T08:13:55Z

extensions-contrib/druid-iceberg-extensions/pom.xml

+  <properties>
+  </properties>
+  <dependencies>
+    <dependency>


I think its fine since we will remove the hadoop 2 support very soon anyway.

abhishekagarwal87 · 2023-07-10T08:15:54Z

extensions-contrib/druid-iceberg-extensions/pom.xml

+
+    <dependency>
+      <groupId>org.apache.hadoop</groupId>
+      <artifactId>hadoop-mapreduce-client-core</artifactId>


I started a discussion on #dev channel. It will be preferable to use the shaded jars to avoid dependency conflicts in the future. is this jar (hadoop-mapreduce-client-core) shaded? If you are not seeing any conflicts, it's fine for now.

abhishekagarwal87 · 2023-07-10T08:17:15Z

...id-iceberg-extensions/src/main/java/org/apache/druid/iceberg/filter/IcebergEqualsFilter.java

+      @JsonProperty("filterValue") String filterValue
+  )
+  {
+    Preconditions.checkNotNull(filterColumn, "filterColumn can not be null");


can the error message be adjusted similar to how you have done in IcebergIntervalFilter?

abhishekagarwal87 · 2023-07-10T08:17:40Z

...-iceberg-extensions/src/main/java/org/apache/druid/iceberg/filter/IcebergIntervalFilter.java

+    List<Expression> expressions = new ArrayList<>();
+    for (Interval filterInterval : intervals) {
+      Long dateStart = (long) Literal.of(filterInterval.getStart().toString())
+                                     .to(Types.TimestampType.withZone())


can you please add this bit as a doc here?

...-iceberg-extensions/src/main/java/org/apache/druid/iceberg/filter/IcebergIntervalFilter.java

abhishekagarwal87 · 2023-07-11T08:14:05Z

@a2l007 - Looks good to me. Thank you. We have a release branch already cut. I was thinking that maybe you can backport just the core changes. That way, anyone can build the extension and try it on a production release.

abhishekagarwal87 · 2023-07-11T08:29:28Z

@a2l007 - I built this locally and the size of the extension directory is 431 MB. Half of that is coming from aws-java-sdk-bundle-1.12.367.jar. This jar includes all AWS services. I think it will be better to replace it with an alternative that has just the stuff we require.

abhishekagarwal87 · 2023-07-11T08:31:24Z

Some dependencies are not required in extension since they are already present in core lib. E.g. guava (2.1 MB)

ektravel · 2023-07-11T21:42:46Z

docs/development/extensions-contrib/iceberg.md

+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension


Suggested change

## Iceberg Ingest Extension

## Iceberg Ingest extension

I don't think this heading is necessary. If you delete this heading, you can move the other headings up a level.

Since the topic is about Iceberg ingestion, consider introducing the feature first and then talk about the extension as a means of enabling the feature. For example:

Apache Iceberg is an open table format for huge analytic datasets.
Iceberg input source lets your ingest data stored in the Iceberg table format into Apache Druid. To enable the Iceberg input source, add druid-iceberg-extensions to the list of extensions. See Loading extensions for more information.

Iceberg manages most of its metadata in metadata files in the object storage. In some cases, it uses a metastore to manage a certain amount of metadata.

ektravel · 2023-07-11T21:57:57Z

docs/development/extensions-contrib/iceberg.md

+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.


Suggested change

Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.

Apache Iceberg is an open table format for huge analytic datasets. Iceberg manages most of its metadata in metadata files in the object storage. In some cases, it uses a metastore to manage a certain amount of metadata.

See comment on line 25.

ektravel · 2023-07-11T22:01:16Z

docs/development/extensions-contrib/iceberg.md

+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:


Suggested change

These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:

Iceberg refers to these metastores as catalogs. The Iceberg extension lets you connect to the following Iceberg catalog types:

ektravel · 2023-07-11T22:05:43Z

docs/development/extensions-contrib/iceberg.md

+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.


See comment on line 25.

ektravel · 2023-07-11T22:24:43Z

docs/development/extensions-contrib/iceberg.md

+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension


You can completely omit this section if you roll it into the introduction.

ektravel · 2023-07-12T04:04:53Z

docs/ingestion/input-sources.md

+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|intervals|A JSON array containing ISO-8601 interval strings. This defines the time ranges to filter on. The start interval is inclusive and the end interval is exclusive. |yes|
+
+`and` Filter:


and, or, and not filters all have the same properties. Consider not using tables to present this information or combine and, or, and not into one table.

Also, the definition for the filters property is confusing. What exactly do we pass into that property? A column name, a filter name, etc?

not filter accepts a single filter whereas and & or accepts a list of iceberg filters.
filters property accepts any of the other iceberg filters mentioned in this section.

ektravel

I've reviewed this PR from the docs perspective and left some suggestions.

ektravel · 2023-07-12T04:07:32Z

docs/ingestion/input-sources.md

+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|


ektravel · 2023-07-12T04:08:06Z

docs/ingestion/input-sources.md

+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|


ektravel · 2023-07-12T04:08:27Z

docs/ingestion/input-sources.md

+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `interval`.|yes|


ektravel · 2023-07-12T15:35:38Z

docs/development/extensions-contrib/iceberg.md

@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"


Suggested change

title: "Iceberg"

title: "Iceberg extension"

a2l007 · 2023-07-13T22:59:33Z

@ektravel Thank you for your review, I've addressed most of your comments. I haven't code formatted the input source properties as I'm following the same format as the other input sources described on that page. Let me know what you think.

a2l007 · 2023-07-13T23:01:59Z

@a2l007 - I built this locally and the size of the extension directory is 431 MB. Half of that is coming from aws-java-sdk-bundle-1.12.367.jar. This jar includes all AWS services. I think it will be better to replace it with an alternative that has just the stuff we require.

@abhishekagarwal87 Good catch! I've excluded the aws-java-sdk-bundle and changed the scope for few of the other dependencies.

abhishekagarwal87 · 2023-07-14T10:24:18Z

extensions-contrib/druid-iceberg-extensions/pom.xml

+      <exclusions>
+        <exclusion>
+          <groupId>com.amazonaws</groupId>
+          <artifactId>aws-java-sdk-bundle</artifactId>


don't you need any aws dependency? For example in hdfs-storage, where we excluded this, we also added below

<dependency> <groupId>com.amazonaws</groupId> <artifactId>aws-java-sdk-s3</artifactId> <version>${aws.sdk.version}</version> <scope>runtime</scope> </dependency>

The metastore only needs the hadoop-aws jar which provides the org.apache.hadoop.fs.s3a.S3AFileSystem class to resolve the s3a client. The s3 druid extension(which has the aws-java-sdk-s3 dependency) takes care of operations on the objects retrieved by the metastore.

abhishekagarwal87 · 2023-07-15T15:41:41Z

@a2l007 - PR looks good to me. I will let you merge it.

abhishekagarwal87 · 2023-07-15T15:48:30Z

Though I think there is still a lot of scope for reducing the number of dependencies that this extension has. It has jars for curator, jetty, jersey, protobuf, orc, mysql. There is iceberg spark runtime jar that I can't figure out how will be used. This will become an issue for the release manager as all these extra dependencies are going to have CVEs that require investigation before being suppressed.

abhishekagarwal87 · 2023-07-15T15:57:38Z

distribution/pom.xml

@@ -258,6 +258,8 @@
                                        <argument>-c</argument>
                                        <argument>org.apache.druid.extensions:druid-kubernetes-extensions</argument>
                                        <argument>-c</argument>
+                                        <argument>org.apache.druid.extensions:druid-iceberg-extensions</argument>


this is a contrib extension so we shouldn't be shipping it in the distribution bundle.

a2l007 · 2023-07-17T23:53:59Z

@ektravel Does the doc changes look good to you?

@abhishekagarwal87 I agree that the dependencies need pruning and this is something that I'm working on. Bunch of the pruning work is going to be on the transitive deps for hive-metastore and splitting out the shaded iceberg-spark-runtime dependency into smaller constituents. Do you think this would be a blocker for this PR merge since the extension is not included as part of the distribution?

abhishekagarwal87 · 2023-07-18T03:30:10Z

Sounds good. I just merged your PR.

abhishekagarwal87 · 2023-07-18T04:39:00Z

@a2l007 - do you want to backport the core changes to 27 so folks can try out the extension with the 27 release?

ektravel · 2023-07-18T14:58:43Z

@a2l007 Thank you for making the requested changes. They look good to me.

a2l007 · 2023-07-18T22:03:45Z

@abhishekagarwal87 Sure, raised #14608

This adds a new contrib extension: druid-iceberg-extensions which can be used to ingest data stored in Apache Iceberg format. It adds a new input source of type iceberg that connects to a catalog and retrieves the data files associated with an iceberg table and provides these data file paths to either an S3 or HDFS input source depending on the warehouse location. Two important dependencies associated with Apache Iceberg tables are: Catalog : This extension supports reading from either a Hive Metastore catalog or a Local file-based catalog. Support for AWS Glue is not available yet. Warehouse : This extension supports reading data files from either HDFS or S3. Adapters for other cloud object locations should be easy to add by extending the AbstractInputSourceAdapter.

maytasm · 2024-06-01T07:26:26Z

@a2l007 Does the icebergFilter works if filtering on column that is not the partitions? My understanding is that if we are filtering on column that is not the partition then a data file returned from the scan may have rows that does not satisfy the filter (basically residual). How do we deal with this since we are just passing the list of file paths to Druid to ingest?

a2l007 · 2024-06-05T02:45:41Z

@maytasm afaik, dynamic filtering cannot be performed on non-partitioned columns. As a workaround for this, we filter based on partitioned columns in the iceberg input source spec and add another filter in the transformspec for the non partitioned columns.

maytasm · 2024-06-05T05:59:59Z

@a2l007 Thanks for getting back to me. My understanding is that if we pass filters that is on non-partitioned columns, we would still get a list of files that may have values not matching our filters. If that is the case, then should we call that out in the docs or fails the ingestion job? Unless we convert the iceberg filters into Druid filters (in the transformspec) and push the iceberg filters down into the Druid ingestion job (after we get the list of files)

a2l007 · 2024-06-06T00:39:52Z

@maytasm Yeah, we should definitely call that out in the docs and also print some warning error messages. I'll make sure to include these in my next PR.
Regarding filter pushdown, are you suggesting to do a filter pushdown to Druid in all cases or only when the iceberg filter is on a non-partitioned column?

maytasm · 2024-06-11T20:14:43Z

@a2l007
+1 on calling this out in the docs and also print some warning error messages. This solution would also not break any existing jobs (since we are only adding side effects like loggings, etc).
Regarding filter pushdown, I think both ways work. Maybe we can check if the filter we pass to Iceberg scan API returns residual or not (I think there is a ResidualEvaluator class in Iceberg that provides this information). If it does not return residual then we can proceed without filter push down (the current behavior today). If there is residual, then we can transform the iceberg filters into Druid's filter (filterSpec) and pass it to Druid ingestionSpec.

Druid support to read and ingest iceberg data files

1eba551

a2l007 added Feature Area - Batch Ingestion Area - Extension labels May 22, 2023

github-actions bot added the Area - Documentation label May 22, 2023

github-advanced-security bot found potential problems May 22, 2023

View reviewed changes

a2l007 added 3 commits May 22, 2023 16:01

Remove unused variables

49e0b1d

Fix hdfs test

02f5171

Add additional tests for code coverage

ea86f52

github-advanced-security bot found potential problems May 23, 2023

View reviewed changes

Chuck expectedexceptions

4575b9e

cryptoe reviewed Jun 5, 2023

View reviewed changes

a2l007 added 2 commits June 10, 2023 08:43

Merge branch 'master' of github.com:apache/druid into iceberg_extn

1c71377

Fix docs and exception handling

09a9fc2

abhishekagarwal87 reviewed Jun 30, 2023

View reviewed changes

a2l007 added 3 commits July 3, 2023 19:14

Merge branch 'master' of github.com:apache/druid into iceberg_extn

24fdc29

Address PR comments on docs, inputsource builder and filters

fc1187a

Fix checkstyle

37f48ed

abhishekagarwal87 reviewed Jul 10, 2023

View reviewed changes

a2l007 added 2 commits July 10, 2023 16:27

Dependency analysis fixes

25d0cf4

Fix scope

40ae471

abhishekagarwal87 approved these changes Jul 11, 2023

View reviewed changes

ektravel reviewed Jul 11, 2023

View reviewed changes

ektravel reviewed Jul 12, 2023

View reviewed changes

Address doc comments

f5d6f8a

abhishekagarwal87 reviewed Jul 14, 2023

View reviewed changes

a2l007 added 2 commits July 14, 2023 14:21

Fix scope for hadoop common

fbf3fb9

Fix typo

2276a58

abhishekagarwal87 reviewed Jul 15, 2023

View reviewed changes

Move extension to contrib profile

381ae38

abhishekagarwal87 merged commit 03d6d39 into apache:master Jul 18, 2023

a2l007 mentioned this pull request Jul 18, 2023

[Backport] Core changes for iceberg ingest extension #14608

Merged

asdf2014 added the Release Notes label Jul 19, 2023

AmatyaAvadhanula mentioned this pull request Aug 6, 2023

[DRAFT] 27.0.0 release notes #14761

Closed

LakshSingla added this to the 28.0 milestone Oct 12, 2023


		This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.

		Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata on metadata files, it is still dependent on a metastore for managing a certain amount of metadata.


		### Hive Metastore catalog

		For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath.


		Support for AWS Glue and REST based catalogs are not available yet.

		For a given catalog, iceberg table name and filters, The IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.

	For a given catalog, iceberg table name and filters, The IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
	For a given catalog, iceberg table name, and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters, and extracting all the underlying live data files up to the latest snapshot.

	Since the AWS connector uses the `s3a` filesystem based client, the warehouse path should be specified with the `s3a://` protocol instead of `s3://`.
	Since the Hadoop AWS connector uses the `s3a` filesystem based client, the warehouse path should be specified with the `s3a://` protocol instead of `s3://`.

	These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
	Iceberg refers to these metastores as catalogs. The Iceberg extension lets you connect to the following Iceberg catalog types:


		## Iceberg Ingest Extension

		This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.

	\|filterValue\|The value to filter on\|yes\|
	\|`filterValue`\|The value to filter on.\|Yes\|

	\|type\|Set this value to `equals`.\|yes\|
	\|`type`\|Set this value to `equals`.\|Yes\|

	\|type\|Set this value to `interval`.\|yes\|
	\|`type`\|Set this value to `interval`.\|Yes\|

Extension to read and ingest iceberg data files #14329

Extension to read and ingest iceberg data files #14329

Conversation

a2l007 commented May 22, 2023 • edited Loading

Description

Release note

Key changed/added classes in this PR

cryptoe left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhishekagarwal87 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhishekagarwal87 commented Jul 11, 2023

abhishekagarwal87 commented Jul 11, 2023

abhishekagarwal87 commented Jul 11, 2023 • edited Loading

Choose a reason for hiding this comment

ektravel Jul 11, 2023 • edited Loading

Choose a reason for hiding this comment

ektravel Jul 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ektravel Jul 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ektravel Jul 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ektravel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

a2l007 commented Jul 13, 2023

a2l007 commented Jul 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhishekagarwal87 commented Jul 15, 2023

abhishekagarwal87 commented Jul 15, 2023

Choose a reason for hiding this comment

a2l007 commented Jul 17, 2023

abhishekagarwal87 commented Jul 18, 2023

abhishekagarwal87 commented Jul 18, 2023

ektravel commented Jul 18, 2023

a2l007 commented Jul 18, 2023 • edited Loading

maytasm commented Jun 1, 2024

a2l007 commented Jun 5, 2024

maytasm commented Jun 5, 2024 • edited Loading

a2l007 commented Jun 6, 2024

maytasm commented Jun 11, 2024

a2l007 commented May 22, 2023 •

edited

Loading

cryptoe left a comment •

edited

Loading

abhishekagarwal87 commented Jul 11, 2023 •

edited

Loading

ektravel Jul 11, 2023 •

edited

Loading

ektravel Jul 11, 2023 •

edited

Loading

ektravel Jul 11, 2023 •

edited

Loading

ektravel Jul 12, 2023 •

edited

Loading

a2l007 commented Jul 18, 2023 •

edited

Loading

maytasm commented Jun 5, 2024 •

edited

Loading