GCP: Add Iceberg Catalog for GCP BigLake Metastore #7412

coufon · 2023-04-23T06:25:27Z

Add the basic implementation of a new Iceberg catalog for GCP BigLake Metastore.

BigLake Metastore (BLMS) is a serverless metastore for Dataproc and BigQuery on GCP. BLMS provides a HMS style API for Iceberg tables. Iceberg tables stored in BLMS are queryable in BigQuery (https://cloud.google.com/bigquery/docs/iceberg-tables).

BLMS API reference: https://cloud.google.com/bigquery/docs/reference/biglake/rest
BLMS API clients: https://github.com/googleapis/google-cloud-java/tree/main/java-biglake

djouallah · 2023-04-25T01:53:03Z

sorry for posting here, ( asking from a user perspective)
does this catalog support reading and writing iceberg table using third party tools ?
is it compatible with Iceberg REST catalog, can I use it with pyiceberg ?

coufon · 2023-04-25T02:03:01Z

sorry for posting here, ( asking from a user perspective) does this catalog support reading and writing iceberg table using third party tools ? is it compatible with Iceberg REST catalog, can I use it with pyiceberg ?

It provides the same functionalities that an Iceberg custom catalog supports (https://iceberg.apache.org/docs/latest/custom-catalog/), like the existing HiveCatalog. It supports read/write with Spark and Flink. It does not work with Trino (needs Trino integration here: https://github.com/trinodb/trino/tree/master/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/catalog).

The BigLake Metastore API is not the same as the Iceberg REST catalog API spec, but they should be compatible and convertible via a proxy. We are happy to explore how to make it work with Iceberg REST client. Please let me know if you have any use cases.

BigLake Metastore works with PyIceberg, we have a Python client: https://cloud.google.com/python/docs/reference/biglake/latest. We need to contribute some code in PyIceberg for the integration.

BTW, we already released this catalog as a JAR: gs://spark-lib/biglake/biglake-catalog-iceberg1.2.0-0.1.0-with-dependencies.jar, with user guide here: https://cloud.google.com/bigquery/docs/iceberg-tables. Please give it a try : )

danielcweeks

A few initial comments on a quick first pass.

gcp/src/main/java/org/apache/iceberg/gcp/biglake/BigLakeCatalog.java

danielcweeks · 2023-04-27T23:23:07Z

gcp/src/main/java/org/apache/iceberg/gcp/biglake/BigLakeCatalog.java

+  // The endpoint of BigLake API. Optional, default to DEFAULT_BIGLAKE_SERVICE_ENDPOINT.
+  public static final String PROPERTIES_KEY_BIGLAKE_ENDPOINT = "blms_endpoint";
+  // The GCP project ID. Required.
+  public static final String PROPERTIES_KEY_GCP_PROJECT = "gcp_project";


These properties appear to be duplicates of values in GCPProperties. Please use that class instead for defining and accessing properties.

It makes sense. I added a TODO to use GCPProperties in a following PR. I changed the config name to follow the existing style (e.g., biglake.project-id) in this PR. The issue of simply using GCPProperties in this PR is that "gcs.project-id" is too GCS specific, but the class field name is "projectId". I hope to change it to gcsProjectId, and add a bigLakeProjectId. It touches other classes so I prefer to separate it out.

I actually feel "gcs.project-id" is not necessary: it would be great if customers can use buckets from any project (and different projects) in the same catalog handler. Most GCP customers today use a GCS connector (https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage). It is an adapter between HDFS and GCS. A URL like "gs://bucket/folder" will be handled by this connector, project ID is detected from the bucket. It is open-source and pre-installed in Dataproc. This makes me believe specifying a project ID for GCS is not needed.

I don't think we should defer the property updates to a separate PR. It doesn't make sense to introduce them here just to move them later. They are also public fields which means that if they end up in a release, they will need to go through a deprecation cycle. We can add the additional properties to the GCPProperties and make sure that they are namespaced appropriately to work with other properties.

If we can determine the project id from the bucket itself, that would be great, but one of the major points of the GCSFileIO is to remove dependencies on hadoop/hdfs, so using that connector is not a solution for this.

gcp/src/main/java/org/apache/iceberg/gcp/biglake/BigLakeCatalog.java

danielcweeks · 2023-04-27T23:29:04Z

gcp/src/main/java/org/apache/iceberg/gcp/biglake/BigLakeCatalog.java

+
+  @Override
+  public void createNamespace(Namespace namespace, Map<String, String> metadata) {
+    if (namespace.levels().length == 0) {


I'm not sure this will map correctly to the spark catalog namespacing. What is the namespacing for the catalog?

Is it <spark-catalog>.<biglake-catalog>.<database/schema>.<table>?

Sorry for the confusion. There are two options: (1) link a <spark-catalog> to a physical <biglake-catalog>, so the full identifier of a table is just <spark-catalog>.<database/schema>.<table>. CREATE/DROP <spark-catalog> is supported: it creates/deletes <biglake-catalog> via API. (2) Use <spark-catalog>.<biglake-catalog>.<database/schema>.<table>.

We choose (1) to avoid the long table identifier in (2) (linking is done by a config biglake.catalog). The limitation is that customers can't use two <biglake-catalog>s in the same <spark-catalog>, they have to install two <spark-catalog>s instead. We think it is OK, because usually the tables to use together are in the same catalog.

My concern is, I am not sure whether (1) violates any design pattern of Spark namespaces. Please let me know if it does.

gcp/src/main/java/org/apache/iceberg/gcp/biglake/BigLakeCatalog.java

gcp/src/main/java/org/apache/iceberg/gcp/biglake/BigLakeClient.java

gcp/src/main/java/org/apache/iceberg/gcp/biglake/BigLakeClientImpl.java

gcp/src/main/java/org/apache/iceberg/gcp/biglake/BigLakeTableOperations.java

danielcweeks · 2023-04-27T23:43:39Z

gcp/src/main/java/org/apache/iceberg/gcp/biglake/BigLakeTableOperations.java

+  private Table makeNewTable(TableMetadata metadata, String metadataFileLocation) {
+    Table.Builder builder = Table.newBuilder().setType(Table.Type.HIVE);
+    builder
+        .getHiveOptionsBuilder()


Is this necessary? These options don't apply to Iceberg tables.

Another way to use BigLake Metastore for Iceberg tables is by installing an HMS proxy exposing a local 9083 Thrift port. This proxy reads from the Metastore API and returns HMS tables. The benefit of this approach is that it works for all data engines that support HMS (without having to write a catalog plugin). We populate these fields because the Hive Iceberg catalog populates them. It makes these fields stored in BigLake Metastore, so the HMS proxy path can return exactly the same table as HMS does.

We received requests to develop more Iceberg catalog plugins (e.g., for Trino) for BigLake Metastore. After we have these catalog plugins, maybe we don't need to keep this HMS path any more. If it is OK, I'd like to keep the code, and we can remove them in future.

danielcweeks · 2023-04-28T00:22:36Z

@coufon can you help explain and document how the atomic update works? I assume this is somehow related to the etag and updateTable call, but it's not entirely clear how the atomic swap is enforced.

coufon · 2023-04-28T16:59:54Z

Zhou Fang can you help explain and document how the atomic update works? I assume this is somehow related to the etag and updateTable call, but it's not entirely clear how the atomic swap is enforced.

Thank you so much for the review. I added a comment in the code to explain atomic update:

  // Updating a BLMS table with etag. The BLMS server transactionally (1) checks that the etag
  // of a table on server is the same as the etag provided by the client, and (2) update the
  // table (and its etag). The server returns an error containing message "etag mismatch", if
  // the etag on server has changed.

rdblue · 2023-05-07T22:56:26Z

gcp/src/main/java/org/apache/iceberg/gcp/biglake/BigLakeClient.java

+import java.util.Map;
+
+/** A client interface of Google BigLake service. */
+interface BigLakeClient {


What is the value of this interface? Do you expect someone to swap out the implementation for some reason? And if so, why not just swap out the entire catalog instead? That would make more sense to me. Then you wouldn't need a second interface that basically duplicates the public Catalog API.

The interface is for creating a fake client implementation. In the previous commit, I use mocked clients so this interface was not used. But now I switched to CatalogTests as you suggested, mocking client per test case doesn't work any more, so I added a FakeBigLakeClient implementing this interface for tests.

I'm still struggling to see how this interface makes sense. The implementation only wraps MetastoreServiceClient so it seems like you should be able to mock or spy that instead of creating one here?

This approach also doesn't test any of the code in BigLakeClientImpl, like convertException. I think this should be removed and replaced by testing against a different MetastoreServiceClient.

gcp/src/main/java/org/apache/iceberg/gcp/biglake/BigLakeCatalog.java

rdblue · 2023-09-24T19:28:38Z

gcp/src/main/java/org/apache/iceberg/gcp/biglake/BigLakeCatalog.java

+        return false;
+      }
+    } catch (NoSuchNamespaceException e) {
+      LOG.warn("Failed to drop namespace", e);


This log message is inaccurate. If the database didn't exist, then there was no failure. This method is idempotent.

But we can't tell whether it is not found or permission denied. For permission denied, we did want to notice the user to double check. Maybe the BigLake client should not convert 403 to 404, instead just convert 403 to Iceberg NotAuthorizedException? then the downstream never treat these errors as not found.

I tried to return NotAuthorizedException from the client instead of 403. The problem is that the tests in CatalogTests expects NoSuchTableException and NoSuchNamespaceException. More refactoring is required. I feel convert 403 to 404 and make it is explicit to users is the best option for now.

gcp/src/main/java/org/apache/iceberg/gcp/biglake/BigLakeCatalog.java

rdblue · 2023-09-24T19:31:51Z

gcp/src/main/java/org/apache/iceberg/gcp/biglake/BigLakeCatalog.java

+    } else if (namespace.levels().length == 1) {
+      String dbId = databaseId(namespace);
+      validateDatabaseId(dbId, namespace);
+      return loadDatabase(dbId).getHiveOptions().getParametersMap();


Is this parameter map immutable or unmodifiable?

It is unmodifiable: https://protobuf.dev/reference/java/java-generated/#map-fields

gcp/src/main/java/org/apache/iceberg/gcp/biglake/BigLakeCatalog.java

rdblue · 2023-09-24T19:45:02Z

gcp/src/main/java/org/apache/iceberg/gcp/biglake/BigLakeClient.java

+  public Catalog catalog(CatalogName name) {
+    try {
+      return stub.getCatalog(GetCatalogRequest.newBuilder().setName(name.toString()).build());
+    } catch (PermissionDeniedException e) {


I think it would make sense if everything returned 404, but returning 401/403 and translating that to 404 doesn't make any sense to me.

When we last talked, the argument for this behavior was that returning 401/403 leaks the fact that the object exists. But this is actually doing the opposite and hiding the fact that the object doesn't exist by throwing NoSuchNamespaceException. I guess that in the end, the service is returning a single response for all 401/403/404 cases, but it's strange to use a permission error rather than a not-exists error.

Can you confirm that the service will never return a 404?

Also, if the service guarantees that 401, 403, and 404 will result in PermissionDeniedException, then that should be documented in this class somewhere, probably in class-level Javadoc. Then maybe we don't need to add the confusing "(or permission denied)" to all of the messages?

Yes the service never returns 404, it is always 403. The response is like this:

{
"error": {
"code": 403,
"message": "Permission 'biglake.databases.get' denied on resource '//biglake.googleapis.com/projects/myproj/locations/us/catalogs/mycat/databases/notexist' (or it may not exist).",
"status": "PERMISSION_DENIED",
}

This is the unified response format in new GCP APIs. The error message mentions it could be "permission denied" or "not exist". Here is another example: https://stackoverflow.com/questions/75357894/permission-logging-logentries-create-denied-on-resource-or-it-may-not-exist

Added a description to the class java doc.

rdblue · 2023-09-24T19:47:56Z

gcp/src/main/java/org/apache/iceberg/gcp/biglake/BigLakeClient.java

+                    .setDatabaseId(name.getDatabase())
+                    .setDatabase(db)
+                    .build());
+          } catch (com.google.api.gax.rpc.AlreadyExistsException e) {


Shouldn't this also handle PermissionDeniedException since that may be thrown by any of the calls? Here it should probably be translated to Iceberg's ForbiddenException or NotAuthorizedException. Or is this an appropriate place to return NoSuchNamespaceException to indicate that the catalog doesn't exist?

The wrapper convertException will handle PermissionDeniedException and convert it to Iceberg's NotAuthorizedException. It is extracted to a wrapper for reducing boilerplate code.

Yes, it is a good point. We should check whether it failed due to parent not found or permission denied. It needs parsing the error message. I added a TODO to do it in a follow-up PR, to avoid adding more new code to this PR which has been around for long.

The error message is like this (this example creates database): "Permission 'biglake.databases.create' denied on resource '//biglake.googleapis.com/projects/myproj/locations/us/catalogs/mycat' (or it may not exist).". We need to determine whether the error is on this resource (table) or its parent (database or catalog).

coufon · 2023-10-06T15:54:30Z

Great work, this feature is exactly what my team needs. Are there any updates? Zhou Fang Ryan Blue

We released these code here (https://cloud.google.com/bigquery/docs/manage-open-source-metadata#connect-dataproc-vm):

Iceberg 1.2.0: gs://spark-lib/biglake/biglake-catalog-iceberg1.2.0-0.1.1-with-dependencies.jar
Iceberg 0.14.0: gs://spark-lib/biglake/biglake-catalog-iceberg0.14.0-0.1.1-with-dependencies.jar

Feel free to try these before this PR is merged.

emkornfield · 2023-10-20T16:19:07Z

@rdblue just wanted to check if you had any remaining concerns here?

dchristle · 2023-10-26T17:34:26Z

We released these code here (https://cloud.google.com/bigquery/docs/manage-open-source-metadata#connect-dataproc-vm):

Iceberg 1.2.0: gs://spark-lib/biglake/biglake-catalog-iceberg1.2.0-0.1.1-with-dependencies.jar Iceberg 0.14.0: gs://spark-lib/biglake/biglake-catalog-iceberg0.14.0-0.1.1-with-dependencies.jar

Feel free to try these before this PR is merged.

@coufon @emkornfield Are there instructions on how to build the equivalent artifacts using this pull request, including dependencies (with-dependencies) & any shading? It would be nice to use a jar built with 1.4.x as a dependency when using Iceberg 1.4.x.

devorbit · 2023-11-03T19:11:43Z

Hi @dchristle @coufon
When are we expecting it to be available with standard iceberg-spark-runtime?
We have some use cases for it to be used with the Iceberg Kafka Connector. I am not able to fetch this artifact from maven.

Please guide me. Thanks.

istreeter · 2024-03-29T21:57:00Z

I just came across this PR -- it looks like this would be a hugely valuable addition to Iceberg! My company would certainly benefit from this.

It seems to have got stuck though. Is there anything I can do to help get this over the finish line? I'd be willing to help out if there's any chance this could get released.

linchun3 · 2024-04-22T08:00:23Z

+1 would love to see this move forward :)

nastra · 2024-04-22T08:57:48Z

@coufon sorry for the delay here, this fell off the radar unfortunately. Could you rebase the PR please?

emkornfield · 2024-06-20T20:12:53Z

@coufon I think we can close this for now, as we will be investing the effort in the BigQuery Metastore catalog we announced at Next (https://www.youtube.com/watch?v=LIMnhzJWmLQ&t=1s)

github-actions bot added build GCP INFRA spark labels Apr 23, 2023

coufon changed the title ~~Add Iceberg Catalog for GCP BigLake Metastore~~ GCP: Add Iceberg Catalog for GCP BigLake Metastore Apr 23, 2023

github-actions bot added API AWS core docs flink hive MR python and removed MR AWS API labels Apr 25, 2023

danielcweeks self-requested a review April 26, 2023 15:34

danielcweeks reviewed Apr 27, 2023

View reviewed changes

github-actions bot added API arrow labels Apr 28, 2023

coufon added 3 commits April 28, 2023 06:37

Add Iceberg Catalog for Google BigLake Metastore

ab0513f

fix test errors in style check

4f1bce1

removed hadoop conf dependency, fix styles and tests

3a9948b

github-actions bot removed the API label Apr 28, 2023

update biglake config names

c53b9b8

rdblue reviewed May 7, 2023

View reviewed changes

rdblue reviewed Sep 24, 2023

View reviewed changes

gcp/src/main/java/org/apache/iceberg/gcp/biglake/BigLakeCatalog.java Outdated Show resolved Hide resolved

rdblue reviewed Sep 24, 2023

View reviewed changes

gcp/src/main/java/org/apache/iceberg/gcp/biglake/BigLakeCatalog.java Outdated Show resolved Hide resolved

rdblue reviewed Sep 24, 2023

View reviewed changes

gcp/src/main/java/org/apache/iceberg/gcp/biglake/BigLakeCatalog.java Outdated Show resolved Hide resolved

rdblue reviewed Sep 24, 2023

View reviewed changes

gcp/src/main/java/org/apache/iceberg/gcp/biglake/BigLakeCatalog.java Outdated Show resolved Hide resolved

rdblue reviewed Sep 24, 2023

View reviewed changes

gcp/src/main/java/org/apache/iceberg/gcp/biglake/BigLakeCatalog.java Outdated Show resolved Hide resolved

rdblue reviewed Sep 24, 2023

View reviewed changes

gcp/src/main/java/org/apache/iceberg/gcp/biglake/BigLakeCatalog.java Outdated Show resolved Hide resolved

rdblue reviewed Sep 24, 2023

View reviewed changes

gcp/src/main/java/org/apache/iceberg/gcp/biglake/BigLakeCatalog.java Outdated Show resolved Hide resolved

rdblue reviewed Sep 24, 2023

View reviewed changes

gcp/src/main/java/org/apache/iceberg/gcp/biglake/BigLakeCatalog.java Outdated Show resolved Hide resolved

rdblue reviewed Sep 24, 2023

View reviewed changes

gcp/src/main/java/org/apache/iceberg/gcp/biglake/BigLakeCatalog.java Outdated Show resolved Hide resolved

rdblue reviewed Sep 24, 2023

View reviewed changes

Zhou Fang added 2 commits October 3, 2023 09:41

Merge branch 'apache:master' into biglake

4606845

fix review comment

afcdd6e

coufon requested a review from rdblue October 3, 2023 18:34

fix an extra space

469b09e

bryanck mentioned this pull request Nov 4, 2023

Iceberg Kafka Connect Support For GCP [GCS+Bigquery] databricks/iceberg-kafka-connect#143

Closed

sashkaw mentioned this pull request Apr 24, 2024

Add support for BigLake Metastore apache/iceberg-python#651

Open

Fokko mentioned this pull request May 28, 2024

Copy iceberg table from hdfs to GCS and register table to BLMS #10389

Open

coufon closed this by deleting the head repository Jul 24, 2024

hesham-medhat mentioned this pull request Sep 12, 2024

GCP: Add Iceberg Catalog for GCP BigQuery Metastore #11039

Open

GCP: Add Iceberg Catalog for GCP BigLake Metastore #7412

GCP: Add Iceberg Catalog for GCP BigLake Metastore #7412

Conversation

coufon commented Apr 23, 2023 • edited Loading

djouallah commented Apr 25, 2023

coufon commented Apr 25, 2023 • edited Loading

danielcweeks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coufon Apr 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coufon Apr 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielcweeks commented Apr 28, 2023

coufon commented Apr 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coufon Oct 3, 2023 • edited Loading

Choose a reason for hiding this comment

coufon commented Oct 6, 2023

emkornfield commented Oct 20, 2023

dchristle commented Oct 26, 2023

devorbit commented Nov 3, 2023

istreeter commented Mar 29, 2024

linchun3 commented Apr 22, 2024

nastra commented Apr 22, 2024

emkornfield commented Jun 20, 2024

coufon commented Apr 23, 2023 •

edited

Loading

coufon commented Apr 25, 2023 •

edited

Loading

coufon Apr 28, 2023 •

edited

Loading

coufon Apr 28, 2023 •

edited

Loading

coufon Oct 3, 2023 •

edited

Loading