Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support GCS for delta sharing Server #81

Merged
merged 5 commits into from
Jan 7, 2022
Merged

Support GCS for delta sharing Server #81

merged 5 commits into from
Jan 7, 2022

Conversation

kohei-tosshy
Copy link
Contributor

This is a PR for #20.

I implemented GCS support by using Google Cloud Client Library (https://cloud.google.com/storage/docs/reference/libraries) and Google Cloud Storage connector (https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage).
Google Cloud Storage connector (GCS connector) has a dependency problem with Apache Spark releated with Google Guava, so I added this library's shaded jar in server/lib (SBT Unmanaged dependencies, not Managed dependencies in build.sbt).
I implemented this very forcefully, is this no problem?

To use GCP for delta sharing server, you have to do 2 steps.

  1. In your terminal, Do export GOOGLE_APPLICATION_CREDENTIALS=</path/to/your_gcp_service_acount_key.json>
  2. In the server config (e,g, delta-sharing-server.yaml), add a setting for GCP
shares:
- name: "share_gcp"
  schemas:
  - name: "schema_gcp"
    tables:
    - name: "table_gcp"
      location: "gs://delta-start/delta/"

If you want to coneect this server from Apache Spark, execute spark-shell or spark-submit like below.

spark-shell \
--packages io.delta:delta-core_2.12:1.0.0,io.delta:delta-contribs_2.12:1.0.0,io.delta:delta-sharing-spark_2.12:0.2.0 \
--jars </path/to/gcs-connector-latest-hadoop2.jar> \
--conf spark.delta.logStore.oci.impl=io.delta.storage.GCSLogStore \
--conf spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS \
--conf spark.hadoop.google.cloud.auth.service.account.enable=true 
--conf spark.hadoop.google.cloud.auth.service.account.json.keyfile=</path/to/your_gcp_service_acount_key.json>

I haven't implement integration test on GCP support yet.
If I understand correctly, what I should do is to add the almost same test with Azure support (https://github.com/delta-io/delta-sharing/blob/main/server/src/test/scala/io/delta/sharing/server/DeltaSharingServiceSuite.scala#L510).
Is this correct?
If my understanding is correct, I'll add this integreation test later.

@zsxwing
Copy link
Member

zsxwing commented Nov 30, 2021

Thanks for the contribution. This is awesome!

so I added this library's shaded jar in server/lib (SBT Unmanaged dependencies, not Managed dependencies in build.sbt).

Is Guava the only issue? If so, we can just exclude guava from other dependencies like this a6bd550 This is better than adding a shaded jar to the project.

In your terminal, Do export GOOGLE_APPLICATION_CREDENTIALS=</path/to/your_gcp_service_acount_key.json>

Does the GOOGLE_APPLICATION_CREDENTIALS env work for reading files using GoogleHadoopFileSystem? I saw GCS connector asks users to set a hadoop configuration google.cloud.auth.service.account.json.keyfile: https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/INSTALL.md#configuring-hadoop Since this is the recommended way, can we just read the key file from the hadoop configuration? This will be similar to https://github.com/delta-io/delta-sharing#azure-data-lake-storage-gen2

If I understand correctly, what I should do is to add the almost same test with Azure support

Correct. I will do some manual test with your change for now. We can add the real integration tests later.

@zsxwing
Copy link
Member

zsxwing commented Dec 7, 2021

@kohei-tosshy Any thoughts on my above questions?

@kohei-tosshy
Copy link
Contributor Author

kohei-tosshy commented Dec 11, 2021

@zsxwing

Thank you for your comment.
And sorry for being late for reacting.

Is Guava the only issue? If so, we can just exclude guava from other dependencies like this a6bd550 This is better than adding a shaded jar to the project.

I see. I'll try.

Does the GOOGLE_APPLICATION_CREDENTIALS env work for reading files using GoogleHadoopFileSystem? I saw GCS connector asks users to set a hadoop configuration google.cloud.auth.service.account.json.keyfile: https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/INSTALL.md#configuring-hadoop Since this is the recommended way, can we just read the key file from the hadoop configuration? This will be similar to https://github.com/delta-io/delta-sharing#azure-data-lake-storage-gen2

To get a presigned URL, I used Google Google Cloud Client Library.
I only used GoogleHadoopFileSystem only to changing a path.
GOOGLE_APPLICATION_CREDENTIALS is needed to use Google Google Cloud Client Library.

Correct. I will do some manual test with your change for now. We can add the real integration tests later.

I see. I'll try.

Signed-off-by: Kohei Toshimitsu <k.tosshy.20@gmail.com>
@kohei-tosshy
Copy link
Contributor Author

@zsxwing

Sorry for being late for revising.
I revised Google Guava's dependency problem and added an integration test for GCS support.

Copy link
Collaborator

@zhuansunxt zhuansunxt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code change LGTM.

@zsxwing Do you want to have another look? It'll be nice to include this in the upcoming release.

.gitignore Outdated
@@ -21,7 +21,6 @@
.pydevproject
.scala_dependencies
.settings
/lib/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert this change?

Copy link
Contributor Author

@kohei-tosshy kohei-tosshy Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted this.

override def sign(path: Path): String = {
val absPath = path.toUri
val bucketName = absPath.getHost
val objectName = absPath.getPath.stripPrefix("/")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to have a check assert(objectName.nonEmpty, s"cannot get object key from $path") similar to AWS and Azure signer

Copy link
Contributor Author

@kohei-tosshy kohei-tosshy Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this.

@zsxwing
Copy link
Member

zsxwing commented Jan 4, 2022

To get a presigned URL, I used Google Google Cloud Client Library.
I only used GoogleHadoopFileSystem only to changing a path.
GOOGLE_APPLICATION_CREDENTIALS is needed to use Google Google Cloud Client Library.

Google Google Cloud Client Library also supports other ways to config credentials. But it seems hard to mimic what GoogleHadoopFileSystem does to support various of ways to config credentials. GOOGLE_APPLICATION_CREDENTIALS env is a good start right now. We can support other ways in future if people are asking.

Left some minor comments. Otherwise LGTM.

class GCSFileSigner(
name: URI,
conf: Configuration,
preSignedUrlTimeoutSeconds: Long) extends CloudFileSigner{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
preSignedUrlTimeoutSeconds: Long) extends CloudFileSigner{
preSignedUrlTimeoutSeconds: Long) extends CloudFileSigner {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revised this.

val absPath = path.toUri
val bucketName = absPath.getHost
val objectName = absPath.getPath.stripPrefix("/")
val storage = StorageOptions.newBuilder.build.getService
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is storage thread-safe? If so, we can save it as a val in GCSFileSigner to avoid loading the credentials for each url.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think storage is thread-safe.
One instance of GCSFileSigner is created in each request for client, so each thread for a request has a dedicated instance of GCSFileSigner and won't touch other GCSFileSigner instances.
And, storage is used only for reading, so it will be OK if some threads for creating a presignied-URL are processed in paralle.

@@ -551,4 +551,34 @@ class DeltaSharingServiceSuite extends FunSuite with BeforeAndAfterAll {
}
assert(e.getMessage.contains("Server returned HTTP response code: 403")) // 403 Forbidden
}

integrationTest("gcs support") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could you use ignore for this test for now? We will set up the credentials and enable it later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used ignore instead of integrationTest.

Signed-off-by: Kohei Toshimitsu <k.tosshy.20@gmail.com>
Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

@zsxwing zsxwing merged commit 1d27064 into delta-io:main Jan 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants