-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support GCS for delta sharing Server #81
Conversation
Signed-off-by: Kohei Toshimitsu <k.tosshy.20@gmail.com>
Thanks for the contribution. This is awesome!
Is Guava the only issue? If so, we can just exclude guava from other dependencies like this a6bd550 This is better than adding a shaded jar to the project.
Does the GOOGLE_APPLICATION_CREDENTIALS env work for reading files using
Correct. I will do some manual test with your change for now. We can add the real integration tests later. |
@kohei-tosshy Any thoughts on my above questions? |
Thank you for your comment.
I see. I'll try.
To get a presigned URL, I used Google Google Cloud Client Library.
I see. I'll try. |
Signed-off-by: Kohei Toshimitsu <k.tosshy.20@gmail.com>
Sorry for being late for revising. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code change LGTM.
@zsxwing Do you want to have another look? It'll be nice to include this in the upcoming release.
.gitignore
Outdated
@@ -21,7 +21,6 @@ | |||
.pydevproject | |||
.scala_dependencies | |||
.settings | |||
/lib/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted this.
override def sign(path: Path): String = { | ||
val absPath = path.toUri | ||
val bucketName = absPath.getHost | ||
val objectName = absPath.getPath.stripPrefix("/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great to have a check assert(objectName.nonEmpty, s"cannot get object key from $path")
similar to AWS and Azure signer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this.
Google Google Cloud Client Library also supports other ways to config credentials. But it seems hard to mimic what GoogleHadoopFileSystem does to support various of ways to config credentials. Left some minor comments. Otherwise LGTM. |
class GCSFileSigner( | ||
name: URI, | ||
conf: Configuration, | ||
preSignedUrlTimeoutSeconds: Long) extends CloudFileSigner{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
preSignedUrlTimeoutSeconds: Long) extends CloudFileSigner{ | |
preSignedUrlTimeoutSeconds: Long) extends CloudFileSigner { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I revised this.
val absPath = path.toUri | ||
val bucketName = absPath.getHost | ||
val objectName = absPath.getPath.stripPrefix("/") | ||
val storage = StorageOptions.newBuilder.build.getService |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is storage
thread-safe? If so, we can save it as a val in GCSFileSigner
to avoid loading the credentials for each url.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think storage
is thread-safe.
One instance of GCSFileSigner
is created in each request for client, so each thread for a request has a dedicated instance of GCSFileSigner
and won't touch other GCSFileSigner
instances.
And, storage
is used only for reading, so it will be OK if some threads for creating a presignied-URL are processed in paralle.
@@ -551,4 +551,34 @@ class DeltaSharingServiceSuite extends FunSuite with BeforeAndAfterAll { | |||
} | |||
assert(e.getMessage.contains("Server returned HTTP response code: 403")) // 403 Forbidden | |||
} | |||
|
|||
integrationTest("gcs support") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could you use ignore
for this test for now? We will set up the credentials and enable it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used ignore
instead of integrationTest
.
Signed-off-by: Kohei Toshimitsu <k.tosshy.20@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks!
This is a PR for #20.
I implemented GCS support by using Google Cloud Client Library (https://cloud.google.com/storage/docs/reference/libraries) and Google Cloud Storage connector (https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage).
Google Cloud Storage connector (GCS connector) has a dependency problem with Apache Spark releated with Google Guava, so I added this library's shaded jar in server/lib (SBT Unmanaged dependencies, not Managed dependencies in build.sbt).
I implemented this very forcefully, is this no problem?
To use GCP for delta sharing server, you have to do 2 steps.
export GOOGLE_APPLICATION_CREDENTIALS=</path/to/your_gcp_service_acount_key.json>
If you want to coneect this server from Apache Spark, execute spark-shell or spark-submit like below.
I haven't implement integration test on GCP support yet.
If I understand correctly, what I should do is to add the almost same test with Azure support (https://github.com/delta-io/delta-sharing/blob/main/server/src/test/scala/io/delta/sharing/server/DeltaSharingServiceSuite.scala#L510).
Is this correct?
If my understanding is correct, I'll add this integreation test later.