New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: gcs storage driver #4120
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you seen the (storage.Writer).ChunkSize
field? It sounds like all the buffering, chunking and retry code in the driver could be replaced with setting wc.ChunkSize = d.chunkSize
!
closed bool | ||
cancelled bool | ||
committed bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This smells like an ad-hoc state machine. Are all eight combinations of these flags valid states? Are all transitions valid? You might want to look into whether this code could benefit from one of the state machine design patterns. (The state-space is small enough that importing an FSM or state-chart library would be massive overkill.) Even just sketching out a state diagram is really helpful, in my experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agree. Right now this basically follows the same FSM established in the S3 driver. I'd like to zoom in on that at some point, too.
I've fixed the first batch of bugs and can report I successfully tested this on my personal GCS bucket 😄
|
Now I see that the chunking built into the Google Cloud SDK does not coalesce short writes. It only breaks up long writes, but that still covers half of what our own buffering is useful for. And we don't have to opt in as the |
I dont think so. GCS driver has always been marked as experimental historically - that's also the reason why building it requires explicitly opting in by specifying a build tag. The code was written by someone in the community and was never cleaned up; I'm attempting to do it but it's a slow burn because I'm not yet very familiar with the GCS SDK and can only test it against my personal GCP account 😅 But at least things are starting to work now and I can finally iterate on them. Way more to be done here. |
Running gcs tests with code from this branch yields hundreds of "already closed" errors, here's the full output (down below I paste the output of a run on master as well for reference).
Not all the tests are passing on the main branch either - for reference, here's the output of running the tests on
|
I need to look into them 😬 I've been testing this mostly by simpe I have changed some code around how the driver is marked as closed....so that needs looking into @flavianmissi what do you use for these tests? Do you just point them into an actual GCS package or do you use some nice tool that let you run them locally? |
If it helps, we maintain a fork here https://github.com/openshift/docker-distribution/blob/master/registry/storage/driver/testsuites/testsuites.go |
I'd just create a mock gcp client another tool I've seen used in OSS is mockery https://vektra.github.io/mockery/latest/#why-mockery |
Mocks are cute and handy for unit tests, but not a replacement for integration tests 😄 I'm interested in the latter |
@flavianmissi the close errors should now be addressed. @corhere Go cloud API SDK still doesn't seem to provide "native" support for resumable upload. I was searching their GH repo and came across googleapis/google-cloud-go#1224 but it's 4 years old and as good as dead. This is as far as I could clean it up using the SDK. PTAL. |
@milosgajdos the SDK uses resumable uploads internally to power chunking and retrying. That doesn't help when the upload needs to be |
I was skimming through docs and on the first and very short look I couldnt tell. I'd need to spend a bit more time with it. Worth exploring indeed.. |
Thinking about it some more, we'd have to fundamentally redesign how the gcs driver implements append-writes in order to use compose operations. That's way bigger than "just" a refactor. Maybe it's even big enough of a change to justify forking the driver. Regardless, it would be massive scope creep for this PR so I'm going to shut up and review this PR as is. |
I've accepted a bunch of suggestions on the first look but will give this a proper look tomorrow. I've also realized that GCS |
Some fantastic optimisation gains, @corhere. There is still the question of that |
Alright, PTAL @corhere |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The (driver).gcs
field is only ever used for one thing: d.gcs.Bucket(d.bucket)
. And with two exceptions, the (driver).bucket
field is only used in the aforementioned expressions. A *BucketHandle
field on the driver could take the place of the gcs
and bucket
fields, much like what we did with the writer and *ObjectHandle
! See inline comments for my suggestion to elide one of the d.bucket
exceptions. The other exception, in URLFor
, can be elided by switching to func (*BucketHandle) SignedURL
. Also, by only constructing the BucketHandle
once per driver instance, we would only have to apply the retryer config in one code location!
Anything else you want me to address on this PR @corhere ? |
@milosgajdos just this:
The Google Cloud SDK will not retry those requests as uploads without preconditions are not idempotent and the default retry config only retries idempotent requests. |
Aaaaah, right. Sorry I'm in a different timezone now and now it's coming back to me 🙃 |
45c3fb7
to
9c813bc
Compare
@milosgajdos what's your plan for merging this PR? Are you planning to squash it all down to a single commit, or clean up the history with rebase? |
Yes, gonna squash today. I have been leaving it unsquashed both because I was lazy but also was hoping more people would chip in. I'll squash and chase folks |
9c813bc
to
f85ebeb
Compare
Ok, squashed. PTAL @thaJeztah @Jamstah @squizzi @wy65701436 🙏 |
@@ -49,11 +48,13 @@ const ( | |||
driverName = "gcs" | |||
dummyProjectID = "<unknown>" | |||
|
|||
minChunkSize = 256 * 1024 | |||
defaultChunkSize = 16 * 1024 * 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This default was changed, was that intentional? (if so, that may warrant it to be put in a separate commit, outlining why it was changed, and what the new value was based on).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC we set it to the default ChunkSize
as defined in the docs https://pkg.go.dev/cloud.google.com/go/storage#Writer (see the ChunkSize
comment).
I can rip this out into a separate commit but I'm not sure it's worth it. We're doing a major overhaul in this PR that changes things almost from the grounds up - by the same token we'd have to separate a lot of changes to dedicated commits on a branch that's not been released yet 🤷♂️
registry/storage/driver/gcs/gcs.go
Outdated
res.Body.Close() | ||
var status *googleapi.Error | ||
if errors.As(err, &status) { | ||
if status.Code == http.StatusNotFound { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bit of a nit, but while we're at it, perhaps change this block to a switch instead of multiple if
statements
switch status.Code {
case http.StatusNotFound:
// ... etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: d461472
res.Body.Close() | ||
obj, err := d.storageStatObject(ctx, path) | ||
if status.Code == http.StatusRequestedRangeNotSatisfiable { | ||
attrs, err := obj.Attrs(ctx) | ||
if err != nil { | ||
return nil, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the other errors are converted to a specific error-type (storagedriver.InvalidOffsetError
, storagedriver.PathNotFoundError
); is any conversion needed for this case? (is a specific error-type expected for this, or is that already handled by obj.Attrs
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a generic error that can happen when we fail to unpack object attributes: https://pkg.go.dev/cloud.google.com/go/storage#ObjectHandle.Attrs. It can return ErrObjectNotExist
which we could return as PathNotFoundError
though that'd have been handled in the earlier check via the http.StatusNotFound
check 🤔
registry/storage/driver/gcs/gcs.go
Outdated
if res.Header.Get("Content-Type") == uploadSessionContentType { | ||
defer res.Body.Close() | ||
if r.Attrs.ContentType == uploadSessionContentType { | ||
defer r.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was existing code, so perhaps no need to change, but looks like we don't need a defer
here, because we're returning immediately, so this could be just r.Close()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair, but mind you, the return statement is right below it 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There ya go: 591271b
@@ -596,7 +577,7 @@ func retry(req request) error { | |||
} | |||
|
|||
status, ok := err.(*googleapi.Error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not new code, so perhaps for a follow-up, but maybe we should rewrite this to errors.As
registry/storage/driver/gcs/gcs.go
Outdated
} | ||
req.Header.Set("Content-Type", blobContentType) | ||
if from == to+1 { | ||
req.Header.Set("Content-Range", fmt.Sprintf("bytes */%v", size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size
is a string; could safe some cycles by using a straight concat here;
req.Header.Set("Content-Range", "bytes */" + size)
Or otherwise make it explicit that it's a string;
req.Header.Set("Content-Range", fmt.Sprintf("bytes */%s", size))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done: 9b3043c
registry/storage/driver/gcs/gcs.go
Outdated
if from == to+1 { | ||
req.Header.Set("Content-Range", fmt.Sprintf("bytes */%v", size)) | ||
} else { | ||
req.Header.Set("Content-Range", fmt.Sprintf("bytes %v-%v/%v", from, to, size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here;
req.Header.Set("Content-Range", fmt.Sprintf("bytes %d-%d/%s", from, to, size))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: 9b3043c
|
||
resp, err := client.Do(req) | ||
bytesPut := int64(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps;
var bytesPut int64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually prefer explicit assignment to the default Go init values 😄
if totalSize < 0 && resp.StatusCode == 308 { | ||
if totalSize < 0 && resp.StatusCode == http.StatusPermanentRedirect { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this was added in 7162cb1 (#1438), but I actually had to Google to see what's the difference between a 308
and 301
;
The main difference between the 301 and 308 redirects is that when a 308 redirect code is specified, the client must repeat the exact same request (POST or GET) on the target location. For 301 redirect, the client may not necessarily follow the exact same request.
So the difference here would be that a 308
MUST preserve the request type (i.e. POST
or GET
), but "exact same request" (my interpretation) would also mean "preserve all headers", which MAY include authentication headers, correct?
Wondering if (in a follow-up) we should explicitly call out why a 308 is used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the difference here would be that a 308 MUST preserve the request type (i.e. POST or GET), but "exact same request" (my interpretation) would also mean "preserve all headers", which MAY include authentication headers, correct?
That's my assumption, too.
registry/storage/driver/gcs/gcs.go
Outdated
if err == iterator.Done { | ||
break | ||
} | ||
if err != nil { | ||
return nil, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also see my other comment; we should be consistent, and pick one approach for these;
if err != nil {
if err == iterator.Done {
break
}
return nil, err
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yah, nice catch 77624e7
Think I've addressed all your comments @thaJeztah PTAL Seems like GH actions were having a bad time yesterday. Re-ran the failing actions and we're back to good now. With the latest update, we're on par with what's currently on Test Output$ GOOGLE_APPLICATION_CREDENTIALS="$HOME/.config/application_default_credentials.json" REGISTRY_STORAGE_GCS_BUCKET="distributiontest" go test -timeout=10m -tags=include_gcs -v ./registry/storage/driver/gcs/...
=== RUN Test
----------------------------------------------------------------------
FAIL: /home/milosgajdos/distribution/registry/storage/driver/testsuites/testsuites.go:535: DriverSuite.TestWriteZeroByteContentThenAppend
/home/milosgajdos/distribution/registry/storage/driver/testsuites/testsuites.go:557:
c.Assert(err, check.IsNil)
... value driver.PathNotFoundError = driver.PathNotFoundError{Path:"tmp/driver-2184697195/n-r-3/rww9h/z-78n/p-ntxaec_k99y", DriverName:"gcs"} ("gcs: Path not found: tmp/driver-2184697195/n-r-3/rww9h/z-78n/p-ntxaec_k99y")
----------------------------------------------------------------------
FAIL: /home/milosgajdos83/distribution/registry/storage/driver/testsuites/testsuites.go:477: DriverSuite.TestWriteZeroByteStreamThenAppend
/home/milosgajdos/distribution/registry/storage/driver/testsuites/testsuites.go:508:
c.Assert(err, check.IsNil)
... value driver.PathNotFoundError = driver.PathNotFoundError{Path:"tmp/driver-2184697195/u9-0xnjsdq5w9/ujo/h/bpnzw4/x_71", DriverName:"gcs"} ("gcs: Path not found: tmp/driver-2184697195/u9-0xnjsdq5w9/ujo/h/bpnzw4/x_71")
OOPS: 34 passed, 2 FAILED
--- FAIL: Test (152.24s)
=== RUN TestCommitEmpty
--- PASS: TestCommitEmpty (0.09s)
=== RUN TestCommit
--- PASS: TestCommit (0.47s)
=== RUN TestRetry
--- PASS: TestRetry (28.00s)
=== RUN TestEmptyRootList
--- PASS: TestEmptyRootList (0.12s)
=== RUN TestMoveDirectory
--- PASS: TestMoveDirectory (0.12s)
FAIL
FAIL github.com/distribution/distribution/v3/registry/storage/driver/gcs 181.535s
FAIL I have some ideas about why the two tests fail, but I'd have to verify that theory first. For now, we're good. PTAL @thaJeztah @Jamstah |
Ok, I figured out the cause of the failing tests 🎉 CC: @flavianmissi all the tests are now passing on this branch 😄 Test Output
|
Does the current code work against GCS? If not then I'll happily approve this. If it does, then I think testing is the big thing here. If we haven't tested against GCS then we shouldn't merge. |
Although it looks like here - you did sign up for a bucket? |
Yeah, it's tested on my personal GCP account. This a proper integration test result. No LARP 😁 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks squeaky clean, haven't tested myself but confident in the testing performed.
Would be nice to figure out integration tests across all major cloud providers. I'll open a ticket with CNCF. Maybe they have some cloud credits we could (ab)use. |
Going to squash and then merge. |
This commit refactors the GCS storage driver from the ground up and makes it more consistent with the rest of the storage drivers. We are also fixing GCS authentication using default app credentials: When the default application credentials are used we don't initialize the GCS storage client which then panics. Co-authored-by: Cory Snider <corhere@gmail.com> Signed-off-by: Milos Gajdos <milosthegajdos@gmail.com>
a46f442
to
e001fad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
post-merge LGTM
sorry was on PTO for a few days, and it was hard to look from my phone 😂
This refactors GCS storage driver from the ground up and makes it a bit more consistent with the rest of the storage drivers.
Notably, the
writer
now contains a reference to thedriver
in a similar fashion the S3 writer is implemented.Some of the "helper" funcs were removed and others were made methods on specific GCS driver objects. This simplifies the code quite a bit and gets it more aligned with other storage driver implementations in this repository.
NOTE: There is more work to be done but I figured this could be a first step hence the PR is marked as a DRAFT at the moment
CC: @flavianmissi @jmontleon @kaovilai since you were the OG authors of the recent GCS driver update. Would you mind having a look and if you have a way to test it somewhere in GCS that'd be grand. I've only done some local "simulation" tests.