Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove subdir path from GCP bucket url #1117

Merged
merged 1 commit into from Oct 17, 2023
Merged

Remove subdir path from GCP bucket url #1117

merged 1 commit into from Oct 17, 2023

Conversation

materight
Copy link
Contributor

Patch Description

When running a cloned task remotely and using GS as storage, calling task.logger.set_default_upload_destination usually fails with an error like:

GET https://storage.googleapis.com/storage/v1/b/clearml/project_a/.pipelines/Main/task_1.d7691550c955409db4622fd74496f51c/artifacts/result_TRAIN_config/result_TRAIN_config.pkl/iam/testPermissions?permissions=storage.objects.get&permissions=storage.objects.update&prettyPrint=false: Not Found

This is because the permissions are being tested on a single file instead of a GS bucket. Removing the subdir from the bucket url should fix it.

Testing Instructions

  • Run a pipeline with a single task that calls task.logger.set_default_upload_destination
  • Clone the task and enqueue it again for execution

@jkhenning
Copy link
Member

Hi @materight,

If the bucket has a specific subdir defined in the configuration, wouldn't it make sense to check these permissions on the subdir? Or are you saying this is not supported in Google Storage?

@materight
Copy link
Contributor Author

Hi @jkhenning. The second one, this method works only on buckets: https://cloud.google.com/storage/docs/json_api/v1/buckets/testIamPermissions

@jkhenning
Copy link
Member

@materight got it 👍

@jkhenning
Copy link
Member

In that specific case, how did you set the default output destination? I'm not sure I get how marking out the subdir will prevent that, since the line that basically causes the file to be mentioned in the request is the one setting the blob to the test_obj, and when setting a default output_uri you do not specify a file - do how did the file end up there?

@materight
Copy link
Contributor Author

materight commented Sep 19, 2023

So this is something I couldn't figure out, I just call set_output_destination('gs://my-bucket'). It only happens when I run a task remotely after cloning it, so it's hard to debug.

But the main problem here is that the subdir is passed as bucket_url, so the resulting bucket object is invalid anyway. So blob.exists() is false because the bucket is invalid, then test_obj == bucket, and then the request fails because the bucket has also the subdir in the uri.

@jkhenning
Copy link
Member

That makes sense, but it wouldn't yield the same error 😕

@materight
Copy link
Contributor Author

Sorry, what do you mean?

@jkhenning
Copy link
Member

If blob.exists() is false because the bucket is invalid, and test_obj == bucket, than the resulting object does not reference the file, and the error you mentioned can't be raised, so I'm worried about the flow that generates this error (even if the fix you added is indeed required)

@materight
Copy link
Contributor Author

Ah got it. If you can give me some hint on how/where this config.subdir is set I can take a deeper look. I'm able to reproduce it, just not locally.

@jkhenning
Copy link
Member

It can basically be set in the configuration file's google.storage credentials section (see here) when configuring the credentials for the bucket. It's strange you have it configured when running remotely but not when running locally - can you share how the agent running this remotely is configured (especially in these storage related sections)?

@materight
Copy link
Contributor Author

materight commented Sep 20, 2023

I have this in the config file:

sdk {
    google.storage {
        project: "myproject"
        credentials_json: ${GOOGLE_APPLICATION_CREDENTIALS}
        pool_connections: 512
        pool_maxsize: 1024
    }
    development {
        default_output_uri: "gs://clearml-bucket"
    }
}

Also in the agent's logs I only see this:

sdk.google.storage.project = myproject
sdk.google.storage.credentials_json = /service_account_key.json
sdk.google.storage.pool_connections = 512
sdk.google.storage.pool_maxsize = 1024
sdk.development.default_output_uri = gs://clearml-bucket

Btw I'm using pipelines, the result_TRAIN_config.pkl artifact it tries to access is a parameter of the pipeline component.

@jkhenning
Copy link
Member

And can you reproduce that error message you shared?

@materight
Copy link
Contributor Author

Yes, if I clone the failed task and re-enqueue it I get the same error. But if I run a new task it doesn't happen all the time.

@jkhenning
Copy link
Member

The config file you attached is from your own workstation? If so, what's the config file used by the agent?

@materight
Copy link
Contributor Author

It's exactly the same for both

@jkhenning
Copy link
Member

And the task log in the remote run where it happens? Can you share?

@materight
Copy link
Contributor Author

Sure, here: task_ff8c8917072641bca89f27ddd488d74c.log

@materight
Copy link
Contributor Author

Hi @jkhenning any update on this? Would it be possible to merge it and dig in more if the issue reappear?

@jkhenning jkhenning merged commit a8ec107 into allegroai:master Oct 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants