Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support gcs:// and r2:// URLs to read data from GCS and R2 #9388

Merged
merged 4 commits into from Nov 6, 2023

Conversation

chrisiou
Copy link
Contributor

@chrisiou chrisiou commented Oct 18, 2023

This PR adds two new schemes for the s3 filesystem:

  • The default region of all services is changed to us-east-1. Relevant links for: s3, gcs and r2
  • The gcs and s3 services have their corresponding endpoints as defaults.
  • The r2 gets s3.amazonaws.com as default because otherwise (for ther2.cloudflarestorage.com endpoint), it needs additionally an account_id (this will be addressed in another PR)
  • The gcs and r2 prefixes have been added wherever needed.
  • The url_encode.test is modified to additionally test gcs and r2 like s3 url.

@chrisiou chrisiou marked this pull request as draft October 18, 2023 15:16
@samansmink samansmink self-requested a review October 18, 2023 15:20
Catalog Error: Table with name gcs://test-bucket/whatever does not exist!

statement error
SELECT * FROM 'r2://test-bucket/whatever';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if these tests are also executed on Windows, if they are then specifying the full path (including slashes) in the expected error might break when tested on Windows

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tishj I think this should work on windows? the slashes in http urls are / regardless of platform?

@chrisiou I have another comment with these tests though. The idea behind this test was to confirm that the generated http urls are correct. So for example when using the settings from source ./scripts/set_s3_test_server_variables.sh, when should be able to test this:

SELECT * FROM 'gcs://test-bucket/whatever.parquet';
Error: IO Error: Unable to connect to URL "http://test-bucket.duckdb-minio.com:9000/whatever.parquet": 404 (Not Found)

Note that the endpoint is now set to duckdb-minio.com:9000 as defined by the env variable from the script. Now when I run the following:

D set s3_endpoint='s3.some.random.endpoint.com';
D SELECT * FROM 'gcs://test-bucket/whatever.parquet';
Error: IO Error: Connection error for HTTP HEAD to 'http://test-bucket.s3.some.random.endpoint.com/whatever.parquet'

You can see that the error message can show us we're constructing the correct urls here. Could you rewrite the test in such a way that this test confirm we are using the right endpoint default with each prefix?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samansmink you're right, I'm thinking of local filesystem paths, my bad!

@chrisiou chrisiou marked this pull request as ready for review October 19, 2023 08:18
Copy link
Contributor

@samansmink samansmink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @chrisiou, thanks for the PR, looks good! I have added 2 small comments

@@ -193,7 +193,13 @@ S3AuthParams S3AuthParams::ReadFrom(FileOpener *opener, FileOpenerInfo &info) {
if (FileOpener::TryGetCurrentSetting(opener, "s3_endpoint", value, info)) {
endpoint = value.ToString();
} else {
endpoint = "s3.amazonaws.com";
if (StringUtil::StartsWith(info.file_path, "gcs://")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of this change is that when a user queries somethings like:

SELECT * FROM 'gcs://test-bucket/some/file';

We can automatically set the endpoint as you are doing here. However I think this codepath is currently never hit, because the TryGetCurrentSetting now always returns true, setting the endpoint to an empty string in this case. Could you rewrite this in such a way that when the s3_endpoint does not exist or is empty, we set the default based on the prefix?

Catalog Error: Table with name gcs://test-bucket/whatever does not exist!

statement error
SELECT * FROM 'r2://test-bucket/whatever';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tishj I think this should work on windows? the slashes in http urls are / regardless of platform?

@chrisiou I have another comment with these tests though. The idea behind this test was to confirm that the generated http urls are correct. So for example when using the settings from source ./scripts/set_s3_test_server_variables.sh, when should be able to test this:

SELECT * FROM 'gcs://test-bucket/whatever.parquet';
Error: IO Error: Unable to connect to URL "http://test-bucket.duckdb-minio.com:9000/whatever.parquet": 404 (Not Found)

Note that the endpoint is now set to duckdb-minio.com:9000 as defined by the env variable from the script. Now when I run the following:

D set s3_endpoint='s3.some.random.endpoint.com';
D SELECT * FROM 'gcs://test-bucket/whatever.parquet';
Error: IO Error: Connection error for HTTP HEAD to 'http://test-bucket.s3.some.random.endpoint.com/whatever.parquet'

You can see that the error message can show us we're constructing the correct urls here. Could you rewrite the test in such a way that this test confirm we are using the right endpoint default with each prefix?

@chrisiou chrisiou closed this Oct 23, 2023
@samansmink samansmink reopened this Oct 23, 2023
@github-actions github-actions bot marked this pull request as draft October 23, 2023 09:25
@chrisiou chrisiou force-pushed the issue217 branch 2 times, most recently from 0ec975c to e51c7dd Compare October 23, 2023 14:32
@chrisiou chrisiou marked this pull request as ready for review October 24, 2023 08:51
extension/httpfs/s3fs.cpp Outdated Show resolved Hide resolved
@github-actions github-actions bot marked this pull request as draft October 25, 2023 13:47
@samansmink
Copy link
Contributor

Hey @chrisiou! Could you mark the PR as ready for review? this will trigger the CI to run

@chrisiou chrisiou marked this pull request as ready for review October 26, 2023 08:18
@chrisiou chrisiou closed this Oct 26, 2023
@chrisiou chrisiou reopened this Oct 26, 2023
@github-actions github-actions bot marked this pull request as draft November 3, 2023 12:34
@chrisiou chrisiou marked this pull request as ready for review November 3, 2023 12:34
@samansmink
Copy link
Contributor

@Mytherin This PR is good to go!

@Mytherin Mytherin merged commit 66b583b into duckdb:feature Nov 6, 2023
45 checks passed
@Mytherin
Copy link
Collaborator

Mytherin commented Nov 6, 2023

Thanks!

@chrisiou chrisiou deleted the issue217 branch November 6, 2023 13:11
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Dec 11, 2023
Merge pull request duckdb/duckdb#9164 from Mause/feature/jdbc-uuid-param
Merge pull request duckdb/duckdb#9185 from pdet/adbc_07
Merge pull request duckdb/duckdb#9126 from Maxxen/parquet-kv-metadata
Merge pull request duckdb/duckdb#9123 from lnkuiper/parquet_schema
Merge pull request duckdb/duckdb#9086 from lnkuiper/json_inconsistent_structure
Merge pull request duckdb/duckdb#8977 from Tishj/python_readcsv_multi_v2
Merge pull request duckdb/duckdb#9279 from hawkfish/nsdate-cast
Merge pull request duckdb/duckdb#8851 from taniabogatsch/binary_lambdas
Merge pull request duckdb/duckdb#8983 from Maxxen/types/fixedsizelist
Merge pull request duckdb/duckdb#9318 from Maxxen/fix-unused
Merge pull request duckdb/duckdb#9220 from hawkfish/exclude
Merge pull request duckdb/duckdb#9230 from Maxxen/json-plan-serialization
Merge pull request duckdb/duckdb#9011 from Tmonster/add_create_statement_support_to_fuzzer
Merge pull request duckdb/duckdb#9400 from Maxxen/array-fixes
Merge pull request duckdb/duckdb#8741 from Tishj/python_import_cache_upgrade
Merge fixes
Merge pull request duckdb/duckdb#9395 from taniabogatsch/lambda-performance
Merge pull request duckdb/duckdb#9427 from Tishj/python_table_support_replacement_scan
Merge pull request duckdb/duckdb#9516 from carlopi/fixformat
Merge pull request duckdb/duckdb#9485 from Maxxen/fix-parquet-serialization
Merge pull request duckdb/duckdb#9388 from chrisiou/issue217
Merge pull request duckdb/duckdb#9565 from Maxxen/fix-array-vector-sizes
Merge pull request duckdb/duckdb#9583 from carlopi/feature
Merge pull request duckdb/duckdb#8907 from cryoEncryp/new-list-functions
Merge pull request duckdb/duckdb#8642 from Virgiel/capi-streaming-arrow
Merge pull request duckdb/duckdb#8658 from Tishj/pytype_optional
Merge pull request duckdb/duckdb#9040 from Light-City/feature/set_mg
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants