Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(catalog): add initial rest catalog impl #58

Merged
merged 20 commits into from Feb 14, 2024

Conversation

zeroshade
Copy link
Member

Adding an initial implementation and unit tests for the Rest catalog.

@zeroshade
Copy link
Member Author

@github-actions github-actions bot added the INFRA label Jan 31, 2024
Copy link
Contributor

@wolfeidau wolfeidau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great effort getting this in, just some small suggestions.

@@ -47,19 +52,136 @@ func WithAwsConfig(cfg aws.Config) Option {
}
}

func WithCredential(cred string) Option {
Copy link
Contributor

@wolfeidau wolfeidau Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like these options are being overloaded with probably only minimal overlap, do you think we should move these catalog implementations into their own packages, each with their own options?

I got a feeling this would happen with "common options", interested in your thoughts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I was thinking the same. Though my question is how many of these options will overlap with the other catalog implementations. Several of them probably could overlap. (This is probably why all the options in pyiceberg are all passed via property mappings)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zeroshade yeah I think it is important to remember that a little duplication is OK, as long as the user of the API gets a neat typed interface.

Sure internally there maybe some overlap, but that typically erodes over time as corner cases evolve for each implementation.

As a user of the API I am using Glue + S3, am i likely to switch to hive + REST?

Personally I prefer a typed interface in Go APIs, over a bag of properties.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wolfeidau take a look at the way I've set up some interesting little generic stuff in the most recent commit making the Options typed so that we can specify options which are explicit for one catalog or the other without having to do a ton of duplication. Let me know what you think.

@@ -91,7 +92,7 @@ func (c *GlueCatalog) ListTables(ctx context.Context, namespace table.Identifier
// LoadTable loads a table from the catalog table details.
//
// The identifier should contain the Glue database name, then glue table name.
func (c *GlueCatalog) LoadTable(ctx context.Context, identifier table.Identifier, props map[string]string) (*table.Table, error) {
func (c *GlueCatalog) LoadTable(ctx context.Context, identifier table.Identifier, props iceberg.Properties) (*table.Table, error) {
database, tableName, err := identifierToGlueTable(identifier)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a better name than identifierToGlueTable eh.

dev/Dockerfile Outdated
@@ -0,0 +1,66 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this dockerfile being used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used by the spark-iceberg container in the docker-compose.yml file. It uses build: . to tell it to use this Dockerfile

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I missed that. In that case we should rename the image name, because I was assuming it's using tabulario/spark-iceberg. Also do we even need the image customization at this point?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

honestly i was just following the configuration that iceberg-python is using (the Dockerfile is from that repo) I'll see if it works without the customization and report back.

const usage = `iceberg.

Usage:
iceberg list [options] [PARENT]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to have proper help text, similar to https://py.iceberg.apache.org/cli/ but this can be done in a follow-up PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use -h or --help it'll print out this whole usage string

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I saw that, but

Usage:
  iceberg list [options] [PARENT]
  iceberg describe [options] [namespace | table] IDENTIFIER
  iceberg (schema | spec | uuid | location) [options] TABLE_ID
  iceberg drop [options] (namespace | table) IDENTIFIER
  iceberg files [options] TABLE_ID [--history]
  iceberg rename [options] <from> <to>
  iceberg properties [options] get (namespace | table) IDENTIFIER [PROPNAME]
  iceberg properties [options] set (namespace | table) IDENTIFIER PROPNAME VALUE
  iceberg properties [options] remove (namespace | table) IDENTIFIER PROPNAME
  iceberg -h | --help | --version

Arguments:
  PARENT         Catalog parent namespace
  IDENTIFIER     fully qualified namespace or table
  TABLE_ID       full path to a table
  PROPNAME       name of a property
  VALUE          value to set

Options:
  -h --help          show this helpe messages and exit
  --catalog TEXT     specify the catalog type [default: rest]
  --uri TEXT         specify the catalog URI
  --output TYPE      output type (json/text) [default: text]
  --credential TEXT  specify credentials for the catalog

isn't super informative. What I meant is having some sort of better description of the different commands, similar to

Commands:
describe    Describes a namespace xor table
drop        Operations to drop a namespace or table
list        Lists tables or namespaces
location    Returns the location of the table
properties  Properties on tables/namespaces
rename      Renames a table
schema      Gets the schema of the table
spec        Returns the partition spec of the table
uuid        Returns the UUID of the table

Again, this isn't in the scope of this PR and would be nice to improve eventually

cmd/iceberg/main.go Outdated Show resolved Hide resolved
})
})

cat, err := catalog.NewRestCatalog("rest", r.srv.URL,
Copy link
Contributor

@nastra nastra Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this needs to make sure that the correct authorization header is set, similar to https://github.com/apache/iceberg-python/blob/main/tests/catalog/test_rest.py#L116

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's verified in the TestListTablesPrefixed200 test, it uses the token check and then makes a request afterwards which confirms the authorization header is sent with the subsequent request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that we're testing a different endpoint here, I would say it's better to also make sure that the correct Authorization header is set

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added test


authUri, err := url.Parse(r.srv.URL)
r.Require().NoError(err)
cat, err := catalog.NewRestCatalog("rest", r.srv.URL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, this needs to verify the authorization header

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we test the authorization header in the TestListTablesPrefixed200 I didn't think it necessary to add that same test in these. particularly given that we don't expose the client/session from the catalog object. I can add a test that isn't in the _test package so that I can verify the header though if you think we still need it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're calling a different endpoint here, so I think we should still make sure that the authorization header is properly configured

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added test

cat, err := catalog.NewRestCatalog("rest", r.srv.URL, catalog.WithOAuthToken(TestToken))
r.Require().NoError(err)

r.Require().NoError(cat.CreateNamespace(context.Background(), catalog.ToRestIdentifier("leden"), iceberg.Properties{"foo": "bar", "super": "duper"}))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also verify that the properties have actually be set on the namespace? Just verifying that there wasn't an error could be misleading and stuff could just silently pass

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all mocked in this test, there isn't any namespace to verify properties for. The r.Equal validations inside the handler verifies that the request was sent correctly and contained the properties as expected (see line 428)

catalog/rest_test.go Outdated Show resolved Hide resolved
tlsConfig *tls.Config
credential string
oauthToken string
warehouseLocation string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can these be configured from the cmd line? I would assume we'd have a config file similar to what pyiceberg has?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently only credential is configurable on the CLI.

I can add the warehouse and token easily as options. Unless we think that the config file would be better in general

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a config file makes more sense, because then configs can be specifig to catalogs. Otherwise you'd have to maintain all the available config options as CLI options

output.Text("Renamed table from " + cfg.RenameFrom + " to " + cfg.RenameTo)
case cfg.Drop:
switch {
case cfg.Namespace:
Copy link
Contributor

@nastra nastra Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that we can drop a namespace, should there be an option to create a namespace? This can be done in a follow-up

@nastra
Copy link
Contributor

nastra commented Feb 13, 2024

I think what's currently missing is having a way to configure the warehouse (which I hardcoded for testing) but also handling the signing part of requests against S3, similar to https://github.com/apache/iceberg-python/blob/f66e3652fdf9720d6c63a6fcec7bcd08d5bb186c/pyiceberg/io/fsspec.py#L70-L95

Listing files via go run ./cmd/iceberg files iceberg124.foobar --catalog rest --uri https://api.dev.tabular.io/ws/ --credential <creds> will fail with

2024/02/13 10:24:07 could not open manifest file: operation error S3: GetObject, https response error StatusCode: 403, RequestID: 066G7WZD23KHZCBJ, HostID: d4V0iCd2uzvp9gZJWDOWmljaREgSaL9Iro0XxOFsv38ECJpdCd/JHWG8Y6/i7oSal8cONZ87Tis=, api error AccessDenied: Access Denied
exit status 1

I believe this is because FileIO isn't configured with the TOKEN in the authorization header that's coming back from the config inside tblResponse here. Reading all other metadata of tables work via CLI, but this is because those never use FileIO and only files does that atm.

@zeroshade
Copy link
Member Author

@nastra

Hmm. So, setting the env vars AWS_REGION, AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY should all work and get picked up by the FileIO. But I haven't tried testing with https://api.dev.tabular.io/ws/ before.

There is the ability to set a session token via the s3.session-token property but you're right that I don't think it gets propagated. Is there any special configuration I need to set up in order to try testing out the api.dev.tabular.io/ws/ uri myself?

@zeroshade
Copy link
Member Author

@nastra So I've figured out the issue:

The properties are correctly being propagated to the FileIO object, however it looks like the tabular api doesn't like the Go Iceberg user-agent.

I loaded up pyiceberg to see what it does differently and how it works, and saw that the request for the table included in its response a series of s3 properties including an access-key-id, session-token, and secret-access-key in the config. When I looked at the same request from the Go cli those properties weren't there. If I hardcode and change the User-Agent that the Go CLI passes to be PyIceberg/0.5.1 suddenly those properties are returned and loading the manifests works just fine. So the problem is definitely the fact that the User-Agent isn't recognized by the tabular rest catalog enough for it to send the s3 key properties.

Anything we can do on the tabular side?
During RestCatalog.LoadTable

catalog/catalog.go Outdated Show resolved Hide resolved
dev/spark-defaults.conf Outdated Show resolved Hide resolved
dev/run-minio.sh Outdated Show resolved Hide resolved
dev/provision.py Outdated Show resolved Hide resolved
dev/entrypoint.sh Outdated Show resolved Hide resolved
Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once the files have been removed that were previously used in the custom Dockerfile.

@zeroshade could you also please open new issues for:

  • improving help text
  • config handling
  • implementing remaining catalog operations for REST / Glue / ...
  • ... (whatever else you think needs to be improved/done)

Having open issues increases the visibility and would also give other people in the community the chance to contribute by seeing what work needs to be done

@nastra nastra merged commit d209a3f into apache:main Feb 14, 2024
5 checks passed
@zeroshade zeroshade deleted the add-rest-catalog branch February 14, 2024 16:16
@zeroshade
Copy link
Member Author

@nastra Added several issues as suggested

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants