New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(catalog): add initial rest catalog impl #58
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great effort getting this in, just some small suggestions.
catalog/catalog.go
Outdated
@@ -47,19 +52,136 @@ func WithAwsConfig(cfg aws.Config) Option { | |||
} | |||
} | |||
|
|||
func WithCredential(cred string) Option { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like these options are being overloaded with probably only minimal overlap, do you think we should move these catalog implementations into their own packages, each with their own options?
I got a feeling this would happen with "common options", interested in your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I was thinking the same. Though my question is how many of these options will overlap with the other catalog implementations. Several of them probably could overlap. (This is probably why all the options in pyiceberg are all passed via property mappings)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zeroshade yeah I think it is important to remember that a little duplication is OK, as long as the user of the API gets a neat typed interface.
Sure internally there maybe some overlap, but that typically erodes over time as corner cases evolve for each implementation.
As a user of the API I am using Glue + S3, am i likely to switch to hive + REST?
Personally I prefer a typed interface in Go APIs, over a bag of properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wolfeidau take a look at the way I've set up some interesting little generic stuff in the most recent commit making the Options
typed so that we can specify options which are explicit for one catalog or the other without having to do a ton of duplication. Let me know what you think.
@@ -91,7 +92,7 @@ func (c *GlueCatalog) ListTables(ctx context.Context, namespace table.Identifier | |||
// LoadTable loads a table from the catalog table details. | |||
// | |||
// The identifier should contain the Glue database name, then glue table name. | |||
func (c *GlueCatalog) LoadTable(ctx context.Context, identifier table.Identifier, props map[string]string) (*table.Table, error) { | |||
func (c *GlueCatalog) LoadTable(ctx context.Context, identifier table.Identifier, props iceberg.Properties) (*table.Table, error) { | |||
database, tableName, err := identifierToGlueTable(identifier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need a better name than identifierToGlueTable
eh.
dev/Dockerfile
Outdated
@@ -0,0 +1,66 @@ | |||
# Licensed to the Apache Software Foundation (ASF) under one or more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this dockerfile being used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used by the spark-iceberg
container in the docker-compose.yml file. It uses build: .
to tell it to use this Dockerfile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I missed that. In that case we should rename the image name, because I was assuming it's using tabulario/spark-iceberg
. Also do we even need the image customization at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
honestly i was just following the configuration that iceberg-python is using (the Dockerfile is from that repo) I'll see if it works without the customization and report back.
const usage = `iceberg. | ||
|
||
Usage: | ||
iceberg list [options] [PARENT] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be good to have proper help text, similar to https://py.iceberg.apache.org/cli/ but this can be done in a follow-up PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use -h
or --help
it'll print out this whole usage string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I saw that, but
Usage:
iceberg list [options] [PARENT]
iceberg describe [options] [namespace | table] IDENTIFIER
iceberg (schema | spec | uuid | location) [options] TABLE_ID
iceberg drop [options] (namespace | table) IDENTIFIER
iceberg files [options] TABLE_ID [--history]
iceberg rename [options] <from> <to>
iceberg properties [options] get (namespace | table) IDENTIFIER [PROPNAME]
iceberg properties [options] set (namespace | table) IDENTIFIER PROPNAME VALUE
iceberg properties [options] remove (namespace | table) IDENTIFIER PROPNAME
iceberg -h | --help | --version
Arguments:
PARENT Catalog parent namespace
IDENTIFIER fully qualified namespace or table
TABLE_ID full path to a table
PROPNAME name of a property
VALUE value to set
Options:
-h --help show this helpe messages and exit
--catalog TEXT specify the catalog type [default: rest]
--uri TEXT specify the catalog URI
--output TYPE output type (json/text) [default: text]
--credential TEXT specify credentials for the catalog
isn't super informative. What I meant is having some sort of better description of the different commands, similar to
Commands:
describe Describes a namespace xor table
drop Operations to drop a namespace or table
list Lists tables or namespaces
location Returns the location of the table
properties Properties on tables/namespaces
rename Renames a table
schema Gets the schema of the table
spec Returns the partition spec of the table
uuid Returns the UUID of the table
Again, this isn't in the scope of this PR and would be nice to improve eventually
}) | ||
}) | ||
|
||
cat, err := catalog.NewRestCatalog("rest", r.srv.URL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this needs to make sure that the correct authorization header is set, similar to https://github.com/apache/iceberg-python/blob/main/tests/catalog/test_rest.py#L116
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's verified in the TestListTablesPrefixed200
test, it uses the token check and then makes a request afterwards which confirms the authorization header is sent with the subsequent request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that we're testing a different endpoint here, I would say it's better to also make sure that the correct Authorization header is set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added test
|
||
authUri, err := url.Parse(r.srv.URL) | ||
r.Require().NoError(err) | ||
cat, err := catalog.NewRestCatalog("rest", r.srv.URL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, this needs to verify the authorization header
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we test the authorization header in the TestListTablesPrefixed200
I didn't think it necessary to add that same test in these. particularly given that we don't expose the client/session from the catalog object. I can add a test that isn't in the _test
package so that I can verify the header though if you think we still need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're calling a different endpoint here, so I think we should still make sure that the authorization header is properly configured
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added test
cat, err := catalog.NewRestCatalog("rest", r.srv.URL, catalog.WithOAuthToken(TestToken)) | ||
r.Require().NoError(err) | ||
|
||
r.Require().NoError(cat.CreateNamespace(context.Background(), catalog.ToRestIdentifier("leden"), iceberg.Properties{"foo": "bar", "super": "duper"})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also verify that the properties have actually be set on the namespace? Just verifying that there wasn't an error could be misleading and stuff could just silently pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is all mocked in this test, there isn't any namespace to verify properties for. The r.Equal
validations inside the handler verifies that the request was sent correctly and contained the properties as expected (see line 428)
tlsConfig *tls.Config | ||
credential string | ||
oauthToken string | ||
warehouseLocation string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how can these be configured from the cmd line? I would assume we'd have a config file similar to what pyiceberg has?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently only credential
is configurable on the CLI.
I can add the warehouse and token easily as options. Unless we think that the config file would be better in general
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a config file makes more sense, because then configs can be specifig to catalogs. Otherwise you'd have to maintain all the available config options as CLI options
output.Text("Renamed table from " + cfg.RenameFrom + " to " + cfg.RenameTo) | ||
case cfg.Drop: | ||
switch { | ||
case cfg.Namespace: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that we can drop a namespace, should there be an option to create a namespace? This can be done in a follow-up
I think what's currently missing is having a way to configure the Listing files via
I believe this is because |
Hmm. So, setting the env vars There is the ability to set a session token via the |
@nastra So I've figured out the issue: The properties are correctly being propagated to the FileIO object, however it looks like the tabular api doesn't like the Go Iceberg user-agent. I loaded up Anything we can do on the tabular side? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once the files have been removed that were previously used in the custom Dockerfile
.
@zeroshade could you also please open new issues for:
- improving help text
- config handling
- implementing remaining catalog operations for REST / Glue / ...
- ... (whatever else you think needs to be improved/done)
Having open issues increases the visibility and would also give other people in the community the chance to contribute by seeing what work needs to be done
@nastra Added several issues as suggested |
Adding an initial implementation and unit tests for the Rest catalog.