Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: CREATE CONNECTION #13489

Open
BohuTANG opened this issue Oct 30, 2023 · 23 comments
Open

feat: CREATE CONNECTION #13489

BohuTANG opened this issue Oct 30, 2023 · 23 comments
Assignees
Labels
C-feature Category: feature

Comments

@BohuTANG
Copy link
Member

BohuTANG commented Oct 30, 2023

Summary

A connection object is connection parameters, and we can apply roles to it in the future.
Syntax:

CREATE CONNECTION [IF NOT EXISTS]
    ENDPOINT_URL = '...',
    ACCESS_KEY_ID = '...',
    SECRET_ACCESS_KEY = '...'
...

We can use it in COPY or other places where we need a external storage connection, for example:

CREATE CONNECTION my_conn ENDPOINT_URL = '...', ACCESS_KEY_ID = '...', SECRET_ACCESS_KEY = '...';
CREATE TABLE external_table CONNECTION = (CONNECTION_NAME='my_conn');

More statements:

CREATE CONNECTION ...
SHOW CONNECTIONS;
SHOW CREATE CONNECTION xx;
DROP CONNECTION ...
@BohuTANG BohuTANG added the C-feature Category: feature label Oct 30, 2023
@BohuTANG BohuTANG changed the title feat: CREATE STORAGE INTEGRATION feat: CREATE CONNECTION Oct 30, 2023
@youngsofun youngsofun self-assigned this Oct 31, 2023
@youngsofun
Copy link
Member

youngsofun commented Nov 1, 2023

@BohuTANG

it is better to store connection as StorageParams.
but it need (URL, CONNECTION) to parse. first, it need the protocol in URL. some also need hostname part, e.g. URL = 'azure://account.blob.core.windows.net/container[/path/]'. the CONNECTION part is parsed accoding to protocol.

Otherwise, CONNECTION is only an unchecked map<string, string>

2 solutions:

  1. not create connection, but create location = (url, connection). but url include the variable path, so not a good choice.
  2. provide protocol when create CONNECTION , then StorageParams::apply_url(uri) -> StorageParams when using the named connection (as base StorageParams).
    • CREATE CONNECTION my_conn S3 ...
    • CREATE CONNECTION my_conn protocol = s3...

@youngsofun
Copy link
Member

cc @Xuanwo

@BohuTANG
Copy link
Member Author

BohuTANG commented Nov 1, 2023

I think CONNECTION is more like a alias, so the check what we now did not changed with the new object CONNECTION.
When using: CONNECTION = (CONNECTION_NAME='my_conn'), it more like to expand it to CONNECTION = (ENDPOINT_URL='' ...) internal, so the check is like now what we did?

@youngsofun
Copy link
Member

youngsofun commented Nov 1, 2023

CREATE CONNECTION my_conn storage_type = 's3' ...
we can consider storage_provider as a part of CONNECTION options. like type in file_format.
but when there is an url, user do not have to provide it

a base StorageParams with hostname="" root = '/' will be stored.
when used with an url, call StorageParams::apply_url(uri) -> StorageParams to get the StorageParams needed.

@youngsofun
Copy link
Member

by the way, logically, most work of parsing connection (a map<string, string>) can be brought forward to the parse stage(now in bind ).
the price is ast crate will depend on opendal to use Scheme
I think it is not worth.

@Xuanwo
Copy link
Member

Xuanwo commented Nov 1, 2023

the price is ast crate will depend on opendal to use Scheme

Databend doesn't support all services that opendal supported. Do you think it's a good idea to maintain a service scheme enum in ast?

@youngsofun
Copy link
Member

youngsofun commented Nov 1, 2023

the price is ast crate will depend on opendal to use Scheme

Databend doesn't support all services that opendal supported. Do you think it's a good idea to maintain a service scheme enum in ast?

make sense! let me do it!

@youngsofun
Copy link
Member

@BohuTANG @Xuanwo

As I am working on this issue, need some help.

first of all, I am wondering if it is necessary to add this feature.


connection = (...) is used in

  1. DDLs: "create external table" and "create external stage"
  2. DML&Query: stage-related opts (copy, list, remove, infer_schema, select from stage ...), so (URI, Connect) can be used like a stage.

the motivation of adding UserDefined-Connection

  1. connections may contain keys, user may want limit it to Admin. it is better practice to only use them in DDLs but not DML.
  2. saving user input connection details each time. especially for DML&Query.

but now we do not need new feature to achieve these 2 points.

  • for external table, connection is only used in DDL creation.
  • for external stage. user (admin) can create external stage.

so now user only have to input connection detail in DDL.

i.e. the connection object is only necessary when create multi external table/stage using one connection or for adding another level of security.

for stage related DML&Query, user will have choices: create a connection or create an exteranl stage, not a good thing.


we do not have many type of objects for now, so I wonder if we should add this one.
however maybe this still is enough reason. let us dissuss some impl detail.


for now: URI(bucket, path) + CONNECTION = StorageConfig + path.

def of StorageConfig is clear: it provides an root for the path. StorageConfig and path are independent.
connection without URI is not such an independent Object, connection is just a piece of gramma: params to make URI works. the uri is more fundamental. Specifically, we can not use any URI with any CONNECTION, since some params is just for the region/bucket specified in the URI ( I am not very sure about this cc @Xuanwo ).

we have 2 ways of defining/use it:

first. never include the URI part (always used with an uri).
internally there are 2 ways to impl:

  1. add new enum Connecion (I prefer)
  2. reuse StorageConfig, but always incomplete.

second. may include the URI part or not. let connection corresponds to StorageConfig(mayby incomplete).
now we already support url = protocal://<bucket>/<path> connection = ()
maybe we can also support:

url = protocal:///<path>   connection = (bucket=...)
url = /<path>   connection = (storage_type =s3 ... bucket=<bucket1>...) # here connection == StorageConfig

create connection url=protocal://<bucket>/<path> .... # here path will used as root/prefix in StorageConfig

the drawback is that there may be an incomplete StorageConfig.

@youngsofun
Copy link
Member

youngsofun commented Nov 7, 2023

just to make sure:

if we create a stage with a named connection.
if the named connection changes (e.g. key changed.), the stage should use the changed CONNECTION object. i.e. contents of connection is not apply only once when creating stages.

@Xuanwo
Copy link
Member

Xuanwo commented Nov 7, 2023

if my_conn changes, the stage should use the changed CONNECTION object.

This behavior appears to easily panic users. For instance, User A creates multiple stages with Connection X, but User B accidentally alters Connection X. Consequently, User A discovers that all his stages are disrupted.

CONNECTION to me is more like a set of args that could be reused.

@youngsofun
Copy link
Member

y the way, logically, most work of parsing connection (a map<string, string>) can be brought forward to the parse stage(now in bind ).

I tried, it turns out it is not suitable for parser to do such dynamic parse work (parse the uri first, then parse the connections dependently), also not easy to get good error message , and there are no big gains.

@youngsofun
Copy link
Member

youngsofun commented Nov 7, 2023

@Xuanwo

so it is ok to you to add an type of object for a set of args that could be reused. ?

so I should store a new enum Connecion (a checked map<string, string>)?

I guess it is alse the originally idea @BohuTANG

@youngsofun
Copy link
Member

youngsofun commented Nov 7, 2023

@

if my_conn changes, the stage should use the changed CONNECTION object.

This behavior appears to easily panic users. For instance, User A creates multiple stages with Connection X, but User B accidentally alters Connection X. Consequently, User A discovers that all his stages are disrupted.

CONNECTION to me is more like a set of args that could be reused.

but it then user have to receate all the external stage/table that created this named connection, based on my discuss earlier.
this situation (one conn to many table/stage) is one of the motivation to intruduce this feature.

@Xuanwo
Copy link
Member

Xuanwo commented Nov 7, 2023

this situation (one conn to many table/stage) is one of the motivation to intruduce this feature.

I got it.

So we will presist CONNECTION in meta and fetch the content of CONNECTION every time we create stage operators. And alter CONNECTION will change all stages created by it.

@BohuTANG
Copy link
Member Author

BohuTANG commented Nov 7, 2023

The original idea want to make connection as a alias, if we have many external table with same connection, we should define them every where.

The original idea to make connection as a object: if user B use a connection which created by A, B maybe not see the ak/sk in the connection, the connection object can apply privileges to show/use and other privileges.

@youngsofun
Copy link
Member

this situation (one conn to many table/stage) is one of the motivation to intruduce this feature.

I got it.

So we will presist CONNECTION in meta and fetch the content of CONNECTION every time we create stage operators. And alter CONNECTION will change all stages created by it.

yes, I think this is the default behavior of most db object ?

@Xuanwo
Copy link
Member

Xuanwo commented Nov 7, 2023

yes, I think this is the default behavior of most db object ?

Great, makes sense.

@youngsofun
Copy link
Member

The original idea want to make connection as a alias, if we have many external table with same connection, we should define them every where.

The original idea to make connection as a object: if user B use a connection which created by A, B maybe not see the ak/sk in the connection, the connection object can apply privileges to show/use and other privileges.

got it

@BohuTANG
Copy link
Member Author

BohuTANG commented Nov 7, 2023

connection as object will store in the meta, if stage/external table who use it, we will read the connection in binding phase.

@Xuanwo
Copy link
Member

Xuanwo commented Nov 7, 2023

OpenDAL is currently developing a feature that lets users interact directly with config structs, bypassing the builder: apache/opendal#3240. I believe this will simplify your work.

@hantmac
Copy link
Member

hantmac commented Nov 21, 2023

We are developing pipe in cloudcontrol, the current requirement for Integration is to enable file listing/notification using our own roles. like the storage integration of snowflake:

CREATE STORAGE INTEGRATION <integration_name>
  TYPE = EXTERNAL_STAGE
  STORAGE_PROVIDER = 'S3'
  ENABLED = TRUE
  STORAGE_AWS_ROLE_ARN = '<iam_role>'
  STORAGE_ALLOWED_LOCATIONS = ('s3://<bucket>/<path>/', 's3://<bucket>/<path>/')
  [ STORAGE_BLOCKED_LOCATIONS = ('s3://<bucket>/<path>/', 's3://<bucket>/<path>/') ]

Can Databend's Connection achieve a similar effect?

@BohuTANG
Copy link
Member Author

We are developing pipe in cloudcontrol, the current requirement for Integration is to enable file listing/notification using our own roles. like the storage integration of snowflake:

CREATE STORAGE INTEGRATION <integration_name>
  TYPE = EXTERNAL_STAGE
  STORAGE_PROVIDER = 'S3'
  ENABLED = TRUE
  STORAGE_AWS_ROLE_ARN = '<iam_role>'
  STORAGE_ALLOWED_LOCATIONS = ('s3://<bucket>/<path>/', 's3://<bucket>/<path>/')
  [ STORAGE_BLOCKED_LOCATIONS = ('s3://<bucket>/<path>/', 's3://<bucket>/<path>/') ]

Can Databend's Connection achieve a similar effect?

Yes, connection now is a object like table, we can apply roles to it. cc @TCeason @flaneur2020

@TCeason
Copy link
Collaborator

TCeason commented Nov 21, 2023

We are developing pipe in cloudcontrol, the current requirement for Integration is to enable file listing/notification using our own roles. like the storage integration of snowflake:

CREATE STORAGE INTEGRATION <integration_name>

TYPE = EXTERNAL_STAGE

STORAGE_PROVIDER = 'S3'

ENABLED = TRUE

STORAGE_AWS_ROLE_ARN = '<iam_role>'

STORAGE_ALLOWED_LOCATIONS = ('s3:////', 's3:////')

[ STORAGE_BLOCKED_LOCATIONS = ('s3:////', 's3:////') ]

Can Databend's Connection achieve a similar effect?

Yes, connection now is a object like table, we can apply roles to it. cc @TCeason @flaneur2020

Yes. It's already in plan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature Category: feature
Projects
None yet
Development

No branches or pull requests

5 participants