Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it possible to interact with external data sources #13

Closed
nils-braun opened this issue Aug 27, 2020 · 6 comments
Closed

Make it possible to interact with external data sources #13

nils-braun opened this issue Aug 27, 2020 · 6 comments

Comments

@nils-braun
Copy link
Collaborator

Currently, all data frames need to be registered before using them in dask-sql.
However, it could also be interesting to have dataframes directly from S3 (or any other storage, such as hdfs), from the Hive metastore or by creating temporary views. In the background, one could create dask dataframes and use the normal registration process.

For this, we first need to come up with a good SQL syntax (which is supported by calcite) and/or an API for this.

@nils-braun
Copy link
Collaborator Author

It is now (after #55) possible to create new tables by reading in data, e.g. with

CREATE TABLE
        "nyc"
    WITH (
        format = 'csv',
        location = 'https://support.staffbase.com/hc/en-us/article_attachments/360009197031/username.csv',
        sep = ';'
    )

However, it is not clear so far if this suits all the needs (e.g. S3). Additionally, an interaction with hive would be interesting.

@raybellwaves
Copy link

BlazingDB/blazingsql#937 may also be relevant

@nils-braun
Copy link
Collaborator Author

Just a small update: interaction with hive is implemented as a experimetal feature in 0.2.0 (#63).
And as fsspec/adlfs#111 was implemented, it should also be possible to use az:// as input format (although untested)

@raybellwaves
Copy link

it should also be possible to use az:// as input format (although untested)

It'll be worth testing with a public repo

storage_options = {'account_name': 'azureopendatastorage'}
ddf = dd.read_parquet('az://nyctlc/green/puYear=2019/puMonth=*/*.parquet', storage_options=storage_options)

@raybellwaves
Copy link

see #84

@nils-braun
Copy link
Collaborator Author

Closing this issue now, as we have already integrations included and for the specific new types of integration, be better create a new issue (e.g. like #83)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants