Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COPY-3586: Support COPY from external location(S3) [PATCH-1] #4170

Merged
merged 41 commits into from Feb 24, 2022

Conversation

BohuTANG
Copy link
Member

@BohuTANG BohuTANG commented Feb 16, 2022

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

Summary about this PR:

  1. STAGE/COPY parser to plan
  2. Read files from external stage
  3. Write data to table
  4. Add source(CSV/Parquet) builder for easy to use

Remove:

  1. streams/source/source_values not used anymore
  2. stage(will backport under the new design in another PR)

Syntax:

COPY INTO [<database>.]<table_name>
     FROM { externalLocation }
[ FILE_FORMAT = ( TYPE = { CSV | JSON | AVRO | ORC | PARQUET | XML } [ formatTypeOptions ] } ) ]
[ VALIDATION_MODE = RETURN_<n>_ROWS | RETURN_ERRORS ]
[ copyOptions ]

Where

externalLocation (for Amazon S3) ::=
  's3://<bucket>[/<path>]'
  [ { CREDENTIALS = ( {  { AWS_KEY_ID = '<string>' AWS_SECRET_KEY = '<string>' } } ) } ]
  [ ENCRYPTION = (  [ MASTER_KEY = '<string>' ] 
formatTypeOptions ::=
-- If FILE_FORMAT = ( TYPE = CSV ... )
     RECORD_DELIMITER = '<character>' 
     FIELD_DELIMITER = '<character>' 
     SKIP_HEADER = <integer>
copyOptions ::=
     ON_ERROR = { CONTINUE | SKIP_FILE | SKIP_FILE_<num>| ABORT_STATEMENT }
     SIZE_LIMIT = <num>

For example:

Create table: ontime/create_table.sql
copy into ontime from 's3://repo.databend.rs/dataset/stateful/ontime.csv' FILE_FORMAT = (type = 'CSV' field_delimiter = ','  record_delimiter = '\n' skip_header = 1)

Changelog

  • New Feature

Related Issues

Fixes #3586

Test Plan

Unit Tests

Stateless Tests

@vercel
Copy link

vercel bot commented Feb 16, 2022

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/databend/databend/4uPkvAjbfdYGD3qzxeUgc9aikWBJ
✅ Preview: https://databend-git-fork-bohutang-dev-copy-external-3586-databend.vercel.app

[Deployment for be63b03 canceled]

@mergify
Copy link
Contributor

mergify bot commented Feb 16, 2022

Thanks for the contribution!
I have applied any labels matching special text in your PR Changelog.

Please review the labels and make any necessary changes.

@mergify mergify bot added the pr-feature this PR introduces a new feature to the codebase label Feb 16, 2022
Comment on lines 113 to 116
let table = self
.ctx
.get_table(&self.plan.db_name, &self.plan.tbl_name)
.await?;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can verify db_name and tbl_name before accessing file in s3? If wrong db_name/tbl_name is given we can fail.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I'm fresh on this project -- databend, if some naive comments posted, I'm supposed to offend you guys. Be free to give correct replies.

Copy link
Member Author

@BohuTANG BohuTANG Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is under Draft status, database and table privileges check hasn't finished yet, will add them soon.
Thank you.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have some recent deadline about this feature? If not, I'd like to write some code for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For COPY INTO TABLE FROM s3 external location feature, I will finish it this week in this PR :)
Well, would like to recommend some issues here:
https://github.com/datafuselabs/databend/issues?q=is%3Aissue+is%3Aopen+label%3A%22C-good+first+issue%22

@BohuTANG
Copy link
Member Author

This PR should merge with #4203, but we can review it now.

@BohuTANG BohuTANG marked this pull request as ready for review February 23, 2022 11:36
@Xuanwo
Copy link
Member

Xuanwo commented Feb 24, 2022

Ping @BohuTANG, please give me your fork's write permission to resolve the conflicts. 😆

@BohuTANG
Copy link
Member Author

@BohuTANG BohuTANG changed the title COPY-3586: Support COPY from external location(S3) COPY-3586: Support COPY from external location(S3) [PATCH-1] Feb 24, 2022
@BohuTANG
Copy link
Member Author

BohuTANG commented Feb 24, 2022

Hello,

This feature is not available at the moment(only works with minio), it needs to wait for OpenDAL to finish:
apache/opendal#54
apache/opendal#56

Since there are too many files involved, I will merge it first as the PATCH-1.
I will make it works with AWS S3 in another PR as PATCH-2 after OpenDAL#54 and OpenDAL#56 finished.

@BohuTANG BohuTANG merged commit 20648c3 into datafuselabs:main Feb 24, 2022
]);

let local = Operator::new(
fs::Backend::build()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-reminder: Implement apache/opendal#30 so that users can test easier.

StageType::External => {
match plan.stage_info.file_format_options.format {
// CSV.
StageFileFormatType::Csv => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we support parquet too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will add the parquet support after the csv works all well :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need-review pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Copy INTO <table> from external location
5 participants