Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: Idempotent Copy #7541

Merged
merged 9 commits into from
Sep 10, 2022

Conversation

lichuang
Copy link
Contributor

@lichuang lichuang commented Sep 9, 2022

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

docs: add Idempotent-Copy rfc

Fixes #issue

@vercel
Copy link

vercel bot commented Sep 9, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
databend ✅ Ready (Inspect) Visit Preview Sep 10, 2022 at 1:19AM (UTC)

@mergify mergify bot added the pr-doc this PR needs/changes the documents or websites label Sep 9, 2022
Copy link
Member

@drmingdrmer drmingdrmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crystal clear explanation.

One of my question is, in this sentence:
copy the stage file and up-insert into the table stage file meta
These are two RPC operations to databend-meta, right?

BTW, the words in this font is not very easy to read, I'm afraid...

image

@BohuTANG
Copy link
Member

BohuTANG commented Sep 9, 2022

Clear 👍
One question:
does the copy file meta in metaservice with the SQL digest?

COPY into t1 select c1 from @stage;

is not same as:

COPY into t1 select parse_json(c2) from @stage;

@BohuTANG
Copy link
Member

BohuTANG commented Sep 9, 2022

Another question:
If we truncate table and COPY from a stage/location again, all the files will be skipped.

@Xuanwo Xuanwo changed the title docs: add avoid-duplicate-when-copy-into-table rfc rfc: add avoid-duplicate-when-copy-into-table rfc Sep 9, 2022
@mergify mergify bot added the pr-rfc label Sep 9, 2022
@lichuang
Copy link
Contributor Author

lichuang commented Sep 9, 2022

Crystal clear explanation.

One of my question is, in this sentence: copy the stage file and up-insert into the table stage file meta These are two RPC operations to databend-meta, right?

BTW, the words in this font is not very easy to read, I'm afraid...

image

copy the stage file means copy the stage file into table, it is not an operation on meta service, but like inserting data into a table, so it happen on object store such as s3 in databend.

up-insert into the table stage file meta means up-insert the stage file meta into meta service, so the next time can use these meta data to compare with original stage file meta, and ignore the stage files which has not been modified since last copied.

@lichuang
Copy link
Contributor Author

lichuang commented Sep 9, 2022

Another question:
If we truncate table and COPY from a stage/location again, all the files will be skipped.

emm, yes, this situation has not been considered before...

@lichuang
Copy link
Contributor Author

lichuang commented Sep 9, 2022

Clear 👍 One question: does the copy file meta in metaservice with the SQL digest?

COPY into t1 select c1 from @stage;

is not same as:

COPY into t1 select parse_json(c2) from @stage;

no, the key of table stage copy stage file meta combined with (tenant,database,table,file_name), which has been analyzed by the SQL parser.

@Xuanwo
Copy link
Member

Xuanwo commented Sep 9, 2022

StageFile is an internal data type we don't want to persist.

Can we just construct the key the specific way instead of storing StageFile in the meta?

For example:

{prefix}/{tenant}/{db_name}/{table_name}/{file_name}/{content_length}/{etag}/{last_modified}

We just need to check whether this key exists or not without reading its content.

@lichuang lichuang changed the title rfc: add avoid-duplicate-when-copy-into-table rfc rfc: add Idempotent-Copy rfc Sep 9, 2022
@Xuanwo Xuanwo changed the title rfc: add Idempotent-Copy rfc rfc: Idempotent Copy Sep 9, 2022
@lichuang
Copy link
Contributor Author

lichuang commented Sep 9, 2022

StageFile is an internal data type we don't want to persist.

Can we just construct the key the specific way instead of storing StageFile in the meta?

For example:

{prefix}/{tenant}/{db_name}/{table_name}/{file_name}/{content_length}/{etag}/{last_modified}

We just need to check whether this key exists or not without reading its content.

since meta now has no mechanism to recycle the expire data, so if the key is {prefix}/{tenant}/{db_name}/{table_name}/{file_name}/{content_length}/{etag}/{last_modified} will make more than one key of a file, i prefer {prefix}/{tenant}/{db_name}/{table_name}/{file_name} -> {content_length}/{etag}/{last_modified} instead.

@Xuanwo
Copy link
Member

Xuanwo commented Sep 9, 2022

In an edge case, what will happen if users copy from different stages:

copy from @gcs_stage into mytable;
copy from @aws_stage into mytable;

@lichuang
Copy link
Contributor Author

lichuang commented Sep 9, 2022

In an edge case, what will happen if users copy from different stages:

copy from @gcs_stage into mytable;
copy from @aws_stage into mytable;

it depends on list_files in interpreter_copy_v2.rs, list_files generate all the stage files name of copy.
if two stages has the same file name, it will not be copy again if meta matched.

@Xuanwo
Copy link
Member

Xuanwo commented Sep 9, 2022

Vercel built failed:

image

Signed-off-by: Xuanwo <github@xuanwo.io>
@mergify mergify bot merged commit 0c878d2 into datafuselabs:main Sep 10, 2022
@lichuang lichuang deleted the table_stage_file_duplicate_rfc branch September 20, 2022 02:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-doc this PR needs/changes the documents or websites pr-rfc
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants