Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: support deduplication on stage attachment api #11710

Closed
ZhiHanZ opened this issue Jun 9, 2023 · 4 comments · Fixed by #11787
Closed

Feature: support deduplication on stage attachment api #11710

ZhiHanZ opened this issue Jun 9, 2023 · 4 comments · Fixed by #11787
Assignees
Labels
C-feature Category: feature good first issue Category: good first issue

Comments

@ZhiHanZ
Copy link
Collaborator

ZhiHanZ commented Jun 9, 2023

Summary
To ensure data ingestion idempotency, databend has already support to deduplicate DML through deduplication label
https://databend.rs/doc/sql-commands/setting-cmds/set-var

Here for cross-language driver integration, we could add a rest api field for the label

@ZhiHanZ ZhiHanZ added C-feature Category: feature good first issue Category: good first issue labels Jun 9, 2023
@akoshchiy
Copy link
Contributor

@ZhiHanZ Hi! Can I try to fix this? As I understood, we should extend the HttpQueryRequest with a new field and then pass it to the QueryContext settings on the http query creation.

@ZhiHanZ
Copy link
Collaborator Author

ZhiHanZ commented Jun 15, 2023

That is perfect, I think we do not need to add additional field on the HttpRequest, we could bring QueryID Header

const HEADER_QUERY_ID: &str = "X-DATABEND-QUERY-ID";
for deduplication, which is mentioned on previous issue:
#11591.

Expected Behavior:

CREATE TABLE sample
(
    Id      INT,
    City    VARCHAR,
    Score   INT,
);

sample.csv

1,'Los Angeles',100
2,'Irvine',80
3,'San Diego',60
4,'Palo alto',70
5,'San Jose',55
6,'Milipitas',99
curl -s -u root: -XPOST "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/query" --header 'Content-Type: application/json'  --header 'X-DATABEND-QUERY-ID:  insert1' -d '{"sql": "insert into sample (Id, City, Score) values (?,?,?)", "stage_attachment": {"location": "@s1/sample.csv", "copy_options": {"purge": "true"}}}' | jq -r '.stats.scan_progress.bytes, .error'

1,'Los Angeles',100
2,'Irvine',80
3,'San Diego',60
4,'Palo alto',70
5,'San Jose',55
6,'Milipitas',99

curl -s -u root: -XPOST "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/query" --header 'Content-Type: application/json'  --header 'X-DATABEND-QUERY-ID:  insert1' -d '{"sql": "insert into sample (Id, City, Score) values (?,?,?)", "stage_attachment": {"location": "@s1/sample.csv", "copy_options": {"purge": "true"}}}' | jq -r '.stats.scan_progress.bytes, .error'

No more inserted rows because of deduplication based on query id insert1
1,'Los Angeles',100
2,'Irvine',80
3,'San Diego',60
4,'Palo alto',70
5,'San Jose',55
6,'Milipitas',99

@ZhiHanZ ZhiHanZ changed the title Feature: support to bring deduplication label on stage attachment api Feature: support deduplication on stage attachment api Jun 15, 2023
@akoshchiy
Copy link
Contributor

Does it mean, that we also should use provided X-DATABEND-QUERY-ID as query_id instead of generating it?

@ZhiHanZ
Copy link
Collaborator Author

ZhiHanZ commented Jun 16, 2023

Does it mean, that we also should use provided X-DATABEND-QUERY-ID as query_id instead of generating it?

exactly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature Category: feature good first issue Category: good first issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants