allow executing COPY INTO in a cluster #6395

flaneur2020 · 2022-07-01T13:31:16Z

Summary

i'm running a COPY INTO in my cluster with a 8 replicas, but it seems only utilized one replica to execute the COPY INTO statement:

it'd be 8x faster if the COPY INTO statement could utilize the other instances in the cluster.

sundy-li · 2022-07-01T13:40:40Z

Is it a single file ?

PsiACE · 2022-07-01T13:51:15Z

Is it a single file ?

One hundred files, obtained by cutting the ontime dataset.

sundy-li · 2022-07-01T13:53:48Z

Ok, seems parallel copy only works in single query mode.

https://github.com/datafuselabs/databend/blob/f152cbe7edc96fe5850982ef40d9c04c57ecc94e/query/src/interpreters/interpreter_copy.rs

     if ctx.get_settings().get_enable_new_processor_framework()? != 0
            && self.ctx.get_cluster().is_empty()
        {
            table.append2(ctx.clone(), &mut pipeline)?;
            pipeline.set_max_threads(settings.get_max_threads()? as usize);

            let async_runtime = ctx.get_storage_runtime();
            let query_need_abort = ctx.query_need_abort();
            let executor =
                PipelineCompleteExecutor::try_create(async_runtime, query_need_abort, pipeline)?;

            executor.execute()?;
            return Ok(ctx.consume_precommit_blocks());
        }

zhang2014 · 2022-07-01T14:41:09Z

Distributed copy into need #6253(exchange precommit block in cluster nodes).

sundy-li · 2022-09-05T23:52:20Z

@RinChanNOWWW you can try this issue, it's ready to do now.

BohuTANG · 2022-09-06T00:30:28Z

I think we can make the <internal/external-stage, remote location> as a special storage engine, then we can get the file list as table source, and optimize the files to the distribution cluster. Also, this will be the basement for:
#7228 and #7211

I would ping @dantengsky, he is doing a similar storage engine (pre-sign), if some codes need refactoring, please let us know:)

RinChanNOWWW · 2022-09-06T08:10:08Z

I think we can make the <internal/external-stage, remote location> as a special storage engine, then we can get the file list as table source

Then we can convert copy into t from @stage into insert into t select from @stage and achieve distributed copy into by #7501.

BohuTANG · 2022-09-06T08:40:14Z

@RinChanNOWWW

Please take a look #7502
We are going to make the catalog to meet these requirements, work is in progress by @dantengsky
If you are interested, you can ping and talk with dantengsky :)

PsiACE · 2023-07-17T08:25:07Z

#11943

sundy-li self-assigned this Jul 1, 2022

sundy-li added the A-query Area: databend query label Jul 1, 2022

zhang2014 assigned zhang2014 and unassigned sundy-li Jul 1, 2022

sundy-li mentioned this issue Jul 1, 2022

chore(query): enable new pipeline for copy in cluster mode #6396

Merged

This was referenced Sep 2, 2022

Release proposal: Nightly v0.9 #7052

Closed

Tracking: Load very large dataset into databend #7444

Closed

sundy-li assigned RinChanNOWWW Sep 5, 2022

BohuTANG mentioned this issue Sep 6, 2022

Support COPY INTO ... FROM a Kafka topic #7499

Closed

BohuTANG mentioned this issue Sep 6, 2022

Tracking: catalog for stage/location special data source #7502

Open

6 tasks

BohuTANG mentioned this issue Sep 29, 2022

feat: COPY INTO returns more status #7730

Closed

This was referenced Oct 11, 2022

tracing file Formats #7732

Open

combine streaming and distributed in copy into #8128

Closed

Xuanwo mentioned this issue Nov 2, 2022

Feature: Distributed COPY INTO #8594

Closed

BohuTANG mentioned this issue Nov 2, 2022

Tracking: Large dataset insert and read #7823

Closed

50 tasks

RinChanNOWWW mentioned this issue Jun 15, 2023

Feature: distributed execution for copy into .... #11752

Closed

JackTan25 self-assigned this Jun 16, 2023

JackTan25 mentioned this issue Jun 22, 2023

feat: Distributed copy into V1 #11840

Merged

JackTan25 mentioned this issue Jul 3, 2023

feat: distributed copy into table from query #11943

Merged

PsiACE closed this as completed Jul 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow executing COPY INTO in a cluster #6395

allow executing COPY INTO in a cluster #6395

flaneur2020 commented Jul 1, 2022

sundy-li commented Jul 1, 2022

PsiACE commented Jul 1, 2022

sundy-li commented Jul 1, 2022

zhang2014 commented Jul 1, 2022

sundy-li commented Sep 5, 2022

BohuTANG commented Sep 6, 2022

RinChanNOWWW commented Sep 6, 2022

BohuTANG commented Sep 6, 2022

PsiACE commented Jul 17, 2023

allow executing COPY INTO in a cluster #6395

allow executing COPY INTO in a cluster #6395

Comments

flaneur2020 commented Jul 1, 2022

sundy-li commented Jul 1, 2022

PsiACE commented Jul 1, 2022

sundy-li commented Jul 1, 2022

zhang2014 commented Jul 1, 2022

sundy-li commented Sep 5, 2022

BohuTANG commented Sep 6, 2022

RinChanNOWWW commented Sep 6, 2022

BohuTANG commented Sep 6, 2022

PsiACE commented Jul 17, 2023