Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importccl: create tables before ingesting data #37338

Merged
merged 3 commits into from May 8, 2019

Conversation

Projects
None yet
3 participants
@dt
Copy link
Member

commented May 6, 2019

importccl: create tables before ingesting in IMPORT

This changes the order in which IMPORT creates tables and ingests data.

Previously, IMPORT computed the tables it would create, but kept them only in the job record during ingestion.
Only after ingestion completed successfully, in the successful completion hook of the IMPORT job, did it then write those table descriptors and create namespace mappings.

This change switches IMPORT to create the tables in a non-public INGESTING state first, then run the ingestion job, and then simply move them from ingesting to public when it finishes.

This has some advantages and disadvantages. Creating the descriptors early has some nice side-effects: it avoids a potential race where another table could be created, imported or restored with a conlficting name during ingestion, and things try to map KVs to tables can actually see the table, so special cases like enforced splits between tables should work as expected during ingestion, fixing the issue where an importing table could be merged into another table after the ingestion job (which suspends merges) finished but before the descriptors were written. Additionally the admin UI or other places that surface tables can show them if they so choose, which could, e.g. help explain where disk-space is being used by not-yet-public tables.

Unfortunately it also introduces some disadvantages, most notably that writing the descriptors adds more ways by which traffic might end up in the ingesting range and thus more code paths that need ot be audited and must include the proper checks for non-public states, plus writing any incomplete state before the job is finished adds more that needs to be properly cleaned up on failure.

Overall though, importing into an existing table will need to solve the same downsides in any case, so making imports of new tables behave the same as imports into existing tables -- by importing into a just-created existing table -- will simplify the overall code.

Release note: none.

There's one special-case where IMPORT never creates tables, when running with the transform option that writes a BACKUP instead. This mode was easily to maintain in the initial version of IMPORT which internally essentially made a BACKUP anyway and then restored it. As IMPORT has diverged from that, and become more focused on directly ingesting data, maintaining this functionality has become more expensive. Given that simply backing up after ingesting can generate a the same backup, maintaining this extra overhead in the IMPORT code no longer makes sense.

Release note (sql change): remove 'transform' option from IMPORT.

@dt dt requested review from mjibson and lucy-zhang May 6, 2019

@dt dt requested review from cockroachdb/distsql-prs as code owners May 6, 2019

@cockroach-teamcity

This comment has been minimized.

Copy link
Member

commented May 6, 2019

This change is Reviewable

@dt

This comment has been minimized.

Copy link
Member Author

commented May 6, 2019

I split the workload part of this into its own change over in #37343 and I'll rebase this one on that after it merges.

@dt dt force-pushed the dt:import-create-first branch from 42b9ae4 to 9253ba5 May 7, 2019

Show resolved Hide resolved pkg/ccl/importccl/import_stmt.go Outdated
log.Infof(ctx, "staging tbl via cput %v \nnil\n -> \n%v\n", sqlbase.MakeDescMetadataKey(tableDescs[i].ID), sqlbase.WrapDescriptor(tableDescs[i]))

}
// TODO(dt): we should be creating the job with this txn too. Once a job

This comment has been minimized.

Copy link
@mjibson

mjibson May 7, 2019

Member

how hard is this? seems kinda important.

This comment has been minimized.

Copy link
@dt

dt May 7, 2019

Author Member

agreed it is important. At first glance, it'll be pretty tricky and will require refactoring the jobs code a bit. I looked at it for a bit but decided that it was enough going to involve enough moving things around that I'd rather not do it here and leave it to a self-contained change later, but I'd certainly consider something that would need to be addressed before release.

@@ -696,67 +736,66 @@ func (r *importResumer) Resume(
// stuff to delete the keys in the background.
func (r *importResumer) OnFailOrCancel(ctx context.Context, txn *client.Txn) error {
details := r.job.Details().(jobspb.ImportDetails)
if details.BackupPath != "" {

This comment has been minimized.

Copy link
@mjibson

mjibson May 7, 2019

Member

can BackupPath be removed from the .proto?

This comment has been minimized.

Copy link
@dt

dt May 7, 2019

Author Member

Eh, I realized that for now I'd like to keep it and assert it is empty to avoid adopting a transform job on a node that doesn't support transform

// and so we don't need to preserve MVCC semantics.
tableDesc.DropTime = 1
b.CPut(sqlbase.MakeNameMetadataKey(tableDesc.ParentID, tableDesc.Name), nil, tableDesc.ID)
} else {

This comment has been minimized.

Copy link
@mjibson

mjibson May 7, 2019

Member

So this else case is for that not-yet-written import into existing tables feature?

This comment has been minimized.

Copy link
@dt

dt May 7, 2019

Author Member

Correct

@dt dt force-pushed the dt:import-create-first branch from 9253ba5 to 7efa54a May 7, 2019

dt added some commits May 3, 2019

importccl: remove 'transform' option from IMPORT
This mode allowed using 'IMPORT' as a distributed pure computation system that didn't do any actual ingestion.
In this mode, it would generate a BACKUP format version of the input data, writing it to the passed URI.

However this is not the main use-case for import -- ingesting data is -- and maintaining it complicates the
code, particularly as it assumes that IMPORT has non-overlapping SSTs to put into a BACKUP, or that IMPORT
prepares data before deciding where to put it. With direct ingestion or import into existing tables, these
can no longer be assumed, so maintaining this extra feature becomes more costly.

Release note (sql change): remove 'transform' option from IMPORT.
importccl: create tables before ingesting in IMPORT
This changes the order in which IMPORT creates tables and ingests data.

Previously, IMPORT computed the tables it would create, but kept them only in the job record during ingestion.
Only after ingestion completed successfully, in the successful completion hook of the IMPORT job, did it then write those table descriptors and create namespace mappings.

This change switches IMPORT to create the tables in a non-public INGESTING state first, then run the ingestion job, and then simply move them from ingesting to public when it finishes.

This has some advantages and disadvantages. Creating the descriptors early has some nice side-effects: it avoids a potential race where another table could be created, imported or restored with a conlficting name during ingestion, and things try to map KVs to tables can actually see the table, so special cases like enforced splits between tables should work as expected during ingestion, fixing the issue where an importing table could be merged into another table after the ingestion job (which suspends merges) finished but before the descriptors were written. Additionally the admin UI or other places that surface tables can show them if they so choose, which could, e.g. help explain where disk-space is being used by not-yet-public tables.

Unfortunately it also introduces some disadvantages, most notably that writing the descriptors adds more ways by which traffic might end up in the ingesting range and thus more code paths that need ot be audited and must include the proper checks for non-public states, plus writing any incomplete state before the job is finished adds more that needs to be properly cleaned up on failure.

Overall though, importing into an existing table will need to solve the same downsides in any case, so making imports of new tables behave the same as imports into existing tables -- by importing into a just-created existing table -- will simplify the overall code.

Release note: none.

@dt dt force-pushed the dt:import-create-first branch from 7efa54a to f78b947 May 7, 2019

@mjibson

mjibson approved these changes May 7, 2019

@dt

This comment has been minimized.

Copy link
Member Author

commented May 8, 2019

bors r+

TFTR!

craig bot pushed a commit that referenced this pull request May 8, 2019

Merge #37338
37338: importccl: create tables before ingesting data r=dt a=dt

importccl: create tables before ingesting in IMPORT

This changes the order in which IMPORT creates tables and ingests data.

Previously, IMPORT computed the tables it would create, but kept them only in the job record during ingestion.
Only after ingestion completed successfully, in the successful completion hook of the IMPORT job, did it then write those table descriptors and create namespace mappings.

This change switches IMPORT to create the tables in a non-public INGESTING state first, then run the ingestion job, and then simply move them from ingesting to public when it finishes.

This has some advantages and disadvantages. Creating the descriptors early has some nice side-effects: it avoids a potential race where another table could be created, imported or restored with a conlficting name during ingestion, and things try to map KVs to tables can actually see the table, so special cases like enforced splits between tables should work as expected during ingestion, fixing the issue where an importing table could be merged into another table after the ingestion job (which suspends merges) finished but before the descriptors were written. Additionally the admin UI or other places that surface tables can show them if they so choose, which could, e.g. help explain where disk-space is being used by not-yet-public tables.

Unfortunately it also introduces some disadvantages, most notably that writing the descriptors adds more ways by which traffic might end up in the ingesting range and thus more code paths that need ot be audited and must include the proper checks for non-public states, plus writing any incomplete state before the job is finished adds more that needs to be properly cleaned up on failure.

Overall though, importing into an existing table will need to solve the same downsides in any case, so making imports of new tables behave the same as imports into existing tables -- by importing into a just-created existing table -- will simplify the overall code.

Release note: none.

There's one special-case where IMPORT _never_ creates tables, when running with the `transform` option that writes a BACKUP instead. This mode was easily to maintain in the initial version of IMPORT which internally essentially made a BACKUP anyway and then restored it. As IMPORT has diverged from that, and become more focused on directly ingesting data, maintaining this functionality has become more expensive. Given that simply backing up after ingesting can generate a the same backup, maintaining this extra overhead in the IMPORT code no longer makes sense. 

Release note (sql change): remove 'transform' option from IMPORT.



Co-authored-by: David Taylor <tinystatemachine@gmail.com>
@craig

This comment has been minimized.

Copy link

commented May 8, 2019

Build succeeded

@craig craig bot merged commit f78b947 into cockroachdb:master May 8, 2019

2 checks passed

GitHub CI (Cockroach) TeamCity build finished
Details
license/cla Contributor License Agreement is signed.
Details

@dt dt deleted the dt:import-create-first branch May 8, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.