Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

util/parquet: support required repetitions #104279

Closed
jayshrivastava opened this issue Jun 2, 2023 · 3 comments
Closed

util/parquet: support required repetitions #104279

jayshrivastava opened this issue Jun 2, 2023 · 3 comments
Assignees
Labels
A-cdc Change Data Capture C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-cdc

Comments

@jayshrivastava
Copy link
Contributor

jayshrivastava commented Jun 2, 2023

By default the library treats all columns as nullable, even when they are not. It would be ideal to treat columns as not nullable if specified.

Jira issue: CRDB-28443
Epic CRDB-27372

@jayshrivastava jayshrivastava added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-cdc labels Jun 2, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jun 2, 2023

cc @cockroachdb/cdc

@blathers-crl blathers-crl bot added the A-cdc Change Data Capture label Jun 2, 2023
@jayshrivastava jayshrivastava self-assigned this Jun 2, 2023
@jayshrivastava jayshrivastava changed the title util/parquet: support REPITIONS_REQUIRED util/parquet: support required repetitions Jun 2, 2023
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jun 2, 2023
Previously, `EXPORT INTO PARQUET` used a library which had
very poor memory usage. This change updates this command to
use the new parquet writer in `util/parquet`, which is newer
and does not use excessive memory.

To ensure backwards compatability, as much testing code as
possible is left untouched to ensure the old `EXPORT INTO PARQUET`
behavior is the same when using the new library. Some of the the
old production code and dependencies are also left untouched
because they are used in the test code.

This commit leaves some TODOs to clean up the tests and production
code once we have confidence in the new library.

Informs: cockroachdb#104278
Informs: cockroachdb#104279
Closes: cockroachdb#103317

Release note (general change): `EXPORT INTO PARQUET` will now
use a new internal implementation for writing parquet files
using parquet spec version 2.6. There should be no significant
impact to the structure of files being written. There is one
minor change:
- all columns written to parquet files will be nullable
  (ie. the parquet repetition type is `OPTIONAL`)
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jun 2, 2023
Previously, `EXPORT INTO PARQUET` used a library which had
very poor memory usage. This change updates this command to
use the new parquet writer in `util/parquet`, which is newer
and does not use excessive memory.

To ensure backwards compatability, as much testing code as
possible is left untouched to ensure the old `EXPORT INTO PARQUET`
behavior is the same when using the new library. Some of the the
old production code and dependencies are also left untouched
because they are used in the test code.

This commit leaves some TODOs to clean up the tests and production
code once we have confidence in the new library.

Informs: cockroachdb#104278
Informs: cockroachdb#104279
Closes: cockroachdb#103317

Release note (general change): `EXPORT INTO PARQUET` will now
use a new internal implementation for writing parquet files
using parquet spec version 2.6. There should be no significant
impact to the structure of files being written. There is one
minor change:
- all columns written to parquet files will be nullable
  (ie. the parquet repetition type is `OPTIONAL`)
@miretskiy
Copy link
Contributor

I'm not sure how important it is to support this feature. On the one hand: yes, if the database says "not null", it'd be nice
to say that. However, database schema can change; constraints can be dropped. Having all fields nullable in parquet
(i.e. optional) is actually a good thing -- it's sort of similar to how protocol buffers should evolve over time -- make all fields optional. If the field is not optional, enforce in code.

jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jun 5, 2023
Previously, `EXPORT INTO PARQUET` used a library which had
very poor memory usage. This change updates this command to
use the new parquet writer in `util/parquet`, which is newer
and does not use excessive memory.

To ensure backwards compatability, as much testing code as
possible is left untouched to ensure the old `EXPORT INTO PARQUET`
behavior is the same when using the new library. Some of the the
old production code and dependencies are also left untouched
because they are used in the test code.

This commit leaves some TODOs to clean up the tests and production
code once we have confidence in the new library.

Informs: cockroachdb#104329
Informs: cockroachdb#104278
Informs: cockroachdb#104279
Closes: cockroachdb#103317

Release note (general change): `EXPORT INTO PARQUET` will now
use a new internal implementation for writing parquet files
using parquet spec version 2.6. There should be no significant
impact to the structure of files being written. There is one
minor change:
- all columns written to parquet files will be nullable
  (ie. the parquet repetition type is `OPTIONAL`)
craig bot pushed a commit that referenced this issue Jun 5, 2023
104234: importer: use new parquet writer for EXPORT INTO PARQUET r=miretskiy a=jayshrivastava

### roachtest: add test for EXPORT INTO PARQUET
This change adds a rochtest which sets up a 3-node CRDB cluster
on 8vCPU machines initialized with the TPC-C database containing
250 warehouses. Then, it executes 30 `EXPORT INTO PARQUET` statements
concurrently, repeatedly for 10 minutes. The main purpose of this
test is to use it for benchmarking. It sets up grafana as well
so metrics can be observed during the test.

This change also adds a daily roachtest which runs exports
on a 100 warehouse TPCC database, without setting up grafana.

Informs: #103317
Release note: None

---

### importer: use new parquet writer for EXPORT INTO PARQUET

Previously, `EXPORT INTO PARQUET` used a library which had
very poor memory usage. This change updates this command to
use the new parquet writer in `util/parquet`, which is newer
and does not use excessive memory.

To ensure backwards compatability, as much testing code as
possible is left untouched to ensure the old `EXPORT INTO PARQUET`
behavior is the same when using the new library. Some of the the
old production code and dependencies are also left untouched
because they are used in the test code.

This commit leaves some TODOs to clean up the tests and production
code once we have confidence in the new library.

Informs: #104329
Informs: #104278
Informs: #104279
Closes: #103317

Release note (general change): `EXPORT INTO PARQUET` will now
use a new internal implementation for writing parquet files
using parquet spec version 2.6. There should be no significant
impact to the structure of files being written. There is one
minor change:
- all columns written to parquet files will be nullable
  (ie. the parquet repetition type is `OPTIONAL`)

Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com>
@amruss
Copy link
Contributor

amruss commented Jun 7, 2023

Closing this. What we do with avro is to treat every field as optional, this allows flexibility over time

@amruss amruss closed this as completed Jun 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-cdc
Projects
None yet
Development

No branches or pull requests

3 participants