Search before asking
Motivation
In production environments, bulk data import and export is a common and critical operation — loading datasets from external storage into Paimon tables, or exporting table data to files for downstream consumption. The COPY INTO statement provides a declarative SQL interface for these operations without requiring users to write custom ETL pipelines.
The initial COPY INTO implementation was recently introduced with basic CSV support. This umbrella issue tracks the effort to extend it with multi-format support and advanced options, aligning with the commonly adopted capabilities in the industry.
Planned Features
We plan to implement the following 7 capabilities, which are the most commonly adopted features in the industry, as separate PRs:
Format Support
Error Handling
Column Matching
Export Enhancement
Load Management
Proposed SQL Syntax
Import
COPY INTO table_name [(col1, col2, ...)]
FROM 'source_path'
FILE_FORMAT = (TYPE = CSV | JSON | PARQUET [, option = value, ...])
[PATTERN = 'regex']
[FORCE = TRUE | FALSE]
[ON_ERROR = ABORT_STATEMENT | CONTINUE | SKIP_FILE]
[MATCH_BY_COLUMN_NAME = NONE | CASE_SENSITIVE | CASE_INSENSITIVE]
[PURGE = TRUE | FALSE]
[VALIDATION_MODE = RETURN_n_ROWS | RETURN_ERRORS | RETURN_ALL_ERRORS]
Export
COPY INTO 'target_path'
FROM table_name | ('SELECT ...')
FILE_FORMAT = (TYPE = CSV | JSON | PARQUET [, option = value, ...])
[OVERWRITE = TRUE | FALSE]
Anything else?
No response
Are you willing to submit a PR?
Search before asking
Motivation
In production environments, bulk data import and export is a common and critical operation — loading datasets from external storage into Paimon tables, or exporting table data to files for downstream consumption. The
COPY INTOstatement provides a declarative SQL interface for these operations without requiring users to write custom ETL pipelines.The initial
COPY INTOimplementation was recently introduced with basic CSV support. This umbrella issue tracks the effort to extend it with multi-format support and advanced options, aligning with the commonly adopted capabilities in the industry.Planned Features
We plan to implement the following 7 capabilities, which are the most commonly adopted features in the industry, as separate PRs:
Format Support
FILE_FORMAT = (TYPE = JSON)for both import and export. Options:MULTI_LINE,COMPRESSION,NULL_IF,EMPTY_FIELD_AS_NULL. JSON import uses column-name matching by default (no positional dependency).FILE_FORMAT = (TYPE = PARQUET)for both import and export. Leverages Spark's native schema inference. Options:COMPRESSION.Error Handling
ABORT_STATEMENTis supported.CONTINUEskips bad rows and reports error counts per file.SKIP_FILEskips entire files that fail and reports per-file LOADED/LOAD_FAILED status.Column Matching
NONE(default, positional),CASE_SENSITIVE, andCASE_INSENSITIVEmodes. Only applicable to structured formats (JSON/Parquet), not CSV. Mutually exclusive with explicit column lists.Export Enhancement
FROM table_name. This adds support for arbitrary SQL queries as the data source forCOPY INTO <location>.Load Management
PURGE = TRUE, automatically delete source files after successful loading. Best-effort deletion (failures are silently ignored). InSKIP_FILEmode, only successfully loaded files are purged.RETURN_<n>_ROWS(preview n rows),RETURN_ERRORS(show first error per file), andRETURN_ALL_ERRORS(show all errors across all files).Proposed SQL Syntax
Import
Export
Anything else?
No response
Are you willing to submit a PR?