Skip to content

[feature] Enhance COPY INTO with multi-format support and advanced options #8005

@JunRuiLee

Description

@JunRuiLee

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

In production environments, bulk data import and export is a common and critical operation — loading datasets from external storage into Paimon tables, or exporting table data to files for downstream consumption. The COPY INTO statement provides a declarative SQL interface for these operations without requiring users to write custom ETL pipelines.

The initial COPY INTO implementation was recently introduced with basic CSV support. This umbrella issue tracks the effort to extend it with multi-format support and advanced options, aligning with the commonly adopted capabilities in the industry.

Planned Features

We plan to implement the following 7 capabilities, which are the most commonly adopted features in the industry, as separate PRs:

Format Support

  • JSON format — Support FILE_FORMAT = (TYPE = JSON) for both import and export. Options: MULTI_LINE, COMPRESSION, NULL_IF, EMPTY_FIELD_AS_NULL. JSON import uses column-name matching by default (no positional dependency).
  • Parquet format — Support FILE_FORMAT = (TYPE = PARQUET) for both import and export. Leverages Spark's native schema inference. Options: COMPRESSION.

Error Handling

  • ON_ERROR = CONTINUE / SKIP_FILE — Currently only ABORT_STATEMENT is supported. CONTINUE skips bad rows and reports error counts per file. SKIP_FILE skips entire files that fail and reports per-file LOADED/LOAD_FAILED status.

Column Matching

  • MATCH_BY_COLUMN_NAME — Support NONE (default, positional), CASE_SENSITIVE, and CASE_INSENSITIVE modes. Only applicable to structured formats (JSON/Parquet), not CSV. Mutually exclusive with explicit column lists.

Export Enhancement

  • FROM (SELECT ...) query export — Currently export only supports FROM table_name. This adds support for arbitrary SQL queries as the data source for COPY INTO <location>.

Load Management

  • PURGE — When PURGE = TRUE, automatically delete source files after successful loading. Best-effort deletion (failures are silently ignored). In SKIP_FILE mode, only successfully loaded files are purged.
  • VALIDATION_MODE — Validate data without actually loading it. Supports RETURN_<n>_ROWS (preview n rows), RETURN_ERRORS (show first error per file), and RETURN_ALL_ERRORS (show all errors across all files).

Proposed SQL Syntax

Import

COPY INTO table_name [(col1, col2, ...)]
FROM 'source_path'
FILE_FORMAT = (TYPE = CSV | JSON | PARQUET [, option = value, ...])
[PATTERN = 'regex']
[FORCE = TRUE | FALSE]
[ON_ERROR = ABORT_STATEMENT | CONTINUE | SKIP_FILE]
[MATCH_BY_COLUMN_NAME = NONE | CASE_SENSITIVE | CASE_INSENSITIVE]
[PURGE = TRUE | FALSE]
[VALIDATION_MODE = RETURN_n_ROWS | RETURN_ERRORS | RETURN_ALL_ERRORS]

Export

COPY INTO 'target_path'
FROM table_name | ('SELECT ...')
FILE_FORMAT = (TYPE = CSV | JSON | PARQUET [, option = value, ...])
[OVERWRITE = TRUE | FALSE]

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions