diff --git a/docs/en/sql-reference/10-sql-commands/10-dml/dml-copy-into-table.md b/docs/en/sql-reference/10-sql-commands/10-dml/dml-copy-into-table.md index 5bc761c724..a4223c0381 100644 --- a/docs/en/sql-reference/10-sql-commands/10-dml/dml-copy-into-table.md +++ b/docs/en/sql-reference/10-sql-commands/10-dml/dml-copy-into-table.md @@ -64,7 +64,8 @@ externalLocation ::= /* Amazon S3-like Storage */ 's3://[/]' CONNECTION = ( - [ ENDPOINT_URL = '' ] + [ CONNECTION_NAME = '' ] + | [ ENDPOINT_URL = '' ] [ ACCESS_KEY_ID = '' ] [ SECRET_ACCESS_KEY = '' ] [ ENABLE_VIRTUAL_HOST_STYLE = TRUE | FALSE ] @@ -78,7 +79,8 @@ externalLocation ::= /* Azure Blob Storage */ | 'azblob://[/]' CONNECTION = ( - ENDPOINT_URL = '' + [ CONNECTION_NAME = '' ] + | ENDPOINT_URL = '' ACCOUNT_NAME = '' ACCOUNT_KEY = '' ) @@ -86,13 +88,15 @@ externalLocation ::= /* Google Cloud Storage */ | 'gcs://[/]' CONNECTION = ( - CREDENTIAL = '' + [ CONNECTION_NAME = '' ] + | CREDENTIAL = '' ) /* Alibaba Cloud OSS */ | 'oss://[/]' CONNECTION = ( - ACCESS_KEY_ID = '' + [ CONNECTION_NAME = '' ] + | ACCESS_KEY_ID = '' ACCESS_KEY_SECRET = '' ENDPOINT_URL = '' [ PRESIGN_ENDPOINT_URL = '' ] @@ -101,7 +105,8 @@ externalLocation ::= /* Tencent Cloud Object Storage */ | 'cos://[/]' CONNECTION = ( - SECRET_ID = '' + [ CONNECTION_NAME = '' ] + | SECRET_ID = '' SECRET_KEY = '' ENDPOINT_URL = '' ) @@ -183,13 +188,18 @@ For remote files, you can use glob patterns to specify multiple files. For examp The `FILE_FORMAT` parameter supports different file types, each with specific formatting options. Below are the available options for each supported file format: -### Common Options for All Formats + + + +These options are available for all file formats: | Option | Description | Values | Default | |--------|-------------|--------|--------| | COMPRESSION | Compression algorithm for data files | AUTO, GZIP, BZ2, BROTLI, ZSTD, DEFLATE, RAW_DEFLATE, XZ, NONE | AUTO | -### TYPE = CSV + + + | Option | Description | Default | |--------|-------------|--------| @@ -204,14 +214,18 @@ The `FILE_FORMAT` parameter supports different file types, each with specific fo | EMPTY_FIELD_AS | How to handle empty fields | null | | BINARY_FORMAT | Encoding format(HEX or BASE64) for binary data | HEX | -### TYPE = TSV + + + | Option | Description | Default | |--------|-------------|--------| | RECORD_DELIMITER | Character(s) separating records | newline | | FIELD_DELIMITER | Character(s) separating fields | tab (\t) | -### TYPE = NDJSON + + + | Option | Description | Default | |--------|-------------|--------| @@ -219,24 +233,33 @@ The `FILE_FORMAT` parameter supports different file types, each with specific fo | MISSING_FIELD_AS | How to handle missing fields | ERROR | | ALLOW_DUPLICATE_KEYS | Allow duplicate object keys | FALSE | -### TYPE = PARQUET + + + | Option | Description | Default | |--------|-------------|--------| | MISSING_FIELD_AS | How to handle missing fields | ERROR | -### TYPE = ORC + + + | Option | Description | Default | |--------|-------------|--------| | MISSING_FIELD_AS | How to handle missing fields | ERROR | -### TYPE = AVRO + + + | Option | Description | Default | |--------|-------------|--------| | MISSING_FIELD_AS | How to handle missing fields | ERROR | + + + ## Copy Options | Parameter | Description | Default | @@ -270,6 +293,10 @@ If `RETURN_FAILED_ONLY` is set to `true`, the output will only contain the files ## Examples +:::tip Best Practice +For external storage sources, it's recommended to use pre-created connections with the `CONNECTION_NAME` parameter instead of specifying credentials directly in the COPY statement. This approach provides better security, maintainability, and reusability. See [CREATE CONNECTION](../00-ddl/13-connection/create-connection.md) for details on creating connections. +::: + ### Example 1: Loading from Stages These examples showcase data loading into Databend from various types of stages: @@ -314,16 +341,19 @@ These examples showcase data loading into Databend from various types of externa -This example establishes a connection to Amazon S3 using AWS access keys and secrets, and it loads 10 rows from a CSV file: +This example uses a pre-created connection to load data from Amazon S3: ```sql --- Authenticated by AWS access keys and secrets. +-- First create a connection (you only need to do this once) +CREATE CONNECTION my_s3_conn + STORAGE_TYPE = 's3' + ACCESS_KEY_ID = '' + SECRET_ACCESS_KEY = ''; + +-- Use the connection to load data COPY INTO mytable FROM 's3://mybucket/data.csv' - CONNECTION = ( - ACCESS_KEY_ID = '', - SECRET_ACCESS_KEY = '' - ) + CONNECTION = (CONNECTION_NAME = 'my_s3_conn') FILE_FORMAT = ( TYPE = CSV, FIELD_DELIMITER = ',', @@ -333,19 +363,20 @@ COPY INTO mytable SIZE_LIMIT = 10; ``` -This example connects to Amazon S3 using AWS IAM role authentication with an external ID and loads CSV files matching the specified pattern from 'mybucket': +**Using IAM Role (Recommended for Production)** ```sql --- Authenticated by AWS IAM role and external ID. +-- Create connection using IAM role (more secure, recommended for production) +CREATE CONNECTION my_iam_conn + STORAGE_TYPE = 's3' + ROLE_ARN = 'arn:aws:iam::123456789012:role/my_iam_role'; + +-- Load CSV files using the IAM role connection COPY INTO mytable FROM 's3://mybucket/' - CONNECTION = ( - ENDPOINT_URL = 'https://', - ROLE_ARN = 'arn:aws:iam::123456789012:role/my_iam_role', - EXTERNAL_ID = '123456' - ) + CONNECTION = (CONNECTION_NAME = 'my_iam_conn') PATTERN = '.*[.]csv' - FILE_FORMAT = ( + FILE_FORMAT = ( TYPE = CSV, FIELD_DELIMITER = ',', RECORD_DELIMITER = '\n', @@ -360,18 +391,46 @@ COPY INTO mytable This example connects to Azure Blob Storage and loads data from 'data.csv' into Databend: ```sql +-- Create connection for Azure Blob Storage +CREATE CONNECTION my_azure_conn + STORAGE_TYPE = 'azblob' + ENDPOINT_URL = 'https://.blob.core.windows.net' + ACCOUNT_NAME = '' + ACCOUNT_KEY = ''; + +-- Use the connection to load data COPY INTO mytable FROM 'azblob://mybucket/data.csv' - CONNECTION = ( - ENDPOINT_URL = 'https://.blob.core.windows.net', - ACCOUNT_NAME = '', - ACCOUNT_KEY = '' - ) + CONNECTION = (CONNECTION_NAME = 'my_azure_conn') FILE_FORMAT = (type = CSV); ``` + + +This example connects to Google Cloud Storage and loads data: + +```sql +-- Create connection for Google Cloud Storage +CREATE CONNECTION my_gcs_conn + STORAGE_TYPE = 'gcs' + CREDENTIAL = ''; + +-- Use the connection to load data +COPY INTO mytable + FROM 'gcs://mybucket/data.csv' + CONNECTION = (CONNECTION_NAME = 'my_gcs_conn') + FILE_FORMAT = ( + TYPE = CSV, + FIELD_DELIMITER = ',', + RECORD_DELIMITER = '\n', + SKIP_HEADER = 1 + ); +``` + + + This example loads data from three remote CSV files and skips a file in case of errors. @@ -411,13 +470,16 @@ COPY INTO mytable This example loads a GZIP-compressed CSV file on Amazon S3 into Databend: ```sql +-- Create connection for compressed data loading +CREATE CONNECTION compressed_s3_conn + STORAGE_TYPE = 's3' + ACCESS_KEY_ID = '' + SECRET_ACCESS_KEY = ''; + +-- Load GZIP-compressed CSV file using the connection COPY INTO mytable FROM 's3://mybucket/data.csv.gz' - CONNECTION = ( - ENDPOINT_URL = 'https://', - ACCESS_KEY_ID = '', - SECRET_ACCESS_KEY = '' - ) + CONNECTION = (CONNECTION_NAME = 'compressed_s3_conn') FILE_FORMAT = ( TYPE = CSV, FIELD_DELIMITER = ',', @@ -432,8 +494,16 @@ COPY INTO mytable This example demonstrates how to load CSV files from Amazon S3 using pattern matching with the PATTERN parameter. It filters files with 'sales' in their names and '.csv' extensions: ```sql +-- Create connection for pattern-based file loading +CREATE CONNECTION pattern_s3_conn + STORAGE_TYPE = 's3' + ACCESS_KEY_ID = '' + SECRET_ACCESS_KEY = ''; + +-- Load CSV files with 'sales' in their names using pattern matching COPY INTO mytable FROM 's3://mybucket/' + CONNECTION = (CONNECTION_NAME = 'pattern_s3_conn') PATTERN = '.*sales.*[.]csv' FILE_FORMAT = ( TYPE = CSV, @@ -445,11 +515,12 @@ COPY INTO mytable Where `.*` is interpreted as zero or more occurrences of any character. The square brackets escape the period character `.` that precedes a file extension. -To load from all the CSV files: +To load from all the CSV files using a connection: ```sql COPY INTO mytable FROM 's3://mybucket/' + CONNECTION = (CONNECTION_NAME = 'pattern_s3_conn') PATTERN = '.*[.]csv' FILE_FORMAT = ( TYPE = CSV, @@ -457,7 +528,6 @@ COPY INTO mytable RECORD_DELIMITER = '\n', SKIP_HEADER = 1 ); - ``` When specifying the pattern for a file path including multiple folders, consider your matching criteria: @@ -605,7 +675,7 @@ DESC t2; An error would occur when attempting to load the data into a table: ```sql -root@localhost:8000/default> COPY INTO t2 FROM @~/invalid_json_string.parquet FILE_FORMAT = (TYPE = PARQUET) ON_ERROR = CONTINUE; +COPY INTO t2 FROM @~/invalid_json_string.parquet FILE_FORMAT = (TYPE = PARQUET) ON_ERROR = CONTINUE; error: APIError: ResponseError with 1006: EOF while parsing a value, pos 3 while evaluating function `parse_json('[1,')` ```