Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/data-operate/import/import-way/broker-load-manual.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,12 @@ Supported data sources:
- HDFS protocol
- Custom protocol (require broker process)

Supported file path patterns:

- Wildcards: `*`, `?`, `[abc]`, `[a-z]`
- Range expansion: `{1..10}`, `{a,b,c}`
- See [File Path Pattern](../../../sql-manual/basic-element/file-path-pattern) for complete syntax

Supported data types:

- CSV
Expand Down Expand Up @@ -558,6 +564,8 @@ Different Broker types and access methods require different authentication infor

### Importing data from HDFS using wildcards to match two batches of files and importing them into two separate tables

Broker Load supports wildcards (`*`, `?`, `[...]`) and range patterns (`{1..10}`) in file paths. For detailed syntax, see [File Path Pattern](../../../sql-manual/basic-element/file-path-pattern).

```sql
LOAD LABEL example_db.label2
(
Expand Down
4 changes: 3 additions & 1 deletion docs/data-operate/import/import-way/insert-into-manual.md
Original file line number Diff line number Diff line change
Expand Up @@ -312,7 +312,9 @@ The INSERT command is a synchronous command. If it returns a result, that indica

## Ingest data by TVF

Doris can directly query and analyze files stored in object storage or HDFS as tables through the Table Value Functions (TVFs), which supports automatic column type inference. For detailed information, please refer to the [Lakehouse/TVF documentation](https://doris.apache.org/docs/3.0/lakehouse/file-analysis).
Doris can directly query and analyze files stored in object storage or HDFS as tables through the Table Value Functions (TVFs), which supports automatic column type inference. For detailed information, please refer to the [Lakehouse/TVF documentation](../../../lakehouse/file-analysis).

TVF supports wildcards (`*`, `?`, `[...]`) and range patterns (`{1..10}`) in file paths. For complete syntax, see [File Path Pattern](../../../sql-manual/basic-element/file-path-pattern).

### Automatic column type inference

Expand Down
20 changes: 7 additions & 13 deletions docs/lakehouse/file-analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,21 +39,15 @@ The attributes of a TVF include the file path to be analyzed, file format, conne

### Multiple File Import

When importing, the file path (URI) supports wildcards for matching. Doris file path matching uses the [Glob matching pattern](https://en.wikipedia.org/wiki/Glob_(programming)#:~:text=glob%20%28%29%20%28%2F%20%C9%A1l%C9%92b%20%2F%29%20is%20a%20libc,into%20a%20list%20of%20names%20matching%20that%20pattern.), and has been extended on this basis to support more flexible file selection methods.
The file path (URI) supports wildcards and range patterns for matching multiple files:

- `file_{1..3}`: Matches files `file_1`, `file_2`, `file_3`
- `file_{1,3}_{1,2}`: Matches files `file_1_1`, `file_1_2`, `file_3_1`, `file_3_2` (supports mixing with `{n..m}` notation, separated by commas)
- `file_*`: Matches all files starting with `file_`
- `*.parquet`: Matches all files with the `.parquet` suffix
- `tvf_test/*`: Matches all files in the `tvf_test` directory
- `*test*`: Matches files containing `test` in the filename
| Pattern | Example | Matches |
|---------|---------|---------|
| `*` | `file_*` | All files starting with `file_` |
| `{n..m}` | `file_{1..3}` | `file_1`, `file_2`, `file_3` |
| `{a,b,c}` | `file_{a,b}` | `file_a`, `file_b` |

**Notes**

- In the `{1..3}` notation, the order can be reversed, `{3..1}` is also valid.
- Notations like `file_{-1..2}` and `file_{a..4}` are not supported, as negative numbers or letters cannot be used as enumeration endpoints. However, `file_{1..3,11,a}` is allowed and will match files `file_1`, `file_2`, `file_3`, `file_11`, and `file_a`.
- Doris tries to import as many files as possible. For paths like `file_{a..b,-1..3,4..5}` that contain incorrect notation, we will match files `file_4` and `file_5`.
- When using commas with `{1..4,5}`, only numbers are allowed. Expressions like `{1..4,a}` are not supported; in this case, `{a}` will be ignored.
For complete syntax including all supported wildcards, range expansion rules, and usage examples, see [File Path Pattern](../sql-manual/basic-element/file-path-pattern).


### Automatic Inference of File Column Types
Expand Down
309 changes: 309 additions & 0 deletions docs/sql-manual/basic-element/file-path-pattern.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,309 @@
---
{
"title": "File Path Pattern",
"language": "en",
"description": "File path patterns and wildcards supported by Doris for accessing files in remote storage systems like S3, HDFS, and other object storage."
}
---

## Description

When accessing files from remote storage systems (S3, HDFS, and other S3-compatible object storage), Doris supports flexible file path patterns including wildcards and range expressions. This document describes the supported path formats and pattern matching syntax.

These path patterns are supported by:
- [S3 TVF](../sql-functions/table-valued-functions/s3)
- [HDFS TVF](../sql-functions/table-valued-functions/hdfs)
- [Broker Load](../../data-operate/import/import-way/broker-load-manual)
- INSERT INTO SELECT from TVF

## Supported URI Formats

### S3-Style URIs

| Style | Format | Example |
|-------|--------|---------|
| AWS Client Style (Hadoop S3) | `s3://bucket/path/to/file` | `s3://my-bucket/data/file.csv` |
| S3A Style | `s3a://bucket/path/to/file` | `s3a://my-bucket/data/file.csv` |
| S3N Style | `s3n://bucket/path/to/file` | `s3n://my-bucket/data/file.csv` |
| Virtual Host Style | `https://bucket.endpoint/path/to/file` | `https://my-bucket.s3.us-west-1.amazonaws.com/data/file.csv` |
| Path Style | `https://endpoint/bucket/path/to/file` | `https://s3.us-west-1.amazonaws.com/my-bucket/data/file.csv` |

### Other Cloud Storage URIs

| Provider | Scheme | Example |
|----------|--------|---------|
| Alibaba Cloud OSS | `oss://` | `oss://my-bucket/data/file.csv` |
| Tencent Cloud COS | `cos://`, `cosn://` | `cos://my-bucket/data/file.csv` |
| Baidu Cloud BOS | `bos://` | `bos://my-bucket/data/file.csv` |
| Huawei Cloud OBS | `obs://` | `obs://my-bucket/data/file.csv` |
| Google Cloud Storage | `gs://` | `gs://my-bucket/data/file.csv` |
| Azure Blob Storage | `azure://` | `azure://container/data/file.csv` |

### HDFS URIs

| Style | Format | Example |
|-------|--------|---------|
| Standard | `hdfs://namenode:port/path/to/file` | `hdfs://namenode:8020/user/data/file.csv` |
| HA Mode | `hdfs://nameservice/path/to/file` | `hdfs://my-ha-cluster/user/data/file.csv` |

## Wildcard Patterns

Doris uses glob-style pattern matching for file paths. The following wildcards are supported:

### Basic Wildcards

| Pattern | Description | Example | Matches |
|---------|-------------|---------|---------|
| `*` | Matches zero or more characters within a path segment | `*.csv` | `file.csv`, `data.csv`, `a.csv` |
| `?` | Matches exactly one character | `file?.csv` | `file1.csv`, `fileA.csv`, but not `file10.csv` |
| `[abc]` | Matches any single character in brackets | `file[123].csv` | `file1.csv`, `file2.csv`, `file3.csv` |
| `[a-z]` | Matches any single character in the range | `file[a-c].csv` | `filea.csv`, `fileb.csv`, `filec.csv` |
| `[!abc]` | Matches any single character NOT in brackets | `file[!0-9].csv` | `filea.csv`, `fileb.csv`, but not `file1.csv` |

### Range Expansion (Brace Patterns)

Doris supports numeric range expansion using brace patterns `{start..end}`:

| Pattern | Expansion | Matches |
|---------|-----------|---------|
| `{1..3}` | `{1,2,3}` | `1`, `2`, `3` |
| `{01..05}` | `{1,2,3,4,5}` | `1`, `2`, `3`, `4`, `5` (leading zeros are NOT preserved) |
| `{3..1}` | `{1,2,3}` | `1`, `2`, `3` (reverse ranges supported) |
| `{a,b,c}` | `{a,b,c}` | `a`, `b`, `c` (enumeration) |
| `{1..3,5,7..9}` | `{1,2,3,5,7,8,9}` | Mixed ranges and values |

:::caution Note
- Doris tries to match as many files as possible. Invalid parts in brace expressions are silently skipped, and valid parts are still expanded. For example, `file_{a..b,-1..3,4..5}` will match `file_4` and `file_5` (the invalid `a..b` and negative range `-1..3` are skipped, but `4..5` is expanded normally).
- If the entire range is negative (e.g., `{-1..2}`), the range is skipped. If mixed with valid ranges (e.g., `{-1..2,1..3}`), only the valid range `1..3` is expanded.
- When using comma-separated values with ranges, only numbers are allowed. For example, in `{1..4,a}`, the non-numeric `a` will be ignored, resulting in `{1,2,3,4}`.
- Pure enumeration patterns like `{a,b,c}` (without `..` ranges) are passed directly to glob matching and work as expected.
:::

### Combining Patterns

Multiple patterns can be combined in a single path:

```
s3://bucket/data_{1..3}/file_*.csv
```

This matches:
- `s3://bucket/data_1/file_a.csv`
- `s3://bucket/data_1/file_b.csv`
- `s3://bucket/data_2/file_a.csv`
- ... and so on

## Examples

### S3 TVF Examples

**Match all CSV files in a directory:**

```sql
SELECT * FROM S3(
"uri" = "s3://my-bucket/data/*.csv",
"s3.access_key" = "xxx",
"s3.secret_key" = "xxx",
"s3.region" = "us-east-1",
"format" = "csv"
);
```

**Match files with numeric range:**

```sql
SELECT * FROM S3(
"uri" = "s3://my-bucket/logs/data_{1..10}.csv",
"s3.access_key" = "xxx",
"s3.secret_key" = "xxx",
"s3.region" = "us-east-1",
"format" = "csv"
);
```

**Match files in date-partitioned directories:**

```sql
SELECT * FROM S3(
"uri" = "s3://my-bucket/logs/year=2024/month=*/day=*/data.parquet",
"s3.access_key" = "xxx",
"s3.secret_key" = "xxx",
"s3.region" = "us-east-1",
"format" = "parquet"
);
```

:::caution Zero-Padded Directories
For zero-padded directory names like `month=01`, `month=02`, use wildcards (`*`) instead of range patterns. The pattern `{01..12}` expands to `{1,2,...,12}` which won't match `month=01`.
:::

**Match numbered file splits (e.g., Spark output):**

```sql
SELECT * FROM S3(
"uri" = "s3://my-bucket/output/part-{00000..00099}.csv",
"s3.access_key" = "xxx",
"s3.secret_key" = "xxx",
"s3.region" = "us-east-1",
"format" = "csv"
);
```

### Broker Load Examples

**Load all CSV files matching a pattern:**

```sql
LOAD LABEL db.label_wildcard
(
DATA INFILE("s3://my-bucket/data/file_*.csv")
INTO TABLE my_table
COLUMNS TERMINATED BY ","
FORMAT AS "CSV"
(col1, col2, col3)
)
WITH S3 (
"provider" = "S3",
"AWS_ENDPOINT" = "s3.us-west-2.amazonaws.com",
"AWS_ACCESS_KEY" = "xxx",
"AWS_SECRET_KEY" = "xxx",
"AWS_REGION" = "us-west-2"
);
```

**Load files using numeric range expansion:**

```sql
LOAD LABEL db.label_range
(
DATA INFILE("s3://my-bucket/exports/data_{1..5}.csv")
INTO TABLE my_table
COLUMNS TERMINATED BY ","
FORMAT AS "CSV"
(col1, col2, col3)
)
WITH S3 (
"provider" = "S3",
"AWS_ENDPOINT" = "s3.us-west-2.amazonaws.com",
"AWS_ACCESS_KEY" = "xxx",
"AWS_SECRET_KEY" = "xxx",
"AWS_REGION" = "us-west-2"
);
```

**Load from HDFS with wildcards:**

```sql
LOAD LABEL db.label_hdfs_wildcard
(
DATA INFILE("hdfs://namenode:8020/user/data/2024-*/*.csv")
INTO TABLE my_table
COLUMNS TERMINATED BY ","
FORMAT AS "CSV"
(col1, col2, col3)
)
WITH HDFS (
"fs.defaultFS" = "hdfs://namenode:8020",
"hadoop.username" = "user"
);
```

**Load from HDFS with numeric range:**

```sql
LOAD LABEL db.label_hdfs_range
(
DATA INFILE("hdfs://namenode:8020/data/file_{1..3,5,7..9}.csv")
INTO TABLE my_table
COLUMNS TERMINATED BY ","
FORMAT AS "CSV"
(col1, col2, col3)
)
WITH HDFS (
"fs.defaultFS" = "hdfs://namenode:8020",
"hadoop.username" = "user"
);
```

### INSERT INTO SELECT Examples

**Insert from S3 with wildcards:**

```sql
INSERT INTO my_table (col1, col2, col3)
SELECT * FROM S3(
"uri" = "s3://my-bucket/data/part-*.parquet",
"s3.access_key" = "xxx",
"s3.secret_key" = "xxx",
"s3.region" = "us-east-1",
"format" = "parquet"
);
```

## Performance Considerations

### Use Specific Prefixes

Doris extracts the longest non-wildcard prefix from your path pattern to optimize S3/HDFS listing operations. More specific prefixes result in faster file discovery.

```sql
-- Good: specific prefix reduces listing scope
"uri" = "s3://bucket/data/2024/01/15/*.csv"

-- Less optimal: broad wildcard at early path segment
"uri" = "s3://bucket/data/**/file.csv"
```

### Prefer Range Patterns for Known Sequences

When you know the exact file numbering, use range patterns instead of wildcards:

```sql
-- Better: explicit range
"uri" = "s3://bucket/data/part-{0001..0100}.csv"

-- Less optimal: wildcard matches unknown files
"uri" = "s3://bucket/data/part-*.csv"
```

### Avoid Deep Recursive Wildcards

Deep recursive patterns like `**` can cause slow file listing on large buckets:

```sql
-- Avoid when possible
"uri" = "s3://bucket/**/*.csv"

-- Prefer explicit path structure
"uri" = "s3://bucket/data/year=*/month=*/day=*/*.csv"
```

## Troubleshooting

| Issue | Cause | Solution |
|-------|-------|----------|
| No files found | Pattern doesn't match any files | Verify the path and pattern syntax; test with a single file first |
| Slow file listing | Wildcard too broad or too many files | Use more specific prefix; limit wildcard scope |
| Invalid URI error | Malformed path syntax | Check URI scheme and bucket name format |
| Access denied | Credentials or permissions issue | Verify S3/HDFS credentials and bucket policies |

### Testing Path Patterns

Before running a large load job, test your pattern with a limited query:

```sql
-- Test if files exist and match pattern
SELECT * FROM S3(
"uri" = "s3://bucket/your/pattern/*.csv",
...
) LIMIT 1;
```

Use `DESC FUNCTION` to verify the schema of matched files:

```sql
DESC FUNCTION S3(
"uri" = "s3://bucket/your/pattern/*.csv",
...
);
```
4 changes: 2 additions & 2 deletions docs/sql-manual/sql-functions/table-valued-functions/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ S3(

| Parameter | Description |
|-----------------|---------------------------------------------------------------------------------------------------------------------------------------|
| `uri` | URI for accessing S3. The function will use either Path Style or Virtual-hosted Style based on the `use_path_style` parameter |
| `uri` | URI for accessing S3. Supports wildcards and range patterns. See [File Path Pattern](../../basic-element/file-path-pattern) for details. The function will use either Path Style or Virtual-hosted Style based on the `use_path_style` parameter |
| `s3.access_key` | Access key for S3 |
| `s3.secret_key` | Secret key for S3 |
| `s3.region` | S3 region |
Expand Down Expand Up @@ -530,7 +530,7 @@ S3(

- **URI with Wildcards**

URI can use wildcards to read multiple files. Note: When using wildcards, ensure all files have the same format (especially `csv`, `csv_with_names`, `csv_with_names_and_types` are different formats). S3 TVF will use the first file to parse Table Schema.
URI can use wildcards and range patterns to read multiple files. For detailed syntax including `*`, `?`, `[...]`, and `{1..10}` range expansion, see [File Path Pattern](../../basic-element/file-path-pattern). Note: When using wildcards, ensure all files have the same format (especially `csv`, `csv_with_names`, `csv_with_names_and_types` are different formats). S3 TVF will use the first file to parse Table Schema.

With the following two CSV files:

Expand Down
Loading
Loading