[docs] Add file path pattern documentation for S3 TVF and Broker Load #3337

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

morningman merged 1 commit into apache:master from dataroaring:docs/file-path-pattern

Feb 5, 2026

docs/data-operate/import/import-way/broker-load-manual.md

-Original file line number
+Diff line change
@@ Expand Up / @@ -24,6 +24,12 @@ Supported data sources: @@
     - HDFS protocol
     - Custom protocol (require broker process)
+    Supported file path patterns:
+    - Wildcards: `*`, `?`, `[abc]`, `[a-z]`
+    - Range expansion: `{1..10}`, `{a,b,c}`
+    - See [File Path Pattern](../../../sql-manual/basic-element/file-path-pattern) for complete syntax
     Supported data types:
     - CSV
@@ Expand Down Expand Up @@
     ### Importing data from HDFS using wildcards to match two batches of files and importing them into two separate tables
+      Broker Load supports wildcards (`*`, `?`, `[...]`) and range patterns (`{1..10}`) in file paths. For detailed syntax, see [File Path Pattern](../../../sql-manual/basic-element/file-path-pattern).
       ```sql
       LOAD LABEL example_db.label2
       (
@@ Expand Down @@

docs/data-operate/import/import-way/insert-into-manual.md

-Original file line number
+Diff line change
@@ Expand Up @@
     ## Ingest data by TVF
-    Doris can directly query and analyze files stored in object storage or HDFS as tables through the Table Value Functions (TVFs), which supports automatic column type inference. For detailed information, please refer to the [Lakehouse/TVF documentation](https://doris.apache.org/docs/3.0/lakehouse/file-analysis).
+    Doris can directly query and analyze files stored in object storage or HDFS as tables through the Table Value Functions (TVFs), which supports automatic column type inference. For detailed information, please refer to the [Lakehouse/TVF documentation](../../../lakehouse/file-analysis).
+    TVF supports wildcards (`*`, `?`, `[...]`) and range patterns (`{1..10}`) in file paths. For complete syntax, see [File Path Pattern](../../../sql-manual/basic-element/file-path-pattern).
     ### Automatic column type inference
@@ Expand Down @@

docs/lakehouse/file-analysis.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -39,21 +39,15 @@ The attributes of a TVF include the file path to be analyzed, file format, conne
  
    ### Multiple File Import

    When importing, the file path (URI) supports wildcards for matching. Doris file path matching uses the [Glob matching pattern](https://en.wikipedia.org/wiki/Glob_(programming)#:~:text=glob%20%28%29%20%28%2F%20%C9%A1l%C9%92b%20%2F%29%20is%20a%20libc,into%20a%20list%20of%20names%20matching%20that%20pattern.), and has been extended on this basis to support more flexible file selection methods.

    The file path (URI) supports wildcards and range patterns for matching multiple files:

    - `file_{1..3}`: Matches files `file_1`, `file_2`, `file_3`

    - `file_{1,3}_{1,2}`: Matches files `file_1_1`, `file_1_2`, `file_3_1`, `file_3_2` (supports mixing with `{n..m}` notation, separated by commas)

    - `file_*`: Matches all files starting with `file_`

    - `*.parquet`: Matches all files with the `.parquet` suffix

    - `tvf_test/*`: Matches all files in the `tvf_test` directory

    - `*test*`: Matches files containing `test` in the filename

    | Pattern | Example | Matches |

    |---------|---------|---------|

    | `*` | `file_*` | All files starting with `file_` |

    | `{n..m}` | `file_{1..3}` | `file_1`, `file_2`, `file_3` |

    | `{a,b,c}` | `file_{a,b}` | `file_a`, `file_b` |

    **Notes**

    - In the `{1..3}` notation, the order can be reversed, `{3..1}` is also valid.

    - Notations like `file_{-1..2}` and `file_{a..4}` are not supported, as negative numbers or letters cannot be used as enumeration endpoints. However, `file_{1..3,11,a}` is allowed and will match files `file_1`, `file_2`, `file_3`, `file_11`, and `file_a`.

    - Doris tries to import as many files as possible. For paths like `file_{a..b,-1..3,4..5}` that contain incorrect notation, we will match files `file_4` and `file_5`.

    - When using commas with `{1..4,5}`, only numbers are allowed. Expressions like `{1..4,a}` are not supported; in this case, `{a}` will be ignored.

    For complete syntax including all supported wildcards, range expansion rules, and usage examples, see [File Path Pattern](../sql-manual/basic-element/file-path-pattern).

    ### Automatic Inference of File Column Types

docs/sql-manual/basic-element/file-path-pattern.md

-Original file line number
+Diff line change
@@ -0,0 +1,309 @@
+    ---
+    {
+        "title": "File Path Pattern",
+        "language": "en",
+        "description": "File path patterns and wildcards supported by Doris for accessing files in remote storage systems like S3, HDFS, and other object storage."
+    }
+    ---
+    ## Description
+    When accessing files from remote storage systems (S3, HDFS, and other S3-compatible object storage), Doris supports flexible file path patterns including wildcards and range expressions. This document describes the supported path formats and pattern matching syntax.
+    These path patterns are supported by:
+    - [S3 TVF](../sql-functions/table-valued-functions/s3)
+    - [HDFS TVF](../sql-functions/table-valued-functions/hdfs)
+    - [Broker Load](../../data-operate/import/import-way/broker-load-manual)
+    - INSERT INTO SELECT from TVF
+    ## Supported URI Formats
+    ### S3-Style URIs
+    | Style | Format | Example |
+    |-------|--------|---------|
+    | AWS Client Style (Hadoop S3) | `s3://bucket/path/to/file` | `s3://my-bucket/data/file.csv` |
+    | S3A Style | `s3a://bucket/path/to/file` | `s3a://my-bucket/data/file.csv` |
+    | S3N Style | `s3n://bucket/path/to/file` | `s3n://my-bucket/data/file.csv` |
+    | Virtual Host Style | `https://bucket.endpoint/path/to/file` | `https://my-bucket.s3.us-west-1.amazonaws.com/data/file.csv` |
+    | Path Style | `https://endpoint/bucket/path/to/file` | `https://s3.us-west-1.amazonaws.com/my-bucket/data/file.csv` |
+    ### Other Cloud Storage URIs
+    | Provider | Scheme | Example |
+    |----------|--------|---------|
+    | Alibaba Cloud OSS | `oss://` | `oss://my-bucket/data/file.csv` |
+    | Tencent Cloud COS | `cos://`, `cosn://` | `cos://my-bucket/data/file.csv` |
+    | Baidu Cloud BOS | `bos://` | `bos://my-bucket/data/file.csv` |
+    | Huawei Cloud OBS | `obs://` | `obs://my-bucket/data/file.csv` |
+    | Google Cloud Storage | `gs://` | `gs://my-bucket/data/file.csv` |
+    | Azure Blob Storage | `azure://` | `azure://container/data/file.csv` |
+    ### HDFS URIs
+    | Style | Format | Example |
+    |-------|--------|---------|
+    | Standard | `hdfs://namenode:port/path/to/file` | `hdfs://namenode:8020/user/data/file.csv` |
+    | HA Mode | `hdfs://nameservice/path/to/file` | `hdfs://my-ha-cluster/user/data/file.csv` |
+    ## Wildcard Patterns
+    Doris uses glob-style pattern matching for file paths. The following wildcards are supported:
+    ### Basic Wildcards
+    | Pattern | Description | Example | Matches |
+    |---------|-------------|---------|---------|
+    | `*` | Matches zero or more characters within a path segment | `*.csv` | `file.csv`, `data.csv`, `a.csv` |
+    | `?` | Matches exactly one character | `file?.csv` | `file1.csv`, `fileA.csv`, but not `file10.csv` |
+    | `[abc]` | Matches any single character in brackets | `file[123].csv` | `file1.csv`, `file2.csv`, `file3.csv` |
+    | `[a-z]` | Matches any single character in the range | `file[a-c].csv` | `filea.csv`, `fileb.csv`, `filec.csv` |
+    | `[!abc]` | Matches any single character NOT in brackets | `file[!0-9].csv` | `filea.csv`, `fileb.csv`, but not `file1.csv` |
+    ### Range Expansion (Brace Patterns)
+    Doris supports numeric range expansion using brace patterns `{start..end}`:
+    | Pattern | Expansion | Matches |
+    |---------|-----------|---------|
+    | `{1..3}` | `{1,2,3}` | `1`, `2`, `3` |
+    | `{01..05}` | `{1,2,3,4,5}` | `1`, `2`, `3`, `4`, `5` (leading zeros are NOT preserved) |
+    | `{3..1}` | `{1,2,3}` | `1`, `2`, `3` (reverse ranges supported) |
+    | `{a,b,c}` | `{a,b,c}` | `a`, `b`, `c` (enumeration) |
+    | `{1..3,5,7..9}` | `{1,2,3,5,7,8,9}` | Mixed ranges and values |
+    :::caution Note
+    - Doris tries to match as many files as possible. Invalid parts in brace expressions are silently skipped, and valid parts are still expanded. For example, `file_{a..b,-1..3,4..5}` will match `file_4` and `file_5` (the invalid `a..b` and negative range `-1..3` are skipped, but `4..5` is expanded normally).
+    - If the entire range is negative (e.g., `{-1..2}`), the range is skipped. If mixed with valid ranges (e.g., `{-1..2,1..3}`), only the valid range `1..3` is expanded.
+    - When using comma-separated values with ranges, only numbers are allowed. For example, in `{1..4,a}`, the non-numeric `a` will be ignored, resulting in `{1,2,3,4}`.
+    - Pure enumeration patterns like `{a,b,c}` (without `..` ranges) are passed directly to glob matching and work as expected.
+    :::
+    ### Combining Patterns
+    Multiple patterns can be combined in a single path:
+    ```
+    s3://bucket/data_{1..3}/file_*.csv
+    ```
+    This matches:
+    - `s3://bucket/data_1/file_a.csv`
+    - `s3://bucket/data_1/file_b.csv`
+    - `s3://bucket/data_2/file_a.csv`
+    - ... and so on
+    ## Examples
+    ### S3 TVF Examples
+    **Match all CSV files in a directory:**
+    ```sql
+    SELECT * FROM S3(
+        "uri" = "s3://my-bucket/data/*.csv",
+        "s3.access_key" = "xxx",
+        "s3.secret_key" = "xxx",
+        "s3.region" = "us-east-1",
+        "format" = "csv"
+    );
+    ```
+    **Match files with numeric range:**
+    ```sql
+    SELECT * FROM S3(
+        "uri" = "s3://my-bucket/logs/data_{1..10}.csv",
+        "s3.access_key" = "xxx",
+        "s3.secret_key" = "xxx",
+        "s3.region" = "us-east-1",
+        "format" = "csv"
+    );
+    ```
+    **Match files in date-partitioned directories:**
+    ```sql
+    SELECT * FROM S3(
+        "uri" = "s3://my-bucket/logs/year=2024/month=*/day=*/data.parquet",
+        "s3.access_key" = "xxx",
+        "s3.secret_key" = "xxx",
+        "s3.region" = "us-east-1",
+        "format" = "parquet"
+    );
+    ```
+    :::caution Zero-Padded Directories
+    For zero-padded directory names like `month=01`, `month=02`, use wildcards (`*`) instead of range patterns. The pattern `{01..12}` expands to `{1,2,...,12}` which won't match `month=01`.
+    :::
+    **Match numbered file splits (e.g., Spark output):**
+    ```sql
+    SELECT * FROM S3(
+        "uri" = "s3://my-bucket/output/part-{00000..00099}.csv",
+        "s3.access_key" = "xxx",
+        "s3.secret_key" = "xxx",
+        "s3.region" = "us-east-1",
+        "format" = "csv"
+    );
+    ```
+    ### Broker Load Examples
+    **Load all CSV files matching a pattern:**
+    ```sql
+    LOAD LABEL db.label_wildcard
+    (
+        DATA INFILE("s3://my-bucket/data/file_*.csv")
+        INTO TABLE my_table
+        COLUMNS TERMINATED BY ","
+        FORMAT AS "CSV"
+        (col1, col2, col3)
+    )
+    WITH S3 (
+        "provider" = "S3",
+        "AWS_ENDPOINT" = "s3.us-west-2.amazonaws.com",
+        "AWS_ACCESS_KEY" = "xxx",
+        "AWS_SECRET_KEY" = "xxx",
+        "AWS_REGION" = "us-west-2"
+    );
+    ```
+    **Load files using numeric range expansion:**
+    ```sql
+    LOAD LABEL db.label_range
+    (
+        DATA INFILE("s3://my-bucket/exports/data_{1..5}.csv")
+        INTO TABLE my_table
+        COLUMNS TERMINATED BY ","
+        FORMAT AS "CSV"
+        (col1, col2, col3)
+    )
+    WITH S3 (
+        "provider" = "S3",
+        "AWS_ENDPOINT" = "s3.us-west-2.amazonaws.com",
+        "AWS_ACCESS_KEY" = "xxx",
+        "AWS_SECRET_KEY" = "xxx",
+        "AWS_REGION" = "us-west-2"
+    );
+    ```
+    **Load from HDFS with wildcards:**
+    ```sql
+    LOAD LABEL db.label_hdfs_wildcard
+    (
+        DATA INFILE("hdfs://namenode:8020/user/data/2024-*/*.csv")
+        INTO TABLE my_table
+        COLUMNS TERMINATED BY ","
+        FORMAT AS "CSV"
+        (col1, col2, col3)
+    )
+    WITH HDFS (
+        "fs.defaultFS" = "hdfs://namenode:8020",
+        "hadoop.username" = "user"
+    );
+    ```
+    **Load from HDFS with numeric range:**
+    ```sql
+    LOAD LABEL db.label_hdfs_range
+    (
+        DATA INFILE("hdfs://namenode:8020/data/file_{1..3,5,7..9}.csv")
+        INTO TABLE my_table
+        COLUMNS TERMINATED BY ","
+        FORMAT AS "CSV"
+        (col1, col2, col3)
+    )
+    WITH HDFS (
+        "fs.defaultFS" = "hdfs://namenode:8020",
+        "hadoop.username" = "user"
+    );
+    ```
+    ### INSERT INTO SELECT Examples
+    **Insert from S3 with wildcards:**
+    ```sql
+    INSERT INTO my_table (col1, col2, col3)
+    SELECT * FROM S3(
+        "uri" = "s3://my-bucket/data/part-*.parquet",
+        "s3.access_key" = "xxx",
+        "s3.secret_key" = "xxx",
+        "s3.region" = "us-east-1",
+        "format" = "parquet"
+    );
+    ```
+    ## Performance Considerations
+    ### Use Specific Prefixes
+    Doris extracts the longest non-wildcard prefix from your path pattern to optimize S3/HDFS listing operations. More specific prefixes result in faster file discovery.
+    ```sql
+    -- Good: specific prefix reduces listing scope
+    "uri" = "s3://bucket/data/2024/01/15/*.csv"
+    -- Less optimal: broad wildcard at early path segment
+    "uri" = "s3://bucket/data/**/file.csv"
+    ```
+    ### Prefer Range Patterns for Known Sequences
+    When you know the exact file numbering, use range patterns instead of wildcards:
+    ```sql
+    -- Better: explicit range
+    "uri" = "s3://bucket/data/part-{0001..0100}.csv"
+    -- Less optimal: wildcard matches unknown files
+    "uri" = "s3://bucket/data/part-*.csv"
+    ```
+    ### Avoid Deep Recursive Wildcards
+    Deep recursive patterns like `**` can cause slow file listing on large buckets:
+    ```sql
+    -- Avoid when possible
+    "uri" = "s3://bucket/**/*.csv"
+    -- Prefer explicit path structure
+    "uri" = "s3://bucket/data/year=*/month=*/day=*/*.csv"
+    ```
+    ## Troubleshooting
+    | Issue | Cause | Solution |
+    |-------|-------|----------|
+    | No files found | Pattern doesn't match any files | Verify the path and pattern syntax; test with a single file first |
+    | Slow file listing | Wildcard too broad or too many files | Use more specific prefix; limit wildcard scope |
+    | Invalid URI error | Malformed path syntax | Check URI scheme and bucket name format |
+    | Access denied | Credentials or permissions issue | Verify S3/HDFS credentials and bucket policies |
+    ### Testing Path Patterns
+    Before running a large load job, test your pattern with a limited query:
+    ```sql
+    -- Test if files exist and match pattern
+    SELECT * FROM S3(
+        "uri" = "s3://bucket/your/pattern/*.csv",
+        ...
+    ) LIMIT 1;
+    ```
+    Use `DESC FUNCTION` to verify the schema of matched files:
+    ```sql
+    DESC FUNCTION S3(
+        "uri" = "s3://bucket/your/pattern/*.csv",
+        ...
+    );
+    ```

docs/sql-manual/sql-functions/table-valued-functions/s3.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -31,7 +31,7 @@ S3(
  
    | Parameter       | Description                                                                                                                           |

    |-----------------|---------------------------------------------------------------------------------------------------------------------------------------|

    | `uri`           | URI for accessing S3. The function will use either Path Style or Virtual-hosted Style based on the `use_path_style` parameter        |

    | `uri`           | URI for accessing S3. Supports wildcards and range patterns. See [File Path Pattern](../../basic-element/file-path-pattern) for details. The function will use either Path Style or Virtual-hosted Style based on the `use_path_style` parameter        |

    | `s3.access_key` | Access key for S3                                                                                                                     |

    | `s3.secret_key` | Secret key for S3                                                                                                                     |

    | `s3.region`     | S3 region                                                                                                                             |

    @@ -530,7 +530,7 @@ S3(
  
    - **URI with Wildcards**

      URI can use wildcards to read multiple files. Note: When using wildcards, ensure all files have the same format (especially `csv`, `csv_with_names`, `csv_with_names_and_types` are different formats). S3 TVF will use the first file to parse Table Schema.

      URI can use wildcards and range patterns to read multiple files. For detailed syntax including `*`, `?`, `[...]`, and `{1..10}` range expansion, see [File Path Pattern](../../basic-element/file-path-pattern). Note: When using wildcards, ensure all files have the same format (especially `csv`, `csv_with_names`, `csv_with_names_and_types` are different formats). S3 TVF will use the first file to parse Table Schema.

      With the following two CSV files:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs] Add file path pattern documentation for S3 TVF and Broker Load #3337

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!