Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 49 additions & 3 deletions docs/en/guides/40-load-data/04-transform/03-querying-ndjson.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,24 @@ title: Querying NDJSON Files in Stage
sidebar_label: NDJSON
---

In Databend, you can directly query NDJSON files stored in stages without first loading the data into tables. This approach is particularly useful for data exploration, ETL processing, and ad-hoc analysis scenarios.

## What is NDJSON?

NDJSON (Newline Delimited JSON) is a JSON-based file format where each line contains a complete and valid JSON object. This format is especially well-suited for streaming data processing and big data analytics.

**Example NDJSON file content:**
```json
{"id": 1, "title": "Database Fundamentals", "author": "John Doe", "price": 45.50, "category": "Technology"}
{"id": 2, "title": "Machine Learning in Practice", "author": "Jane Smith", "price": 68.00, "category": "AI"}
{"id": 3, "title": "Web Development Guide", "author": "Mike Johnson", "price": 52.30, "category": "Frontend"}
```

**Advantages of NDJSON:**
- **Stream-friendly**: Can be parsed line by line without loading entire file into memory
- **Big data compatible**: Widely used in log files, data exports, and ETL pipelines
- **Easy to process**: Each line is an independent JSON object, enabling parallel processing

## Query NDJSON Files in Stage

Syntax:
Expand Down Expand Up @@ -57,6 +75,8 @@ CREATE FILE FORMAT ndjson_query_format

### Step 3. Query NDJSON Files

Now you can query the NDJSON files directly from the stage. This example extracts the `title` and `author` fields from each JSON object:

```sql
SELECT $1:title, $1:author
FROM @ndjson_query_stage
Expand All @@ -66,7 +86,15 @@ FROM @ndjson_query_stage
);
```

If the NDJSON files are compressed with gzip, we can use the following query:
**Explanation:**
- `$1:title` and `$1:author`: Extract specific fields from the JSON object. The `$1` represents the entire JSON object as a variant, and `:field_name` accesses individual fields
- `@ndjson_query_stage`: References the external stage created in Step 1
- `FILE_FORMAT => 'ndjson_query_format'`: Uses the custom file format defined in Step 2
- `PATTERN => '.*[.]ndjson'`: Regex pattern that matches all files ending with `.ndjson`

### Querying Compressed Files

If the NDJSON files are compressed with gzip, modify the pattern to match compressed files:

```sql
SELECT $1:title, $1:author
Expand All @@ -76,9 +104,11 @@ FROM @ndjson_query_stage
PATTERN => '.*[.]ndjson[.]gz'
);
```

**Key difference:** The pattern `.*[.]ndjson[.]gz` matches files ending with `.ndjson.gz`. Databend automatically decompresses gzip files during query execution thanks to the `COMPRESSION = AUTO` setting in the file format.
### Query with Metadata

Query NDJSON files directly from a stage, including metadata columns like `METADATA$FILENAME` and `METADATA$FILE_ROW_NUMBER`:
You can also include file metadata in your queries, which is useful for tracking data lineage and debugging:

```sql
SELECT
Expand All @@ -90,4 +120,20 @@ FROM @ndjson_query_stage
FILE_FORMAT => 'ndjson_query_format',
PATTERN => '.*[.]ndjson'
);
```
```

**Metadata columns explained:**
- `METADATA$FILENAME`: Shows which file each row came from - helpful when querying multiple files
- `METADATA$FILE_ROW_NUMBER`: Shows the line number within the source file - useful for tracking specific records

**Use cases:**
- **Data lineage**: Track which source file contributed each record
- **Debugging**: Identify problematic records by file and line number
- **Incremental processing**: Process only specific files or ranges within files

## Related Documentation

- [Loading NDJSON Files](../03-load-semistructured/03-load-ndjson.md) - How to load NDJSON data into tables
- [NDJSON File Format Options](/sql/sql-reference/file-format-options#ndjson-options) - Complete NDJSON format configuration
- [CREATE STAGE](/sql/sql-commands/ddl/stage/ddl-create-stage) - Managing external and internal stages
- [Querying Metadata](./04-querying-metadata.md) - More details about metadata columns
Loading