feat: Add comprehensive Parquet export support with complex data types by BingqingLyu · Pull Request #241 · alibaba/neug

BingqingLyu · 2026-04-16T09:59:33Z

Related Issues

What does this PR do?

This PR implements complete Parquet export functionality for NeuG, supporting major data types including complex graph types (Vertex, Edge, Path) with native Arrow StructArray conversion. It includes comprehensive Python test coverage and resolves multiple type conversion bugs.

What changes in this PR?

Core Features

1. Parquet Export Infrastructure (M001)

Implemented ArrowParquetExportWriter with writeTable() method
Integrated with NeuG's COPY TO command infrastructure
Supports streaming export with configurable row group sizes

2. Compression and Performance Options (M002)

Added support for multiple compression codecs: SNAPPY, GZIP, ZSTD, LZ4, BROTLI, NONE
Configurable row_group_size for memory optimization
dictionary_encoding toggle for string column optimization
Centralized options parsing in parquet_options.h

3. Complete Type Support (M003)

Primitive Types:

INT32, INT64, UINT32, UINT64
FLOAT, DOUBLE, BOOLEAN
STRING (large_utf8), DATE (Date64), TIMESTAMP (microseconds)
INTERVAL (stored as string for Parquet compatibility)

Complex Types:

List: Native Arrow ListArray with proper offsets semantics
Struct: Native Arrow StructArray for nested data
Vertex, Edge, and Path: save as JSON string to avoid schema conflicts when exporting mixed-type graph objects

Key Implementation Details

Type Conversion Architecture

Proto-to-Arrow dispatch: Routes based on protobuf Array type (not Arrow type)
JSON-to-Arrow conversion: Dispatches on JSON value type (not expected Arrow type)
- Eliminates complex type conversion logic
- Prevents type mismatch errors (e.g., Uint64 out of int64 range)
- Simplified from 114 lines to 63 lines

Buffer Management

Proper ownership semantics for Arrow buffers
ListArray offsets: AllocateBuffer + memcpy instead of Buffer::Wrap
StructArray validity: Using arrow::Buffer::Copy for safety

Python Tests

Added comprehensive test coverage in test_export.py:

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

- Add feature specification for Parquet Export Support - Add implementation plan with technical design - Add tasks organized by 3 modules (P1/P2/P3) Spec: 006-parquet-export Modules: - M001: Core Parquet Export (P1, 8 tasks) - M002: Compression & Performance Options (P2, 6 tasks) - M003: Complex Data Type Support (P3, 7 tasks)

- Add ArrowParquetExportWriter class for streaming Parquet writes - Implement ExportParquetFunction (COPY_PARQUET) - Register export function in parquet_extension.cpp - Support default SNAPPY compression and 1M row groups - Type mapping for INT64, INT32, DOUBLE, STRING, BOOLEAN, DATE, TIMESTAMP Tasks: F006-T101 to F006-T108

- Fix LogicalTypeID::BOOLEAN -> LogicalTypeID::BOOL - Fix TimeUnit::MICROSECOND -> TimeUnit::MICRO - Fix QueryResponse API: columns_size() -> arrays_size(), columns() -> arrays() - Fix WriterProperties Builder: use pointer syntax instead of dot syntax - Fix ARROW_ASSIGN_OR_RAISE type mismatch for OutputStream - Update column name extraction from entry_schema_ Fixes compilation errors reported in CI

- Implement ArrowParquetExportWriter::writeTable() for Parquet file export - Support type inference from protobuf Array structure (oneof typed_array) - Handle all primitive types: int32/64, uint32/64, float, double, bool, string, date, timestamp - Use schema inference from QueryResponse when entry_schema_->columnTypes is unavailable - Single-call pattern: open file → convert data → write → close (matches JSON/CSV export pattern) - Add C++ unit tests for basic export, nulls, multiple types, and large datasets - Add Python integration test for COPY TO Parquet

…ng export options

Implement Date and Timestamp type conversion for Parquet export. Add type inference framework for List, Vertex, Edge, Path types (placeholder).

…port

…r Parquet export

shirly121 · 2026-04-22T04:21:21Z

另外我看现在 copy_to_parquet 的实现是 neug pb -> arrow -> parquet，我看磊哥的这个 PR #270 也是实现了 neug pb -> arrow，转换部分是不是可以复用？

…ctArray - Change Vertex/Edge/Path export from StructArray to JSON string format - Avoids schema conflicts when exporting mixed-type graph objects - Adds detailed comments explaining the design decision - Fixes test failures: 'Column expected length X but got length 0'

greptile-apps Bot reviewed Apr 16, 2026

View reviewed changes

BingqingLyu force-pushed the parquet-export branch from 598bdbb to c91751e Compare April 17, 2026 03:00

BingqingLyu added 14 commits April 20, 2026 11:30

feat(parquet): add compression, row_group_size, and dictionary_encodi…

01967bc

…ng export options

feat(parquet): add Date and Timestamp type support for export:

53a4af1

Implement Date and Timestamp type conversion for Parquet export. Add type inference framework for List, Vertex, Edge, Path types (placeholder).

feat(parquet): add native List and Struct type support for Parquet ex…

fd06176

…port

feat(parquet): add Vertex/Edge/Path type support with JSON parsing fo…

2810da2

…r Parquet export

refine

8954464

add python tests and some fix

5d31acd

fix json value conversion

12d3736

format

585611f

add doc

dcf8c87

add complex tests, and enable parquet export tests in ci

6aab51a

BingqingLyu force-pushed the parquet-export branch from 15feb8c to 6aab51a Compare April 20, 2026 03:32

minor fix

32d2582

BingqingLyu requested a review from shirly121 April 20, 2026 05:57

BingqingLyu added 2 commits April 23, 2026 16:21

Merge branch 'main' into parquet-export

0300d2f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add comprehensive Parquet export support with complex data types#241

feat: Add comprehensive Parquet export support with complex data types#241
BingqingLyu wants to merge 17 commits intoalibaba:mainfrom
BingqingLyu:parquet-export

BingqingLyu commented Apr 16, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot left a comment

Uh oh!

shirly121 commented Apr 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

BingqingLyu commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues

What does this PR do?

What changes in this PR?

Core Features

1. Parquet Export Infrastructure (M001)

2. Compression and Performance Options (M002)

3. Complete Type Support (M003)

Key Implementation Details

Type Conversion Architecture

Buffer Management

Python Tests

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

shirly121 commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BingqingLyu commented Apr 16, 2026 •

edited

Loading

shirly121 commented Apr 22, 2026 •

edited

Loading