Skip to content

feat: Add comprehensive Parquet export support with complex data types#241

Open
BingqingLyu wants to merge 17 commits intoalibaba:mainfrom
BingqingLyu:parquet-export
Open

feat: Add comprehensive Parquet export support with complex data types#241
BingqingLyu wants to merge 17 commits intoalibaba:mainfrom
BingqingLyu:parquet-export

Conversation

@BingqingLyu
Copy link
Copy Markdown
Collaborator

@BingqingLyu BingqingLyu commented Apr 16, 2026

Related Issues

fix #194

What does this PR do?

This PR implements complete Parquet export functionality for NeuG, supporting major data types including complex graph types (Vertex, Edge, Path) with native Arrow StructArray conversion. It includes comprehensive Python test coverage and resolves multiple type conversion bugs.

What changes in this PR?

Core Features

1. Parquet Export Infrastructure (M001)

  • Implemented ArrowParquetExportWriter with writeTable() method
  • Integrated with NeuG's COPY TO command infrastructure
  • Supports streaming export with configurable row group sizes

2. Compression and Performance Options (M002)

  • Added support for multiple compression codecs: SNAPPY, GZIP, ZSTD, LZ4, BROTLI, NONE
  • Configurable row_group_size for memory optimization
  • dictionary_encoding toggle for string column optimization
  • Centralized options parsing in parquet_options.h

3. Complete Type Support (M003)

Primitive Types:

  • INT32, INT64, UINT32, UINT64
  • FLOAT, DOUBLE, BOOLEAN
  • STRING (large_utf8), DATE (Date64), TIMESTAMP (microseconds)
  • INTERVAL (stored as string for Parquet compatibility)

Complex Types:

  • List: Native Arrow ListArray with proper offsets semantics
  • Struct: Native Arrow StructArray for nested data
  • Vertex, Edge, and Path: save as JSON string to avoid schema conflicts when exporting mixed-type graph objects

Key Implementation Details

Type Conversion Architecture

  • Proto-to-Arrow dispatch: Routes based on protobuf Array type (not Arrow type)
  • JSON-to-Arrow conversion: Dispatches on JSON value type (not expected Arrow type)
    • Eliminates complex type conversion logic
    • Prevents type mismatch errors (e.g., Uint64 out of int64 range)
    • Simplified from 114 lines to 63 lines

Buffer Management

  • Proper ownership semantics for Arrow buffers
  • ListArray offsets: AllocateBuffer + memcpy instead of Buffer::Wrap
  • StructArray validity: Using arrow::Buffer::Copy for safety

Python Tests

Added comprehensive test coverage in test_export.py:

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

- Add feature specification for Parquet Export Support
- Add implementation plan with technical design
- Add tasks organized by 3 modules (P1/P2/P3)

Spec: 006-parquet-export
Modules:
  - M001: Core Parquet Export (P1, 8 tasks)
  - M002: Compression & Performance Options (P2, 6 tasks)
  - M003: Complex Data Type Support (P3, 7 tasks)
- Add ArrowParquetExportWriter class for streaming Parquet writes
- Implement ExportParquetFunction (COPY_PARQUET)
- Register export function in parquet_extension.cpp
- Support default SNAPPY compression and 1M row groups
- Type mapping for INT64, INT32, DOUBLE, STRING, BOOLEAN, DATE, TIMESTAMP

Tasks: F006-T101 to F006-T108
- Fix LogicalTypeID::BOOLEAN -> LogicalTypeID::BOOL
- Fix TimeUnit::MICROSECOND -> TimeUnit::MICRO
- Fix QueryResponse API: columns_size() -> arrays_size(), columns() -> arrays()
- Fix WriterProperties Builder: use pointer syntax instead of dot syntax
- Fix ARROW_ASSIGN_OR_RAISE type mismatch for OutputStream
- Update column name extraction from entry_schema_

Fixes compilation errors reported in CI
- Implement ArrowParquetExportWriter::writeTable() for Parquet file export
- Support type inference from protobuf Array structure (oneof typed_array)
- Handle all primitive types: int32/64, uint32/64, float, double, bool, string, date, timestamp
- Use schema inference from QueryResponse when entry_schema_->columnTypes is unavailable
- Single-call pattern: open file → convert data → write → close (matches JSON/CSV export pattern)
- Add C++ unit tests for basic export, nulls, multiple types, and large datasets
- Add Python integration test for COPY TO Parquet
Implement Date and Timestamp type conversion for Parquet export.
Add type inference framework for List, Vertex, Edge, Path types (placeholder).
@BingqingLyu BingqingLyu requested a review from shirly121 April 20, 2026 05:57
@shirly121
Copy link
Copy Markdown
Collaborator

shirly121 commented Apr 22, 2026

另外我看现在 copy_to_parquet 的实现是 neug pb -> arrow -> parquet,我看磊哥的这个 PR #270 也是实现了 neug pb -> arrow,转换部分是不是可以复用?

…ctArray

- Change Vertex/Edge/Path export from StructArray to JSON string format
- Avoids schema conflicts when exporting mixed-type graph objects
- Adds detailed comments explaining the design decision
- Fixes test failures: 'Column expected length X but got length 0'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add Parquet Export Support

2 participants