feat: Add comprehensive Parquet export support with complex data types#241
Open
BingqingLyu wants to merge 17 commits intoalibaba:mainfrom
Open
feat: Add comprehensive Parquet export support with complex data types#241BingqingLyu wants to merge 17 commits intoalibaba:mainfrom
BingqingLyu wants to merge 17 commits intoalibaba:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
598bdbb to
c91751e
Compare
- Add feature specification for Parquet Export Support - Add implementation plan with technical design - Add tasks organized by 3 modules (P1/P2/P3) Spec: 006-parquet-export Modules: - M001: Core Parquet Export (P1, 8 tasks) - M002: Compression & Performance Options (P2, 6 tasks) - M003: Complex Data Type Support (P3, 7 tasks)
- Add ArrowParquetExportWriter class for streaming Parquet writes - Implement ExportParquetFunction (COPY_PARQUET) - Register export function in parquet_extension.cpp - Support default SNAPPY compression and 1M row groups - Type mapping for INT64, INT32, DOUBLE, STRING, BOOLEAN, DATE, TIMESTAMP Tasks: F006-T101 to F006-T108
- Fix LogicalTypeID::BOOLEAN -> LogicalTypeID::BOOL - Fix TimeUnit::MICROSECOND -> TimeUnit::MICRO - Fix QueryResponse API: columns_size() -> arrays_size(), columns() -> arrays() - Fix WriterProperties Builder: use pointer syntax instead of dot syntax - Fix ARROW_ASSIGN_OR_RAISE type mismatch for OutputStream - Update column name extraction from entry_schema_ Fixes compilation errors reported in CI
- Implement ArrowParquetExportWriter::writeTable() for Parquet file export - Support type inference from protobuf Array structure (oneof typed_array) - Handle all primitive types: int32/64, uint32/64, float, double, bool, string, date, timestamp - Use schema inference from QueryResponse when entry_schema_->columnTypes is unavailable - Single-call pattern: open file → convert data → write → close (matches JSON/CSV export pattern) - Add C++ unit tests for basic export, nulls, multiple types, and large datasets - Add Python integration test for COPY TO Parquet
…ng export options
Implement Date and Timestamp type conversion for Parquet export. Add type inference framework for List, Vertex, Edge, Path types (placeholder).
15feb8c to
6aab51a
Compare
Collaborator
|
另外我看现在 copy_to_parquet 的实现是 neug pb -> arrow -> parquet,我看磊哥的这个 PR #270 也是实现了 neug pb -> arrow,转换部分是不是可以复用? |
…ctArray - Change Vertex/Edge/Path export from StructArray to JSON string format - Avoids schema conflicts when exporting mixed-type graph objects - Adds detailed comments explaining the design decision - Fixes test failures: 'Column expected length X but got length 0'
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related Issues
fix #194
What does this PR do?
This PR implements complete Parquet export functionality for NeuG, supporting major data types including complex graph types (Vertex, Edge, Path) with native Arrow StructArray conversion. It includes comprehensive Python test coverage and resolves multiple type conversion bugs.
What changes in this PR?
Core Features
1. Parquet Export Infrastructure (M001)
ArrowParquetExportWriterwithwriteTable()methodCOPY TOcommand infrastructure2. Compression and Performance Options (M002)
row_group_sizefor memory optimizationdictionary_encodingtoggle for string column optimizationparquet_options.h3. Complete Type Support (M003)
Primitive Types:
Complex Types:
Key Implementation Details
Type Conversion Architecture
Buffer Management
AllocateBuffer+memcpyinstead ofBuffer::Wraparrow::Buffer::Copyfor safetyPython Tests
Added comprehensive test coverage in
test_export.py: