Conversation
Adds Apache.Arrow.Serialization with a Roslyn incremental source generator that emits compile-time Arrow schema derivation, serialization, and deserialization for types marked with [ArrowSerializable]. - Runtime library (Apache.Arrow.Serialization): attributes, helpers, IPC extension methods, reflection-based RecordBatchBuilder - Source generator (Apache.Arrow.Serialization.Generator): code emission for 31+ type mappings, polymorphism, custom converters, callbacks - Test suite: 197 tests covering all supported types and features - Integrated into solution, central package management, Apache 2.0 headers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Align with upstream CI which uses .NET 8.0 SDK. All 197 tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove src/test solution folders from sln so serialization projects appear at root level like all other projects. Change serialization library target from net8.0;net10.0 to net8.0 for CI compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Sorry, I created a merge conflict by checking in the Parquet variant projects :(. |
There was a problem hiding this comment.
Pull request overview
This PR adds a new POCO serialization subsystem for Apache Arrow in .NET, resolving issue #186. It provides two serialization paths: a source-generator-based AOT-safe approach using [ArrowSerializable] attributes and a reflection-based RecordBatchBuilder for anonymous types/prototyping.
Changes:
- New
Apache.Arrow.Serializationruntime library with attributes, interfaces (IArrowSerializer<T>,IArrowConverter<T>), helper classes, reflection-basedRecordBatchBuilder, and extension methods for Arrow IPC serialization - New
Apache.Arrow.Serialization.GeneratorRoslyn incremental source generator that emits schema derivation, serialization, deserialization code including polymorphic type support and JSON schema emission - Comprehensive test suite covering primitives, collections, nested types, enums, polymorphism, custom converters, callbacks, datetime types, diagnostics, and the reflection-based builder
Reviewed changes
Copilot reviewed 19 out of 20 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
src/Apache.Arrow.Serialization/Attributes.cs |
Defines all serialization attributes (ArrowSerializable, ArrowField, ArrowType, ArrowIgnore, ArrowMetadata, ArrowPolymorphic, ArrowDerivedType) and callback interface |
src/Apache.Arrow.Serialization/IArrowSerializer.cs |
IArrowSerializer<T> interface with static abstract members and IArrowConverter<T> for custom converters |
src/Apache.Arrow.Serialization/ArrowArrayHelper.cs |
Utility methods for building null arrays, Guid/TimeOnly/TimeSpan/Decimal arrays, and DateTime normalization |
src/Apache.Arrow.Serialization/ArrowSerializerExtensions.cs |
Extension methods for IPC byte/stream serialization and collection convenience methods |
src/Apache.Arrow.Serialization/RecordBatchBuilder.cs |
Reflection-based serializer for anonymous types and non-attributed objects |
src/Apache.Arrow.Serialization/README.md |
Comprehensive documentation covering all features |
src/Apache.Arrow.Serialization/Apache.Arrow.Serialization.csproj |
Runtime library project (net8.0) |
src/Apache.Arrow.Serialization.Generator/ArrowSerializerGenerator.cs |
Main incremental generator: type analysis, diagnostics, and orchestration |
src/Apache.Arrow.Serialization.Generator/CodeEmitter.cs |
Emits serialization/deserialization code for [ArrowSerializable] types |
src/Apache.Arrow.Serialization.Generator/PolymorphicCodeEmitter.cs |
Emits code for [ArrowPolymorphic] type hierarchies |
src/Apache.Arrow.Serialization.Generator/JsonSchemaEmitter.cs |
Emits optional JSON schema descriptors |
src/Apache.Arrow.Serialization.Generator/Models.cs |
Internal model classes for the generator pipeline |
src/Apache.Arrow.Serialization.Generator/Apache.Arrow.Serialization.Generator.csproj |
Generator project (netstandard2.0) |
test/Apache.Arrow.Serialization.Tests/SerializationTests.cs |
Tests for source-generated serialization round-trips |
test/Apache.Arrow.Serialization.Tests/RecordBatchBuilderTests.cs |
Tests for reflection-based builder |
test/Apache.Arrow.Serialization.Tests/DiagnosticTests.cs |
Tests for generator diagnostic reporting |
test/Apache.Arrow.Serialization.Tests/TestTypes.cs |
Shared test type definitions |
test/Apache.Arrow.Serialization.Tests/Apache.Arrow.Serialization.Tests.csproj |
Test project |
Directory.Packages.props |
Adds CodeAnalysis package versions |
Apache.Arrow.sln |
Adds new projects to solution |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| <PackageVersion Include="Microsoft.CodeAnalysis.Analyzers" Version="3.3.4" /> | ||
| <PackageVersion Include="Microsoft.CodeAnalysis.CSharp" Version="4.11.0" /> | ||
| <PackageVersion Include="Microsoft.Bcl.AsyncInterfaces" Version="8.0.0" /> |
| public void Append(object? value) | ||
| { | ||
| if (value is null) _b.AppendNull(); | ||
| else _b.Append(new DateTimeOffset((DateTime)value, TimeSpan.Zero)); |
| // For null slots, we need a stand-in value (first non-null item) | ||
| object? standIn = _items.FirstOrDefault(v => v is not null); | ||
| foreach (var item in _items) | ||
| typedList.Add(item ?? standIn!); |
| Line($"if ({access} is {{ }} v_{index}) {{ bld_{index}_idx.Append((short)bld_{index}_dict.Count); bld_{index}_dict.Add(v_{index}.ToString()); }} else bld_{index}_idx.AppendNull();"); | ||
| else | ||
| { | ||
| Line($"bld_{index}_idx.Append((short)bld_{index}_dict.Count);"); | ||
| Line($"bld_{index}_dict.Add({access}.ToString());"); | ||
| } |
| // fall back to AppendNull for now — these are less common in polymorphic scenarios | ||
| Line($"bld_{index}.AppendNull(); // TODO: complex type {prop.Type.Kind}"); |
| break; | ||
| } | ||
| default: | ||
| Line($"object? prop_{propIndex} = null; // TODO: unsupported type {prop.Type.Kind}"); |
| // Remove trailing newline, add comma | ||
| sb.Length -= sb.ToString().EndsWith("\r\n") ? 2 : 1; |
| <TargetFramework>net8.0</TargetFramework> | ||
| <Nullable>enable</Nullable> | ||
| <ImplicitUsings>enable</ImplicitUsings> | ||
| <Description>Source-generated Apache Arrow serialization for .NET. Provides [ArrowSerializable] attribute and IArrowSerializer<T> interface for compile-time Arrow schema derivation, serialization, and deserialization.</Description> |
There was a problem hiding this comment.
I think we don't need to worry about net6.0 as it's out of support and we'll probably remove it as a build target after the next release. The inability to use with net472 or netstandard2.0 is a greater loss and it might be worth a quick test to see how hard it would be to add support for those.
There was a problem hiding this comment.
Thanks for the review! I'll take a look at your comments over the next few days and follow up.
There was a problem hiding this comment.
Thanks, this is great and has long been missing.
Can you please fix the merge conflict and the white space that our linter doesn't like? I think the Documentation task failure can be addressed by editing ci/scripts/docs.sh and doing something like
pushd "${source_dir}/src/Apache.Arrow.Serialization"
dotnet build -c Release
popd
before trying to build the documentation.
We should probably also add validation for the release by adding something to dev/release/verify_rc.sh like
reference_package "Apache.Arrow.Serialization" "Apache.Arrow.Serialization.Tests"
but I'm not sure the reference_package will handle the PrivateAssets="all" in the project file. If it doesn't, we could consider figuring that out later after the bulk of the change is checked in.
|
|
||
| namespace Apache.Arrow.Serialization.Tests; | ||
|
|
||
| public class DiagnosticTests |
| Global | ||
| GlobalSection(SolutionConfigurationPlatforms) = preSolution | ||
| Debug|Any CPU = Debug|Any CPU | ||
| Debug|x64 = Debug|x64 |
There was a problem hiding this comment.
It would be nice to avoid adding all these targets. Is there something bitness-specific in these changes?
| { | ||
| var list = items as IReadOnlyList<T> ?? items.ToList(); | ||
| if (list.Count == 0) | ||
| throw new ArgumentException("Cannot infer schema from empty collection.", nameof(items)); |
There was a problem hiding this comment.
Is this really true though? We use the type to infer the schema, not the data. It would be annoying for someone to have to special case an empty list if they want to serialize it.
| builders.Add(CreateColumnBuilder(propType, arrowType)); | ||
| } | ||
|
|
||
| var schema = new Schema.Builder(); |
There was a problem hiding this comment.
Consider moving schema above the foreach and adding the fields directly into the schema builder instead of a temporary list.
| <Project Sdk="Microsoft.NET.Sdk"> | ||
|
|
||
| <PropertyGroup> | ||
| <TargetFramework>net8.0</TargetFramework> |
There was a problem hiding this comment.
What would it take to make this work for .NET 4.7.2? Is that even plausible?
| | `float` | `Float32` | | | ||
| | `double` | `Float64` | | | ||
| | `Half` | `Float16` | | | ||
| | `decimal` | `Decimal128(38, 18)` | Configurable via `[ArrowType("decimal128(28, 10)")]` | |
There was a problem hiding this comment.
It might be worth pointing out in the documentation that a CLR decimal is not a perfect match for an Arrow decimal.
|
|
||
| namespace Apache.Arrow.Serialization.Generator | ||
| { | ||
| internal enum TypeKind2 |
There was a problem hiding this comment.
Consider a more descriptive name. What about ArrowTypeKind?
Resolves #186.
I needed Arrow POCO serialization for an internal cross-language interop project (C# ↔ vgi-rpc-python). I took inspiration from
System.Text.Json's source generator andMessagePack-CSharp's attribute model, iterated on it with Claude as a coding assistant, and arrived at this implementation.Figured it might be useful upstream — please take a look and let me know what you think.
See README.md for full documentation and examples.