Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: JSON serializer doesn't produce valid json (table) #409

Closed
OneCyrus opened this issue Sep 27, 2023 · 7 comments
Closed

[BUG]: JSON serializer doesn't produce valid json (table) #409

OneCyrus opened this issue Sep 27, 2023 · 7 comments
Labels
Row API Row API development is frozen, issues marked will be updated when alternatives are stable.

Comments

@OneCyrus
Copy link

Library Version

4.16.4

OS

Windows 11

OS Architecture

64 bit

How to reproduce?

  1. Deserialize multiple rows
using var parquetReader = await ParquetReader.CreateAsync(stream);
var table = await parquetReader.ReadAsTableAsync();

var json = table.ToString("j");
  1. the JSON returned is basically multiple JSON on each line whioch is not valid. The serializer should return an array with comma delimited rows.

instead of this

{ "a": 1, "b": "test" }
{ "a": 2, "b": "test" }
{ "a": 3, "b": "test" }

we should get this

[
  { "a": 1, "b": "test" },
  { "a": 2, "b": "test" },
  { "a": 3, "b": "test" }
]

Failing test

using var parquetReader = await ParquetReader.CreateAsync(stream);
var table = await parquetReader.ReadAsTableAsync();

var json = table.ToString("j");

var myTable = JsonSerializer.Deserialize<dynamic>(json);
@aloneguid
Copy link
Owner

It's actually a feature :) It produces output in json-lines format, more commonly used in big data platforms like Apache Spark. Having one document with massive array does not allow for easy streaming of JSON.

@OneCyrus
Copy link
Author

isn't that supposed to be the serialization of rows? when i serialize a table I would expect to have a representation of the table and not multiple rows.
would be great to have at least an option to switch between the modes.

@aloneguid
Copy link
Owner

You can call ToString() for each row and append those with comma separator.

@OneCyrus
Copy link
Author

that's what we did. basically we did it the same way like you did in the ToString() method but removed the limitation with the level filter.

unfortunately that was just a workaround for deserializing parquet to c# classes. as the parquet deserializer had too many limitations we convert the parquet first to json and use the system.text.json deserializer to get our c# objects.

@aloneguid
Copy link
Owner

The performance of this will be terrible :( What kind of limitations did you encounter with class serializer by the way?

@OneCyrus
Copy link
Author

the most important issues we have are:

  • deserializing unknown struct columns to a dynamic type like JsonElement, object, ... didn't work. It always expects fully typed classes
  • case sensitivity for column to property mapping

@aloneguid
Copy link
Owner

Fair enough, thanks. Regarding the first, it will change in future. I'm planning to unify code for row api and class serializer, which will make row api more stable, but also allow declaring properties of object or List<object> or Dictionary<string, object> types for exactly this use case - mixing strong typing with quack typing.
As for the second, you can mark properties with [JsonPropertyName] attribute. I may add custom resolvers for case-insensitive match, underscore removal and so on.
No promises on timing though.

@aloneguid aloneguid added the Row API Row API development is frozen, issues marked will be updated when alternatives are stable. label Jan 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Row API Row API development is frozen, issues marked will be updated when alternatives are stable.
Projects
None yet
Development

No branches or pull requests

2 participants