Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-33209: [C++] Support for reading JSON Datasets #33732

Merged
merged 12 commits into from
Feb 16, 2023

Conversation

benibus
Copy link
Collaborator

@benibus benibus commented Jan 17, 2023

This adds initial support the JSON file format to the Dataset library. Since there's currently no public API for writing JSON files, this only deals with the reader-side facilities.

@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #33209 has been automatically assigned in GitHub to PR creator.

@benibus
Copy link
Collaborator Author

benibus commented Feb 7, 2023

I attempted to implement the newer ScanNode methods. However, I ran into some issues that have prevented feature parity with the legacy API. Primarily, the methods that use ScanOptions are able to prevent loading top-level fields if none of its children are expected to be materialized. However, I haven't figured out a good way to do this using FragmentScanRequest since it only provides FieldPaths (relative to the inspected fragment).

I suppose there's a possibility that the tests in FileFormatScanNodeMixin are expecting too much of the format implementation, though. From what I can tell, CSV is the only other format that implements the new API and even then it doesn't support nested fields - so the relevant tests in that fixture haven't been used anywhere before.

@benibus
Copy link
Collaborator Author

benibus commented Feb 7, 2023

For reference, here is where that test would fail (in FileFormatScanNodeMixin::TestScanProjectedNested), whereas the equivalent test in FileFormatScanMixin currently passes.

@benibus benibus marked this pull request as ready for review February 7, 2023 23:32
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exciting. Sorry you took this on at a time when we have two mirror paths in the file format. Hopefully this will all be getting simplified soon.

cpp/src/arrow/dataset/file_json_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/file_json_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/file_json.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/file_json.cc Show resolved Hide resolved
cpp/src/arrow/dataset/file_json.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/file_json.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/file_json.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/file_json.cc Show resolved Hide resolved
cpp/src/arrow/dataset/file_json.cc Outdated Show resolved Hide resolved
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor nits from a tidy/iwyu pass but mostly clean, thank you.

cpp/src/arrow/dataset/CMakeLists.txt Show resolved Hide resolved
cpp/src/arrow/dataset/file_json.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/file_json.h Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/file_json.cc Show resolved Hide resolved
cpp/src/arrow/dataset/file_json.cc Outdated Show resolved Hide resolved
TEST_P(TestJsonFormatScanNode, ScanProjectedMissingColumns) {
TestScanProjectedMissingCols();
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just scanning through the CSV tests I wonder if there are a few JSON-specific tests that might be nice to have:

  • Confirm that alternate read/parse options (e.g. alternate block size or newlines allowed) get passed in correctly
  • Does the JSON reader handle the the presence of a unicode BOM well? Maybe not as big an issue here where we have to explicitly handle it for the CSV reader.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reader doesn't explicitly handle BOM, but I suspect that rapidjson does, as it's at least referenced in their docs. Either way, I added a test for it (which currently passes).

@benibus
Copy link
Collaborator Author

benibus commented Feb 16, 2023

@felipecrv Feel free to take a look if you find the time.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the addition. A very cool feature.

@westonpace westonpace merged commit 45918a9 into apache:master Feb 16, 2023
@felipecrv
Copy link
Contributor

@benibus sorry, I only noticed this today. Feel free to mention my handle as soon as PR is out of draft status.

@ursabot
Copy link

ursabot commented Feb 17, 2023

Benchmark runs are scheduled for baseline = 9033573 and contender = 45918a9. 45918a9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.28% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.26% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.29% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 45918a90 ec2-t3-xlarge-us-east-2
[Finished] 45918a90 test-mac-arm
[Finished] 45918a90 ursa-i9-9960x
[Finished] 45918a90 ursa-thinkcentre-m75q
[Finished] 9033573a ec2-t3-xlarge-us-east-2
[Failed] 9033573a test-mac-arm
[Finished] 9033573a ursa-i9-9960x
[Finished] 9033573a ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@benibus
Copy link
Collaborator Author

benibus commented Feb 17, 2023

@felipecrv No worries! I'll let you in earlier on the next one.

gringasalpastor pushed a commit to gringasalpastor/arrow that referenced this pull request Feb 17, 2023
This adds initial support the JSON file format to the Dataset library. Since there's currently no public API for writing JSON files, this only deals with the reader-side facilities.
* Closes: apache#33209

Lead-authored-by: benibus <bpharks@gmx.com>
Co-authored-by: Ben Harkins <60872452+benibus@users.noreply.github.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
fatemehp pushed a commit to fatemehp/arrow that referenced this pull request Feb 24, 2023
This adds initial support the JSON file format to the Dataset library. Since there's currently no public API for writing JSON files, this only deals with the reader-side facilities.
* Closes: apache#33209

Lead-authored-by: benibus <bpharks@gmx.com>
Co-authored-by: Ben Harkins <60872452+benibus@users.noreply.github.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
kou added a commit that referenced this pull request Jul 3, 2023
### Rationale for this change

Enable Support for reading JSON Datasets #33732 on Java side

### What changes are included in this PR?

Support for reading JSON Datasets

### Are these changes tested?

Unit test added

### Are there any user-facing changes?

No
* Closes: #36421

Lead-authored-by: david dali susanibar arce <davi.sarces@gmail.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
westonpace pushed a commit to westonpace/arrow that referenced this pull request Jul 7, 2023
…he#36422)

### Rationale for this change

Enable Support for reading JSON Datasets apache#33732 on Java side

### What changes are included in this PR?

Support for reading JSON Datasets

### Are these changes tested?

Unit test added

### Are there any user-facing changes?

No
* Closes: apache#36421

Lead-authored-by: david dali susanibar arce <davi.sarces@gmail.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Bind the JSON RecordBatchReader to Dataset
4 participants