Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-4708: [C++] refactoring JSON parser to prepare for multithreaded impl #4148

Closed
wants to merge 14 commits into from

Conversation

bkietz
Copy link
Member

@bkietz bkietz commented Apr 12, 2019

  • don't use in-situ parsing
  • remove UnsafeStringBuilder (which was only useful when parsing in-situ)
  • don't require null termination of parsed buffers
  • resize parser's scalar storage when it might overflow
  • rewrite chunker to use Buffer instead of string_view
    to represent memory ranges
  • iwyu + lint cleanup
  • add test for parsing JSON with a partial schema
  • add test for inferring timestamp type in an unexpected
    field of strings
  • refactor rapidjson defines into a single header
  • correct SSE detection
  • disable MSVC conversion errors (to match GCC and Clang options)
  • allow ArrayFromJSON to parse timestamps from strings

- don't use in-situ parsing
- don't require null termination of parsed buffers
- resize parser's scalar storage when it might overflow
- rewrite chunker to use Buffer instead of string_view
  to represent memory ranges
- iwyu + lint cleanup
- add test for parsing JSON with a partial schema
- add test for inferring timestamp type in an unexpected
  field of strings
@bkietz
Copy link
Member Author

bkietz commented Apr 12, 2019

@pitrou I have removed the converter/reader changes, this PR only affects the chunker, parser, and utilities.

@bkietz bkietz closed this Apr 12, 2019
@bkietz bkietz reopened this Apr 12, 2019
@codecov-io
Copy link

codecov-io commented Apr 13, 2019

Codecov Report

Merging #4148 into master will increase coverage by 1.31%.
The diff coverage is 87.76%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4148      +/-   ##
==========================================
+ Coverage   87.85%   89.16%   +1.31%     
==========================================
  Files         748      617     -131     
  Lines       91842    81857    -9985     
  Branches     1251        0    -1251     
==========================================
- Hits        80687    72988    -7699     
+ Misses      11036     8869    -2167     
+ Partials      119        0     -119
Impacted Files Coverage Δ
cpp/src/arrow/json/chunker.h 100% <ø> (ø) ⬆️
cpp/src/arrow/type_traits.h 92.95% <ø> (ø) ⬆️
cpp/src/arrow/util/sse-util.h 100% <ø> (ø) ⬆️
cpp/src/arrow/json/options.h 50% <ø> (ø) ⬆️
cpp/src/arrow/json/reader.cc 0.64% <0%> (-71.62%) ⬇️
cpp/src/arrow/util/parsing.h 95.91% <100%> (+0.24%) ⬆️
cpp/src/arrow/ipc/json-simple-test.cc 100% <100%> (ø) ⬆️
cpp/src/arrow/json/parser-test.cc 98.16% <100%> (-0.44%) ⬇️
cpp/src/arrow/buffer.h 100% <100%> (ø) ⬆️
cpp/src/arrow/json/test-common.h 100% <100%> (ø) ⬆️
... and 180 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8537420...673ae34. Read the comment docs.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. This looks good on the principle. A number of small comments below.

cpp/src/arrow/ipc/json-simple-test.cc Show resolved Hide resolved
cpp/src/arrow/ipc/json-simple.cc Outdated Show resolved Hide resolved
cpp/src/arrow/json/reader.h Outdated Show resolved Hide resolved
cpp/src/arrow/json/parser.cc Outdated Show resolved Hide resolved
cpp/src/arrow/json/parser.cc Show resolved Hide resolved
cpp/src/arrow/json/parser.cc Outdated Show resolved Hide resolved
cpp/src/arrow/json/chunker.h Outdated Show resolved Hide resolved
cpp/src/arrow/json/chunker.cc Outdated Show resolved Hide resolved
cpp/src/arrow/json/chunker-test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/json/chunker-test.cc Show resolved Hide resolved
@bkietz
Copy link
Member Author

bkietz commented Apr 16, 2019

@pitrou how's this?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just two nits.

cpp/src/arrow/util/sse-util.h Show resolved Hide resolved
}

// enable SIMD whitespace skipping, if available
#if defined(__SSE4_2__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use ARROW_HAVE_SSE4_2 from sse-util.h instead.

cpp/src/arrow/json/rapidjson-defs.h Outdated Show resolved Hide resolved
@bkietz
Copy link
Member Author

bkietz commented Apr 17, 2019

@pitrou done

@pitrou
Copy link
Member

pitrou commented Apr 17, 2019

Thank you @bkietz !

@pitrou pitrou closed this in b496913 Apr 17, 2019
@bkietz bkietz deleted the 4708-Add-multithreaded-JSON-reader branch February 25, 2021 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants