-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Simple calls to read_json will crash with an exception like below. The crashing can be non-deterministic, depending on the input file.
—
Traceback (most recent call last):
File "test_arrow.py", line 11, in
data = json.read_json(f, json.ReadOptions(use_threads=True))
File "pyarrow/_json.pyx", line 193, in pyarrow._json.read_json
File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: JSON conversion to struct<continent: timestamp[s], subcontinent: timestamp[s], country: timestamp[s]> is not supported
—
The input file is several thousand lines of ndjson, where each record looks similar to:
{
"title": "Black Friday 2019: Our Tips for Finding the Best Deals",
"text": ".... <bunch of text with arbitrary length>"
<bunch of other string and integer fields with arbitrary length>
"geoLocations": [
{
"continent": "Americas",
"subcontinent": "Northern America",
"country": "United States"
}
]
}
and any particular record may have an empty array for a geoLocation.
Workarounds include:
-
shuffling the input file (not guaranteed to work)
-
partitioning the input file into separate pieces (not guaranteed to work)
-
disabling threaded reading (always works)
-
changing block size (not guaranteed to work)
Other things that stop the crash include:
-
deleting fields from the input records
I'm guessing that anything that changes the data partitioning and/or multi-threading affects the auto-schema introspection, which is the source of conflict. Supplying an explicit schema may also be a workaround.
It's arguable that this is not a bug, but updating the API docs with a warning would be very helpful.
Environment: Debian in Docker. Python 2 and Python 3
Reporter: xtaje
Note: This issue was originally created as ARROW-9547. Please see the migration documentation for further details.