json.read_json crashes due to possible race

Simple calls to `read_json` will crash with an exception like below.  The crashing can be non-deterministic, depending on the input file.

—

Traceback (most recent call last):
 File "test_arrow.py", line 11, in <module>
 data = json.read_json(f, json.ReadOptions(use_threads=True))
 File "pyarrow/_json.pyx", line 193, in pyarrow._json.read_json
 File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status

pyarrow.lib.ArrowNotImplementedError: JSON conversion to struct<continent: timestamp[s], subcontinent: timestamp[s], country: timestamp[s]> is not supported

—

The input file is several thousand lines of ndjson, where each record looks similar to:

```

{
  "title": "Black Friday 2019: Our Tips for Finding the Best Deals",
  "text": ".... <bunch of text with arbitrary length>"
  <bunch of other string and integer fields with arbitrary length>
  "geoLocations": [
    {
      "continent": "Americas",
      "subcontinent": "Northern America",
      "country": "United States"
    }
  ]
}
```
and any particular record may have an empty array for a geoLocation.

Workarounds include:
- shuffling the input file (not guaranteed to work)
- partitioning the input file into separate pieces (not guaranteed to work)
- disabling threaded reading (always works)
- changing block size (not guaranteed to work)
  
  Other things that stop the crash include:
- deleting fields from the input records
  
  I'm guessing that anything that changes the data partitioning and/or multi-threading affects the auto-schema introspection, which is the source of conflict.   Supplying an explicit schema may also be a workaround.
  
  It's arguable that this is not a bug, but updating the API docs with a warning would be very helpful.
  
   

**Environment**: Debian in Docker. Python 2 and Python 3
**Reporter**: [xtaje](https://issues.apache.org/jira/browse/ARROW-9547)

<sub>**Note**: *This issue was originally created as [ARROW-9547](https://issues.apache.org/jira/browse/ARROW-9547). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

json.read_json crashes due to possible race #25613

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

json.read_json crashes due to possible race #25613

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions