Skip to content

json.read_json crashes due to possible race #25613

@asfimport

Description

@asfimport

Simple calls to read_json will crash with an exception like below. The crashing can be non-deterministic, depending on the input file.

Traceback (most recent call last):
File "test_arrow.py", line 11, in
data = json.read_json(f, json.ReadOptions(use_threads=True))
File "pyarrow/_json.pyx", line 193, in pyarrow._json.read_json
File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status

pyarrow.lib.ArrowNotImplementedError: JSON conversion to struct<continent: timestamp[s], subcontinent: timestamp[s], country: timestamp[s]> is not supported

The input file is several thousand lines of ndjson, where each record looks similar to:


{
  "title": "Black Friday 2019: Our Tips for Finding the Best Deals",
  "text": ".... <bunch of text with arbitrary length>"
  <bunch of other string and integer fields with arbitrary length>
  "geoLocations": [
    {
      "continent": "Americas",
      "subcontinent": "Northern America",
      "country": "United States"
    }
  ]
}

and any particular record may have an empty array for a geoLocation.

Workarounds include:

  • shuffling the input file (not guaranteed to work)

  • partitioning the input file into separate pieces (not guaranteed to work)

  • disabling threaded reading (always works)

  • changing block size (not guaranteed to work)

    Other things that stop the crash include:

  • deleting fields from the input records

    I'm guessing that anything that changes the data partitioning and/or multi-threading affects the auto-schema introspection, which is the source of conflict. Supplying an explicit schema may also be a workaround.

    It's arguable that this is not a bug, but updating the API docs with a warning would be very helpful.

     

Environment: Debian in Docker. Python 2 and Python 3
Reporter: xtaje

Note: This issue was originally created as ARROW-9547. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions