Describe the bug, including details regarding any error messages, version, and platform.
Summary
PyArrow cannot read from a newline-delimited JSON file with inconsistent column types, even if parse_options specifies a schema
This happens on PyArrow version 23.0.0 (current release) on Python 3.10
Details
Consider a newline-delimited JSON file consisting of the following lines
{"_type": "part", "_id": "152934", "_op_type": "delete", "_index": "my_index"}
{"_type": "part", "_id": 152934, "_op_type": "delete", "_index": "my_index"}
Note how "_id" is inconsistently quoted.
Then this code
import pyarrow
from pyarrow import json as pjson
names = ["_type", "_id", "_op_type", "_source", "_index"]
src_schema = pyarrow.schema([(x, pyarrow.string()) for x in names])
parse_options = pjson.ParseOptions(
explicit_schema=src_schema,
newlines_in_values=False,
unexpected_field_behavior = 'Ignore'
)
blob = pjson.read_json('failure_poc.json', parse_options = parse_options)
Generates an error
pyarrow.lib.ArrowInvalid: JSON parse error: Column(/_id) changed from string to number in row 1
Expected behaviour would be for PyArrow to read the file entire, casting "_id" to string as required.
Component(s)
Python