Skip to content

PyArrow cannot read from a newline-delimited JSON file with inconsistent column types, even if parse_options specifies a schema #49158

@cdl-altium

Description

@cdl-altium

Describe the bug, including details regarding any error messages, version, and platform.

Summary

PyArrow cannot read from a newline-delimited JSON file with inconsistent column types, even if parse_options specifies a schema

This happens on PyArrow version 23.0.0 (current release) on Python 3.10

Details

Consider a newline-delimited JSON file consisting of the following lines

{"_type": "part", "_id": "152934", "_op_type": "delete", "_index": "my_index"}
{"_type": "part", "_id": 152934, "_op_type": "delete", "_index": "my_index"}

Note how "_id" is inconsistently quoted.

Then this code

import pyarrow
from pyarrow import json as pjson

names = ["_type", "_id", "_op_type", "_source", "_index"]
src_schema = pyarrow.schema([(x, pyarrow.string()) for x in names])
parse_options = pjson.ParseOptions(
    explicit_schema=src_schema,
    newlines_in_values=False,
    unexpected_field_behavior = 'Ignore'
)

blob = pjson.read_json('failure_poc.json', parse_options = parse_options)

Generates an error

pyarrow.lib.ArrowInvalid: JSON parse error: Column(/_id) changed from string to number in row 1

Expected behaviour would be for PyArrow to read the file entire, casting "_id" to string as required.

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions