Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON input barfs on {"emptylist":[]} #1036

Closed
chadbrewbaker opened this issue Dec 13, 2021 · 6 comments · Fixed by #1481
Closed

JSON input barfs on {"emptylist":[]} #1036

chadbrewbaker opened this issue Dec 13, 2021 · 6 comments · Fixed by #1481
Assignees
Labels
bug good first issue Good for newcomers

Comments

@chadbrewbaker
Copy link

chadbrewbaker commented Dec 13, 2021

Describe the bug
JSON inuput barfs on {"emptylist":[]}

stack backtrace:
   0: rust_begin_unwind
             at /rustc/0b42deaccc2cbe17a68067aa5fdb76104369e1fd/library/std/src/panicking.rs:498:5
   1: core::panicking::panic_fmt
             at /rustc/0b42deaccc2cbe17a68067aa5fdb76104369e1fd/library/core/src/panicking.rs:107:14
   2: parquet::arrow::levels::LevelInfo::filter_array_indices
   3: parquet::arrow::arrow_writer::write_leaf
   4: parquet::arrow::arrow_writer::write_leaves
   5: parquet::arrow::arrow_writer::write_leaves
   6: parquet::arrow::arrow_writer::ArrowWriter<W>::write
   7: json2parquet::main

I was driving the library with https://github.com/domoritz/json2parquet

thread 'main' panicked at 'Cannot filter indices on a non-primitive array, found List(true)', /PATH/parquet-6.1.0/src/arrow/levels.rs:757:18

To Reproduce
{"emptylist":[]}

Expected behavior
Same as pyarrow, it does not barf the Parquet writer.

pyarrow.Table
emptylist: list<item: null>
  child 0, item: null
----
emptylist: [[0 nulls]]

Additional context

@chadbrewbaker chadbrewbaker changed the title JSON reader barfs on {"emptylist":[]} JSON input barfs on {"emptylist":[]} Dec 13, 2021
@alamb
Copy link
Contributor

alamb commented Dec 14, 2021

Thanks for the report @chadbrewbaker --

@alamb alamb added the good first issue Good for newcomers label Dec 14, 2021
@nevi-me
Copy link
Contributor

nevi-me commented Dec 14, 2021

These parquet bugs are mostly/all my fault. I was working on fixing them a few months ago, but there's been significant changes in my time, and I left them hanging. I really apologise for that.

@novemberkilo
Copy link
Contributor

I am interested in picking this up please // @alamb

novemberkilo added a commit to novemberkilo/arrow-rs that referenced this issue Dec 20, 2021
novemberkilo added a commit to novemberkilo/arrow-rs that referenced this issue Dec 20, 2021
@novemberkilo
Copy link
Contributor

novemberkilo commented Dec 20, 2021

@nevi-me @alamb I started with json2parquet and found the shape of the RecordBatch that corresponded to {"emptylist": []} (see below). This then guided me to writing the test that I've committed for now. I get the same panic and error message so I think I am on the right track. Any suggestions for where the actual fix might be? I'm spelunking around but if either of you (or anyone else familiar with the code here) can help orient me, that would help.

I ran json2parquet on {"emptylist": []} and placed a dbg! on what is sent to the writer:

[src/main.rs:182] &batch = Ok(
    RecordBatch {
        schema: Schema {
            fields: [
                Field {
                    name: "emptylist",
                    data_type: List(
                        Field {
                            name: "item",
                            data_type: Null,
                            nullable: true,
                            dict_id: 0,
                            dict_is_ordered: false,
                            metadata: None,
                        },
                    ),
                    nullable: true,
                    dict_id: 0,
                    dict_is_ordered: false,
                    metadata: None,
                },
            ],
            metadata: {},
        },
        columns: [
            ListArray
            [
              NullArray(0),
            ],
        ],
    },
)
thread 'main' panicked at 'Cannot filter indices on a non-primitive array, found List(true)', /home/navin/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.3.0/src/arrow/levels.rs:757:18

@alamb
Copy link
Contributor

alamb commented Dec 21, 2021

I am not an expert in this code @novemberkilo -- I think @nevi-me is currently focused on other things, so I am not sure he will have time to answer. Perhaps looking at the "blame" (or history) of the relevant code might lead to some others to ask?

I also think @tustvold has been looking at this code recently

@novemberkilo
Copy link
Contributor

This is relevant to this issue #1063 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment