Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic when writing Parquet from non-nullable ListArray #385

Closed
hohav opened this issue May 31, 2021 · 7 comments
Closed

Panic when writing Parquet from non-nullable ListArray #385

hohav opened this issue May 31, 2021 · 7 comments
Labels

Comments

@hohav
Copy link

hohav commented May 31, 2021

Possibly related: #282, #270.

Minimal reproducing code here.

Trying to write a Parquet file containing a variable-length array with non-nullable items results in this panic:

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `1`,
 right: `0`', .../parquet-4.1.0/src/util/bit_util.rs:332:9
@hohav hohav added the bug label May 31, 2021
@hohav hohav changed the title Crash when writing Parquet with non-nullable ListArray Panic when writing Parquet from non-nullable ListArray May 31, 2021
@hohav
Copy link
Author

hohav commented Jun 28, 2021

I think there may be a more fundamental issue with ListArray. I created a new version of my repro here, where I create a very simple ListArray: [[1], [], [2]]. I can successfully write this to a Parquet file using ArrowWriter, but then parquet meta shows incorrect information:

$ parquet meta test.parquet 

File path:  test.parquet
Created by: parquet-rs version 5.0.0-SNAPSHOT (build de62168a4f428e3c334e1cfa5c5db23272f313d7)
Properties:
  ARROW:schema: /////7gAAAAQAAAAAAAKAA4ADAALAAQACgAAABQAAAAAAAABBAAKAAwAAAAIAAQACgAAAAgAAAAIAAAAAAAAAAEAAAAEAAAA3P///xwAAAAMAAAAAAABDFwAAAABAAAAHAAAAAQABAAEAAAAEAAUABAADgAPAAQAAAAIABAAAAAYAAAAIAAAAAAAAQIcAAAACAAMAAQACwAIAAAAIAAAAAAAAAEAAAAABAAAAGl0ZW0AAAAABgAAAHZhbHVlcwAA
Schema:
message arrow_schema {
  optional group values (LIST) {
    repeated group list {
      optional int32 item;
    }
  }
}


Row group 0:  count: 3  23.67 B records  start: 4  total: 71 B
--------------------------------------------------------------------------------
                  type      encodings count     avg size   nulls   min / max
values.list.item  INT32     _ RR_     3         23.67 B    1       "1" / "2"

Notice nulls 1, which AFAICT is incorrect: there are no null items, only one empty list. And parquet cat fails entirely:

$ parquet cat test.parquet 
Unknown error
java.lang.RuntimeException: Failed on record 0
	at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
	at org.apache.parquet.cli.Main.run(Main.java:155)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
	at org.apache.parquet.cli.Main.main(Main.java:185)
Caused by: java.lang.ClassCastException: optional int32 item is not a group
	at org.apache.parquet.schema.Type.asGroupType(Type.java:248)
	at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
	at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228)
	at org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74)
	at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.<init>(AvroRecordConverter.java:539)
	at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.<init>(AvroRecordConverter.java:489)
	at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293)
	at org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:137)
	at org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:91)
	at org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33)
	at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142)
	at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:185)
	at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)
	at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
	at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363)
	at org.apache.parquet.cli.BaseCommand$1$1.<init>(BaseCommand.java:344)
	at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342)
	at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73)
	... 3 more

@nevi-me
Copy link
Contributor

nevi-me commented Jun 28, 2021

Hi @hohav I missed this, thanks for looking further. I'll take a look at this

@nevi-me
Copy link
Contributor

nevi-me commented Jun 28, 2021

#270 fixed the initial behaviour that you observed with the panics, so we correctly roundtrip even though the file is technically incorrect. We do this because we independently count the nulls from the definition, instead of relying on what the metadata says.

The issue is with the column writer at https://github.com/apache/arrow-rs/blob/master/parquet/src/column/writer.rs#L471.

It effectively says "if a value is not populated, then it's null", which is incorrect with the empty list case.

@hohav
Copy link
Author

hohav commented Jun 28, 2021

Thanks for taking a look. I'm still seeing the initial panic when I update to latest master of arrow-rs, so I don't think #270 fixed it unfortunately.

But I think there's something else going on, because I get the same crash from parquet cat even when I remove the empty list. And if I pass false to try_from_iter_with_nullable then parquet meta tells me every element is null, even for a list like [[1] [2]] (and parquet cat still crashes). Repro code here.

@alamb
Copy link
Contributor

alamb commented Jul 29, 2022

I wonder if this is still an issue after the recent work from @tustvold and others to clean up nested struct / null handling?

@nevi-me
Copy link
Contributor

nevi-me commented Jul 29, 2022

I'll check this too

@tustvold
Copy link
Contributor

Closed by #1746

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants