Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support nested dictionaries in ipc serialization & deserialization #846

Closed
helgikrs opened this issue Oct 22, 2021 · 3 comments · Fixed by #923
Closed

Support nested dictionaries in ipc serialization & deserialization #846

helgikrs opened this issue Oct 22, 2021 · 3 comments · Fixed by #923
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@helgikrs
Copy link
Contributor

Ipc serialization and deserialization of a column containing a dictionary nested within a struct does not seem to work as expected right now. Here's a small code snippet exhibiting that behavior.

use arrow::{
    array::{self, ArrayRef, StructArray},
    datatypes::{self, Field, Schema},
    ipc::{reader::FileReader, writer::FileWriter},
    record_batch::RecordBatch,
};
use std::sync::Arc;

fn main() {
    // This works
    // let inner: array::StringArray = vec![Some("a"), Some("b"), Some("a")].into_iter().collect();

    // This does not
    let inner: array::DictionaryArray<datatypes::Int32Type> =
        vec!["a", "b", "a"].into_iter().collect();

    let array = Arc::new(inner) as ArrayRef;

    let dctfield = Field::new("dict", array.data_type().clone(), false);

    let s = StructArray::from(vec![(dctfield, array)]);
    let struct_array = Arc::new(s) as ArrayRef;

    let schema = Arc::new(Schema::new(vec![Field::new(
        "struct",
        struct_array.data_type().clone(),
        false,
    )]));

    let batch = RecordBatch::try_new(schema.clone(), vec![struct_array]).unwrap();

    let mut buf = Vec::new();
    let mut writer = FileWriter::try_new(&mut buf, &schema).unwrap();
    writer.write(&batch).unwrap();
    writer.finish().unwrap();
    drop(writer);

    let reader = FileReader::try_new(std::io::Cursor::new(buf)).unwrap();
    let batch2: Result<Vec<_>, _> = reader.collect();

    assert_eq!(batch, batch2.unwrap()[0]);
}

The code panics when the struct field is a dictionary, but works as expected for other types.

@helgikrs helgikrs added the enhancement Any new improvement worthy of a entry in the changelog label Oct 22, 2021
@alamb
Copy link
Contributor

alamb commented Nov 4, 2021

FWIW when I tried the reproducer on current apache/master I still see an error:

    Finished dev [unoptimized + debuginfo] target(s) in 18.38s
     Running `target/debug/rust_arrow_playground`
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidArgumentError("dictionary id not found in schema")', src/main.rs:39:65
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

@helgikrs do you know what is left for this PR?

@helgikrs
Copy link
Contributor Author

helgikrs commented Nov 4, 2021

@alamb Currently the deserializer only considers dictionaries at the top-level of the schema, so when a dictionary id is found during loading that is not at the top-level of the schema you get the panic above. I have a working fix, I'll submit a PR that closes this probably tomorrow.

@alamb
Copy link
Contributor

alamb commented Nov 5, 2021

Thank you @helgikrs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants