Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rust is not interoperability with C++ for IPC schemas with dictionaries #1694

Closed
pierrebelzile opened this issue May 12, 2022 · 13 comments
Closed
Labels
arrow Changes to the arrow crate bug

Comments

@pierrebelzile
Copy link

In my application, I created in Rust RecordBatch's (arrow 12) that contain multiple dictionary fields. I serialized using streaming IPC and attempted to read with C++ (4.0 and 8.0). The server failed to recover the batches. It seems to get confused with associating the proper dictionary to the correct field. Seems to only see the last dictionary?

I should also point out that I attempted the opposite: Reading in Rust what was created in C++ and that also failed.

I reduced this to a batch with two dictionary fields and uploaded that to: https://github.com/pierrebelzile/arrow_ipc_test. This includes files with the binaries.

I'm hoping there is an expert IPC-person here that quickly knows what is wrong...

@viirya
Copy link
Member

viirya commented May 12, 2022

Hi @pierrebelzile , thanks for reporting the issue. I recently fixed some issues (e.g. #1636) on integrating with C++ for dictionaries. These fixes will be released soon in 14.0.0 (#1693). Can you re-check with 14.0.0 once it is released? Thank you.

@pierrebelzile
Copy link
Author

Great. I will try as soon as it comes out. Any chance that you can test with your code if you are able to read the cpp_out.bin file in this repo?

@viirya
Copy link
Member

viirya commented May 12, 2022

Yea, I just downloaded cpp_out.bin file and ran a test with it. I can read the file by using IPC StreamReader:

let reader = StreamReader::try_new(file, None).unwrap();

reader.for_each(|batch| {
  let batch = batch.unwrap();
  println!("batch: {:?}", batch);
});
batch: RecordBatch { schema: Schema { fields: [Field { name: "as", data_type: Dictionary(Int8, Utf8), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "bs", data_type: Dictionary(Int8, Utf8), null
able: true, dict_id: 1, dict_is_ordered: false, metadata: None }], metadata: {} }, columns: [DictionaryArray {keys: PrimitiveArray<Int8>                                                                                               
[                                                                                                                  
  0,                                                     
  1,                                                                                                               
] values: StringArray                                    
[                                                                                                                  
  "a1",                                                  
  "a2",                                                                                                            
]}                                                       
, DictionaryArray {keys: PrimitiveArray<Int8>                                                                      
[                                                        
  0,                                                                                                               
  1,                                                     
] values: StringArray                                                                                              
[                                                        
  "b1",                                                                                                            
  "b2",                                                  
]}                                                                                                                 
], row_count: 2 }                                        

@viirya
Copy link
Member

viirya commented May 12, 2022

I think we can close this issue as it proves the file can be read now.

@pierrebelzile
Copy link
Author

Closing given that almost certainly fixed. Will re-open if necessary.

@viirya
Copy link
Member

viirya commented May 12, 2022

Thank you @pierrebelzile. Yea, you can open this again if you find any issue on integrating this crate with C++.

@pierrebelzile pierrebelzile reopened this May 25, 2022
@pierrebelzile
Copy link
Author

I'm reopening because the problem is still there in the direction rust -> c++: An arrow IPC data with dictionaries written in Rust is not readable from Arrow C++ (version 8). I was able to confirm that the other direction (c++ -> rust) is working as expected with dictionaries.

@viirya
Copy link
Member

viirya commented May 25, 2022

@pierrebelzile Can you let me know how I can test the case (Rust -> C++)? Do you have reproducible example?

@pierrebelzile
Copy link
Author

I updated the github repo to include instructions to build but I can't connect to it from my work to correctly finish setting up. In the meanwhile, I've single stepped the code in C++ and the issue is that the dictionary id seen in C++ is always 0 in cpp/src/arrow/ipc/reader.cc:786 (arrow 7.0). This causes the dictionaries to "overwrite" each other and that's why we only see one of them. I have no idea why it works in Rust...

@viirya
Copy link
Member

viirya commented May 25, 2022

Oh, that is because in your build_batch method, you create Field by:

fields.push(Field::new("as", array.data_type().clone(), true));
...
fields.push(Field::new("bs", array.data_type().clone(), true));

dict_id in Field is default to be 0.

So you duplicate dictionary ids.

I changed the code to the follows, and ran the C++ program without any issue:

fields.push(Field::new_dict("as", array.data_type().clone(), true, 0, false));
...
fields.push(Field::new_dict("bs", array.data_type().clone(), true, 1, false));

@pierrebelzile
Copy link
Author

Thanks for finding that. I'm not used to this brecause in C++ the id is automatically assigned.

What will happen when (if not yet) the data type is a map<string dict key, string value>? How does one set the dict id then?

@viirya
Copy link
Member

viirya commented May 25, 2022

The dict_id is at Field. For Map, it has a Field which is a Struct field which has two fields for key and value. The dict id should be set at the nested field for the key.

@pierrebelzile
Copy link
Author

Closing because issue was mine.

@alamb alamb added the arrow Changes to the arrow crate label May 26, 2022
@alamb alamb changed the title Interoperability with C++ for IPC schemas with dictionaries Rust is not interoperability with C++ for IPC schemas with dictionaries May 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate bug
Projects
None yet
Development

No branches or pull requests

3 participants