New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Unify dictionaries when writing IPC file in a single shot #26388
Comments
Antoine Pitrou / @pitrou: |
Wes McKinney / @wesm: |
Neal Richardson / @nealrichardson: |
Antoine Pitrou / @pitrou: @nealrichardson: no need for an example, this is a format implementation question. |
Antoine Pitrou / @pitrou: |
Neal Richardson / @nealrichardson: I still see the error message in the code, so I don't think this is resolved: https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L1055-L1060 |
Antoine Pitrou / @pitrou: |
Neal Richardson / @nealrichardson: |
Neal Richardson / @nealrichardson: Here's a trivial way to reproduce it from R, using a tiny CSV and setting a small block size in the read options: > library(arrow)
> df <- data.frame(chr=c(rep("a", 3), rep("b", 3)), int=1:6)
> write.csv(df, "test.csv", row.names=FALSE)
> system("cat test.csv")
"chr","int"
"a",1
"a",2
"a",3
"b",4
"b",5
"b",6
> tab <- read_csv_arrow("test.csv", read_options=CsvReadOptions$create(block_size=16L), as_data_frame=FALSE, schema=schema(chr=dictionary(), int=int32()))
> tab
Table
6 rows x 2 columns
$chr <dictionary<values=string, indices=int32>>
$int <int32>
> tab$chr
ChunkedArray
[
-- dictionary:
[]
-- indices:
[],
-- dictionary:
[
"a"
]
-- indices:
[
0,
0,
0
],
-- dictionary:
[
"b"
]
-- indices:
[
0,
0,
0
]
]
> write_feather(tab, tempfile())
Error: Invalid: Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single dictionary for a given field across all batches.
In /Users/enpiar/Documents/ursa/arrow/cpp/src/arrow/ipc/writer.cc, line 983, code: WriteDictionaries(batch)
In /Users/enpiar/Documents/ursa/arrow/cpp/src/arrow/ipc/writer.cc, line 939, code: WriteRecordBatch(*batch)
In /Users/enpiar/Documents/ursa/arrow/cpp/src/arrow/ipc/feather.cc, line 804, code: writer->WriteTable(table, properties.chunksize) |
Antoine Pitrou / @pitrou: IMO, it's an issue with how dictionary mapping has been defined in the IPC protocol. If each dictionary batch had its own unique id (instead of putting dictionary ids in the schema), it would probably be easy. |
Neal Richardson / @nealrichardson: |
Antoine Pitrou / @pitrou: @wesm Can you give an opinion on this? |
Joris Van den Bossche / @jorisvandenbossche: (that's also what we eg do on conversion to pandas in |
Wes McKinney / @wesm: |
Ben Kietzman / @bkietz: |
I read a big (taxi) csv file and specified that I wanted to dictionary-encode some columns. The resulting Table has ChunkedArrays with 1604 chunks. When I go to write this Table to the IPC file format (write_feather), I get an error:
I can write this to Parquet and read it back in, and the roundtrip of the data is correct. We should be able to do this in IPC too.
Reporter: Neal Richardson / @nealrichardson
Assignee: Antoine Pitrou / @pitrou
PRs and other links:
Note: This issue was originally created as ARROW-10406. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: