[C++] Unify dictionaries when writing IPC file in a single shot #26388

asfimport · 2020-10-27T19:10:07Z

I read a big (taxi) csv file and specified that I wanted to dictionary-encode some columns. The resulting Table has ChunkedArrays with 1604 chunks. When I go to write this Table to the IPC file format (write_feather), I get an error:

  Invalid: Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single dictionary for a given field accross all batches.

I can write this to Parquet and read it back in, and the roundtrip of the data is correct. We should be able to do this in IPC too.

Reporter: Neal Richardson / @nealrichardson
Assignee: Antoine Pitrou / @pitrou

PRs and other links:

GitHub Pull Request #9348

_{Note: This issue was originally created as ARROW-10406. Please see the migration documentation for further details.}

asfimport · 2020-10-27T19:12:07Z

Antoine Pitrou / @pitrou:
cc @wesm

asfimport · 2020-10-27T20:49:38Z

Wes McKinney / @wesm:
I thought that we could support deltas, or is that not yet an option?

asfimport · 2020-10-27T20:54:26Z

Neal Richardson / @nealrichardson:
Just reporting the error I hit. Haven't tried to reduce it to a minimal reproducible example yet.

asfimport · 2020-10-27T20:56:47Z

Antoine Pitrou / @pitrou:
The file format spec allows deltas, I just don't think we implement them.

@nealrichardson: no need for an example, this is a format implementation question.

asfimport · 2021-01-05T18:43:10Z

Antoine Pitrou / @pitrou:
I think this was done in ARROW-6883 (including unit tests). @nealrichardson Can you confirm this is ok for you?

asfimport · 2021-01-05T19:09:13Z

Neal Richardson / @nealrichardson:
Shame on the reporter for not providing steps to reproduce.

I still see the error message in the code, so I don't think this is resolved: https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L1055-L1060

asfimport · 2021-01-05T21:21:55Z

Antoine Pitrou / @pitrou:
Ah! I thought you were talking about "IPC files" in the general case (i.e. IPC streams saved in a file) :-)
Well, the problem is, the IPC file format doesn't support dictionary replacement because it's meant to be readable in random order.
I'm not sure there's an incentive to change it. @wesm

asfimport · 2021-01-05T21:33:04Z

Neal Richardson / @nealrichardson:
The problem here was that, due to the way the chunked CSV reading handles dictionary encoding, I couldn't read a CSV and write it to Feather/Parquet, which seems like a critical failing.

asfimport · 2021-01-05T21:50:10Z

Neal Richardson / @nealrichardson:
I'm not sure I agree with the reclassification of this as a Format issue, though I guess that's one way to solve it. I contend that, if the format doesn't support this, then the C++ CSV reader has a bug/misfeature because it shouldn't read data that we can't write.

Here's a trivial way to reproduce it from R, using a tiny CSV and setting a small block size in the read options:

> library(arrow)
> df <- data.frame(chr=c(rep("a", 3), rep("b", 3)), int=1:6)
> write.csv(df, "test.csv", row.names=FALSE)
> system("cat test.csv")
"chr","int"
"a",1
"a",2
"a",3
"b",4
"b",5
"b",6
> tab <- read_csv_arrow("test.csv", read_options=CsvReadOptions$create(block_size=16L), as_data_frame=FALSE, schema=schema(chr=dictionary(), int=int32()))
> tab
Table
6 rows x 2 columns
$chr <dictionary<values=string, indices=int32>>
$int <int32>
> tab$chr
ChunkedArray
[

  -- dictionary:
    []
  -- indices:
    [],

  -- dictionary:
    [
      "a"
    ]
  -- indices:
    [
      0,
      0,
      0
    ],

  -- dictionary:
    [
      "b"
    ]
  -- indices:
    [
      0,
      0,
      0
    ]
]
> write_feather(tab, tempfile())
Error: Invalid: Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single dictionary for a given field across all batches.
In /Users/enpiar/Documents/ursa/arrow/cpp/src/arrow/ipc/writer.cc, line 983, code: WriteDictionaries(batch)
In /Users/enpiar/Documents/ursa/arrow/cpp/src/arrow/ipc/writer.cc, line 939, code: WriteRecordBatch(*batch)
In /Users/enpiar/Documents/ursa/arrow/cpp/src/arrow/ipc/feather.cc, line 804, code: writer->WriteTable(table, properties.chunksize)

asfimport · 2021-01-05T22:35:58Z

Antoine Pitrou / @pitrou:
Hmm, what does this have to do with the CSV reader? You can perfectly well write that data to an IPC stream (even as an on-disk file in the more general meaning) or to a Parquet file. You just can't write it to an IPC file (as in "IPC file format").

IMO, it's an issue with how dictionary mapping has been defined in the IPC protocol. If each dictionary batch had its own unique id (instead of putting dictionary ids in the schema), it would probably be easy.

asfimport · 2021-01-05T22:45:52Z

Neal Richardson / @nealrichardson:
As a user, it's appears to be a bug that the library generates data that can't be saved to its own file format. If it's valid data everywhere else, then sure, it's not a problem in the CSV reader, but the IPC file writer should be able to write it, even if that means it has to combine/recode the dictionaries to be valid.

asfimport · 2021-01-05T22:51:45Z

Antoine Pitrou / @pitrou:
Ah, unifying the dictionaries would be doable... at least if you write the whole table at once (which I suppose is a common case, especially from R or Python). It would also waste some space and CPU time, however.

@wesm Can you give an opinion on this?

asfimport · 2021-01-06T11:05:34Z

Joris Van den Bossche / @jorisvandenbossche:
I agree with @nealrichardson that this is a typical use case that should work out of the box. If the IPC file format does not support different dictionaries, I would expect that we unify the dictionaries before writing.

(that's also what we eg do on conversion to pandas in to_pandas(), and probably in R as well when converting to a DataFrame with factors?)

asfimport · 2021-01-08T20:35:37Z

Wes McKinney / @wesm:
I think it's reasonable for it to fail when writing the file format in a streaming fashion, but when we have the entire table up front, it seems reasonable (though certainly tedious...) to scan the dictionaries in each chunk and if there are any differences, to do a unification. I reckon there would be some refactoring necessary, but if it is not too gory this seems like it would be worth doing in the coming months.

asfimport · 2021-02-09T17:50:51Z

Ben Kietzman / @bkietz:
Issue resolved by pull request 9348
#9348

asfimport closed this as completed Feb 9, 2021

asfimport assigned pitrou Jan 10, 2023

asfimport added this to the 4.0.0 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Unify dictionaries when writing IPC file in a single shot #26388

[C++] Unify dictionaries when writing IPC file in a single shot #26388

asfimport commented Oct 27, 2020

asfimport commented Oct 27, 2020

asfimport commented Oct 27, 2020

asfimport commented Oct 27, 2020

asfimport commented Oct 27, 2020

asfimport commented Jan 5, 2021

asfimport commented Jan 5, 2021

asfimport commented Jan 5, 2021

asfimport commented Jan 5, 2021

asfimport commented Jan 5, 2021

asfimport commented Jan 5, 2021

asfimport commented Jan 5, 2021

asfimport commented Jan 5, 2021

asfimport commented Jan 6, 2021

asfimport commented Jan 8, 2021

asfimport commented Feb 9, 2021

[C++] Unify dictionaries when writing IPC file in a single shot #26388

[C++] Unify dictionaries when writing IPC file in a single shot #26388

Comments

asfimport commented Oct 27, 2020

PRs and other links:

asfimport commented Oct 27, 2020

asfimport commented Oct 27, 2020

asfimport commented Oct 27, 2020

asfimport commented Oct 27, 2020

asfimport commented Jan 5, 2021

asfimport commented Jan 5, 2021

asfimport commented Jan 5, 2021

asfimport commented Jan 5, 2021

asfimport commented Jan 5, 2021

asfimport commented Jan 5, 2021

asfimport commented Jan 5, 2021

asfimport commented Jan 5, 2021

asfimport commented Jan 6, 2021

asfimport commented Jan 8, 2021

asfimport commented Feb 9, 2021