Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Unify dictionaries when writing IPC file in a single shot #26388

Closed
asfimport opened this issue Oct 27, 2020 · 15 comments
Closed

[C++] Unify dictionaries when writing IPC file in a single shot #26388

asfimport opened this issue Oct 27, 2020 · 15 comments

Comments

@asfimport
Copy link

I read a big (taxi) csv file and specified that I wanted to dictionary-encode some columns. The resulting Table has ChunkedArrays with 1604 chunks. When I go to write this Table to the IPC file format (write_feather), I get an error:

  Invalid: Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single dictionary for a given field accross all batches.

I can write this to Parquet and read it back in, and the roundtrip of the data is correct. We should be able to do this in IPC too.

Reporter: Neal Richardson / @nealrichardson
Assignee: Antoine Pitrou / @pitrou

PRs and other links:

Note: This issue was originally created as ARROW-10406. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
cc @wesm

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I thought that we could support deltas, or is that not yet an option?

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
Just reporting the error I hit. Haven't tried to reduce it to a minimal reproducible example yet.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
The file format spec allows deltas, I just don't think we implement them.

@nealrichardson: no need for an example, this is a format implementation question.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
I think this was done in ARROW-6883 (including unit tests). @nealrichardson Can you confirm this is ok for you?

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
Shame on the reporter for not providing steps to reproduce.

I still see the error message in the code, so I don't think this is resolved: https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L1055-L1060

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Ah! I thought you were talking about "IPC files" in the general case (i.e. IPC streams saved in a file) :-)
Well, the problem is, the IPC file format doesn't support dictionary replacement because it's meant to be readable in random order.
I'm not sure there's an incentive to change it. @wesm

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
The problem here was that, due to the way the chunked CSV reading handles dictionary encoding, I couldn't read a CSV and write it to Feather/Parquet, which seems like a critical failing.

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
I'm not sure I agree with the reclassification of this as a Format issue, though I guess that's one way to solve it. I contend that, if the format doesn't support this, then the C++ CSV reader has a bug/misfeature because it shouldn't read data that we can't write.

Here's a trivial way to reproduce it from R, using a tiny CSV and setting a small block size in the read options:

> library(arrow)
> df <- data.frame(chr=c(rep("a", 3), rep("b", 3)), int=1:6)
> write.csv(df, "test.csv", row.names=FALSE)
> system("cat test.csv")
"chr","int"
"a",1
"a",2
"a",3
"b",4
"b",5
"b",6
> tab <- read_csv_arrow("test.csv", read_options=CsvReadOptions$create(block_size=16L), as_data_frame=FALSE, schema=schema(chr=dictionary(), int=int32()))
> tab
Table
6 rows x 2 columns
$chr <dictionary<values=string, indices=int32>>
$int <int32>
> tab$chr
ChunkedArray
[

  -- dictionary:
    []
  -- indices:
    [],

  -- dictionary:
    [
      "a"
    ]
  -- indices:
    [
      0,
      0,
      0
    ],

  -- dictionary:
    [
      "b"
    ]
  -- indices:
    [
      0,
      0,
      0
    ]
]
> write_feather(tab, tempfile())
Error: Invalid: Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single dictionary for a given field across all batches.
In /Users/enpiar/Documents/ursa/arrow/cpp/src/arrow/ipc/writer.cc, line 983, code: WriteDictionaries(batch)
In /Users/enpiar/Documents/ursa/arrow/cpp/src/arrow/ipc/writer.cc, line 939, code: WriteRecordBatch(*batch)
In /Users/enpiar/Documents/ursa/arrow/cpp/src/arrow/ipc/feather.cc, line 804, code: writer->WriteTable(table, properties.chunksize)

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Hmm, what does this have to do with the CSV reader? You can perfectly well write that data to an IPC stream (even as an on-disk file in the more general meaning) or to a Parquet file. You just can't write it to an IPC file (as in "IPC file format").

IMO, it's an issue with how dictionary mapping has been defined in the IPC protocol. If each dictionary batch had its own unique id (instead of putting dictionary ids in the schema), it would probably be easy.

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
As a user, it's appears to be a bug that the library generates data that can't be saved to its own file format. If it's valid data everywhere else, then sure, it's not a problem in the CSV reader, but the IPC file writer should be able to write it, even if that means it has to combine/recode the dictionaries to be valid.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Ah, unifying the dictionaries would be doable... at least if you write the whole table at once (which I suppose is a common case, especially from R or Python). It would also waste some space and CPU time, however.

@wesm Can you give an opinion on this?

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
I agree with @nealrichardson that this is a typical use case that should work out of the box. If the IPC file format does not support different dictionaries, I would expect that we unify the dictionaries before writing.

(that's also what we eg do on conversion to pandas in to_pandas(), and probably in R as well when converting to a DataFrame with factors?)

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I think it's reasonable for it to fail when writing the file format in a streaming fashion, but when we have the entire table up front, it seems reasonable (though certainly tedious...) to scan the dictionaries in each chunk and if there are any differences, to do a unification. I reckon there would be some refactoring necessary, but if it is not too gory this seems like it would be worth doing in the coming months.

@asfimport
Copy link
Author

Ben Kietzman / @bkietz:
Issue resolved by pull request 9348
#9348

@asfimport asfimport added this to the 4.0.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants