[Python] Open CSVs with different encodings #23544

asfimport · 2019-11-25T12:28:24Z

I would like to open an UTF-16 encoded CSVs (among others) without preprocessing in let's say Pandas. Is there maybe a way to do this already ?

Reporter: Sascha Hofmann / @saschahofmann

Related issues:

[C++] Add C++ foundation to ease file transcoding (is superceded by)

_{Note: This issue was originally created as ARROW-7251. Please see the migration documentation for further details.}

asfimport · 2019-11-25T12:46:42Z

Antoine Pitrou / @pitrou:
The only way to do this is to convert the CSV upfront to UTF-8.

asfimport · 2019-11-25T13:40:06Z

Sascha Hofmann / @saschahofmann:
Do you know whether there are any plans to maybe implement an encoding option on the ParseOptions? Like e.g. pandas, does it?

asfimport · 2019-11-25T14:08:40Z

Antoine Pitrou / @pitrou:
We don't have any plans currently. We'd rather avoid adding a character set conversion machinery to Arrow. Do you know why you have UTF-16 encoded CSVs?

asfimport · 2019-11-25T17:07:48Z

Sascha Hofmann / @saschahofmann:
I think it's a Kaggle dataset. In general, we'd like to ingest all kind of legacy CSVs. Do you by any chance know how the pyarrow read_csv compares to Pandas'?

In the meantime, we might simply check for a BOM and fall back to the pandas csv reader if we find a None UTF-8 one.

asfimport · 2019-11-25T17:19:01Z

Antoine Pitrou / @pitrou:

Do you by any chance know how the pyarrow read_csv compares to Pandas'?

That's a vague question. Which aspect are you interested in?

asfimport · 2019-11-25T17:50:33Z

Sascha Hofmann / @saschahofmann:
Basically, I could imagine simply switching everything over to use pandas.read_csv and then get the pyarrow table from there (pa.Table.from_pandas).

Question would be something like: how much time and memory overhead would that give compared to a pure pa.csv.read_csv approach (when possible). I thought maybe someone has benchmarked that but couldn't find anything.

asfimport · 2019-11-25T17:52:47Z

Antoine Pitrou / @pitrou:
Well, using Arrow directly should generally be much faster than going through Pandas. I'd be curious about your experience.

If disk space is not an issue, you might convert your CSV files to UTF-8 upfront (for example using iconv or even Python).

asfimport · 2019-11-25T18:39:02Z

Sascha Hofmann / @saschahofmann:
I just ran a simple timeit notebook for 30MB csv with 7 string columns one bool and one int column. Differences are quite massive.

Pandas to pyarrow:

314 ms ± 20 ms

Only pyarrow:

38.7 ms ± 3.15 ms

We are trying to hide as much complexity from the user, who might throw in arbitrarily old CSVs, as possible. That's why converting upfront is not really an option but I guess having the pandas reader as a fallback should be sufficient.

asfimport · 2019-11-25T18:50:24Z

Antoine Pitrou / @pitrou:
Thank you for the numbers. Which encodings do you usually encounter?

asfimport · 2019-11-25T19:14:03Z

Sascha Hofmann / @saschahofmann:
As mentioned above, today I had a UTF-16 file. Previously, I encountered a latin-1 encoded CSV but that was actually fine because only one of the columns had the weird byte so only that column was read as binary instead of string (you actually answered me on that one here: https://issues.apache.org/jira/browse/ARROW-6934)

asfimport · 2019-11-25T19:19:15Z

Sascha Hofmann / @saschahofmann:
back then I thought. It might be useful to have different string encoded column types.

asfimport · 2019-11-27T13:59:56Z

Sascha Hofmann / @saschahofmann:
I have another CSV related question. How does the csv reader decide how many chunks the created arrow table will have? I know I can fix it by giving a block_size on the read options but the default is None. I observed that sometimes the reader is creating a lot of chunks but it doesn't seem to necessarily scale with file size?

asfimport · 2019-11-27T14:04:21Z

Antoine Pitrou / @pitrou:
The default is not None:

>>> from pyarrow import csv                                                                                                                                                     
>>> opts = csv.ReadOptions()                                                                                                                                                    
>>> opts.block_size                                                                                                                                                             
1048576

By changing this value you will control the size of chunks. But do note this has an impact in performance, especially in parallel mode.

asfimport · 2019-11-27T14:25:22Z

Sascha Hofmann / @saschahofmann:
Ah, good to know.

Maybe it might be worth noting on the docs
https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html#pyarrow.csv.ReadOptions

that providing None uses the default of 1MB

asfimport · 2020-04-27T02:11:17Z

Wes McKinney / @wesm:
Any further thoughts about this? Perhaps at minimum we should document the expectation about non-UTF-8 CSV files. FTR I'm not sure many SQL databases will ingest non-UTF-8 CSV files

asfimport · 2020-04-27T12:25:40Z

Sascha Hofmann / @saschahofmann:
For us having different string encoding support would be amazing. That being said, I admit other encodings are rare/dying out but we stumble upon them once in a while. From those, I don't know how many are using a BOM to identify their encoding. We haven't actually tried it but we might use pandas as mentioned above in cases where a file has a BOM different than the utf-8 (see comment above). I am not sure how you did the csv reading in pandas but I assume it might not be worth going through it again. In the end, it might be best to force people using UTF-8.

asfimport · 2020-04-27T12:45:01Z

Antoine Pitrou / @pitrou:
cc @saschahofmann Is there anything that prevents you from recoding the CSV file before opening it with Arrow?
(what are your constraints? performance? file size?)

With some care, you could even implement a file-like object in Python that recodes data to UTF-8 on the fly. It should be accepted by csv.read_csv.

asfimport · 2020-04-27T14:08:57Z

Sascha Hofmann / @saschahofmann:
Our current setup allows users to upload CSV files which we are parsing to arrow. Right now, we are not doing any the preprocessing of the csv file so we can receive arbitrary weird files. I will propose your "recode on the fly" suggestion.

asfimport closed this as completed Jun 16, 2020

asfimport mentioned this issue Jan 11, 2023

[C++] Add C++ foundation to ease file transcoding #25219

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Open CSVs with different encodings #23544

[Python] Open CSVs with different encodings #23544

asfimport commented Nov 25, 2019 •

edited

asfimport commented Nov 25, 2019

asfimport commented Nov 25, 2019

asfimport commented Nov 25, 2019

asfimport commented Nov 25, 2019

asfimport commented Nov 25, 2019

asfimport commented Nov 25, 2019

asfimport commented Nov 25, 2019

asfimport commented Nov 25, 2019

asfimport commented Nov 25, 2019

asfimport commented Nov 25, 2019

asfimport commented Nov 25, 2019

asfimport commented Nov 27, 2019

asfimport commented Nov 27, 2019

asfimport commented Nov 27, 2019

asfimport commented Apr 27, 2020

asfimport commented Apr 27, 2020

asfimport commented Apr 27, 2020

asfimport commented Apr 27, 2020

[Python] Open CSVs with different encodings #23544

[Python] Open CSVs with different encodings #23544

Comments

asfimport commented Nov 25, 2019 • edited

Related issues:

asfimport commented Nov 25, 2019

asfimport commented Nov 25, 2019

asfimport commented Nov 25, 2019

asfimport commented Nov 25, 2019

asfimport commented Nov 25, 2019

asfimport commented Nov 25, 2019

asfimport commented Nov 25, 2019

asfimport commented Nov 25, 2019

asfimport commented Nov 25, 2019

asfimport commented Nov 25, 2019

asfimport commented Nov 25, 2019

asfimport commented Nov 27, 2019

asfimport commented Nov 27, 2019

asfimport commented Nov 27, 2019

asfimport commented Apr 27, 2020

asfimport commented Apr 27, 2020

asfimport commented Apr 27, 2020

asfimport commented Apr 27, 2020

asfimport commented Nov 25, 2019 •

edited