Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Open CSVs with different encodings #23544

Closed
asfimport opened this issue Nov 25, 2019 · 18 comments
Closed

[Python] Open CSVs with different encodings #23544

asfimport opened this issue Nov 25, 2019 · 18 comments

Comments

@asfimport
Copy link

asfimport commented Nov 25, 2019

I would like to open an UTF-16 encoded CSVs (among others) without preprocessing in let's say Pandas. Is there maybe a way to do this already ?

Reporter: Sascha Hofmann / @saschahofmann

Related issues:

Note: This issue was originally created as ARROW-7251. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
The only way to do this is to convert the CSV upfront to UTF-8.

@asfimport
Copy link
Author

Sascha Hofmann / @saschahofmann:
Do you know whether there are any plans to maybe implement an encoding option on the ParseOptions? Like e.g. pandas, does it?

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
We don't have any plans currently. We'd rather avoid adding a character set conversion machinery to Arrow. Do you know why you have UTF-16 encoded CSVs?

@asfimport
Copy link
Author

Sascha Hofmann / @saschahofmann:
I think it's a Kaggle dataset. In general, we'd like to ingest all kind of legacy CSVs. Do you by any chance know how the pyarrow read_csv compares to Pandas'?

 

In the meantime, we might simply check for a BOM and fall back to the pandas csv reader if we find a None UTF-8 one.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:

Do you by any chance know how the pyarrow read_csv compares to Pandas'?

That's a vague question. Which aspect are you interested in?

 

@asfimport
Copy link
Author

Sascha Hofmann / @saschahofmann:
Basically, I could imagine simply switching everything over to use pandas.read_csv and then get the pyarrow table from there (pa.Table.from_pandas). 

Question would be something like: how much time and memory overhead would that give compared to a pure pa.csv.read_csv approach (when possible). I thought maybe someone has benchmarked that but couldn't find anything. 

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Well, using Arrow directly should generally be much faster than going through Pandas. I'd be curious about your experience.

If disk space is not an issue, you might convert your CSV files to UTF-8 upfront (for example using iconv or even Python).

@asfimport
Copy link
Author

Sascha Hofmann / @saschahofmann:
I just ran a simple timeit notebook for 30MB csv with 7 string columns one bool and one int column. Differences are quite massive.

Pandas to pyarrow:

314 ms ± 20 ms 

Only pyarrow:

38.7 ms ± 3.15 ms

 

We are trying to hide as much complexity from the user, who might throw in arbitrarily old CSVs, as possible. That's why converting upfront is not really an option but I guess having the pandas reader as a fallback should be sufficient.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Thank you for the numbers. Which encodings do you usually encounter?

@asfimport
Copy link
Author

Sascha Hofmann / @saschahofmann:
As mentioned above, today I had a UTF-16 file. Previously, I encountered a latin-1 encoded CSV but that was actually fine because only one of the columns had the weird byte so only that column was read as binary instead of string (you actually answered me on that one here: https://issues.apache.org/jira/browse/ARROW-6934)

@asfimport
Copy link
Author

Sascha Hofmann / @saschahofmann:
back then I thought. It might be useful to have different string encoded column types.

@asfimport
Copy link
Author

Sascha Hofmann / @saschahofmann:
I have another CSV related question. How does the csv reader decide how many chunks the created arrow table will have? I know I can fix it by giving a block_size on the read options but the default is None. I observed that sometimes the reader is creating a lot of chunks but it doesn't seem to necessarily scale with file size?

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
The default is not None:

>>> from pyarrow import csv                                                                                                                                                     
>>> opts = csv.ReadOptions()                                                                                                                                                    
>>> opts.block_size                                                                                                                                                             
1048576

By changing this value you will control the size of chunks. But do note this has an impact in performance, especially in parallel mode.

@asfimport
Copy link
Author

Sascha Hofmann / @saschahofmann:
Ah, good to know. 

Maybe it might be worth noting on the docs
https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html#pyarrow.csv.ReadOptions

that providing None uses the default of 1MB

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Any further thoughts about this? Perhaps at minimum we should document the expectation about non-UTF-8 CSV files. FTR I'm not sure many SQL databases will ingest non-UTF-8 CSV files

@asfimport
Copy link
Author

Sascha Hofmann / @saschahofmann:
For us having different string encoding support would be amazing. That being said, I admit other encodings are rare/dying out but we stumble upon them once in a while. From those, I don't know how many are using a BOM to identify their encoding. We haven't actually tried it but we might use pandas as mentioned above in cases where a file has a BOM different than the utf-8 (see comment above).  I am not sure how you did the csv reading in pandas but I assume it might not be worth going through it again. In the end, it might be best to force people using UTF-8. 

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
cc @saschahofmann Is there anything that prevents you from recoding the CSV file before opening it with Arrow?
(what are your constraints? performance? file size?)

With some care, you could even implement a file-like object in Python that recodes data to UTF-8 on the fly. It should be accepted by csv.read_csv.

@asfimport
Copy link
Author

Sascha Hofmann / @saschahofmann:
Our current setup allows users to upload CSV files which we are parsing to arrow. Right now, we are not doing any the preprocessing of the csv file so we can receive arbitrary weird files. I will propose your "recode on the fly" suggestion.

 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant