New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Open CSVs with different encodings #23544
Comments
Antoine Pitrou / @pitrou: |
Sascha Hofmann / @saschahofmann: |
Antoine Pitrou / @pitrou: |
Sascha Hofmann / @saschahofmann:
In the meantime, we might simply check for a BOM and fall back to the pandas csv reader if we find a None UTF-8 one. |
That's a vague question. Which aspect are you interested in?
|
Sascha Hofmann / @saschahofmann: Question would be something like: how much time and memory overhead would that give compared to a pure pa.csv.read_csv approach (when possible). I thought maybe someone has benchmarked that but couldn't find anything. |
Antoine Pitrou / @pitrou: If disk space is not an issue, you might convert your CSV files to UTF-8 upfront (for example using iconv or even Python). |
Sascha Hofmann / @saschahofmann: Pandas to pyarrow: 314 ms ± 20 ms Only pyarrow: 38.7 ms ± 3.15 ms
We are trying to hide as much complexity from the user, who might throw in arbitrarily old CSVs, as possible. That's why converting upfront is not really an option but I guess having the pandas reader as a fallback should be sufficient. |
Antoine Pitrou / @pitrou: |
Sascha Hofmann / @saschahofmann: |
Sascha Hofmann / @saschahofmann: |
Sascha Hofmann / @saschahofmann: |
Antoine Pitrou / @pitrou: >>> from pyarrow import csv
>>> opts = csv.ReadOptions()
>>> opts.block_size
1048576 By changing this value you will control the size of chunks. But do note this has an impact in performance, especially in parallel mode. |
Sascha Hofmann / @saschahofmann: Maybe it might be worth noting on the docs that providing None uses the default of 1MB |
Wes McKinney / @wesm: |
Sascha Hofmann / @saschahofmann: |
Antoine Pitrou / @pitrou: With some care, you could even implement a file-like object in Python that recodes data to UTF-8 on the fly. It should be accepted by |
Sascha Hofmann / @saschahofmann:
|
I would like to open an UTF-16 encoded CSVs (among others) without preprocessing in let's say Pandas. Is there maybe a way to do this already ?
Reporter: Sascha Hofmann / @saschahofmann
Related issues:
Note: This issue was originally created as ARROW-7251. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: