-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
partitioning and CSV delimiters #1284
Comments
Your assessment seems correct to me. Dask.dataframe is just seeking to the To be honest I'm not sure how to do this well while still respecting all of On Mon, Jun 13, 2016 at 2:22 PM, Michael Heilman notifications@github.com
|
Hmm... yeah, it seems to be such that, in the worst case, one would need to parse the whole CSV file to get this sort of thing right. Perhaps the best thing would be to just add a note to Also, maybe the exception could be caught and a more verbose error message could be provided, but it looks a bit tricky to do that because it's raised in a thread or process---and by pandas, not dask. |
A note in the docstring definitely sounds like a good idea. We can also raise errors at runtime. The function |
Parsing CSV blocks can fail if a line terminator appears in a quoted value. This just makes that a little clearer to users since there's probably not an easy solution. dask#1284
* added note and verbose exception about CSV parsing errors Parsing CSV blocks can fail if a line terminator appears in a quoted value. This just makes that a little clearer to users since there's probably not an easy solution. #1284 * removed CSV parsing try-except
I have a csv file: and get this message: File "/home/guyos/anaconda3/lib/python3.5/site-packages/dask/dataframe/csv.py", line 236, in read_csv File "/home/guyos/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 645, in parser_f File "/home/guyos/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 400, in _read File "/home/guyos/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 938, in read File "/home/guyos/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 1505, in read File "pandas/parser.pyx", line 846, in pandas.parser.TextReader.read (pandas/parser.c:9884) File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10142) File "pandas/parser.pyx", line 922, in pandas.parser.TextReader._read_rows (pandas/parser.c:10870) File "pandas/parser.pyx", line 909, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:10741) File "pandas/parser.pyx", line 2018, in pandas.parser.raise_parser_error (pandas/parser.c:25878) CParserError: Error tokenizing data. C error: Expected 1 fields in line 7, saw 2 df = pd.read_csv('/home/guyos/raxel/data/drivers_150/5624_points.csv.gz', encoding='cp1251') df = dd.read_csv('/home/guyos/raxel/data/drivers_150/5624_points.csv.gz', encoding='cp1251') File "", line 1, in File "/home/guyos/anaconda3/lib/python3.5/site-packages/dask/dataframe/csv.py", line 236, in read_csv File "/home/guyos/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 645, in parser_f File "/home/guyos/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 400, in _read File "/home/guyos/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 938, in read File "/home/guyos/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 1505, in read File "pandas/parser.pyx", line 846, in pandas.parser.TextReader.read (pandas/parser.c:9884) File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10142) File "pandas/parser.pyx", line 922, in pandas.parser.TextReader._read_rows (pandas/parser.c:10870) File "pandas/parser.pyx", line 909, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:10741) File "pandas/parser.pyx", line 2018, in pandas.parser.raise_parser_error (pandas/parser.c:25878) CParserError: Error tokenizing data. C error: Expected 1 fields in line 7, saw 2 ` |
It looks like the logic around partitioning CSV files into blocks based on line delimiters can fail when a text column includes (escaped) line delimiters.
For example, if I have a file "foo.csv" like this...
...then running
dd.read_csv("foo.csv", blocksize=8)['foo'].map(lambda x: len(x)).mean().compute()
results in this:I can't figure out how to see exactly what the pandas C code is failing on, but I'm pretty sure it's a block where there is something like
"...\n
, with no closing"
.I think it's the case that the
read_block
function here just looks for the nearest delimiter character, which might be in a quoted string. Working around this might be pretty tricky since CSV-specific logic would be needed instead of the general purpose functionality used right now (i.e.,read_block
).Miscellany:
pd.read_csv
parses the small example and my actual dataset just fine.The text was updated successfully, but these errors were encountered: