partitioning and CSV delimiters #1284

mheilman · 2016-06-13T21:22:05Z

It looks like the logic around partitioning CSV files into blocks based on line delimiters can fail when a text column includes (escaped) line delimiters.

For example, if I have a file "foo.csv" like this...

foo,bar
"a",1
"b",2
"c
d
e
",3
"h",4

...then running dd.read_csv("foo.csv", blocksize=8)['foo'].map(lambda x: len(x)).mean().compute() results in this:

/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/base.py in compute(self, **kwargs)
     35 
     36     def compute(self, **kwargs):
---> 37         return compute(self, **kwargs)[0]
     38 
     39     @classmethod

/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/base.py in compute(*args, **kwargs)
    108                 for opt, val in groups.items()])
    109     keys = [var._keys() for var in variables]
--> 110     results = get(dsk, keys, **kwargs)
    111 
    112     results_iter = iter(results)

/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/threaded.py in get(dsk, result, cache, num_workers, **kwargs)
     55     results = get_async(pool.apply_async, len(pool._pool), dsk, result,
     56                         cache=cache, queue=queue, get_id=_thread_get_id,
---> 57                         **kwargs)
     58 
     59     return results

/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/async.py in get_async(apply_async, num_workers, dsk, result, cache, queue, get_id, raise_on_exception, rerun_exceptions_locally, callbacks, **kwargs)
    486                 _execute_task(task, data)  # Re-execute locally
    487             else:
--> 488                 raise(remote_exception(res, tb))
    489         state['cache'][key] = res
    490         finish_task(dsk, key, state, results, keyorder.get)

CParserError: Error tokenizing data. C error: EOF inside string starting at line 1

Traceback
---------
  File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/async.py", line 267, in execute_task
    result = _execute_task(task, data)
  File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/async.py", line 249, in _execute_task
    return func(*args2)
  File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/dataframe/csv.py", line 41, in bytes_read_csv
    df = pd.read_csv(bio, **kwargs)
  File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 562, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 315, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 645, in __init__
    self._make_engine(self.engine)
  File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 799, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 1213, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas/parser.pyx", line 520, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:5129)
  File "pandas/parser.pyx", line 671, in pandas.parser.TextReader._get_header (pandas/parser.c:7259)
  File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:9602)
  File "pandas/parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas/parser.c:23325)

I can't figure out how to see exactly what the pandas C code is failing on, but I'm pretty sure it's a block where there is something like "...\n, with no closing ".

I think it's the case that the read_block function here just looks for the nearest delimiter character, which might be in a quoted string. Working around this might be pretty tricky since CSV-specific logic would be needed instead of the general purpose functionality used right now (i.e., read_block).

Miscellany:

I ran into this trying to process a much larger text dataset with some newlines in one of the columns.
pd.read_csv parses the small example and my actual dataset just fine.
versions: dask 0.9.0, python 3.5, pandas 0.18.1, OS X 10.11.5

The text was updated successfully, but these errors were encountered:

mrocklin · 2016-06-13T23:04:57Z

Your assessment seems correct to me. Dask.dataframe is just seeking to the
nearest endline without paying attention to quotes.

To be honest I'm not sure how to do this well while still respecting all of
the intricacies of fully complex CSV.

On Mon, Jun 13, 2016 at 2:22 PM, Michael Heilman notifications@github.com
wrote:

It looks like the logic around partitioning CSV files into blocks based on
line delimiters can fail when a text column includes (escaped) line
delimiters.

For example, if I have a file "foo.csv" like this...

foo,bar
"a",1
"b",2
"c
d
e
",3
"h",4

...then running dd.read_csv("foo.csv", blocksize=8)['foo'].map(lambda x:
len(x)).mean().compute() results in this:

/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/base.py in compute(self, *_kwargs)
35
36 def compute(self, *_kwargs):
---> 37 return compute(self, **kwargs)[0]
38
39 @classmethod

/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/base.py in compute(_args, *_kwargs)
108 for opt, val in groups.items()])
109 keys = [var._keys() for var in variables]
--> 110 results = get(dsk, keys, **kwargs)
111
112 results_iter = iter(results)

/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/threaded.py in get(dsk, result, cache, num_workers, *_kwargs)
55 results = get_async(pool.apply_async, len(pool._pool), dsk, result,
56 cache=cache, queue=queue, get_id=_thread_get_id,
---> 57 *_kwargs)
58
59 return results

/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/async.py in get_async(apply_async, num_workers, dsk, result, cache, queue, get_id, raise_on_exception, rerun_exceptions_locally, callbacks, **kwargs)
486 _execute_task(task, data) # Re-execute locally
487 else:
--> 488 raise(remote_exception(res, tb))
489 state['cache'][key] = res
490 finish_task(dsk, key, state, results, keyorder.get)

CParserError: Error tokenizing data. C error: EOF inside string starting at line 1

Traceback

File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/async.py", line 267, in execute_task
result = _execute_task(task, data)
File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/async.py", line 249, in _execute_task
return func(_args2)
File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/dataframe/csv.py", line 41, in bytes_read_csv
df = pd.read_csv(bio, *_kwargs)
File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 315, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 645, in init
self._make_engine(self.engine)
File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 799, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 1213, in init
self._reader = _parser.TextReader(src, **kwds)
File "pandas/parser.pyx", line 520, in pandas.parser.TextReader.cinit (pandas/parser.c:5129)
File "pandas/parser.pyx", line 671, in pandas.parser.TextReader._get_header (pandas/parser.c:7259)
File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:9602)
File "pandas/parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas/parser.c:23325)

I can't figure out how to see exactly what the pandas C code is failing
on, but I'm pretty sure it's a block where there is something like "...\n,
with no closing ".

I think it's the case that the read_block function here

dask/dask/bytes/utils.py

Line 110 in fb35749

if delimiter:

just looks for the nearest delimiter character, which might be in a quoted
string. Working around this might be pretty tricky since CSV-specific logic
would be needed instead of the general purpose functionality used right now
(i.e., read_block).

Miscellany:

I ran into this trying to process a much larger text dataset with
some newlines in one of the columns.

pd.read_csv parses the small example and my actual dataset just fine.

versions: dask 0.9.0, python 3.5, pandas 0.18.1, OS X 10.11.5

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1284, or mute the thread
https://github.com/notifications/unsubscribe/AASszINmEpy-kEg0iiyHksD3WM0Cmjnoks5qLcn-gaJpZM4I0v7t
.

mheilman · 2016-06-14T01:03:45Z

Hmm... yeah, it seems to be such that, in the worst case, one would need to parse the whole CSV file to get this sort of thing right.

Perhaps the best thing would be to just add a note to read_csv around here... something like, "Note that this function may fail if a CSV file includes quoted strings that contain lineterminator." A sentence like that might have helped me figure out the issue a bit quicker.

Also, maybe the exception could be caught and a more verbose error message could be provided, but it looks a bit tricky to do that because it's raised in a thread or process---and by pandas, not dask.

mrocklin · 2016-06-14T01:31:46Z

A note in the docstring definitely sounds like a good idea. We can also raise errors at runtime. The function dask.dataframe.csv.bytes_read_csv has a pd.read_csv call that we can try-except against. I can get to this later this week. Alternatively if you have a moment either of these would be a very welcome addition to the project.

Parsing CSV blocks can fail if a line terminator appears in a quoted value. This just makes that a little clearer to users since there's probably not an easy solution. dask#1284

* added note and verbose exception about CSV parsing errors Parsing CSV blocks can fail if a line terminator appears in a quoted value. This just makes that a little clearer to users since there's probably not an easy solution. #1284 * removed CSV parsing try-except

kkonevets · 2016-10-19T23:52:33Z

I have a csv file:
"TrackId","TotalMeters","Speed","PointDate"
310717,0,0,2016-06-20 12:21:58
310717,0,0,2016-06-20 12:22:05
310717,0,0,2016-06-20 12:22:05
310717,0,0,2016-06-20 12:22:05
310717,0,0,2016-06-20 12:22:05
310717,0,0,2016-06-20 12:21:58
310717,0,0,2016-06-20 12:21:58
310717,0,0,2016-06-20 12:21:58

and get this message:
`
File "", line 1, in
df = dd.read_csv('/home/guyos/raxel/data/drivers_150/5624_points.csv.gz', encoding='cp1251')

File "/home/guyos/anaconda3/lib/python3.5/site-packages/dask/dataframe/csv.py", line 236, in read_csv
head = pd.read_csv(BytesIO(sample), **kwargs)

File "/home/guyos/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 645, in parser_f
return _read(filepath_or_buffer, kwds)

File "/home/guyos/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 400, in _read
data = parser.read()

File "/home/guyos/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 938, in read
ret = self._engine.read(nrows)

File "/home/guyos/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 1505, in read
data = self._reader.read(nrows)

File "pandas/parser.pyx", line 846, in pandas.parser.TextReader.read (pandas/parser.c:9884)

File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10142)

File "pandas/parser.pyx", line 922, in pandas.parser.TextReader._read_rows (pandas/parser.c:10870)

File "pandas/parser.pyx", line 909, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:10741)

File "pandas/parser.pyx", line 2018, in pandas.parser.raise_parser_error (pandas/parser.c:25878)

CParserError: Error tokenizing data. C error: Expected 1 fields in line 7, saw 2

df = pd.read_csv('/home/guyos/raxel/data/drivers_150/5624_points.csv.gz', encoding='cp1251')

df = dd.read_csv('/home/guyos/raxel/data/drivers_150/5624_points.csv.gz', encoding='cp1251')
Traceback (most recent call last):

File "", line 1, in
df = dd.read_csv('/home/guyos/raxel/data/drivers_150/5624_points.csv.gz', encoding='cp1251')