Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partitioning and CSV delimiters #1284

Closed
mheilman opened this issue Jun 13, 2016 · 4 comments
Closed

partitioning and CSV delimiters #1284

mheilman opened this issue Jun 13, 2016 · 4 comments

Comments

@mheilman
Copy link
Contributor

It looks like the logic around partitioning CSV files into blocks based on line delimiters can fail when a text column includes (escaped) line delimiters.

For example, if I have a file "foo.csv" like this...

foo,bar
"a",1
"b",2
"c
d
e
",3
"h",4

...then running dd.read_csv("foo.csv", blocksize=8)['foo'].map(lambda x: len(x)).mean().compute() results in this:

/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/base.py in compute(self, **kwargs)
     35 
     36     def compute(self, **kwargs):
---> 37         return compute(self, **kwargs)[0]
     38 
     39     @classmethod

/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/base.py in compute(*args, **kwargs)
    108                 for opt, val in groups.items()])
    109     keys = [var._keys() for var in variables]
--> 110     results = get(dsk, keys, **kwargs)
    111 
    112     results_iter = iter(results)

/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/threaded.py in get(dsk, result, cache, num_workers, **kwargs)
     55     results = get_async(pool.apply_async, len(pool._pool), dsk, result,
     56                         cache=cache, queue=queue, get_id=_thread_get_id,
---> 57                         **kwargs)
     58 
     59     return results

/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/async.py in get_async(apply_async, num_workers, dsk, result, cache, queue, get_id, raise_on_exception, rerun_exceptions_locally, callbacks, **kwargs)
    486                 _execute_task(task, data)  # Re-execute locally
    487             else:
--> 488                 raise(remote_exception(res, tb))
    489         state['cache'][key] = res
    490         finish_task(dsk, key, state, results, keyorder.get)

CParserError: Error tokenizing data. C error: EOF inside string starting at line 1

Traceback
---------
  File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/async.py", line 267, in execute_task
    result = _execute_task(task, data)
  File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/async.py", line 249, in _execute_task
    return func(*args2)
  File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/dataframe/csv.py", line 41, in bytes_read_csv
    df = pd.read_csv(bio, **kwargs)
  File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 562, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 315, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 645, in __init__
    self._make_engine(self.engine)
  File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 799, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 1213, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas/parser.pyx", line 520, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:5129)
  File "pandas/parser.pyx", line 671, in pandas.parser.TextReader._get_header (pandas/parser.c:7259)
  File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:9602)
  File "pandas/parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas/parser.c:23325)

I can't figure out how to see exactly what the pandas C code is failing on, but I'm pretty sure it's a block where there is something like "...\n, with no closing ".

I think it's the case that the read_block function here just looks for the nearest delimiter character, which might be in a quoted string. Working around this might be pretty tricky since CSV-specific logic would be needed instead of the general purpose functionality used right now (i.e., read_block).

Miscellany:

  • I ran into this trying to process a much larger text dataset with some newlines in one of the columns.
  • pd.read_csv parses the small example and my actual dataset just fine.
  • versions: dask 0.9.0, python 3.5, pandas 0.18.1, OS X 10.11.5
@mrocklin
Copy link
Member

Your assessment seems correct to me. Dask.dataframe is just seeking to the
nearest endline without paying attention to quotes.

To be honest I'm not sure how to do this well while still respecting all of
the intricacies of fully complex CSV.

On Mon, Jun 13, 2016 at 2:22 PM, Michael Heilman notifications@github.com
wrote:

It looks like the logic around partitioning CSV files into blocks based on
line delimiters can fail when a text column includes (escaped) line
delimiters.

For example, if I have a file "foo.csv" like this...

foo,bar
"a",1
"b",2
"c
d
e
",3
"h",4

...then running dd.read_csv("foo.csv", blocksize=8)['foo'].map(lambda x:
len(x)).mean().compute() results in this:

/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/base.py in compute(self, *_kwargs)
35
36 def compute(self, *_kwargs):
---> 37 return compute(self, **kwargs)[0]
38
39 @classmethod

/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/base.py in compute(_args, *_kwargs)
108 for opt, val in groups.items()])
109 keys = [var._keys() for var in variables]
--> 110 results = get(dsk, keys, **kwargs)
111
112 results_iter = iter(results)

/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/threaded.py in get(dsk, result, cache, num_workers, *_kwargs)
55 results = get_async(pool.apply_async, len(pool._pool), dsk, result,
56 cache=cache, queue=queue, get_id=_thread_get_id,
---> 57 *_kwargs)
58
59 return results

/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/async.py in get_async(apply_async, num_workers, dsk, result, cache, queue, get_id, raise_on_exception, rerun_exceptions_locally, callbacks, **kwargs)
486 _execute_task(task, data) # Re-execute locally
487 else:
--> 488 raise(remote_exception(res, tb))
489 state['cache'][key] = res
490 finish_task(dsk, key, state, results, keyorder.get)

CParserError: Error tokenizing data. C error: EOF inside string starting at line 1

Traceback

File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/async.py", line 267, in execute_task
result = _execute_task(task, data)
File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/async.py", line 249, in _execute_task
return func(_args2)
File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/dask/dataframe/csv.py", line 41, in bytes_read_csv
df = pd.read_csv(bio, *_kwargs)
File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 315, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 645, in init
self._make_engine(self.engine)
File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 799, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/Users/civisemployee/miniconda/envs/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 1213, in init
self._reader = _parser.TextReader(src, **kwds)
File "pandas/parser.pyx", line 520, in pandas.parser.TextReader.cinit (pandas/parser.c:5129)
File "pandas/parser.pyx", line 671, in pandas.parser.TextReader._get_header (pandas/parser.c:7259)
File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:9602)
File "pandas/parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas/parser.c:23325)

I can't figure out how to see exactly what the pandas C code is failing
on, but I'm pretty sure it's a block where there is something like "...\n,
with no closing ".

I think it's the case that the read_block function here

if delimiter:

just looks for the nearest delimiter character, which might be in a quoted
string. Working around this might be pretty tricky since CSV-specific logic
would be needed instead of the general purpose functionality used right now
(i.e., read_block).

Miscellany:

  • I ran into this trying to process a much larger text dataset with
    some newlines in one of the columns.
  • pd.read_csv parses the small example and my actual dataset just fine.
  • versions: dask 0.9.0, python 3.5, pandas 0.18.1, OS X 10.11.5


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1284, or mute the thread
https://github.com/notifications/unsubscribe/AASszINmEpy-kEg0iiyHksD3WM0Cmjnoks5qLcn-gaJpZM4I0v7t
.

@mheilman
Copy link
Contributor Author

Hmm... yeah, it seems to be such that, in the worst case, one would need to parse the whole CSV file to get this sort of thing right.

Perhaps the best thing would be to just add a note to read_csv around here... something like, "Note that this function may fail if a CSV file includes quoted strings that contain lineterminator." A sentence like that might have helped me figure out the issue a bit quicker.

Also, maybe the exception could be caught and a more verbose error message could be provided, but it looks a bit tricky to do that because it's raised in a thread or process---and by pandas, not dask.

@mrocklin
Copy link
Member

A note in the docstring definitely sounds like a good idea. We can also raise errors at runtime. The function dask.dataframe.csv.bytes_read_csv has a pd.read_csv call that we can try-except against. I can get to this later this week. Alternatively if you have a moment either of these would be a very welcome addition to the project.

mheilman added a commit to mheilman/dask that referenced this issue Jun 14, 2016
Parsing CSV blocks can fail if a line terminator appears in a quoted value.
This just makes that a little clearer to users since there's probably not an easy solution.

dask#1284
mrocklin pushed a commit that referenced this issue Jun 14, 2016
* added note and verbose exception about CSV parsing errors

Parsing CSV blocks can fail if a line terminator appears in a quoted value.
This just makes that a little clearer to users since there's probably not an easy solution.

#1284

* removed CSV parsing try-except
@kkonevets
Copy link

kkonevets commented Oct 19, 2016

I have a csv file:
"TrackId","TotalMeters","Speed","PointDate"
310717,0,0,2016-06-20 12:21:58
310717,0,0,2016-06-20 12:22:05
310717,0,0,2016-06-20 12:22:05
310717,0,0,2016-06-20 12:22:05
310717,0,0,2016-06-20 12:22:05
310717,0,0,2016-06-20 12:21:58
310717,0,0,2016-06-20 12:21:58
310717,0,0,2016-06-20 12:21:58

and get this message:
`
File "", line 1, in
df = dd.read_csv('/home/guyos/raxel/data/drivers_150/5624_points.csv.gz', encoding='cp1251')

File "/home/guyos/anaconda3/lib/python3.5/site-packages/dask/dataframe/csv.py", line 236, in read_csv
head = pd.read_csv(BytesIO(sample), **kwargs)

File "/home/guyos/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 645, in parser_f
return _read(filepath_or_buffer, kwds)

File "/home/guyos/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 400, in _read
data = parser.read()

File "/home/guyos/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 938, in read
ret = self._engine.read(nrows)

File "/home/guyos/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 1505, in read
data = self._reader.read(nrows)

File "pandas/parser.pyx", line 846, in pandas.parser.TextReader.read (pandas/parser.c:9884)

File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10142)

File "pandas/parser.pyx", line 922, in pandas.parser.TextReader._read_rows (pandas/parser.c:10870)

File "pandas/parser.pyx", line 909, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:10741)

File "pandas/parser.pyx", line 2018, in pandas.parser.raise_parser_error (pandas/parser.c:25878)

CParserError: Error tokenizing data. C error: Expected 1 fields in line 7, saw 2

df = pd.read_csv('/home/guyos/raxel/data/drivers_150/5624_points.csv.gz', encoding='cp1251')

df = dd.read_csv('/home/guyos/raxel/data/drivers_150/5624_points.csv.gz', encoding='cp1251')
Traceback (most recent call last):

File "", line 1, in
df = dd.read_csv('/home/guyos/raxel/data/drivers_150/5624_points.csv.gz', encoding='cp1251')

File "/home/guyos/anaconda3/lib/python3.5/site-packages/dask/dataframe/csv.py", line 236, in read_csv
head = pd.read_csv(BytesIO(sample), **kwargs)

File "/home/guyos/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 645, in parser_f
return _read(filepath_or_buffer, kwds)

File "/home/guyos/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 400, in _read
data = parser.read()

File "/home/guyos/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 938, in read
ret = self._engine.read(nrows)

File "/home/guyos/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 1505, in read
data = self._reader.read(nrows)

File "pandas/parser.pyx", line 846, in pandas.parser.TextReader.read (pandas/parser.c:9884)

File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10142)

File "pandas/parser.pyx", line 922, in pandas.parser.TextReader._read_rows (pandas/parser.c:10870)

File "pandas/parser.pyx", line 909, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:10741)

File "pandas/parser.pyx", line 2018, in pandas.parser.raise_parser_error (pandas/parser.c:25878)

CParserError: Error tokenizing data. C error: Expected 1 fields in line 7, saw 2

`
But my quoted strings do not contain lineterminator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants