-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected ParserError
when loading data with dask.dataframe.read_csv()
#7680
Comments
Thanks for raising an issue @paigem! It looks like the issue is arising when we pass a sample of the data to pandas so we can infer |
Ah, after some prodding, I get it - the "sample", which terminates on a newline, but inside a quoted string. This should be fixable by setting
and the size of the preamble is being included in the apparent size. So there are two issues:
|
Hello everyone. I'm getting the same error as @paigem's while trying to export a I can't provide the sample file as it has sensible data, but I can generate an example later if needed. I already tried @martindurant's tip of setting import dask.dataframe as dd
df = dd.read_csv(filename)
df.to_parquet('test.parquet') And here is the ParseError--------------------------------------------------------------------------- ParserError Traceback (most recent call last) in 1 df = dd.read_csv(filename) ----> 2 df.to_parquet('test.parquet')~/.virtualenvs/bigdata/lib/python3.8/site-packages/dask/dataframe/core.py in to_parquet(self, path, *args, **kwargs) ~/.virtualenvs/bigdata/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py in to_parquet(df, path, engine, compression, write_index, append, overwrite, ignore_divisions, partition_on, storage_options, custom_metadata, write_metadata_file, compute, compute_kwargs, schema, **kwargs) ~/.virtualenvs/bigdata/lib/python3.8/site-packages/dask/base.py in compute(self, **kwargs) ~/.virtualenvs/bigdata/lib/python3.8/site-packages/dask/base.py in compute(*args, **kwargs) ~/.virtualenvs/bigdata/lib/python3.8/site-packages/dask/threaded.py in get(dsk, result, cache, num_workers, pool, **kwargs) ~/.virtualenvs/bigdata/lib/python3.8/site-packages/dask/local.py in get_async(submit, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, chunksize, **kwargs) ~/.virtualenvs/bigdata/lib/python3.8/site-packages/dask/local.py in reraise(exc, tb) ~/.virtualenvs/bigdata/lib/python3.8/site-packages/dask/local.py in execute_task(key, task_info, dumps, loads, get_id, pack_exception) ~/.virtualenvs/bigdata/lib/python3.8/site-packages/dask/core.py in _execute_task(arg, cache, dsk) ~/.virtualenvs/bigdata/lib/python3.8/site-packages/dask/optimization.py in call(self, *args) ~/.virtualenvs/bigdata/lib/python3.8/site-packages/dask/core.py in get(dsk, out, cache) ~/.virtualenvs/bigdata/lib/python3.8/site-packages/dask/core.py in _execute_task(arg, cache, dsk) ~/.virtualenvs/bigdata/lib/python3.8/site-packages/dask/dataframe/io/csv.py in call(self, part) ~/.virtualenvs/bigdata/lib/python3.8/site-packages/dask/dataframe/io/csv.py in pandas_read_text(reader, b, header, kwargs, dtypes, columns, write_header, enforce, path) ~/.virtualenvs/bigdata/lib/python3.8/site-packages/pandas/io/parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options) ~/.virtualenvs/bigdata/lib/python3.8/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds) ~/.virtualenvs/bigdata/lib/python3.8/site-packages/pandas/io/parsers.py in read(self, nrows) ~/.virtualenvs/bigdata/lib/python3.8/site-packages/pandas/io/parsers.py in read(self, nrows) pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows() pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error() ParserError: Error tokenizing data. C error: EOF inside string starting at row 193429 Environment info
|
I wonder if, instead of getting a |
Hmm that sounds sensible. Would that require a more recent version of pandas that use |
What happened:
Get unexpected
ParserError
when loading a csv file viadask.dataframe.read_csv()
. However, loading the file directly withpandas.read_csv()
and then converting to dask Dataframe viadask.dataframe.from_pandas()
runs successfully.What you expected to happen:
I expect
dask.dataframe.read_csv()
to successfully load the data ifpandas.read_csv()
is able to.Minimal Complete Verifiable Example:
Anything else we need to know?:
This bug was found while running a dask notebook tutorial in Pangeo Tutorial Gallery, which runs on Pangeo Binder. This issue was originally reported here.
The error can be found below:
ParserError
---------------------------------------------------------------------------ParserError Traceback (most recent call last)
in
6
7 # blocksize=None means use a single partion
----> 8 df = dd.read_csv(server+query, blocksize=None)
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read(urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
578 storage_options=storage_options,
579 include_path_column=include_path_column,
--> 580 **kwargs,
581 )
582
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read_pandas(reader, urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
444
445 # Use sample to infer dtypes and check for presence of include_path_column
--> 446 head = reader(BytesIO(b_sample), **kwargs)
447 if include_path_column and (include_path_column in head.columns):
448 raise ValueError(
/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
674 )
675
--> 676 return _read(filepath_or_buffer, kwds)
677
678 parser_f.name = name
/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
452
453 try:
--> 454 data = parser.read(nrows)
455 finally:
456 parser.close()
/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows)
1131 def read(self, nrows=None):
1132 nrows = _validate_integer("nrows", nrows)
-> 1133 ret = self._engine.read(nrows)
1134
1135 # May alter columns / col_dict
/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows)
2035 def read(self, nrows=None):
2036 try:
-> 2037 data = self._reader.read(nrows)
2038 except StopIteration:
2039 if self._first_chunk:
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()
pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()
ParserError: Error tokenizing data. C error: EOF inside string starting at row 172
Environment:
The text was updated successfully, but these errors were encountered: