New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug writing to parquet #3660
Comments
Thanks for the bug report @henriqueribeiro I suspect that it will be easier for people to resolve if you are able to provide a minimal reproducible example. http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports |
Hey @mrocklin, Here it is a minimal reproducible example import numpy as np
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({'time': pd.date_range('1980-01-01', periods=20, freq='1min'),
'data1': np.random.normal(size=20),
'data2': np.random.normal(size=20),
})
df = df.set_index('time')
ddf = dd.from_pandas(df, npartitions=2)
ddf.to_parquet('dummy_df')
ddf2 = dd.read_parquet('dummy_df')
ddf2 = ddf2.resample('1min').mean()
ddf2.to_parquet('new_df') I get a ValueError when writing ddf2.to_parquet('new_df') TracebackValueError Traceback (most recent call last)
<ipython-input-15-beb91111f173> in <module>()
1 ddf2 = dd.read_parquet('dummy_df')
2 ddf2 = ddf2.resample('1min').mean()
----> 3 ddf2.to_parquet('new_df')
/opt/conda/lib/python3.6/site-packages/dask/dataframe/core.py in to_parquet(self, path, *args, **kwargs)
1084 """ See dd.to_parquet docstring for more information """
1085 from .io import to_parquet
-> 1086 return to_parquet(self, path, *args, **kwargs)
1087
1088 def to_csv(self, filename, **kwargs):
/opt/conda/lib/python3.6/site-packages/dask/dataframe/io/parquet.py in to_parquet(df, path, engine, compression, write_index, append, ignore_divisions, partition_on, storage_options, compute, **kwargs)
1076
1077 if compute:
-> 1078 out.compute()
1079 return None
1080 return out
/opt/conda/lib/python3.6/site-packages/dask/base.py in compute(self, **kwargs)
154 dask.base.compute
155 """
--> 156 (result,) = compute(self, traverse=False, **kwargs)
157 return result
158
/opt/conda/lib/python3.6/site-packages/dask/base.py in compute(*args, **kwargs)
400 keys = [x.__dask_keys__() for x in collections]
401 postcomputes = [x.__dask_postcompute__() for x in collections]
--> 402 results = schedule(dsk, keys, **kwargs)
403 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
404
/opt/conda/lib/python3.6/site-packages/dask/threaded.py in get(dsk, result, cache, num_workers, **kwargs)
73 results = get_async(pool.apply_async, len(pool._pool), dsk, result,
74 cache=cache, get_id=_thread_get_id,
---> 75 pack_exception=pack_exception, **kwargs)
76
77 # Cleanup pools associated to dead threads
/opt/conda/lib/python3.6/site-packages/dask/local.py in get_async(apply_async, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, **kwargs)
519 _execute_task(task, data) # Re-execute locally
520 else:
--> 521 raise_exception(exc, tb)
522 res, worker_id = loads(res_info)
523 state['cache'][key] = res
/opt/conda/lib/python3.6/site-packages/dask/compatibility.py in reraise(exc, tb)
67 if exc.__traceback__ is not tb:
68 raise exc.with_traceback(tb)
---> 69 raise exc
70
71 else:
/opt/conda/lib/python3.6/site-packages/dask/local.py in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
288 try:
289 task, data = loads(task_info)
--> 290 result = _execute_task(task, data)
291 id = get_id()
292 result = dumps((result, id))
/opt/conda/lib/python3.6/site-packages/dask/local.py in _execute_task(arg, cache, dsk)
269 func, args = arg[0], arg[1:]
270 args2 = [_execute_task(a, cache) for a in args]
--> 271 return func(*args2)
272 elif not ishashable(arg):
273 return arg
/opt/conda/lib/python3.6/site-packages/dask/dataframe/core.py in apply_and_enforce(func, args, kwargs, meta)
3561 if not np.array_equal(np.nan_to_num(meta.columns),
3562 np.nan_to_num(df.columns)):
-> 3563 raise ValueError("The columns in the computed data do not match"
3564 " the columns in the provided metadata")
3565 else:
ValueError: The columns in the computed data do not match the columns in the provided metadata VersionsI used das 0.18.0 and, as said before, this doesn't happen when using dask with version 0.17.5. |
This does not appear to have anything to do with parquet.
somehow in the dask version, the index loses its name. The
|
@TomAugspurger , perhaps? |
Maybe a pandas bug with In [12]: idx = pd.date_range('1980-01-01T00:00:00', '1980-01-01T00:10:00', freq='2T', name='name')
In [13]: idx
Out[13]:
DatetimeIndex(['1980-01-01 00:00:00', '1980-01-01 00:02:00',
'1980-01-01 00:04:00', '1980-01-01 00:06:00',
'1980-01-01 00:08:00', '1980-01-01 00:10:00'],
dtype='datetime64[ns]', name='name', freq='2T')
In [14]: idx.reindex(pd.date_range('1980-01-01T00:00:00', '1980-01-01T00:10:00', freq='T'))[0].name
# None |
pandas issue: pandas-dev/pandas#9885 In short, it's not clear whether @henriqueribeiro you're hitting dask/dask/dataframe/tseries/resample.py Lines 37 to 42 in a95be8d
|
Sure! I will work on it. Give me a few days |
Hello,
Running the code above on dask version 0.17.5, everything works fine but when running it with dask version 0.18.0 it fails with the error:
Edit: See 3rd comment for a minimal reproducible example.
The text was updated successfully, but these errors were encountered: