DataFrame reset_index() not working with DataFrames created with from_delayed or read_csv methods #3788

CermakM · 2018-07-20T10:26:13Z

When creating dataframes from normal pandas dfs, there is no problem with resetting the index:

a = dd.from_pandas(pd.DataFrame({'a': [1, 2, 3], 'b': [1, 2, 3]}), chunksize=1000)
b = dd.from_pandas(pd.DataFrame({'a': [1, 2, 3], 'b': [1, 2, 3]}), chunksize=1000)
c = dd.concat([a, b], interleave_partitions=True)
c = c.reset_index(drop=True)
c.compute()

But the following code produces the same index as the original one

foo = lambda a, m: pd.DataFrame(np.array([[a**m, a**(2*m), a**(3*m)]] * 3), columns=['m', '2m', '3m'])

tasks = [dask.delayed(foo)(2, m) for m in range(3)]

ddf = dd.from_delayed(tasks)
ddf = ddf.reset_index()  # drop=True does not work as well, here False for clarity of the example result
ddf.compute()

yields:

	index	m	2m	3m
0	0	1	1	1
1	1	1	1	1
2	2	1	1	1
0	0	2	4	8
1	1	2	4	8
2	2	2	4	8
0	0	4	16	64
1	1	4	16	64
2	2	4	16	64

The text was updated successfully, but these errors were encountered:

CermakM · 2018-07-23T08:08:10Z

Same behavior with read_csv when reading from multiple files

fpaths: list = [...]
ddf = dd.read_csv(fpaths)
ddf = ddf.reset_index(drop=True)  # also no effect

quasiben · 2019-04-30T16:31:45Z

@CermakM I think this is working correctly. From the docstring:

Note that unlike in pandas, the reset dask.dataframe index will
not be monotonically increasing from 0. Instead, it will restart at 0
for each partition (e.g. index1 = [0, ..., 10], index2 = [0, ...]).
This is due to the inability to statically know the full length of the
index.

ddf has three partitions and thus the reset_index builds 3 [0,1,3] indexes for each partition.

I'm going to close for now as this is quite a bit old but feel free to reopen if you are still having trouble

CermakM changed the title ~~DataFrame reset_index() not working with DataFrames created with from_delayed method~~ DataFrame reset_index() not working with DataFrames created with from_delayed or read_csv methods Jul 23, 2018

quasiben closed this as completed Apr 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame reset_index() not working with DataFrames created with from_delayed or read_csv methods #3788

DataFrame reset_index() not working with DataFrames created with from_delayed or read_csv methods #3788

CermakM commented Jul 20, 2018

CermakM commented Jul 23, 2018

quasiben commented Apr 30, 2019

DataFrame reset_index() not working with DataFrames created with from_delayed or read_csv methods #3788

DataFrame reset_index() not working with DataFrames created with from_delayed or read_csv methods #3788

Comments

CermakM commented Jul 20, 2018

CermakM commented Jul 23, 2018

quasiben commented Apr 30, 2019