Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame reset_index() not working with DataFrames created with from_delayed or read_csv methods #3788

Closed
CermakM opened this issue Jul 20, 2018 · 2 comments

Comments

@CermakM
Copy link

CermakM commented Jul 20, 2018

When creating dataframes from normal pandas dfs, there is no problem with resetting the index:

a = dd.from_pandas(pd.DataFrame({'a': [1, 2, 3], 'b': [1, 2, 3]}), chunksize=1000)
b = dd.from_pandas(pd.DataFrame({'a': [1, 2, 3], 'b': [1, 2, 3]}), chunksize=1000)
c = dd.concat([a, b], interleave_partitions=True)
c = c.reset_index(drop=True)
c.compute()

But the following code produces the same index as the original one

foo = lambda a, m: pd.DataFrame(np.array([[a**m, a**(2*m), a**(3*m)]] * 3), columns=['m', '2m', '3m'])

tasks = [dask.delayed(foo)(2, m) for m in range(3)]

ddf = dd.from_delayed(tasks)
ddf = ddf.reset_index()  # drop=True does not work as well, here False for clarity of the example result
ddf.compute()

yields:

index m 2m 3m
0 0 1 1 1
1 1 1 1 1
2 2 1 1 1
0 0 2 4 8
1 1 2 4 8
2 2 2 4 8
0 0 4 16 64
1 1 4 16 64
2 2 4 16 64
@CermakM
Copy link
Author

CermakM commented Jul 23, 2018

Same behavior with read_csv when reading from multiple files

fpaths: list = [...]
ddf = dd.read_csv(fpaths)
ddf = ddf.reset_index(drop=True)  # also no effect

@CermakM CermakM changed the title DataFrame reset_index() not working with DataFrames created with from_delayed method DataFrame reset_index() not working with DataFrames created with from_delayed or read_csv methods Jul 23, 2018
@quasiben
Copy link
Member

@CermakM I think this is working correctly. From the docstring:

Note that unlike in pandas, the reset dask.dataframe index will
not be monotonically increasing from 0. Instead, it will restart at 0
for each partition (e.g. index1 = [0, ..., 10], index2 = [0, ...]).
This is due to the inability to statically know the full length of the
index.

ddf has three partitions and thus the reset_index builds 3 [0,1,3] indexes for each partition.

I'm going to close for now as this is quite a bit old but feel free to reopen if you are still having trouble

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants