Revise divisions logic in from_pandas#9221
Conversation
|
Yikes - It looks like a lot of tests start to fail when |
|
It looks like the next steps are the following:
More about Step 1 The current version of The docstring for """
Examples
--------
>>> L = ['A', 'B', 'C', 'D', 'E', 'F']
>>> sorted_division_locations(L, chunksize=2)
(['A', 'C', 'E', 'F'], [0, 2, 4, 6])
>>> sorted_division_locations(L, chunksize=3)
(['A', 'D', 'F'], [0, 3, 6])
>>> L = ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C']
>>> sorted_division_locations(L, chunksize=3)
(['A', 'B', 'C'], [0, 4, 8])
>>> sorted_division_locations(L, chunksize=2)
(['A', 'B', 'C'], [0, 4, 8])
>>> sorted_division_locations(['A'], chunksize=2)
(['A', 'A'], [0, 1])
"""While the first two examples seem correct to me, the second two do not. For example, the existing logic will break
More about Step 2 & 3 This PR changes the behavior of many tests so that DataFrame collections being generated with |
…m/rjzamora/dask into rewrite-sorted_division_locations
This is looking great! I haven't been able to make it do anything that seems wrong in playing around. |
I noticed that |
|
@jrbourbeau @jsignell @ian-r-rose - I think this PR is ready for a final review. The takeaway is that My personal opinion is that a minor performance sacrifice is worth having "correct" behavior. |
ian-r-rose
left a comment
There was a problem hiding this comment.
Thank you for your patience on this @rjzamora!
I took a bit of a stochastic approach to trying this out, generating 10,000 random indexes with varying levels of duplication, and I wasn't able to make anything break:
import string
import random
import dask.dataframe as dd
import pandas as pd
from dask.dataframe.core import check_divisions
def check_from_pandas(chunk: bool):
idx = []
for _ in range(100):
idx.extend(random.randint(0, 10) * [random.choice(string.ascii_uppercase)])
df = pd.DataFrame({"value": range(len(idx))}) # , index=idx)
if chunk:
ddf = dd.from_pandas(df, chunksize=random.randint(1, 100))
else:
npartitions = random.randint(1, len(df.index.unique()))
ddf = dd.from_pandas(df, npartitions=npartitions)
if ddf.npartitions != npartitions:
print(f"Failed to match partition count {npartitions}, {ddf.npartitions}")
try:
check_divisions(ddf.divisions)
except ValueError as e:
print(e)
for i in range(10_000):
check_from_pandas(False)
for i in range(100_000):
check_from_pandas(True)The implementation seems sound, if a bit complex (but it's trying to do a complex thing, so).
|
This PR seems to have broken a large number of tests in ibis: https://github.com/ibis-project/ibis/runs/7707205507?check_suite_focus=true |
|
The behavior of I'm probably just a naive Dask user, but I would expect that to be an array of all |
|
After looking at the docstring, it could also be that we're depending on broken behavior in our test suite/dask backend implementation. |
|
Thanks for reporting @cpcloud. Yeah, as you mentioned, |
|
Thank you for the report @cpcloud. I think most of what is going on here is that, before this PR, In the To elaborate a bit more on @jrbourbeau's comment -- yes, It looks like the vast majority of the failures are due to the above, but perhaps not all? I particularly see things like |
| df = pd.DataFrame( | ||
| {"x": [1, 2, 3, 4, 5, 6] * 10, "y": list("abdabd") * 10}, | ||
| index=pd.Series([1, 2, 3, 4, 5, 6] * 10, dtype=dtype), | ||
| index=pd.Series([10, 20, 30, 40, 50, 60] * 10, dtype=dtype), |
There was a problem hiding this comment.
This broke tests on aarch64, ppc64le, s390x again, cf #4561
There was a problem hiding this comment.
Thanks for the report @QuLogic, is the problem the same? It seems pretty unsatisfying to have to change the specific numbers here in order to support those architectures.
There was a problem hiding this comment.
Thanks for pointing this out @QuLogic - I didn't realize that this was changed from [10, 20, 30, 40, 50, 60] to [1, 2, 3, 4, 5, 6] in the past. Unfortunately, using [1, 2, 3, 4, 5, 6] no longer works when from_pandas returns the correct number of partitions, so I cannot simply revert this change.
Can you share what parameters/assertions are failing? Does it help if we remove the assert all(map(len, parts)) (since it's technically okay to have empty partitions if we are increasing the number of partitions)?
There was a problem hiding this comment.
The same ones as before: test_repartition_npartitions[<lambda>0-float-5-1-True] and test_repartition_npartitions[<lambda>1-float-5-1-True]. See build.log on the other arches here.
I assume it would be okay to remove that line as it's the last line of the test and the one that's failing.
There was a problem hiding this comment.
Ping? This is still broken in the latest version. Should I open a new issue?
There was a problem hiding this comment.
Thanks for the ping @QuLogic. A new issue would be good. Do you have a traceback?
pre-commit run --all-filesThis PR was originally meant to be a small bug-fix for #9218 - However, I struggled to work with the existing
sorted_division_locationsfunction, and eventually decided to rewrite it in a slightly different way.Side Notes:
sorted_division_locationsis expected to prioritize the uniqueness ofdivisionselements over the output satisfyingnpartitionsorchunksize. Although I can understand the motivation for having unique divisions, it is not clear to me that uniqueness is always more important than the partition size and/or count. Perhaps it makes sense to add something like aunique_partitionsargument tofrom_pandas?npartitions=argument is not always satisfied. This result seems a bit suprising to me.