Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed flaky test-rearrange #5977

Merged
merged 17 commits into from
Mar 31, 2020

Conversation

TomAugspurger
Copy link
Member

Puts file cleanup in the task graph, rather than relying on Partd to do it for us. I'm not really happy with my approach, but wanted to try this on CI a few times.

Closes #5867

Puts file cleanup in the task graph, rather than relying on Partd.
@TomAugspurger TomAugspurger changed the title Fixed flaky test-rearrange [WIP]: Fixed flaky test-rearrange Mar 4, 2020
@TomAugspurger
Copy link
Member Author

@mrocklin you mentioned that you have concerns about this approach. Can you share them when you get a chance, before I put more time into finishing it off / finding an alternative?

Copy link
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so now I see that we're still using the existing cleanup mechanism in partd, so presumably this is unlikely to be less robust. I guess I'm coming around. I do have a few comments below though.

p.append(d, fsync=True)
except Exception:
try:
p.drop()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to drop this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this matches the behavior on master of a case where something in shuffle_group_3 raises an exception. Previously, p would go out of scope and be garbage collected, which includes a call to partd.File.drop.

Because we're explicitly providing a path to create the partd.File now, we're responsible for cleaning up.

dask/dataframe/tests/test_shuffle.py Outdated Show resolved Hide resolved
path = None

if path:
shutil.rmtree(path, ignore_errors=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's low priority, but some folks might create their own PartD objects here. In the future we might want a more robust solution to finding a File object.

dask/dataframe/shuffle.py Outdated Show resolved Hide resolved
}
cleanup_token = "cleanup-" + always_new_token
dsk5 = {cleanup_token: (cleanup_partd_files, p, list(dsk4))}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you're bringing back all of the results into a single task with list(dsk4).

If we do this approach then we might have to have an intermediary set of tasks that take in the results of dsk4 and return None for each. In this way we maintain the dependencies, but don't move the data around.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm OK. Do you know, does dsk3 at https://github.com/dask/dask/pull/5977/files#diff-83ae5352ddc87bd80c831102addd9b1eL407 have a similar problem? Perhaps that result isn't as large.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, my guess is that the output tasks in dsk2 write things to disk and then return None.

The output tasks of dsk4 read from disk and return dataframes, so this is more of a concern.

Instead, I think that the solution is to have a dsk4b that maps lambda x: None across the outputs of dsk4.

Copy link
Member Author

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for reference, here's the dask graph for the set_index test.

mydask

}
cleanup_token = "cleanup-" + always_new_token
dsk5 = {cleanup_token: (cleanup_partd_files, p, list(dsk4))}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm OK. Do you know, does dsk3 at https://github.com/dask/dask/pull/5977/files#diff-83ae5352ddc87bd80c831102addd9b1eL407 have a similar problem? Perhaps that result isn't as large.

p.append(d, fsync=True)
except Exception:
try:
p.drop()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this matches the behavior on master of a case where something in shuffle_group_3 raises an exception. Previously, p would go out of scope and be garbage collected, which includes a call to partd.File.drop.

Because we're explicitly providing a path to create the partd.File now, we're responsible for cleaning up.

dask/dataframe/tests/test_shuffle.py Outdated Show resolved Hide resolved
@@ -779,34 +793,6 @@ def test_compute_divisions():
compute_divisions(c)


# TODO: Fix sporadic failure on Python 3.8 and remove this xfail mark
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed this test, but I can restore it if needed.

It's asserting that there are some files left around, but I'm not sure that we want that / can reliably assert that. My understanding was that we wanted them to be cleaned up automatically.

@mrocklin
Copy link
Member

Because we're explicitly providing a path to create the partd.File now, we're responsible for cleaning up.

This concerns me. We will often fail at this. What is some exception happens during computation and we never get to the cleanup step?

@TomAugspurger TomAugspurger changed the title [WIP]: Fixed flaky test-rearrange Fixed flaky test-rearrange Mar 26, 2020
@TomAugspurger
Copy link
Member Author

TomAugspurger commented Mar 26, 2020

This concerns me. We will often fail at this. What is some exception happens during computation and we never get to the cleanup step?

1b704d7 attempts to make this clearer. The idea is to wrap calls using the partd.File object in try / except blocks. If there is an exception, then we drop the partd file and re-raise the original exception. 1b704d7 has a test that should illustrate this.

Copy link
Member Author

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave this another look and I'm sufficiently happy with where it's at.

@TomAugspurger TomAugspurger merged commit 3c14766 into dask:master Mar 31, 2020
@TomAugspurger TomAugspurger deleted the 5867-debub-test-rearrange branch March 31, 2020 19:34
@jrbourbeau
Copy link
Member

Thanks for your work on this @TomAugspurger

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test test_rearrange
3 participants