Skip to content

Add compute kwargs to methods that write data to disk#6056

Merged
jrbourbeau merged 20 commits intodask:masterfrom
KrishanBhasin:feature/compute-kwargs
May 4, 2020
Merged

Add compute kwargs to methods that write data to disk#6056
jrbourbeau merged 20 commits intodask:masterfrom
KrishanBhasin:feature/compute-kwargs

Conversation

@KrishanBhasin
Copy link
Copy Markdown
Contributor

@KrishanBhasin KrishanBhasin commented Apr 2, 2020

  • Tests added / passed
  • Passes black dask / flake8 dask

Fixes #6026

@KrishanBhasin
Copy link
Copy Markdown
Contributor Author

to_hdf() doesn't call a .compute() anywhere, instead calling compute_as_if_collection().

I can't see where to use the compute_kwargs in compute_as_if_collection():

def compute_as_if_collection(cls, dsk, keys, scheduler=None, get=None, **kwargs):
    """Compute a graph as if it were of type cls.

    Allows for applying the same optimizations and default scheduler."""
    schedule = get_scheduler(scheduler=scheduler, cls=cls, get=get)
    dsk2 = optimization_function(cls)(ensure_dict(dsk), keys, **kwargs)
    return schedule(dsk2, keys, **kwargs)

Copy link
Copy Markdown
Contributor

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @KrishanBhasin ! I know this is a draft, but I have a few comments here (also will help with the test failures).

@KrishanBhasin
Copy link
Copy Markdown
Contributor Author

@gforsyth do you have any thoughts on how I can keep to_hdf() consistent with the rest of the methods I've modified?

@martindurant
Copy link
Copy Markdown
Member

I see you have got to the linting stage, is this PR ready for review? @gforsyth , did you have an interest?

Copy link
Copy Markdown
Contributor

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @KrishanBhasin -- this is coming along! I haven't had a chance to look at to_hdf yet, I'll try to do that this week.

@martindurant may have thoughts on how to handle that

@martindurant
Copy link
Copy Markdown
Member

to_hdf isn't one of mine, I would have to do some digging, not sure when I would find the time.

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Apr 8, 2020 via email

@KrishanBhasin
Copy link
Copy Markdown
Contributor Author

@gforsyth I've cleaned up the bits you mentioned!

The hdf thing bothers me though, I'll try to have another crawl through its code when I get the time

@KrishanBhasin KrishanBhasin marked this pull request as ready for review April 11, 2020 09:05
@gforsyth
Copy link
Copy Markdown
Contributor

Hey @KrishanBhasin -- sorry for my delay in getting to this. I think we should push this in and open a separate issue to tackle the hdf stuff.

Could you add tests for compute_kwargs for parquet and json, as well?

@KrishanBhasin
Copy link
Copy Markdown
Contributor Author

I'll get to this over the weekend, sorry for the delay!

@gforsyth
Copy link
Copy Markdown
Contributor

I'll get to this over the weekend, sorry for the delay!

No worries! This is otherwise ready to go in.

@KrishanBhasin
Copy link
Copy Markdown
Contributor Author

I've added a test for the json case, but I have a funny feeling I'm missing the point with it. Do you just want a test that passes in a compute_kwargs value and handles it correctly, like this test:

def test_to_csv_with_get():
from dask.multiprocessing import get as mp_get
flag = [False]
def my_get(*args, **kwargs):
flag[0] = True
return mp_get(*args, **kwargs)
df = pd.DataFrame({"x": ["a", "b", "c", "d"], "y": [1, 2, 3, 4]})
ddf = dd.from_pandas(df, npartitions=2)
with tmpdir() as dn:
ddf.to_csv(dn, index=False, compute_kwargs={"scheduler": my_get})
assert flag[0]
result = dd.read_csv(os.path.join(dn, "*")).compute().reset_index(drop=True)
assert_eq(result, df)

?

@gforsyth
Copy link
Copy Markdown
Contributor

I've added a test for the json case, but I have a funny feeling I'm missing the point with it. Do you just want a test that passes in a compute_kwargs value and handles it correctly, like this test:

I think there should be tests for each to_* method that show that compute_kwargs are being correctly passed through to the compute call, and also tests to show that the expected warnings/errors come up if a user uses a deprecated argument (or specifies two schedulers, one using scheduler and another using compute_kwargs)

@KrishanBhasin
Copy link
Copy Markdown
Contributor Author

I've added a test for the json case, but I have a funny feeling I'm missing the point with it. Do you just want a test that passes in a compute_kwargs value and handles it correctly, like this test:

I think there should be tests for each to_* method that show that compute_kwargs are being correctly passed through to the compute call, and also tests to show that the expected warnings/errors come up if a user uses a deprecated argument (or specifies two schedulers, one using scheduler and another using compute_kwargs)

The test...with_get() tests show the scheduler arguments are being passed appropriately, as the flag value is asserted to have been changed.

Checking whether to_csv() is raising warnings/errors appropriately, it looks like we still need that 3-line if statment - please let me know if I'm overlooking something (the problem with contributing to open source in your free time, is that you tend to do it in the evening when a little tired).

@KrishanBhasin KrishanBhasin requested a review from gforsyth April 28, 2020 21:08
Copy link
Copy Markdown
Contributor

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks for putting this in @KrishanBhasin!

@jrbourbeau this is good to go

Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution @KrishanBhasin, I'm looking forward to seeing this merged. Overall things here look really good, just left a few small comments

KrishanBhasin and others added 2 commits May 3, 2020 14:45
Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>
@gforsyth
Copy link
Copy Markdown
Contributor

gforsyth commented May 4, 2020

Hey @KrishanBhasin -- looks like black needs to be run on dask/dataframe/io/tests/test_csv.py -- that's what's failing CI on travis

@KrishanBhasin
Copy link
Copy Markdown
Contributor Author

Test added and black has been run @gforsyth @jrbourbeau

Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you for working on this @KrishanBhasin! Thanks @gforsyth for reviewing!

@jrbourbeau jrbourbeau merged commit 96d0853 into dask:master May 4, 2020
@KrishanBhasin KrishanBhasin deleted the feature/compute-kwargs branch May 5, 2020 07:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Question/Feature Request/PR offer - passing **kwargs to compute() within to_parquet()

6 participants