Skip to content

couple bugfixes + comment nits#6578

Merged
mrocklin merged 5 commits intodask:masterfrom
celsiustx:nits
Aug 30, 2020
Merged

couple bugfixes + comment nits#6578
mrocklin merged 5 commits intodask:masterfrom
celsiustx:nits

Conversation

@ryan-williams
Copy link
Copy Markdown
Contributor

@ryan-williams ryan-williams commented Aug 30, 2020

  • Tests added / passed
  • Passes black dask / flake8 dask

  • a length check related to passing -1s to array.reshape was incorrect; fixed + added test
  • cryptic unused ddf.iloc expression was intended to fail, but should probably return; fixed + documented (spoiler: I've implemented ddf.iloc separately, for some cases; PR incoming…)
  • a pytest was parameterized with [False, False] when [False, True] was presumably intended; fixed + adjusted test to work in both cases
  • misc comment clarifications / typo fixes

(I previously pushed a commit here where I'd run an older version of black and introduced some incorrect formatting changes; force-pushed to correct that, as well as add a few more fixes).

doesn't matter, since this statement always throws a NotImplementedError, but good for future-proofing
@mrocklin
Copy link
Copy Markdown
Member

This looks good to me. Thank you for your attention to detail here @ryan-williams .

I'm curious, what were you up to that lead you to these issues?

Regardless. Merging!

@mrocklin mrocklin merged commit 5b79971 into dask:master Aug 30, 2020
@ryan-williams
Copy link
Copy Markdown
Contributor Author

ryan-williams commented Sep 1, 2020

Thanks @mrocklin!

@sakoht and I have been working in celsiustx/dask for several months. Our goal is basically to dask-ify Anndata (in the spirit of earlier approaches, e.g. @tomwhite's scverse/scanpy#921).

There are a few meaty changes about ready for upstreaming to Dask:

DataFrame.iloc implementation

  • backed by a new partition_sizes tuple on DataFrames, which is populated and propagated on a best-effort basis
  • pre-PR comparison; some parquet tests are failing; I'm still debugging them

sum() on Arrays w np.matrix or spmatrix blocks

  • this crashes today bc Dask passes a keepdims kwarg to each block, but these special array types' sum methods don't accept keepdims
  • I fixed it by forking numpy, and scipy, to add keepdims to those sum methods (as well as a few changes in Dask)
  • juggling all the forks is a bit unwieldy
    • we are developing in a Docker image with our forked numpy, scipy, and dask libraries, and have spent almost as much time on Docker+Git scripting as on the changes themselves; it has built a lot of character 😉
    • upstreaming everything may take a while, but we are intent on doing it if relevant maintainers agree

DataFrame.to_sql (#6038)

For completeness: this earlier PR came from the project at Celsius, though we've walked back from a heady SQL-maximalism approach to try to get other, more attainable filesystem-based things working first.

Definitely interested in your / other Dasks' thoughts about all of this, but you are more than welcome to wait for a clean presentation + proper PRs!

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Sep 2, 2020

Thanks for the context @ryan-williams . Some thoughts:

  1. I strongly encourage you to raise an issue about iloc early. I think there is already an issue somewhere about alternate partitioning information. You'll find good company and a lot of this conversation already hashed out.
  2. A fork of numpy/scipy/dask: oof. Have fun. I'm going to keep away from this one :)
  3. to_sql. Thank you for contributing your work upstream generally. It's great to see contributions come out of practical use cases

You all may also want to be aware of and maybe get involved in https://github.com/nils-braun/dask-sql

@sakoht
Copy link
Copy Markdown

sakoht commented Sep 2, 2020

A fork of numpy/scipy/dask: oof. Have fun. I'm going to keep away from this one :)

FWIW it sounds more crazy than it is. Some neglected corners of the codebase around sparse matrices were brought to match the rest of the codebase. (A PITA to fix and test, which is probably why it wasn't done before.)

@ryan-williams ryan-williams deleted the nits branch September 10, 2020 14:28
kumarprabhu1988 pushed a commit to kumarprabhu1988/dask that referenced this pull request Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants