[WIP] Add large svd draft #38

mrocklin · 2019-07-05T10:00:47Z

I need to put in actual code samples and probably do a screencast, but in the mean time if you'd be willing to look through this post for the general structure I'd welcome the collaboration.

alimanfoo · 2019-07-05T10:09:59Z

Cool, thanks @mrocklin. I took a scan through and looks good, I'll try to follow up with more detailed feedback.

mrocklin · 2019-07-05T10:42:08Z

If you want to give it a try, see this notebook https://gist.github.com/mrocklin/3af8a428e18a3047ab501161ffe17cf9

It should run fine on the single-node GPU machine to which you have access today (sending the address by e-mail)

pentschev · 2019-07-05T11:05:54Z

No %%time? :(

mrocklin · 2020-03-25T15:40:27Z

@quasiben @jakirkham @pentschev I suspect that you all are busy, but it would be fun to try this experiment again, now with the new and improved UCX integration.

quasiben · 2020-04-22T03:13:18Z

I gave this a whirl on a DGX2 (16GPUs) with NVLink enabled. Here's a performance report:

https://gistcdn.githack.com/quasiben/78033ef8130afb7b4c5f5f2def36d388/raw/59eb22d47bef5462bdb433ebfe3ae099cc64ccf1/dask-performance-large-svd.html

In general things are looking ok -- I am a bit concerned with the NumPy transfers (don't know where they are coming from yet) and there is a fairly large gap in the middle of the workflow. But things worked!

pentschev · 2020-04-22T08:57:25Z

I am a bit concerned with the NumPy transfers (don't know where they are coming from yet)

Probably metadata in UCX send/recv, which is mostly being addressed in dask/distributed#3732 .

mrocklin · 2020-04-22T14:48:49Z

I am a bit concerned with the NumPy transfers (don't know where they are coming from yet)

It would be nice to know how much was transferred (this is mostly a mental note to me to maybe improve diagnostics)

mrocklin · 2020-04-22T14:50:41Z

Can I ask that you maybe also call v.compute(). That might slightly change things.

quasiben · 2020-04-22T14:56:42Z

Can I ask that you maybe also call v.compute(). That might slightly change things.

Do you mean u.compute() ? I called v

mrocklin · 2020-04-22T15:01:51Z

From the performance report it looks like we're spending essentially all of our time generating random numbers.

There are a few things that we might experiment with here to improve the focus on SVD, and also give this example a bit more reality. cc'ing @alimanfoo, @eric-czech, and @jakirkham who I think might all care about this problem.

We can mimic a genomics problem I think by using random integers uniformly distributed between 0 and 3, stored as uint8.
We might consider storing these somewhere. We could try something small that fits in GPU memory (it might be easier given that we're storing these somewhat compactly). So we would persist the input dask array fill up memory (we probably have to reduce size a bit) and then and then call svd, hoping that the working memory that it needs wasn't too much greater.
For something larger, we could move to disk. I suspect that reading from disk will be more expensive than regenerating the data, but this is still a useful thing to highlight and will make a nice section. <puts on NVIIDA hat> It's also a great opportunity to mention NVIDIA IO technologies </takes off NVIDIA hat>
If we wanted to experiment a bit, we could play with compressing the data in memory just before we persist it. This would require us to have functions that converted a cupy array to a compressed form, and then decompressed that cupy array. Do we have such a function?
We could fit everything into memory if we just added more machines :)

Anyway, those are some ideas. I'm actually pretty excited about trying them all.

mrocklin · 2020-04-22T15:02:28Z

@quasiben I'm somewhat concerned about the 2GB/s bandwidth for cupy arrays. This seems low?

I hope that once we remove the random number generation that this becomes our next bottleneck to squash.

mrocklin · 2020-04-22T15:22:20Z

Oh indeed you did. My apologies. v is good. u is probably huge.

…

On Wed, Apr 22, 2020 at 7:56 AM Benjamin Zaitlen ***@***.***> wrote: Can I ask that you maybe also call v.compute(). That might slightly change things. Do you mean u.compute() ? I called v — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#38 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTFNO3ELCCMCEJ2NJ4DRN4ATVANCNFSM4H6J3EXQ> .

alimanfoo · 2020-04-24T15:18:38Z

We can mimic a genomics problem I think by using random integers uniformly distributed between 0 and 3, stored as uint8.

FWIW in a real genomics problem the integers can take values 0, 1 or 2 (if working with a diploid organism like humans or mosquitoes). They are also not uniformly distributed. These may or may not make a different, depending on what you're testing. Biggest different I expect would be compressibility (real data will be much more compressible than uniform random). Give me a shout if you'd like some real data to play with, I can point you at some zarrs on GCS.

mrocklin · 2020-04-25T03:57:59Z

Some zarr datasets on gcs would be great. We could probably download locally.

…

On Fri, Apr 24, 2020, 8:18 AM Alistair Miles ***@***.***> wrote: 1. We can mimic a genomics problem I think by using random integers uniformly distributed between 0 and 3, stored as uint8. FWIW in a real genomics problem the integers can take values 0, 1 or 2 (if working with a diploid organism like humans or mosquitoes). They are also not uniformly distributed. These may or may not make a different, depending on what you're testing. Biggest different I expect would be compressibility (real data will be much more compressible than uniform random). Give me a shout if you'd like some real data to play with, I can point you at some zarrs on GCS. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#38 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTCLYRVKJMBZ6QUJFQTROGUV3ANCNFSM4H6J3EXQ> .

mrocklin · 2020-04-27T00:29:45Z

Here is an example using Zarr with dask array together to compress in-memory data.

https://gist.github.com/mrocklin/ae18a1b51ccf90a2fa9872de23144ffa

alimanfoo · 2020-04-27T09:29:25Z

Here's a gist showing location of some genome variation data in zarr on GCS, and also the couple of preprocessing steps required before can be run through SVD:

https://gist.github.com/alimanfoo/ef6d9e70750b5273f95cbf681d162de8

Hth :-)

alimanfoo · 2020-04-27T09:52:29Z

Here's a gist showing location of some genome variation data in zarr on GCS, and also the couple of preprocessing steps required before can be run through SVD:

https://gist.github.com/alimanfoo/ef6d9e70750b5273f95cbf681d162de8

Hth :-)

Btw this example is "tall and skinny" with ~10 million features and ~1,000 individuals, but the data are growing to become more square, e.g., biobank datasets are more like ~10 million features and ~1 million individuals. So don't optimise for the tall and skinny case. At some point would probably be useful to simulate a biobank-scale dataset.

alimanfoo · 2020-04-27T09:53:32Z

Here is an example using Zarr with dask array together to compress in-memory data.

https://gist.github.com/mrocklin/ae18a1b51ccf90a2fa9872de23144ffa

That's cunning :-)

mrocklin · 2020-04-27T14:16:17Z

That's cunning :-)

Yeah, I then followed it up by actually swapping out zarr with functions that do bit-twiddling

def compress(x: np.ndarray) -> np.ndarray:
   out = np.zeros_like(x, shape=(x.shape[0], x.shape[1] // 4))
   out += x[:, 0::4]
   out += x[:, 1::4] << 2
   out += x[:, 2::4] << 4
   out += x[:, 3::4] << 6
   return out

def decompress(out: np.ndarray) -> np.ndarray:
   back = np.zeros_like(out, shape=(out.shape[0], out.shape[1] * 4))
   back[:, 0::4] = out & 0b00000011
   back[:, 1::4] = (out & 0b00001100) >> 2
   back[:, 2::4] = (out & 0b00110000) >> 4
   back[:, 3::4] = (out & 0b11000000) >> 6
   return back

This is a bit faster, and it also works wtih CuPy (we don't have good compression algorithms with Zarr on GPU yet). This lets us store a dataset pretty compactly, and expand it on-demand. It didn't quite work, I suspect because of a known memory pressure issue with the dask scheduler (a few other groups have run into something similar, so I hope that there is good pressure to solve it).

Speaking of CuPy, @quasiben and I were playing with a DGX-2 (16GPUs, half terrabyte of device memory) last night and solved something like a 10,000,000 by 20,000 svd in 10 seconds or so (if I recall correctly). That's after it was in RAM though. Probably the main cost will be collecting it from wherever it's stored.

Btw this example is "tall and skinny" with ~10 million features and ~1,000 individuals, but the data are growing to become more square, e.g., biobank datasets are more like ~10 million features and ~1 million individuals. So don't optimise for the tall and skinny case. At some point would probably be useful to simulate a biobank-scale dataset

That's good to know.

So far it looks like we're not overly sensitive to chunking on the y-axis, assuming that we're comfortable with approximate results.

eric-czech · 2020-04-27T16:11:50Z

swapping out zarr with functions that do bit-twiddling

@mrocklin do you have any thoughts on trying to do bitpacking like that with the dask API rather than as map_blocks functions (or zarr filters, even assuming they worked with cupy)? Looking at that example, I'm immediately seeing some ways to do useful math on the packed representations so if applying the compression filters at that level instead and relying on bitwise da.Array functions rather than custom numpy functions isn't a bad idea, I'd like to explore that some more.

@alimanfoo mentioned experimenting with something like that in the past a bit but I'd love to know if you think doing lots of bit-fiddling at the dask level would be a bad practice.

Apologies for a somewhat off-topic post but your example shook some thoughts loose so thanks for sharing it!

mrocklin · 2020-04-27T16:28:57Z

do you have any thoughts on trying to do bitpacking like that with the dask API rather than as map_blocks functions

In principle, yes, this could be made to work. Dask array isn't as smooth on in-place operations, so to make the packing process here as fast then we might want something like dask/dask#6123 . If you want to pack with map_blocks and then operate on the packed representation with the Dask array API though then yeah, I imagine that that would be easier.

mrocklin · 2020-04-27T16:48:15Z

In general, the experiment with @quasiben was interesting in order to see how much of this data we could effectively operate on in a fixed amount of device memory. Things are very fast if we can stay in that RAM, but there are a lot of interesting complications to doing so. This was more for us to play with technology than to solve an actual science problem.

This leads to a larger question of "what are the actual science problems here?" For example, you all may not even care about doing this quickly, but instead just care about being able to do it at all. I encourage folks to speak up about what would be motivating.

alimanfoo · 2020-04-28T08:59:12Z

This leads to a larger question of "what are the actual science problems here?" For example, you all may not even care about doing this quickly, but instead just care about being able to do it at all. I encourage folks to speak up about what would be motivating.

I'd say for genomics the main driver is to be able to do it at all. If it can then be done in an interactive environment, even better, where "interactive time" is anything less than the time it takes to make a cup of tea. I.e., you don't need to leave your normal working environment (Jupyter notebook) or worry about any long-running tasks. Also probably relevant as analysis moves to the cloud is cost, i.e., it is not an expensive computation.

Sorry this is a bit vague, but hopefully useful. Main point is we are not running SVDs over and over again all day, so this is not something where there is strong pressure to run as fast and cheaply as possible. But making it possible and straightforward to setup and run large SVDs is valuable.

quasiben · 2020-04-28T14:05:39Z

@alimanfoo can you make tea in 20 seconds ?

In my last experiment I was able to push to 400GB of in memory data (10_000_000, 30_000). SVD on this size was also not greatly impacted. I measured a compute time of 19.3 seconds. This is probably close to the limit (as of now) of the amount of data we can naively perform an SVD on.

To my snarky tea comment before, and as Matt said earlier, reading from disk will be somewhat burdensome. It took roughly a minute to generate the data in memory -- this will mostly like be longer when reading from disk and push us into tea construction time.

There are two things which could be done here to alleviate performance and memory constraints:

Optimized GPU readers from disk which integrate seamlessly with Zarr or other formats
Increase single GPU RAM (don't know how hard this is) or move to a multinode setup (I am working on this now)

Another interesting thing to note is that after loading data into memory, SVD is a communication heavy process. I didn't expect that. A distributed SVD spends more time transferring than computing. Using a box like a DGX2 is hugely impactful here as we can leverage NVLink for communication between GPUs (NVLink has a 50GB/s bandwidth)

Breakdown of actions for an SVD

defaultdict(int,
            {'compute': 55.52175450325012,
             'transfer': 141.12734627723694,
             'disk-read': 0.00934290885925293})

I also wanted to note that working on these problems is also of general interest to NVIDIA.
ClaraGenomics would be an example of NVIDIA in the genetics space: https://github.com/clara-genomics/Compute4COVID

mrocklin · 2020-04-28T14:46:14Z

It would be fun, I think, to do this on a practical problem (like the zarr dataset that @alimanfoo pointed to earlier) on both a consumer laptop and a DGX-2, and quantify all of the costs involved (reading from disk, moving to GPU, transfer time, compute time). That might give us a sense about what a range of performance looks like, and where technology efforts would need to focus in the future to improve things.

…

On Tue, Apr 28, 2020 at 7:05 AM Benjamin Zaitlen ***@***.***> wrote: @alimanfoo <https://github.com/alimanfoo> can you make tea in 20sec ? In my last experiment I was able to push to 400GB of in memory data (10_000_000, 30_000). SVD on this size was also not greatly impacted. I measured a compute time of 19.3 seconds. This is probably close to the limit (as of now) of the amount of data we can naively perform an SVD on. To my snarky tea comment before, and as Matt said earlier, reading from disk will be somewhat burdensome. It took roughly a minute to generate the data in memory -- this will mostly like be longer when reading from disk and push us into tea construction time. There are two things which *could* be done here to alleviate performance and memory constraints: - Optimized GPU readers from disk <https://devblogs.nvidia.com/gpudirect-storage/> which integrate seamlessly with Zarr or other formats - Increase GPU RAM (don't know how hard this is) Another interesting thing to note is that after loading data into memory, SVD is a communication heavy process. I didn't expect that. A distributed SVD spends more time transferring than computing. Using a box like a DGX2 is hugely impactful here as we can leverage NVLink for communication between GPUs (NVLink has a 50GB/s bandwidth) Breakdown of actions for an SVD defaultdict(int, {'compute': 55.52175450325012, 'transfer': 141.12734627723694, 'disk-read': 0.00934290885925293}) I also wanted to note that working on these problems is also of general interest to NVIDIA. ClaraGenomics would be an example of NVIDIA in the genetics space: https://github.com/clara-genomics/Compute4COVID — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#38 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTEAFD4YM7TFC7EHFKLRO3PEJANCNFSM4H6J3EXQ> .

alimanfoo · 2020-04-28T15:10:56Z

@alimanfoo can you make tea in 20sec ?

Nope :-) Just to elaborate a little, obviously interactive is ideal, but tea break would be OK too.

Another way to think about it might be, imagine a researcher who has access to a pangeo-like or saturncloud-like research computing environment, and can fire up notebook servers and dask-kubernetes clusters as needed based on standard node types offered by public cloud providers. They need to run an SVD over a large dataset, and ideally it's (a) convenient to fire up the necessary compute resources and setup the computation, (2) doesn't take too long to run, and (iii) cloud costs are small enough to be negligible, i.e., don't require any special accounting.

Hth.

mrocklin · 2020-04-28T21:01:49Z

Another way to think about it might be, imagine a researcher who has access to a pangeo-like or saturncloud-like research computing environment, and can fire up notebook servers and dask-kubernetes clusters as needed based on standard node types offered by public cloud providers. They need to run an SVD over a large dataset, and ideally it's (a) convenient to fire up the necessary compute resources and setup the computation, (2) doesn't take too long to run, and (iii) cloud costs are small enough to be negligible, i.e., don't require any special accounting.

Like this?

Notebook: https://nbviewer.jupyter.org/urls/gist.githubusercontent.com/mrocklin/0c4e97b9164c55d85ce457fce0b87d41/raw/90a07e2f6fd1563aa65eb83f190e4c853494a4dd/coiled-compressed-svd.ipynb
Screencast: https://www.youtube.com/watch?v=qaJcAvhgLy4

alimanfoo · 2020-05-05T16:00:24Z

Added quasiben#1 with a bit more explanation of the genomics.

quasiben · 2020-05-05T16:05:43Z

Thanks @alimanfoo ! I added them directly here. Matt gave me write access and thought it best to centralize now that we can

_posts/2019-07-02-very-large-svds.md

Co-authored-by: Alistair Miles <alimanfoo@googlemail.com>

quasiben · 2020-05-05T16:28:08Z

@mrocklin when you have some time can you take a pass over this ?

mrocklin · 2020-05-06T00:02:23Z

_posts/2019-07-02-very-large-svds.md

+On A DGX2 we can calculate an SVD on a 200GB Dask array between 10 and15 seconds: [SVD Multi-GPU Notebook](https://gist.github.com/quasiben/98ee254920837313946f621e103d41f4)
+
+To see this run, we recommend viewing
+[the attached screencast](https://youtu.be/4X5yky2lvEw)


My computer may be borked? Can someone else check if this has audio?

mrocklin · 2020-05-06T00:09:44Z

Thanks @quasiben I took a look. Some high level thoughts:

Can you check the screencast and verify that it has audio. It didn't play on my machine, but my audio has also been strange recently. It didn't play on my phone though too, which has me concerned
The GPU part of this post assumes knowledge of terms like DGX2 and host memory that we might not be able to rely on with the majority of our readership. It might also be good to somehow bracket that section in "now let's enter GPU land (oh and as a disclaimer, one of the authors works for NVIDIA)"
I'd like to also include the bits on compression, why it's useful and why it didn't work
I'd also like to include bits on moving to the cloud (with appropriate disclaimer there as well)
We should probably shift language from "I" to "we". I did this in a few places as I was going through but I probably missed some. I also added you to the author list, and decided to sort alphabetically by first name.

quasiben · 2020-05-06T00:14:31Z

@mrocklin thanks for the review. I just recently upgraded to Catalina and I think that changed various app permissions. I'll re-record and upload.

I can also update the post with your suggestions as well. Thanks

alimanfoo · 2020-05-06T11:45:26Z

_posts/2019-07-02-very-large-svds.md

+layout: post
+title: Very Large SVDs
+tagline: Dask + CuPy + Zarr + Genomics
+author: Alistair Miles (Oxford Big Data Institute), Ben Zaitlen (NVIDIA), Matthew Rocklin (Coiled)


Suggested change

author: Alistair Miles (Oxford Big Data Institute), Ben Zaitlen (NVIDIA), Matthew Rocklin (Coiled)

author: Ben Zaitlen (NVIDIA), Matthew Rocklin (Coiled), Alistair Miles (Oxford University)

I don't deserve to be first author here :)

I think we should do it alphabetically by first name. I believe that happens to coincide with the current ordering of names

That would feel a bit strange for me, my contribution to this nice work is very minor.

changed the author order

_posts/2019-07-02-very-large-svds.md

quasiben · 2020-05-12T13:15:28Z

This is looking really good. @mrocklin left a few small comments on the latest commits

Co-authored-by: Benjamin Zaitlen <quasiben@users.noreply.github.com>

quasiben · 2020-05-12T14:55:59Z

Read through the post again and I'm good with posting. @alimanfoo ?

alimanfoo · 2020-05-12T15:33:38Z

LGTM :-)

mrocklin · 2020-05-12T17:40:24Z

I will plan to merge this and tweet about it tomorrow morning US Eastern time

add large svd draft

2177cd7

mrocklin changed the title ~~Add large svd draft~~ [WIP] Add large svd draft Jul 5, 2019

eric-czech mentioned this pull request Apr 27, 2020

Build PyData prototype for GWAS analysis related-sciences/gwas-analysis#20

Open

12 tasks

add Alistair's genomic background

86d127b

cleaned up previous commit

f6259de

alimanfoo reviewed May 5, 2020

View reviewed changes

_posts/2019-07-02-very-large-svds.md Outdated Show resolved Hide resolved

quasiben and others added 4 commits May 5, 2020 12:20

Update _posts/2019-07-02-very-large-svds.md

8526d1a

Co-authored-by: Alistair Miles <alimanfoo@googlemail.com>

checkpoint

38ca9dc

Merge branch 'svd' of github.com:mrocklin/dask-blog into svd

b524f06

flush out svd with zarr and conclusion

0c3944d

mrocklin commented May 6, 2020

View reviewed changes

some minor edits

8ff0ce5

update gpu section -- add disclaimer

7abdf3a

alimanfoo reviewed May 6, 2020

View reviewed changes

quasiben and others added 3 commits May 7, 2020 16:20

change author order

1016ae7

add cloud, compression, and final thoughts to post

edc8994

Merge branch 'svd' of github.com:mrocklin/dask-blog into svd

5eb91ee

quasiben reviewed May 12, 2020

View reviewed changes

_posts/2019-07-02-very-large-svds.md Outdated Show resolved Hide resolved

quasiben reviewed May 12, 2020

View reviewed changes

_posts/2019-07-02-very-large-svds.md Outdated Show resolved Hide resolved

Apply suggestions from code review

3fb4248

Co-authored-by: Benjamin Zaitlen <quasiben@users.noreply.github.com>

Move to today's date

b6fbb51

push

049a965

mrocklin merged commit 79ac035 into dask:gh-pages May 13, 2020

mrocklin deleted the svd branch May 13, 2020 14:08

	author: Alistair Miles (Oxford Big Data Institute), Ben Zaitlen (NVIDIA), Matthew Rocklin (Coiled)
	author: Ben Zaitlen (NVIDIA), Matthew Rocklin (Coiled), Alistair Miles (Oxford University)

[WIP] Add large svd draft #38

[WIP] Add large svd draft #38

Conversation

mrocklin commented Jul 5, 2019

alimanfoo commented Jul 5, 2019

mrocklin commented Jul 5, 2019

pentschev commented Jul 5, 2019

mrocklin commented Mar 25, 2020

quasiben commented Apr 22, 2020

pentschev commented Apr 22, 2020

mrocklin commented Apr 22, 2020

mrocklin commented Apr 22, 2020

quasiben commented Apr 22, 2020

mrocklin commented Apr 22, 2020

mrocklin commented Apr 22, 2020

mrocklin commented Apr 22, 2020 via email

alimanfoo commented Apr 24, 2020

mrocklin commented Apr 25, 2020 via email

mrocklin commented Apr 27, 2020

alimanfoo commented Apr 27, 2020

alimanfoo commented Apr 27, 2020

alimanfoo commented Apr 27, 2020

mrocklin commented Apr 27, 2020

eric-czech commented Apr 27, 2020

mrocklin commented Apr 27, 2020

mrocklin commented Apr 27, 2020

alimanfoo commented Apr 28, 2020

quasiben commented Apr 28, 2020 • edited Loading

mrocklin commented Apr 28, 2020 via email

alimanfoo commented Apr 28, 2020

mrocklin commented Apr 28, 2020 • edited Loading

alimanfoo commented May 5, 2020

quasiben commented May 5, 2020

quasiben commented May 5, 2020

mrocklin May 6, 2020

Choose a reason for hiding this comment

mrocklin commented May 6, 2020

quasiben commented May 6, 2020

alimanfoo May 6, 2020

Choose a reason for hiding this comment

quasiben May 6, 2020

Choose a reason for hiding this comment

alimanfoo May 6, 2020

Choose a reason for hiding this comment

quasiben May 7, 2020

Choose a reason for hiding this comment

quasiben commented May 12, 2020

quasiben commented May 12, 2020

alimanfoo commented May 12, 2020

mrocklin commented May 12, 2020

quasiben commented Apr 28, 2020 •

edited

Loading

mrocklin commented Apr 28, 2020 •

edited

Loading