Useful plugins for JupyterHub / JupyterLab #13

pibion · 2020-04-02T01:41:33Z

This is in no way a priority, but I wanted to record the idea somewhere before I forget it:

It might be nice to have the Theia IDE available in the JupyterHub environment. It looks like there is some support for this: https://jupyter-server-proxy.readthedocs.io/en/latest/convenience/packages/theia.html.

zonca · 2020-04-06T20:36:03Z

I also think JupyterLab system monitor would be great for users to check how much memory are using.

https://github.com/jtpio/jupyterlab-system-monitor

pibion · 2020-04-06T21:00:25Z

@zonca a system monitor would be extremely helpful. In general we'd love to have more monitoring both for users and the cluster so we know what to request on allocations.

zonca · 2020-04-06T21:27:48Z

Server-side we can install Prometheus and Grafana: https://zonca.dev/2019/04/kubernetes-monitoring-prometheus-grafana.html

pibion · 2020-04-06T21:34:30Z

@zonca Prometheus and Grafana look like they'd be perfect. The integration with Slack is especially nice.

pibion · 2020-05-19T00:34:49Z

@zonca would it be possible to integrate JupyerHub on Jetstream with XSEDE batch resources?

Right now one of our analyzers is going back and forth between interactive analysis and submitting jobs that use on order 600 CPU hours. I think this would be a simple allocation to get but making it easy to submit jobs from the Jupyter environment would be awesome and significantly increase the likelihood that people would use those resources.

zonca · 2020-05-19T02:43:12Z

which supercomputers are they using? how much data (order of magnitude) are the input and output that need to move between interactive and not?

pibion · 2020-05-19T03:04:04Z

@ziqinghong could you comment?

The supercomputers currently used for this are the SLAC cluster. I think the input data is order 100 GB and the output data is an order of magnitude less.

ziqinghong · 2020-05-19T13:50:49Z

A month worth of small detector data is 5-10 TB. (Amy, these are continuous non-triggered data, thus they're bigger than the usual numbers we quote for Snolab.)
A simple model is that each batch job processes one dataset (an hour of data). The input is ~30 G. They get turned into <1 G output.

zonca · 2020-05-19T18:36:52Z

thanks @ziqinghong, is processing using MPI? how many nodes and how long does a typical job take? is the software multithreaded?

pibion · 2020-05-19T20:37:06Z

@zonca I don't believe the software is multithreaded or using MPI.

I'll let @ziqinghong comment on how many nodes and how long a typical job takes.

ziqinghong · 2020-05-19T21:00:00Z

We don't usually use MPI. Our jobs are parallelized by splitting up datasets and running idential processing on each of them.

How long does a typical job takes depends on how many nodes/jobs we spread the task. It takes O(500) CPU-hours to process ~10 TB of data. If we spread them among 200 cores (which is consistent with our typical usage), it'll get done in 3 hours.

zonca · 2020-05-19T21:07:03Z

do you run single or multi-threaded code? is it Python?

pibion · 2020-05-19T21:10:07Z

@ziqinghong my understanding is that the code is Python and that it is single-threaded.

zonca · 2020-05-19T21:12:33Z

so 200 cores single-threaded, you mean 200 nodes, right?

…

On Tue, May 19, 2020 at 2:10 PM pibion ***@***.***> wrote: @ziqinghong <https://github.com/ziqinghong> my understanding is that the code is Python and that it is single-threaded. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAC5Q4XQRK7ZKCED3PJYZCTRSLYT3ANCNFSM4LZZUJGQ> .

ziqinghong · 2020-05-19T21:20:50Z

Single-threaded. Not sure if I know the difference between 200 nodes and 200 cores... 200 x single thread.

zonca · 2020-05-19T21:35:56Z

thanks, for a workload that doesn't use MPI and is not too large as this one, I think we could execute that directly on Jetstream, inside Kubernetes, with dask.

can you prepare a dataset + the code to do this data processing stage, with some documentation on how to execute, better if inside a dedicated github repository (the code, with pointers to the data). Then I can check if we can execute it with dask.

ziqinghong · 2020-05-20T01:32:49Z

I copied a little bit of raw data over, and a bunch of scripts. I could start an interactive data processing run in the jupyter terminal in the browser, by doing
source /cvmfs/cdms.opensciencegrid.org/setup_cdms.sh V03-01
cd /cvmfs/data/tf/AnimalData/processing/
python AnimalDataAnalysiscont_blocks.py 20190718102534 20190718102534

The code is nasty, as it has a bunch of locations hard wired... It also needs an existing directory of /home/jovyan/work/blocks... If the work directory gets reset, it'll error out.

@zonca If you could give it a try, let me know if you encounter more errors. I'm still running it... Seems like it'll take O(10 minutes).

pibion · 2020-05-20T01:39:44Z

@ziqinghong if you point me to the data you're using, we can make sure we get that into the data catalog. Once we've got the code updated that is :)

zonca · 2020-05-21T21:04:27Z

thanks @ziqinghong yes, it runs fine, I'll try if using dask I can run multiple instances of that in parallel.

zonca · 2020-05-21T21:05:23Z

actually best is if you give me a set of 10 (or better 100) different inputs, so I can try run them in parallel

ziqinghong · 2020-05-21T21:26:27Z

AWESOME!!!!

zonca · 2020-09-03T06:10:11Z

there are too many different things in this issue, @pibion if you are still interested in any of the above, please open a dedicated issue.

zonca added this to To do in CDMS JupyterHub on XSEDE Apr 6, 2020

zonca changed the title ~~Really nice IDE~~ Useful plugins for JupyterHub / JupyterLab Apr 6, 2020

zonca mentioned this issue May 21, 2020

Distributed computing with dask #7

Closed

zonca closed this as completed Sep 3, 2020

CDMS JupyterHub on XSEDE automation moved this from To do to Done Sep 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Useful plugins for JupyterHub / JupyterLab #13

Useful plugins for JupyterHub / JupyterLab #13

pibion commented Apr 2, 2020

zonca commented Apr 6, 2020

pibion commented Apr 6, 2020

zonca commented Apr 6, 2020

pibion commented Apr 6, 2020

pibion commented May 19, 2020

zonca commented May 19, 2020

pibion commented May 19, 2020 •

edited

Loading

ziqinghong commented May 19, 2020

zonca commented May 19, 2020

pibion commented May 19, 2020

ziqinghong commented May 19, 2020

zonca commented May 19, 2020

pibion commented May 19, 2020

zonca commented May 19, 2020 via email

ziqinghong commented May 19, 2020

zonca commented May 19, 2020 •

edited

Loading

ziqinghong commented May 20, 2020

pibion commented May 20, 2020

zonca commented May 21, 2020

zonca commented May 21, 2020

ziqinghong commented May 21, 2020

zonca commented Sep 3, 2020

Useful plugins for JupyterHub / JupyterLab #13

Useful plugins for JupyterHub / JupyterLab #13

Comments

pibion commented Apr 2, 2020

zonca commented Apr 6, 2020

pibion commented Apr 6, 2020

zonca commented Apr 6, 2020

pibion commented Apr 6, 2020

pibion commented May 19, 2020

zonca commented May 19, 2020

pibion commented May 19, 2020 • edited Loading

ziqinghong commented May 19, 2020

zonca commented May 19, 2020

pibion commented May 19, 2020

ziqinghong commented May 19, 2020

zonca commented May 19, 2020

pibion commented May 19, 2020

zonca commented May 19, 2020 via email

ziqinghong commented May 19, 2020

zonca commented May 19, 2020 • edited Loading

ziqinghong commented May 20, 2020

pibion commented May 20, 2020

zonca commented May 21, 2020

zonca commented May 21, 2020

ziqinghong commented May 21, 2020

zonca commented Sep 3, 2020

pibion commented May 19, 2020 •

edited

Loading

zonca commented May 19, 2020 •

edited

Loading