Jobs fail due to memory usage #88

dhimmel · 2017-08-15T21:13:52Z

When fitting models on all disease types, it's common for the job to exceed its memory allotment and fail.

We can increase the instance size as a first step. If that becomes cost prohibitive, we can consider changes to our dask-searchcv configuration. Currently, we use the default cache_cv=True:

Whether to extract each train/test subset at most once in each worker process, or every time that subset is needed. Caching the splits can speedup computation at the cost of increased memory usage per worker process. If True, worst case memory usage is (n_splits + 1) * (X.nbytes + y.nbytes) per worker. If False, worst case memory usage is (n_threads_per_worker + 1) * (X.nbytes + y.nbytes) per worker.

This really speeds things up, so setting cache_cv=False would not be ideal.

The text was updated successfully, but these errors were encountered:

rdvelazquez · 2017-08-30T19:06:12Z

@dcgoss or @dhimmel How much RAM does the AWS instance currently have and how much could we potentially increase that by? This will be of note for @wisygig or whoever else looks into the memory issue and also be of note for selecting the parameter grid to use in GridSearchCV.

wisygig · 2017-08-30T19:46:29Z

@rdvelazquez I've been looking through dask/dask-searchcv#33 and playing a bit with memory_profiler. Aiming to put up a WIP with some monitoring tools by the end of the week.

dhimmel · 2017-09-06T18:23:27Z

How much RAM does the AWS instance currently have and how much could we potentially increase that by?

I think currently the instances are 2 GB. What about increasing them to 8 GB? Paging @kurtwheeler who I think can make the change.

kurtwheeler · 2017-09-06T18:33:18Z

We're actually at 8 GB already, for both instances. How much do we think we'll need? I can bump them up to 16 GB if really needed, although that's starting to get a bit expensive.

A cheaper alternative could be to switch to one ml-worker that's large, and then just have a second smaller instance to run the second instance of the core-service for high availability.

Do the ml-workers only run one job at a time? Is it possible that one of them is picking up more than one job at a time and this is causing the memory issues?

rdvelazquez · 2017-09-08T19:58:55Z

I can run the version of notebook 2 from the machine-learning repo on my PC that has 8 GB of RAM without issue (and no swapping). I wonder what the available RAM on the AWS instance is; is there a way to easily find that out? Also I think the AWS instance is using Docker to install the requirements.txt from PIP as opposed to Conda (not sure if this may affect either memory overhead or classifier memory usage). Another potential difference may be how the data is downloaded in notebook 1 on AWS.

Do the ml-workers only run one job at a time? Is it possible that one of them is picking up more than one job at a time and this is causing the memory issues?

The ml-workers should be only running one job at a time even when multiple requests come in at once. I think we've also reproduced this problem enough that it's unlikely there were multiple requests occurring at once each time.

kurtwheeler · 2017-09-11T17:46:29Z

I would think that pip vs. Conda shouldn't make any difference in memory overhead and even if it did it should be negligible. However I won't say I'm 100% sure on that, but it probably shouldn't be the first thing we investigate.

I double checked the available RAM on the AWS instance and it looks to in fact be 8 GB:

[ec2-user@ip-172-31-3-175 ~]$ cat /proc/meminfo
MemTotal:        8178428 kB
MemFree:         7043448 kB
MemAvailable:    7661728 kB

That memory is being shared between the core-service, ml-workers, and nginx containers, so perhaps it could be happening when both the core-service and the ml-workers hit a peak? I wonder what the graph for the core-service's memory usage would look like, but I don't think we have anything in place to track that.

rdvelazquez · 2017-10-02T21:04:26Z

Any updates on the memory issue?

@patrick-miller is planning to migrate the revised machine-learning version of notebook 2 to the ml-workers repo (tagging #110). Maybe after the revised notebook is in production we can try a query with all samples to check if the revised notebook mitigated the memory issue and, if not, evaluate increasing the RAM to 16 GB... at least while we look for another solution?

cgreene · 2018-05-20T12:43:33Z

Did we increase the RAM? Is this still an issue?

rdvelazquez mentioned this issue Aug 16, 2017

Machine Learning Punch List for Launch cognoma/machine-learning#110

Closed

5 tasks

rdvelazquez mentioned this issue Sep 15, 2017

[WIP] Number of PCA Components to Keep cognoma/machine-learning#113

Merged

cgreene added the backlog label May 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs fail due to memory usage #88

Jobs fail due to memory usage #88

dhimmel commented Aug 15, 2017

rdvelazquez commented Aug 30, 2017

wisygig commented Aug 30, 2017

dhimmel commented Sep 6, 2017

kurtwheeler commented Sep 6, 2017

rdvelazquez commented Sep 8, 2017

kurtwheeler commented Sep 11, 2017

rdvelazquez commented Oct 2, 2017 •

edited

Loading

cgreene commented May 20, 2018

Jobs fail due to memory usage #88

Jobs fail due to memory usage #88

Comments

dhimmel commented Aug 15, 2017

rdvelazquez commented Aug 30, 2017

wisygig commented Aug 30, 2017

dhimmel commented Sep 6, 2017

kurtwheeler commented Sep 6, 2017

rdvelazquez commented Sep 8, 2017

kurtwheeler commented Sep 11, 2017

rdvelazquez commented Oct 2, 2017 • edited Loading

cgreene commented May 20, 2018

rdvelazquez commented Oct 2, 2017 •

edited

Loading