Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs fail due to memory usage #88

Open
dhimmel opened this issue Aug 15, 2017 · 8 comments
Open

Jobs fail due to memory usage #88

dhimmel opened this issue Aug 15, 2017 · 8 comments
Labels

Comments

@dhimmel
Copy link
Member

dhimmel commented Aug 15, 2017

When fitting models on all disease types, it's common for the job to exceed its memory allotment and fail.

We can increase the instance size as a first step. If that becomes cost prohibitive, we can consider changes to our dask-searchcv configuration. Currently, we use the default cache_cv=True:

Whether to extract each train/test subset at most once in each worker process, or every time that subset is needed. Caching the splits can speedup computation at the cost of increased memory usage per worker process. If True, worst case memory usage is (n_splits + 1) * (X.nbytes + y.nbytes) per worker. If False, worst case memory usage is (n_threads_per_worker + 1) * (X.nbytes + y.nbytes) per worker.

This really speeds things up, so setting cache_cv=False would not be ideal.

@rdvelazquez
Copy link
Member

@dcgoss or @dhimmel How much RAM does the AWS instance currently have and how much could we potentially increase that by? This will be of note for @wisygig or whoever else looks into the memory issue and also be of note for selecting the parameter grid to use in GridSearchCV.

@wisygig
Copy link

wisygig commented Aug 30, 2017

@rdvelazquez I've been looking through dask/dask-searchcv#33 and playing a bit with memory_profiler. Aiming to put up a WIP with some monitoring tools by the end of the week.

@dhimmel
Copy link
Member Author

dhimmel commented Sep 6, 2017

How much RAM does the AWS instance currently have and how much could we potentially increase that by?

I think currently the instances are 2 GB. What about increasing them to 8 GB? Paging @kurtwheeler who I think can make the change.

@kurtwheeler
Copy link
Member

We're actually at 8 GB already, for both instances. How much do we think we'll need? I can bump them up to 16 GB if really needed, although that's starting to get a bit expensive.

A cheaper alternative could be to switch to one ml-worker that's large, and then just have a second smaller instance to run the second instance of the core-service for high availability.

Do the ml-workers only run one job at a time? Is it possible that one of them is picking up more than one job at a time and this is causing the memory issues?

@rdvelazquez
Copy link
Member

I can run the version of notebook 2 from the machine-learning repo on my PC that has 8 GB of RAM without issue (and no swapping). I wonder what the available RAM on the AWS instance is; is there a way to easily find that out? Also I think the AWS instance is using Docker to install the requirements.txt from PIP as opposed to Conda (not sure if this may affect either memory overhead or classifier memory usage). Another potential difference may be how the data is downloaded in notebook 1 on AWS.

Do the ml-workers only run one job at a time? Is it possible that one of them is picking up more than one job at a time and this is causing the memory issues?

The ml-workers should be only running one job at a time even when multiple requests come in at once. I think we've also reproduced this problem enough that it's unlikely there were multiple requests occurring at once each time.

@kurtwheeler
Copy link
Member

I would think that pip vs. Conda shouldn't make any difference in memory overhead and even if it did it should be negligible. However I won't say I'm 100% sure on that, but it probably shouldn't be the first thing we investigate.

I double checked the available RAM on the AWS instance and it looks to in fact be 8 GB:

[ec2-user@ip-172-31-3-175 ~]$ cat /proc/meminfo
MemTotal:        8178428 kB
MemFree:         7043448 kB
MemAvailable:    7661728 kB

That memory is being shared between the core-service, ml-workers, and nginx containers, so perhaps it could be happening when both the core-service and the ml-workers hit a peak? I wonder what the graph for the core-service's memory usage would look like, but I don't think we have anything in place to track that.

@rdvelazquez
Copy link
Member

rdvelazquez commented Oct 2, 2017

Any updates on the memory issue?

@patrick-miller is planning to migrate the revised machine-learning version of notebook 2 to the ml-workers repo (tagging #110). Maybe after the revised notebook is in production we can try a query with all samples to check if the revised notebook mitigated the memory issue and, if not, evaluate increasing the RAM to 16 GB... at least while we look for another solution?

@cgreene
Copy link
Member

cgreene commented May 20, 2018

Did we increase the RAM? Is this still an issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants