Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine Learning Punch List for Launch #110

Closed
5 tasks done
rdvelazquez opened this issue Aug 16, 2017 · 19 comments
Closed
5 tasks done

Machine Learning Punch List for Launch #110

rdvelazquez opened this issue Aug 16, 2017 · 19 comments

Comments

@rdvelazquez
Copy link
Member

rdvelazquez commented Aug 16, 2017

Here's the general punch list that we discussed at tonight's meetup for getting the machine learning part of cognoma launch ready.

To be completed at a later date: Templating for jupyter notebooks (@wisygig)

@patrick-miller
Copy link
Member

Once we are happy with the final notebook, we should do a final cleanup (there is still a warning in the import section) and then I will do a PR into mlworkers.

@patrick-miller
Copy link
Member

@rdvelazquez @dhimmel @wisygig

Because the launch party is coming up, I was going to open a PR to push the notebook to production this week/weekend. I think Ryan has made very good progress, and our focus should probably be put towards making sure that his changes work on our AWS instances.

@dhimmel
Copy link
Member

dhimmel commented Oct 2, 2017

Because the launch party is coming up, I was going to open a PR to push the notebook to production this week/weekend

I agree & fully support!

@rdvelazquez
Copy link
Member Author

I was going to open a PR to push the notebook to production this week/weekend.

Sounds good to me!

@patrick-miller
Copy link
Member

Using the updated notebook and a small sample set, I got reasonable results back in production. We are getting the following warning for a pandas import:

image

I'm going to run a larger query, to see if we get memory issues.

@patrick-miller
Copy link
Member

The larger query for TP53 as the gene and all diseases selected runs into a memory error.

@dhimmel @rdvelazquez

@dhimmel
Copy link
Member

dhimmel commented Oct 4, 2017

The larger query for TP53 as the gene and all diseases selected runs into a memory error.

Okay @kurtwheeler and I will discuss our options tomorrow and let you know what we're thinking.

@patrick-miller
Copy link
Member

Any updates on this? Should we reduce the hyperparameter search space? We could probably cut it in half, by looking at alpha for every 0.2 instead of 0.1.

@dhimmel
Copy link
Member

dhimmel commented Oct 9, 2017

Any updates on this?

I'm hoping to get to this this afternoon. What I'm thinking is run the pathological query (TP53 as the gene and all diseases selected) locally and see how much memory is consumed. I would use this technique which @yl565 previously implemented.

This will let us know whether our AWS image size is too small or there is another issue where memory isn't fully allocated.

Should we reduce the hyperparameter search space?

Does this affect memory usage now that we're using dask-searchcv?

@patrick-miller
Copy link
Member

patrick-miller commented Oct 9, 2017

Does this affect memory usage now that we're using dask-searchcv?

Oh, I'm not sure, I just assumed that's where the memory burden still was. I think @rdvelazquez would have a better answer.

@rdvelazquez
Copy link
Member Author

I assume that the hyperparameter space affects memory usage. I'm not positive but I don't see how it wouldn't. I also assume that the number of PCA components has a much greater affect on memory than alpha_range so we may also need to evaluate the number of PCA components.

Once we have the notebook code set-up to be profiled we can easily adjust the hyperparameter space and re-profile to quantify how changes to the hyperparameter space affects memory (if at all).

@dhimmel
Copy link
Member

dhimmel commented Oct 9, 2017

So I installed memory profiler (pip install memory_profiler) and then used the %%memit notebook magic. Here's the HTML export of the notebook: 2.mutation-classifier-1-job.html.txt. Reading the files consumed 4.5 GB, which increased to 6.5 after making the training / testing dataset. Fitting the default models peaked at 11.7 GB (assuming a 1000 to 1 Mebibytes to Gigabytes conversion, which is slightly off). So @kurtwheeler, looks like we must upgrade the AWS instance size and limit them to 1 job at a time.

Increasing n_jobs=4 increased peak memory to ~16 GB and gave the repeated warning:

/home/dhimmel/anaconda3/envs/cognoma-machine-learning/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py:540: UserWarning: Multiprocessing backed parallel loops cannot be nested below threads, setting n_jobs=1
  **self._backend_args)

So we can keep at n_jobs=1 for now.

@kurtwheeler
Copy link
Member

I've changed the cognoma EC2 instances from m4.large to r4.large which increased the RAM from 8 GB to 15.25 GB.

@rdvelazquez
Copy link
Member Author

FYI - I'm getting an error on cognoma.org when I try to search for diseases: "Failed to load diseases." in a pink bar across the top.
error

@dhimmel
Copy link
Member

dhimmel commented Oct 10, 2017

Me too! https://api.cognoma.org/diseases/ returns a 503 code.

Failed to load resource: the server responded with a status of 503 (Service Unavailable: Back-end server is at capacity)
disease-type:1 Failed to load https://api.cognoma.org/diseases/: No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://cognoma.org' is therefore not allowed access. The response had HTTP status code 503.

@kurtwheeler and I will look into what failed tomorrow!

@rdvelazquez
Copy link
Member Author

Sounds good. If the issues seems difficult to track down or fix you could consider rewinding #9 and just changing the EC2 size for now. The other changes from #9 could then be troubleshooted after the launch party... with a little less pressure 😉

@dhimmel
Copy link
Member

dhimmel commented Oct 10, 2017

@rdvelazquez https://api.cognoma.org should now be back up. @kurtwheeler fixed it this morning. We had changed the instance type, but had not destroyed and recreated the instances (which ECS apparently requires).

@patrick-miller
Copy link
Member

Awesome! I can confirm that it works on TP53 with all diseases included:

image

@rdvelazquez
Copy link
Member Author

It's pretty fast too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants