Machine Learning Punch List for Launch #110

rdvelazquez · 2017-08-16T00:33:12Z

Here's the general punch list that we discussed at tonight's meetup for getting the machine learning part of cognoma launch ready.

Update Notebook 2 to be compatible with ml-workers and PR ml-workers so the two notebooks are the same (Allow for environment variables in notebook #111 by @patrick-miller)
Change plotting in notebooks from vega to plotnine (Change plots to use plotnine #112 by @patrick-miller)
Look into the memory usage issues (@wisygig)
Look into the number of pca components to use ([WIP] Number of PCA Components to Keep #113 and Revise parameter grid #114 by @rdvelazquez)
Optimize between memory usage, pca components (and the regulirization_alpha_list), and how large of an AWS instance we use (@dhimmel)

To be completed at a later date: Templating for jupyter notebooks (@wisygig)

patrick-miller · 2017-09-22T00:54:40Z

Once we are happy with the final notebook, we should do a final cleanup (there is still a warning in the import section) and then I will do a PR into mlworkers.

patrick-miller · 2017-10-02T19:51:32Z

@rdvelazquez @dhimmel @wisygig

Because the launch party is coming up, I was going to open a PR to push the notebook to production this week/weekend. I think Ryan has made very good progress, and our focus should probably be put towards making sure that his changes work on our AWS instances.

dhimmel · 2017-10-02T19:55:15Z

Because the launch party is coming up, I was going to open a PR to push the notebook to production this week/weekend

I agree & fully support!

rdvelazquez · 2017-10-02T21:00:00Z

I was going to open a PR to push the notebook to production this week/weekend.

Sounds good to me!

patrick-miller · 2017-10-04T22:20:17Z

Using the updated notebook and a small sample set, I got reasonable results back in production. We are getting the following warning for a pandas import:

I'm going to run a larger query, to see if we get memory issues.

patrick-miller · 2017-10-04T22:27:49Z

The larger query for TP53 as the gene and all diseases selected runs into a memory error.

@dhimmel @rdvelazquez

dhimmel · 2017-10-04T22:34:12Z

The larger query for TP53 as the gene and all diseases selected runs into a memory error.

Okay @kurtwheeler and I will discuss our options tomorrow and let you know what we're thinking.

patrick-miller · 2017-10-09T18:44:20Z

Any updates on this? Should we reduce the hyperparameter search space? We could probably cut it in half, by looking at alpha for every 0.2 instead of 0.1.

dhimmel · 2017-10-09T18:55:58Z

Any updates on this?

I'm hoping to get to this this afternoon. What I'm thinking is run the pathological query (TP53 as the gene and all diseases selected) locally and see how much memory is consumed. I would use this technique which @yl565 previously implemented.

This will let us know whether our AWS image size is too small or there is another issue where memory isn't fully allocated.

Should we reduce the hyperparameter search space?

Does this affect memory usage now that we're using dask-searchcv?

patrick-miller · 2017-10-09T19:00:22Z

Does this affect memory usage now that we're using dask-searchcv?

Oh, I'm not sure, I just assumed that's where the memory burden still was. I think @rdvelazquez would have a better answer.

rdvelazquez · 2017-10-09T19:12:48Z

I assume that the hyperparameter space affects memory usage. I'm not positive but I don't see how it wouldn't. I also assume that the number of PCA components has a much greater affect on memory than alpha_range so we may also need to evaluate the number of PCA components.

Once we have the notebook code set-up to be profiled we can easily adjust the hyperparameter space and re-profile to quantify how changes to the hyperparameter space affects memory (if at all).

dhimmel · 2017-10-09T21:43:15Z

So I installed memory profiler (pip install memory_profiler) and then used the %%memit notebook magic. Here's the HTML export of the notebook: 2.mutation-classifier-1-job.html.txt. Reading the files consumed 4.5 GB, which increased to 6.5 after making the training / testing dataset. Fitting the default models peaked at 11.7 GB (assuming a 1000 to 1 Mebibytes to Gigabytes conversion, which is slightly off). So @kurtwheeler, looks like we must upgrade the AWS instance size and limit them to 1 job at a time.

Increasing n_jobs=4 increased peak memory to ~16 GB and gave the repeated warning:

/home/dhimmel/anaconda3/envs/cognoma-machine-learning/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py:540: UserWarning: Multiprocessing backed parallel loops cannot be nested below threads, setting n_jobs=1
  **self._backend_args)

So we can keep at n_jobs=1 for now.

kurtwheeler · 2017-10-09T21:58:36Z

I've changed the cognoma EC2 instances from m4.large to r4.large which increased the RAM from 8 GB to 15.25 GB.

rdvelazquez · 2017-10-10T00:13:28Z

FYI - I'm getting an error on cognoma.org when I try to search for diseases: "Failed to load diseases." in a pink bar across the top.

dhimmel · 2017-10-10T00:50:05Z

Me too! https://api.cognoma.org/diseases/ returns a 503 code.

Failed to load resource: the server responded with a status of 503 (Service Unavailable: Back-end server is at capacity)
disease-type:1 Failed to load https://api.cognoma.org/diseases/: No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://cognoma.org' is therefore not allowed access. The response had HTTP status code 503.

@kurtwheeler and I will look into what failed tomorrow!

rdvelazquez · 2017-10-10T14:49:27Z

Sounds good. If the issues seems difficult to track down or fix you could consider rewinding #9 and just changing the EC2 size for now. The other changes from #9 could then be troubleshooted after the launch party... with a little less pressure 😉

dhimmel · 2017-10-10T14:55:35Z

@rdvelazquez https://api.cognoma.org should now be back up. @kurtwheeler fixed it this morning. We had changed the instance type, but had not destroyed and recreated the instances (which ECS apparently requires).

patrick-miller · 2017-10-10T15:00:22Z

Awesome! I can confirm that it works on TP53 with all diseases included:

rdvelazquez · 2017-10-10T15:01:05Z

It's pretty fast too!

rdvelazquez mentioned this issue Aug 28, 2017

Change plots to use plotnine #112

Merged

rdvelazquez mentioned this issue Oct 2, 2017

Jobs fail due to memory usage cognoma/core-service#88

Open

kurtwheeler mentioned this issue Oct 9, 2017

Ml workers cognoma/infrastructure#9

Merged

rdvelazquez closed this as completed Oct 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machine Learning Punch List for Launch #110

Machine Learning Punch List for Launch #110

rdvelazquez commented Aug 16, 2017 •

edited

Loading

patrick-miller commented Sep 22, 2017

patrick-miller commented Oct 2, 2017

dhimmel commented Oct 2, 2017

rdvelazquez commented Oct 2, 2017

patrick-miller commented Oct 4, 2017

patrick-miller commented Oct 4, 2017

dhimmel commented Oct 4, 2017 •

edited

Loading

patrick-miller commented Oct 9, 2017

dhimmel commented Oct 9, 2017

patrick-miller commented Oct 9, 2017 •

edited

Loading

rdvelazquez commented Oct 9, 2017

dhimmel commented Oct 9, 2017

kurtwheeler commented Oct 9, 2017

rdvelazquez commented Oct 10, 2017

dhimmel commented Oct 10, 2017

rdvelazquez commented Oct 10, 2017

dhimmel commented Oct 10, 2017

patrick-miller commented Oct 10, 2017

rdvelazquez commented Oct 10, 2017

Machine Learning Punch List for Launch #110

Machine Learning Punch List for Launch #110

Comments

rdvelazquez commented Aug 16, 2017 • edited Loading

patrick-miller commented Sep 22, 2017

patrick-miller commented Oct 2, 2017

dhimmel commented Oct 2, 2017

rdvelazquez commented Oct 2, 2017

patrick-miller commented Oct 4, 2017

patrick-miller commented Oct 4, 2017

dhimmel commented Oct 4, 2017 • edited Loading

patrick-miller commented Oct 9, 2017

dhimmel commented Oct 9, 2017

patrick-miller commented Oct 9, 2017 • edited Loading

rdvelazquez commented Oct 9, 2017

dhimmel commented Oct 9, 2017

kurtwheeler commented Oct 9, 2017

rdvelazquez commented Oct 10, 2017

dhimmel commented Oct 10, 2017

rdvelazquez commented Oct 10, 2017

dhimmel commented Oct 10, 2017

patrick-miller commented Oct 10, 2017

rdvelazquez commented Oct 10, 2017

rdvelazquez commented Aug 16, 2017 •

edited

Loading

dhimmel commented Oct 4, 2017 •

edited

Loading

patrick-miller commented Oct 9, 2017 •

edited

Loading