Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add first version of Vega ROC plots #77

Merged
merged 12 commits into from
Feb 1, 2017

Conversation

patrick-miller
Copy link
Member

This introduces a Vega based ROC plot. Without the interactivity it looks like the following:

image

Currently, it takes a CSV file/data stream, but we can use a JSON one instead depending on how the backend team wants to serve it. The inputs to it are the false positive rate, the true positive rate, the curve type (train, test, CV). I plan on adding in the ability to specify the data set used (or model) so that we can split out the full feature model and the covariates only model.

Let me know if you have any questions/comments.

Copy link
Member

@dhimmel dhimmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was fast! Nice pull request.

Did you try vega-lite? The specification is higher level and gets compiled to vega. If vega-lite isn't lacking a necessary feature, I think that would be preferred. I'm impressed that you tackled the vega!


<!-- TODO -->
<!-- Install with npm install vega -->
<script src="http://vega.github.io/vega-editor/vendor/d3.min.js" charset="utf-8"></script>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's switch to versioned includes, so we don't have any surprises at deployment time. From https://github.com/vega/vega-lite-demo/issues/1#issuecomment-271972536:

<script src="https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.17/d3.min.js" charset="utf-8"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/vega/2.6.5/vega.min.js" charset="utf-8"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/vega-lite/1.3.1/vega-lite.min.js" charset="utf-8"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/vega-embed/2.2.0/vega-embed.min.js" charset="utf-8"></script>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check to see if you want vega 2 or 3.

Vega 2 uses D3 v3. Vega 3 uses D3 v4.

@dhimmel
Copy link
Member

dhimmel commented Jan 14, 2017

@patrick-miller where do you think we should note the AUROC for each curve? Either as additional text in the legend or on hover?

@dhimmel
Copy link
Member

dhimmel commented Jan 14, 2017

I'm starting to think being able to compute the TPRs and FPRs to create an ROC in javascript would be killer. There are up to 33 different cancers that can be selected -- users may be interested in selecting certain cancers, which will filter to a subset of samples (observations). Thus the ROC curve would change.

We could always have the backend recalculate, if doing this on the frontend is too burdensome. Not that this decision or implementation should be part of this PR. Just wanted to jot down my thoughts and get your opinion.

@patrick-miller
Copy link
Member Author

I'll put some thought to it though I doubt I will have any strong opinions between the versions. I think vega 3 is still in development. As for vega vs. vega-lite, you can definitely do more with vega -- I'm not sure if you have the ability to do any interactive stuff with vega-lite (I have only used vega in the past).

There are a few different places we could put the AUROC. We can put it in the legend like you have been doing in Python. We can put it on hover (would switch to keeping hover on permanently). We can put it to the right of the lines. I'll play around with adding it in some different places in a separate pull request.

In terms of the way the data is going to be served...anytime a user filters to a subset of cancers we would need to make a server side call to the data set, correct? Or are you imagining storing all of the prediction data in the frontend? We can certainly move a step to the frontend, I'm just not sure if this will really speed things up that much if you have the data cached in Redis on the backend anyway. Correct me if I'm wrong, but isn't the difference just IO?

@dhimmel
Copy link
Member

dhimmel commented Jan 14, 2017

Correct me if I'm wrong, but isn't the difference just IO?

IO and programming language. The javascript method could be done entirely client side. Otherwise, we can use python via the backend to compute the ROC curve.

In terms of the way the data is going to be served...anytime a user filters to a subset of cancers we would need to make a server side call to the data set, correct?

Unless we load the entire prediction table into the browser. This table is at most 8,000 rows, so it's a possibility.

Let's defer any decisions here until we have a better idea of the results viewer.

Copy link
Member

@dhimmel dhimmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looking good.

I plan on adding in the ability to specify the data set used (or model) so that we can split out the full feature model and the covariates only model.

Once you get this implemented, I'll run locally and play with the viz.

@patrick-miller
Copy link
Member Author

Here is how the visualization looks now. We can play with how the interactivity works once I start putting together the AUROC for each curve.

image

Copy link
Member

@dhimmel dhimmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming along nicely.

I think it makes most sense to map partition to color and feature_set to linetype (e.g. solid for all features, dashed for covariates only). How difficult would that be to implement?

@@ -0,0 +1,25 @@
false_positive,true_positive,curve,data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to add 0, 0 and 1, 1 to each curve to more closely represent the ROC curves on real data.

In order to run these sample files, you should first start up a simple HTTP server such as:

```sh
python -m SimpleHTTPServer 8000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this repository's environment uses Python 3.

In Python 3, I think this should be:

python -m http.server 8000

Feel free to include both commands, if you'd like.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I'll fix that.

@@ -0,0 +1,25 @@
false_positive,true_positive,curve,data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the column names more descriptive:

  • false_positive_rate
  • true_positive_rate
  • partition
  • feature_set

python -m SimpleHTTPServer 8000
```

Then navigate to that instance (localhost:8000) and click on the file that you wish to view.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing localhost:8000 with http://localhost:8000/ will make the link clickable -- just worried that some ML devs will be confused by what localhost is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@patrick-miller
Copy link
Member Author

Made the small tweaks and switched to dashed lines for the covariates. It wasn't exactly straightforward, so there may be an easier way that I couldn't find to do it. Latest update:

image

@dhimmel
Copy link
Member

dhimmel commented Jan 16, 2017

@patrick-miller nice. I'm thinking we want to remove the dots (and keep just the lines), since there can be thousands of actual points in some of our ROC curves.

For the "feature set" legend, is it possible to use a line rather than a point to show the difference between solid and dashed. No big deal if this is too difficult.

Also, how hard is it to add some transparency/alpha to the lines... I'm thinking we may have overlapping ROCs.

@dhimmel
Copy link
Member

dhimmel commented Jan 16, 2017

Would love to get you some real data to plug in.

@patrick-miller
Copy link
Member Author

Agreed on removing the dots, they are placeholders for now for the interactive portion -- still considering how I would want to best display it (thoughts are very welcome!)

I'll switch the legend to a line, I'm pretty sure it should be possible.

Transparency should be easy, I'll play around with some values. I'll do a data dump from one of the notebooks so that I can work out which values will be better.

@patrick-miller
Copy link
Member Author

patrick-miller commented Jan 27, 2017

I added 'real' data for the ROC plot (comes from the 2.TCGA-MLexample notebook) -- for the covariates only model I fabricated the data. I took out the dots to make the rendering faster, but we will probably want to sample from the full ROC data that sklearn outputs (too many FPR and TPR breaks).

Things left to decide on: interactivity and where to put the AUROC for each feature set/partition split.

image

@dhimmel
Copy link
Member

dhimmel commented Jan 27, 2017

@patrick-miller, looks great and thanks for creating the more realistic data.

I took out the dots to make the rendering faster, but we will probably want to sample from the full ROC data that sklearn outputs

Since most points in our ROC curve lie on the line and are not actually inflection points, we can prune many of the points without any change to the curve! Here is an R implementation of this method. It shouldn't be hard for us to implement this in python.

Things left to decide on: interactivity and where to put the AUROC for each feature set/partition split.

For the AUROC, I think the two options are in the tooltip that appears on hover or in an additional legend. The additional legend could just contain the linetypes and the AUROC%.

@patrick-miller
Copy link
Member Author

I got some interactivity working. It isn't perfect, but it is definitely a start.

image

@dhimmel
Copy link
Member

dhimmel commented Feb 1, 2017

I got some interactivity working. It isn't perfect, but it is definitely a start.

Looks great. My only suggesting would be making AUC a percentage, and making the TPR FPR and AUC percentages to have 1 decimal point of precision... like TPR 88.1%.

@patrick-miller
Copy link
Member Author

Ok, I formatted the interactive legend to have 1 decimal point and all three figures are %s.

@dhimmel
Copy link
Member

dhimmel commented Feb 1, 2017

Great. I got the visualization up and running locally. See

vega-roc

I noticed the box overlaps with the AUC percentage sign. Is there an easy fix. If not, I'm happy to merge as is! Thanks for seeing this PR through. Can't wait till we deploy it.

@patrick-miller
Copy link
Member Author

Yep, it is very easy. Right now, a lot of those parameters are hard coded, so I'm going to look at changing that in the future.

@dhimmel dhimmel merged commit 40a02f7 into cognoma:master Feb 1, 2017
@patrick-miller patrick-miller deleted the feature-vega_ROC branch February 1, 2017 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants