Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to plot embeddings that NaN values #1559

Closed
mattcai opened this issue Jun 11, 2020 · 17 comments · Fixed by #1667
Closed

Ability to plot embeddings that NaN values #1559

mattcai opened this issue Jun 11, 2020 · 17 comments · Fixed by #1667
Assignees

Comments

@mattcai
Copy link
Contributor

mattcai commented Jun 11, 2020

In working with Kun.Leng@ucsf.edu to remix his neuronal AD dataset for Corpora, we came upon the issue of visualizing subcluster embeddings that only have coordinate values for a subpopulation of cells. The solution we would like is to fill the other cell coordinates with NaNs and cellxgene would plot only the cells with numerical coordinates for each embedding. Currently, cellxgene shows a blank plot if there are any np.nan or np.inf in the embedding.

Previous issue identifying the "bug": #1118


Update: cellxgene will now accept embeddings containing a NaN value. +/- Inf continue to be rejected. This issue is left open as we are still working out how to improve the UX experience for embeddings with NaN values.

@bkmartinjr
Copy link
Contributor

@colinmegill - if you can flesh out this user story & design, we can likely implement. It is definitely true that the current implementation ignores any embedding with non-finite numbers as a coordinate.

@mattcai - for a work-around, could you assign all of the not-in-this-cluster cells a finite coordinate such as [-1,-1]? They will just stack up on that point, and likely not disrupt the plots too badly.

@colinmegill
Copy link
Contributor

@mattcai thanks for creating a detailed issue! This is definitely an edge case we should handle gracefully.

@bkmartinjr I would expect finite coordinates to interfere with selections and break expectations.

The origin coordinate here is quite busy, for instance:

image

How difficult is it to ignore these? Would we handle this in graph in regl? https://github.com/chanzuckerberg/cellxgene/blob/master/client/src/components/graph/graph.js#L697

@bkmartinjr
Copy link
Contributor

I was not proposing the [-1,-1] hack as anything other than a temporary work-around for the current version of cellxgene. Pick a point that is not going to conflict with the embedding.

For a real fix/support, I think we can do this. Some work, but definitely possible. I am interested in seeing the UI design fleshed out a bit more before implementation.

Example: if one of these subset embeddings is being displayed, and he user lasso-selects part of it, are the cells with a NaN/Inf coordinate selected, or not selected? This will affect how they display on related UI, such as scatterplot, or how they interact with the subset button.

@mattcai
Copy link
Contributor Author

mattcai commented Jun 13, 2020

@bkmartinjr, yes using a dummy value is a possible solution but not very elegant imo. We can go with that until cellxgene supports NaN values.

if one of these subset embeddings is being displayed, and he user lasso-selects part of it, are the cells with a NaN/Inf coordinate selected, or not selected? This will affect how they display on related UI, such as scatterplot, or how they interact with the subset button.

For the current user's application, we were thinking the relevant cell types would be selected and subset before displaying the subset embeddings. The cells with NaN should not be able to be selected.

Some context: another solution to the user's problem was to take each subpopulation of cells used for subclustering and save it as a separate dataset, that way subset embeddings would have all finite values for each dataset. This reflects their analysis methods but would lead to multiple datasets in Corpora with overlapping data.

@bkmartinjr bkmartinjr self-assigned this Jun 13, 2020
@bkmartinjr
Copy link
Contributor

Thanks @mattcai - sounds reasonable. Another design question: when displaying an embedding with a subset of cells having a non-finite coordinate in that embedding, do all cells still show in the categorical and continuous metadata?

It seems mechanically simplest if they do - ie, the only effect is that some cells are "missing" from the embedding, but all still display in the other views. Just not sure that is sensible from a UI standpoint. @colinmegill ?

@colinmegill
Copy link
Contributor

@bkmartinjr The pattern that comes to mind is the tooltip we currently show on hover of any label. This could include extra text, ie., Lung — 14 cells have no x y coordinate and are not shown in graph (but shorter :)

image

@colinmegill
Copy link
Contributor

If this was not discoverable enough, and depending on how mission critical this was, we could escalate to adding an icon after the label to indicate that there are cells not shown in graph.

@bkmartinjr
Copy link
Contributor

@colinmegill - I am not sure what you are proposing, as the NaN is the value of the embedding coordinate, not the category label. The "missing" cells (ie, those with no coordinate) could be scattered across any number of categories/labels.

I think the simplest thing is to show everything, everywhere, but "hide" those with NaN coordinates in the graph. We already do the same thing in the continuous dimensions, where cells with non-finite values are silently omitted from the histogram.

Does this seem reasonable?

@colinmegill
Copy link
Contributor

I should have used a different example. Arbitrary label. All tooltips would be modified with counts of cells without coords.

image

@colinmegill
Copy link
Contributor

I agree that we should show everything, everywhere, as we do already.

I do believe we should surface the lack of presence of cells to the user via tooltip to avoid confusion / hint at unexpected behavior (clicking a checkbox with a large amount of labels and a small amount of cells toggling on and off). In the case of missing continuous data, we grey out the cell, a hint, but a path not available lacking coords.

@mattcai
Copy link
Contributor Author

mattcai commented Jun 15, 2020

Another design question: when displaying an embedding with a subset of cells having a non-finite coordinate in that embedding, do all cells still show in the categorical and continuous metadata?

Yes, I think that makes the most sense and gives the user the most flexibility for analysis. Just to reiterate the desired functionality, since I am not as familiar with cellxgene: Choosing an embedding with some NaN coordinates should not select or subset data in any way. If the user does want to subset data to only cells with finite values in embedding, they must either know which cell label to select (e.g. checking box for patient 1 and clicking subset) or they can use lasso to select all cells in embedding and clicking subset.

All tooltips would be modified with counts of cells without coords.

A tooltip might be too subtle, imo. Would it be too cluttered to have a second greyed out number indicating counts of cells with coords? This would only be shown when the embedding has non-finite coords, which would alert the user that the embedding they switched to is missing cells. I think it's debatable whether it's more useful to show number of cells with coords or without.

@colinmegill
Copy link
Contributor

colinmegill commented Jun 15, 2020

I was afraid of that :)

There are a number of disadvantages to doubling the number on the label, primarily:

  • Variable amount of width, potentially quite a bit of space
  • Confusion with the concept of 'currently selected' and 'subset', we don't want to confuse the user with what fraction we're displaying — the idea of 'out of x' is quite nuanced as the user might be looking at x cells (current selection) out of y subset out of z total.

I was thinking of something like this after the label itself, designating that there were missing values, that would provide more context on hover. I believe this should be yellow, for 'warning', and toggleable off to avoid distraction if unwanted.

image

@mattcai
Copy link
Contributor Author

mattcai commented Jun 16, 2020

I agree with the reasons against adding another number to the labels. The icon is also intuitive and I like it. As long as the user can find the number of cells missing/included, such as by hovering over the icon, I think it is useful.

Just to throw out another idea, what if the icon was a pie chart that conveyed the fraction of cells missing? Like a yellow wedge proportional to missing values? That way the user can tell at a glance the severity of the 'warning' for each label. This might be useful in cases where there are many categories like cell types, and it would be time consuming to hover over each label to check how many are missing.

@colinmegill
Copy link
Contributor

Interesting idea! We do already use the pie chart to signify 'subset', but we don't signify how much.

image

It's quite an interesting idea to have a dynamically updating chart on that button that updates as the current selection changes, and as the subset changes.

This might imply that initially, both buttons are circles, and disabled. As current selections are made, these are highlighted (in a bright color) as a wedge on the pie on the button. Click it, and that becomes the new wedge.

If this pattern was established on the subset button, it could be used other places — ie., what fraction of overall cells is a given label, and what fraction of that fraction is selected, and what fraction has coords. This last step is significantly more involved, but there will be user surprise if the only place we're showing a fraction is 'not displayed on graph', given how many other fractions we ask them to keep in their head as they use the tool.

@mattcai
Copy link
Contributor Author

mattcai commented Jun 17, 2020

In a discussion with GFPS (comp bios + app scis), the consensus was that the data should be subset when selecting an embedding that has np.nan values. This would reflect the analysis process that created the embedding. This implementation would remove some flexibility but is also clear and straightforward from the user perspective. What do you think @colinmegill?

@bkmartinjr
Copy link
Contributor

@mattcai - if I understand correctly, you are asking that the user selection of an embedding implicitly perform a subset UI operation if the embedding contains non-finite numbers?

I don't have any comments on UI impact (ie, hidden state, good or bad), but this will make the implementation significantly more complex. That said, it is doable.

@colinmegill - it is worth considering this in the context of the experimental reembedding feature, which also interacts (implicitly) with subsetting/unsubsetting. In this particular case we also elected to tie an embedding to a subset, as it was very confusing to show it outside of this view for the reasons you discuss above. I actually think Matt's proposal is a generalization of what we did earlier, and would improve our existing model.

@mattcai 's proposal leads to a number of questions about the interaction between manual subsetting (select & subset) vs the subsetting implicit in switching embeddings:

  • if the user switches to multiple "subset embeddings", subsetting each time, and then back to a whole-dataset embedding, are the subsets reverted, or do they accumulate, or? (I would assume they simply revert)
  • how does this interact with the subset UI feature, where the user controls the subset?
    1. if the user already has a manual subset, which does not intersect an embedding, is the embedding viewable or disabled?
    2. if the user already has a manual subset that overlaps the embedding's cells, what is the effect of switching to that embedding? Implicitly subset the intersection?
    3. if the user is viewing an embedding on a subset of cells, and manually subsets, then "resets" the subset, what happens?
      etc....

This all seems resolvable, but there is some explicit state we need to manage. If you prefer, a flow diagram might make this easier to work through.

My suggested approach:

  • each embedding has an implicit subset
  • the currently visible cells are the intersection of the user's manual subset and the embedding's subset
  • the "reset subset" only controls the user's manual subset
  • the embedding view only controls the embedding's implicit subset

This has some very sharp edges if the user is doing manual subsetting while looking at an embedding that doesn't contain all cells. But honestly, that might also be a useful feature...

Thoughts?

@mattcai
Copy link
Contributor Author

mattcai commented Jul 1, 2020

Ok, so I started by trying to make a flowchart that covered all the possible combinations and couldn't think of an approach that wasn't confusing for users. My proposal, like @bkmartinjr's suggested approach, is to treat the "subset embeddings" like the separate dataset from which it was computed. Switching to an embedding with a different subset would be akin to opening a different file, which is one way this analysis is handled already.

Background: Every embedding has an implicit subset of cells with finite embedding coordinates. I'll refer to this as the cell-set of the embedding. Multiple embeddings can share the same cell-set. Cell-sets can also overlap with each other, partially or completely. There is at least one embedding whose cell-set contains every cell in the dataset (likely always true, but I don't think is a requirement of this approach).

How to handle changing embeddings:

  • Changing to an embedding with a different cell-set will prompt a confirmation dialog box that warns "all current subsets and DE analysis will be lost". Then the operation performs an "undo all (manual and implicit) subsets", "subsets to new embedding cell-set", and then changes embedding. Other selections like color labels are unchanged. This occurs even if the current subset is completely overlapping with the new implicit subset.

  • Changing to an embedding that shares the same cell-set as the current embedding will result in no warning or undoing of subsets. Any manual subsets are kept, DE analysis is kept, etc. This is like what cellxgene currently does when embeddings are changed.

Manual subsetting and undoing subsets:

  • Subsetting behavior is unchanged. You can only select cells in the embedding's cell-set.

  • Undoing subset will only revert the manual subsets, returning to the complete cell-set of the current embedding.

This has some very sharp edges if the user is doing manual subsetting while looking at an embedding that doesn't contain all cells. But honestly, that might also be a useful feature...

I agree, this will likely be viewed as useful or at least expected.

Possible pitfalls:

  • Users annoyed they are unable to carry over a selected subset from one embedding to another, if the selected subset is known to be in both cell-sets
    • possible solution: have a way to annotate those cells that persists
  • Users may want to reset to the default cell-set without understanding they have to change to the default embedding
    • possible solution: provide clear UI cues when they initially leave the default cell-set
  • Users may become frustrated with trying to change to embeddings that keep warning them they'll lose their selected subset
    • possible solution: label embeddings to indicate which ones share cell-sets and which ones will change cell-set

What do you think @colinmegill? Does this sound like an idea that can be implemented?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants