New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to plot embeddings that NaN values #1559
Comments
@colinmegill - if you can flesh out this user story & design, we can likely implement. It is definitely true that the current implementation ignores any embedding with non-finite numbers as a coordinate. @mattcai - for a work-around, could you assign all of the not-in-this-cluster cells a finite coordinate such as [-1,-1]? They will just stack up on that point, and likely not disrupt the plots too badly. |
@mattcai thanks for creating a detailed issue! This is definitely an edge case we should handle gracefully. @bkmartinjr I would expect finite coordinates to interfere with selections and break expectations. The origin coordinate here is quite busy, for instance: How difficult is it to ignore these? Would we handle this in graph in |
I was not proposing the [-1,-1] hack as anything other than a temporary work-around for the current version of cellxgene. Pick a point that is not going to conflict with the embedding. For a real fix/support, I think we can do this. Some work, but definitely possible. I am interested in seeing the UI design fleshed out a bit more before implementation. Example: if one of these subset embeddings is being displayed, and he user lasso-selects part of it, are the cells with a NaN/Inf coordinate selected, or not selected? This will affect how they display on related UI, such as scatterplot, or how they interact with the subset button. |
@bkmartinjr, yes using a dummy value is a possible solution but not very elegant imo. We can go with that until cellxgene supports NaN values.
For the current user's application, we were thinking the relevant cell types would be selected and subset before displaying the subset embeddings. The cells with NaN should not be able to be selected. Some context: another solution to the user's problem was to take each subpopulation of cells used for subclustering and save it as a separate dataset, that way subset embeddings would have all finite values for each dataset. This reflects their analysis methods but would lead to multiple datasets in Corpora with overlapping data. |
Thanks @mattcai - sounds reasonable. Another design question: when displaying an embedding with a subset of cells having a non-finite coordinate in that embedding, do all cells still show in the categorical and continuous metadata? It seems mechanically simplest if they do - ie, the only effect is that some cells are "missing" from the embedding, but all still display in the other views. Just not sure that is sensible from a UI standpoint. @colinmegill ? |
@bkmartinjr The pattern that comes to mind is the tooltip we currently show on hover of any label. This could include extra text, ie., |
If this was not discoverable enough, and depending on how mission critical this was, we could escalate to adding an icon after the label to indicate that there are cells not shown in graph. |
@colinmegill - I am not sure what you are proposing, as the NaN is the value of the embedding coordinate, not the category label. The "missing" cells (ie, those with no coordinate) could be scattered across any number of categories/labels. I think the simplest thing is to show everything, everywhere, but "hide" those with NaN coordinates in the graph. We already do the same thing in the continuous dimensions, where cells with non-finite values are silently omitted from the histogram. Does this seem reasonable? |
I agree that we should show everything, everywhere, as we do already. I do believe we should surface the lack of presence of cells to the user via tooltip to avoid confusion / hint at unexpected behavior (clicking a checkbox with a large amount of labels and a small amount of cells toggling on and off). In the case of missing continuous data, we grey out the cell, a hint, but a path not available lacking coords. |
Yes, I think that makes the most sense and gives the user the most flexibility for analysis. Just to reiterate the desired functionality, since I am not as familiar with cellxgene: Choosing an embedding with some NaN coordinates should not select or subset data in any way. If the user does want to subset data to only cells with finite values in embedding, they must either know which cell label to select (e.g. checking box for patient 1 and clicking subset) or they can use lasso to select all cells in embedding and clicking subset.
A tooltip might be too subtle, imo. Would it be too cluttered to have a second greyed out number indicating counts of cells with coords? This would only be shown when the embedding has non-finite coords, which would alert the user that the embedding they switched to is missing cells. I think it's debatable whether it's more useful to show number of cells with coords or without. |
I was afraid of that :) There are a number of disadvantages to doubling the number on the label, primarily:
I was thinking of something like this after the label itself, designating that there were missing values, that would provide more context on hover. I believe this should be yellow, for 'warning', and toggleable off to avoid distraction if unwanted. |
I agree with the reasons against adding another number to the labels. The icon is also intuitive and I like it. As long as the user can find the number of cells missing/included, such as by hovering over the icon, I think it is useful. Just to throw out another idea, what if the icon was a pie chart that conveyed the fraction of cells missing? Like a yellow wedge proportional to missing values? That way the user can tell at a glance the severity of the 'warning' for each label. This might be useful in cases where there are many categories like cell types, and it would be time consuming to hover over each label to check how many are missing. |
Interesting idea! We do already use the pie chart to signify 'subset', but we don't signify how much. It's quite an interesting idea to have a dynamically updating chart on that button that updates as the current selection changes, and as the subset changes. This might imply that initially, both buttons are circles, and disabled. As current selections are made, these are highlighted (in a bright color) as a wedge on the pie on the button. Click it, and that becomes the new wedge. If this pattern was established on the subset button, it could be used other places — ie., what fraction of overall cells is a given label, and what fraction of that fraction is selected, and what fraction has coords. This last step is significantly more involved, but there will be user surprise if the only place we're showing a fraction is 'not displayed on graph', given how many other fractions we ask them to keep in their head as they use the tool. |
In a discussion with GFPS (comp bios + app scis), the consensus was that the data should be subset when selecting an embedding that has |
@mattcai - if I understand correctly, you are asking that the user selection of an embedding implicitly perform a subset UI operation if the embedding contains non-finite numbers? I don't have any comments on UI impact (ie, hidden state, good or bad), but this will make the implementation significantly more complex. That said, it is doable. @colinmegill - it is worth considering this in the context of the experimental reembedding feature, which also interacts (implicitly) with subsetting/unsubsetting. In this particular case we also elected to tie an embedding to a subset, as it was very confusing to show it outside of this view for the reasons you discuss above. I actually think Matt's proposal is a generalization of what we did earlier, and would improve our existing model. @mattcai 's proposal leads to a number of questions about the interaction between manual subsetting (select & subset) vs the subsetting implicit in switching embeddings:
This all seems resolvable, but there is some explicit state we need to manage. If you prefer, a flow diagram might make this easier to work through. My suggested approach:
This has some very sharp edges if the user is doing manual subsetting while looking at an embedding that doesn't contain all cells. But honestly, that might also be a useful feature... Thoughts? |
Ok, so I started by trying to make a flowchart that covered all the possible combinations and couldn't think of an approach that wasn't confusing for users. My proposal, like @bkmartinjr's suggested approach, is to treat the "subset embeddings" like the separate dataset from which it was computed. Switching to an embedding with a different subset would be akin to opening a different file, which is one way this analysis is handled already. Background: Every embedding has an implicit subset of cells with finite embedding coordinates. I'll refer to this as the cell-set of the embedding. Multiple embeddings can share the same cell-set. Cell-sets can also overlap with each other, partially or completely. There is at least one embedding whose cell-set contains every cell in the dataset (likely always true, but I don't think is a requirement of this approach). How to handle changing embeddings:
Manual subsetting and undoing subsets:
I agree, this will likely be viewed as useful or at least expected. Possible pitfalls:
What do you think @colinmegill? Does this sound like an idea that can be implemented? |
In working with Kun.Leng@ucsf.edu to remix his neuronal AD dataset for Corpora, we came upon the issue of visualizing subcluster embeddings that only have coordinate values for a subpopulation of cells. The solution we would like is to fill the other cell coordinates with NaNs and cellxgene would plot only the cells with numerical coordinates for each embedding. Currently, cellxgene shows a blank plot if there are any
np.nan
ornp.inf
in the embedding.Previous issue identifying the "bug": #1118
Update: cellxgene will now accept embeddings containing a NaN value. +/- Inf continue to be rejected. This issue is left open as we are still working out how to improve the UX experience for embeddings with NaN values.
The text was updated successfully, but these errors were encountered: