Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regraph/Reset user story #280

Closed
bkmartinjr opened this issue Sep 27, 2018 · 16 comments
Closed

Regraph/Reset user story #280

bkmartinjr opened this issue Sep 27, 2018 · 16 comments
Assignees
Labels
P4 Priority 4

Comments

@bkmartinjr
Copy link
Contributor

We need to nail down user stories for Regraph/Reset in the first release.

Issues to consider:

  • should there be async / batch compute triggered by front-end (web UI), or should it all be done on the CLI with the cellxgene wrapper script?
  • what are the core use cases for regraph/reset and what workflow do we want for those use cases?
  • how much persistent state do we mandate for re-layout (vs. on-demand compute)?

Scanpy re-layout is fast when NN are already computed. Very slow when not. Should the web UI drive both, or make assumptions about some already existing & persisted?

^ @freeman-lab - please add your thoughts.

@freeman-lab
Copy link
Contributor

freeman-lab commented Oct 16, 2018

a few thoughts here:

  • the primary use case is: select a subset of cells via the UI, and then recompute and view a low-dimensional embedding layout just for those cells. why would someone do this? here's an intuition: when collapsing into low-dimensional projections we are always losing information. these algorithms strive to represent as much information about relative distances in the high-dimensional data as possible. the less structure there is to capture, the more room you have to work with 2d. thus, the 2d layout for a smaller subset of the dataset can in principle reveal structure that is obfuscated when laying out the entire thing at once. it's not the same as just zooming in.

  • it is indeed faster to do the layout computations if NN are already computed, but it still requires computation. it is of course impossible to precompute layouts for all possible subselections, and in general the same subselection is unlikely to be reused multiple times, so caching any particular subset does not seem useful.

  • we are going to ensure that NN are precomputed for all data during CLI launch (and compute them via a separate script if necessary)

  • to me, this all argues for a regraph button that triggers an async recomputation of the layout. depending on the subselection size and algorithm, it could be anywhere from a couple seconds to a few minutes.

  • for the sake of terminological consistency, i'd rename this relayout, especially as graph is an overloaded term in this domain :)

  • i don't see this as a super critical feature, so could probably drop for the first release (differential expression is a much more common "on-demand" computation)

  • i do see this as a pattern we could duplicate almost identically for other recomputations that are fast enough to expose in the UI (e.g. recluster)

  • i'm not sure what reset is supposed to do, but a button that returns the zoom and pan and selection to the initial view of the entire dataset seems useful

@colinmegill @csweaver

@bkmartinjr
Copy link
Contributor Author

I'd like to add additional (future) issues we need to factor into this design:

  • the work with Starfish/SpaceTx is going to introduce an additional "layout" view - in this case, it won't be subject to regraphing, but will be a layout the user wants to choose. There may be additional "views" coming (eg, @sidneymbell proposal for trajectories). Some of these views may require parameterization or other user input to perform their computes
  • The raison d'etre of cellxgene is interactive exploration of the data. It is a biggish (but doable) step to move into async compute job management (queuing/web push, param management, etc).
  • upcoming ScanPy release is going to add Dask support, enabling a wide range of additional compute in a "reasonable" timeframe (maybe not interactive speed, but cup-o-coffee speed)

Putting all this together, I am in favor of pushing this to a post-MVP release, and giving us time to think about the UX more. I think I am an advocate for keeping the f/e super simple:

  • very fast loading & interactive web UX that displays the data
  • CLI-driven batch/async, including most of the parameterization of those computes
  • and zero async support in the front-end.

Random thinking: as an alternative UX, why not have the user "save" named layouts using the CLI (including all precompute), and then just use the front-end to visualize these "named layouts" (the front-end would only know the name & coordinates)? This is easily done for cell subsets that are defined computationally (eg, select by metadata field), but harder for lasso selected sets. We could solve for the latter by implementing the "save selection" feature (which drops a selection list).

@laserson
Copy link

I should also mention that in addition to Dask support, we've been experimenting with Pywren/serverless, which could give lots of parallelism and also minimal configuration on the user's end. We're also thinking about distributed impl of the kNN computation to make it scalable and hopefully fast.

@freeman-lab
Copy link
Contributor

Those future directions are super cool @laserson !

Totally on board with pushing this to post-MVP and just dropping the current regraph button.

I still lean towards supporting a few targeted forms of async compute eventually, especially if more speed ups are coming.

One issue with the saving of named layouts is that the combinatorics just blow up. For example, even for metadata (e.g. cluster) based selections, it'll often be lots of different combinations of different clusters. I.e. the exploratory nature of the visualization is exactly what gives rise to a wide variety of subsequent computations that are hard to precompute.

But this should really be driven by user feedback -- what additional computations do people want to do while using the tool in its current simple form? Can address that post MVP.

@freeman-lab
Copy link
Contributor

also, a clarifying comment / question for @bkmartinjr -- when posing the question as, "should there be async / batch compute triggered by front-end (web UI)", we are already doing this when computing differential expression, right? so we've already gone down this road? it's simply a question of speed, and when the expected speed does or does justify exposing the functionality on the front-end

@bkmartinjr
Copy link
Contributor Author

@freeman-lab - we treat differential expression as a more-or-less synchronous operation (not literally, but in the sense that we expect a server response quickly enough that the UI does not need to have explicitly async workflow). It boils down to speed - if it is "interactive" (ie, ~<1sec) response, then we don't need to build in explicit async management. If it could take minutes to hours, then we need to provide the user with some signal that compute is in progress, notification when done, etc.

@colinmegill
Copy link
Contributor

colinmegill commented Oct 16, 2018 via email

@freeman-lab
Copy link
Contributor

awesome, that all matches my perspective, just wanted to sanity check

@bkmartinjr
Copy link
Contributor Author

I should add, I'm not opposed to adding async support if we do it right. We just aren't there - for example, the REST API has no concept of async requests (ie, no ability to know when an async request is completed). This isn't just a UI issue.

@sidneymbell
Copy link
Contributor

sidneymbell commented Feb 5, 2019

Regraphing and reclustering were among the "most valuable features" identified in today's user feedback session with the Humphrey's lab.

I agree that we need to be careful about what's feasible at interactive speed vs. async computation.

FWIW, though, I think that re-embedding at totally reasonable (not quite interactive, I was mistaken) speed is totally possible with scanpy if the subselection is contiguous. I've detailed this more here, but this basically happens when selections are based on:
1 - cluster assignment
and/or
2 - rectangular or lasso selection in UMAP space

I believe (worth following up) that this would satisfy at least most user needs. What's your impression on this one, @neuromusic ?

@bkmartinjr
Copy link
Contributor Author

@sidneymbell - what does subselection is contiguous mean? Do you mean in-memory continuity in the underlying dataframe, or something else (sorry, naive question I know). And what is the primary determinant of re-layout speed?

@sidneymbell
Copy link
Contributor

@bkmartinjr - Great question! I should have elaborated.

The subselection is contiguous when the neighbor graph connecting all the selected cells doesn't have any breaks in it. I.e., there's a way to "walk" directly between each pair of cells in the subselection (along edges the neighbor graph).

So, for example, in this notebook I made two different mini datasets representing different ways of subselecting.

The granulocytes are all closely related cells that are close to one another in the (contiguous) neighbor graph. In this case, it doesn't make a big difference whether we recompute the neighbor graph ('clean') or subset the existing neighbor graph from universe ('subset').

As a counterexample, I also made a dataset where I randomly subselected every 5th cell from the entire dataset. Here, the cells are from all over the neighbor graph, and many of the cells between those selected cells got dropped. The result is a ton of tiny neighbor graphs that are disconnected from one another. Here, we see that it makes a big difference whether we recompute the neighbor graph ('clean') or just subset the universal one ('subset').

Does that help? Happy to stop by and chat.

@aopisco
Copy link

aopisco commented Feb 12, 2020

A couple of thoughts here:

  1. The most common user case will be the user wanting to subset a group of cells.
  2. The user selects the cells they're interested in and click to subset. That should trigger re-embedding by running sc.tl.umap(adata_subset).
  3. If the user is doing annotations using cellxgene, then the user can select a smaller group within the subset and annotate those and so one
  4. At the moment because there is no re-embedding after subsetting it's impossible in many applications (like when there are T cells and NK cells) to get a finer look at the data which is limiting the potential of the subsetting function

@ambrosejcarr
Copy link
Contributor

ambrosejcarr commented Sep 23, 2020

Spyros Darmanis mentioned that his group needs support for trajectory analyses, and that re-embedding + reanalysis of trajectories will be needed to drill into smaller developmental scales. He admits that the trajectory algorithms need more significant parameter tuning and wasn't a clear "default" that we could execute automatically.

@signechambers1 signechambers1 added the P4 Priority 4 label Oct 1, 2020
@brianraymor
Copy link

@ambrosejcarr reports that this issue is addressed. Closing during triage.

@colinmegill
Copy link
Contributor

Agree — this issue is resolved and implemented. If we return to trajectories, it's a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P4 Priority 4
Projects
None yet
Development

No branches or pull requests

10 participants