Regraph/Reset user story #280

bkmartinjr · 2018-09-27T16:34:04Z

We need to nail down user stories for Regraph/Reset in the first release.

Issues to consider:

should there be async / batch compute triggered by front-end (web UI), or should it all be done on the CLI with the cellxgene wrapper script?
what are the core use cases for regraph/reset and what workflow do we want for those use cases?
how much persistent state do we mandate for re-layout (vs. on-demand compute)?

Scanpy re-layout is fast when NN are already computed. Very slow when not. Should the web UI drive both, or make assumptions about some already existing & persisted?

^ @freeman-lab - please add your thoughts.

freeman-lab · 2018-10-16T15:18:34Z

a few thoughts here:

the primary use case is: select a subset of cells via the UI, and then recompute and view a low-dimensional embedding layout just for those cells. why would someone do this? here's an intuition: when collapsing into low-dimensional projections we are always losing information. these algorithms strive to represent as much information about relative distances in the high-dimensional data as possible. the less structure there is to capture, the more room you have to work with 2d. thus, the 2d layout for a smaller subset of the dataset can in principle reveal structure that is obfuscated when laying out the entire thing at once. it's not the same as just zooming in.
it is indeed faster to do the layout computations if NN are already computed, but it still requires computation. it is of course impossible to precompute layouts for all possible subselections, and in general the same subselection is unlikely to be reused multiple times, so caching any particular subset does not seem useful.
we are going to ensure that NN are precomputed for all data during CLI launch (and compute them via a separate script if necessary)
to me, this all argues for a regraph button that triggers an async recomputation of the layout. depending on the subselection size and algorithm, it could be anywhere from a couple seconds to a few minutes.
for the sake of terminological consistency, i'd rename this relayout, especially as graph is an overloaded term in this domain :)
i don't see this as a super critical feature, so could probably drop for the first release (differential expression is a much more common "on-demand" computation)
i do see this as a pattern we could duplicate almost identically for other recomputations that are fast enough to expose in the UI (e.g. recluster)
i'm not sure what reset is supposed to do, but a button that returns the zoom and pan and selection to the initial view of the entire dataset seems useful

@colinmegill @csweaver

bkmartinjr · 2018-10-16T15:41:39Z

I'd like to add additional (future) issues we need to factor into this design:

the work with Starfish/SpaceTx is going to introduce an additional "layout" view - in this case, it won't be subject to regraphing, but will be a layout the user wants to choose. There may be additional "views" coming (eg, @sidneymbell proposal for trajectories). Some of these views may require parameterization or other user input to perform their computes
The raison d'etre of cellxgene is interactive exploration of the data. It is a biggish (but doable) step to move into async compute job management (queuing/web push, param management, etc).
upcoming ScanPy release is going to add Dask support, enabling a wide range of additional compute in a "reasonable" timeframe (maybe not interactive speed, but cup-o-coffee speed)

Putting all this together, I am in favor of pushing this to a post-MVP release, and giving us time to think about the UX more. I think I am an advocate for keeping the f/e super simple:

very fast loading & interactive web UX that displays the data
CLI-driven batch/async, including most of the parameterization of those computes
and zero async support in the front-end.

Random thinking: as an alternative UX, why not have the user "save" named layouts using the CLI (including all precompute), and then just use the front-end to visualize these "named layouts" (the front-end would only know the name & coordinates)? This is easily done for cell subsets that are defined computationally (eg, select by metadata field), but harder for lasso selected sets. We could solve for the latter by implementing the "save selection" feature (which drops a selection list).

laserson · 2018-10-16T15:55:37Z

I should also mention that in addition to Dask support, we've been experimenting with Pywren/serverless, which could give lots of parallelism and also minimal configuration on the user's end. We're also thinking about distributed impl of the kNN computation to make it scalable and hopefully fast.

freeman-lab · 2018-10-16T16:28:49Z

Those future directions are super cool @laserson !

Totally on board with pushing this to post-MVP and just dropping the current regraph button.

I still lean towards supporting a few targeted forms of async compute eventually, especially if more speed ups are coming.

One issue with the saving of named layouts is that the combinatorics just blow up. For example, even for metadata (e.g. cluster) based selections, it'll often be lots of different combinations of different clusters. I.e. the exploratory nature of the visualization is exactly what gives rise to a wide variety of subsequent computations that are hard to precompute.

But this should really be driven by user feedback -- what additional computations do people want to do while using the tool in its current simple form? Can address that post MVP.

freeman-lab · 2018-10-16T18:21:02Z

also, a clarifying comment / question for @bkmartinjr -- when posing the question as, "should there be async / batch compute triggered by front-end (web UI)", we are already doing this when computing differential expression, right? so we've already gone down this road? it's simply a question of speed, and when the expected speed does or does justify exposing the functionality on the front-end

bkmartinjr · 2018-10-16T18:29:16Z

@freeman-lab - we treat differential expression as a more-or-less synchronous operation (not literally, but in the sense that we expect a server response quickly enough that the UI does not need to have explicitly async workflow). It boils down to speed - if it is "interactive" (ie, ~<1sec) response, then we don't need to build in explicit async management. If it could take minutes to hours, then we need to provide the user with some signal that compute is in progress, notification when done, etc.

colinmegill · 2018-10-16T18:31:35Z

Agree with this point re diffexp, the question is not will we do arbitrary computation from the client or have to handle it differently with different computation packages, but how long we are willing to wait for it in various circumstances

…

On Tue, Oct 16, 2018, 11:24 AM Jeremy Freeman ***@***.***> wrote: also, a clarifying comment / question for @bkmartinjr <https://github.com/bkmartinjr> -- when posing the question as, "should there be async / batch compute triggered by front-end (web UI)", we are already doing this when computing differential expression, right? so we've already gone down this road? it's simply a question of speed, and when the expected speed does or does justify exposing the functionality on the front-end — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#280 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABsDGSBMJZVVgV1gh3AcB4yzns2ao33hks5uliRGgaJpZM4W9CMX> .

freeman-lab · 2018-10-16T19:00:05Z

awesome, that all matches my perspective, just wanted to sanity check

bkmartinjr · 2018-10-16T19:13:16Z

I should add, I'm not opposed to adding async support if we do it right. We just aren't there - for example, the REST API has no concept of async requests (ie, no ability to know when an async request is completed). This isn't just a UI issue.

sidneymbell · 2019-02-05T22:45:29Z

Regraphing and reclustering were among the "most valuable features" identified in today's user feedback session with the Humphrey's lab.

I agree that we need to be careful about what's feasible at interactive speed vs. async computation.

FWIW, though, I think that re-embedding at totally reasonable (not quite interactive, I was mistaken) speed is totally possible with scanpy if the subselection is contiguous. I've detailed this more here, but this basically happens when selections are based on:
1 - cluster assignment
and/or
2 - rectangular or lasso selection in UMAP space

I believe (worth following up) that this would satisfy at least most user needs. What's your impression on this one, @neuromusic ?

bkmartinjr · 2019-02-05T22:47:20Z

@sidneymbell - what does subselection is contiguous mean? Do you mean in-memory continuity in the underlying dataframe, or something else (sorry, naive question I know). And what is the primary determinant of re-layout speed?

sidneymbell · 2019-02-05T22:54:59Z

@bkmartinjr - Great question! I should have elaborated.

The subselection is contiguous when the neighbor graph connecting all the selected cells doesn't have any breaks in it. I.e., there's a way to "walk" directly between each pair of cells in the subselection (along edges the neighbor graph).

So, for example, in this notebook I made two different mini datasets representing different ways of subselecting.

The granulocytes are all closely related cells that are close to one another in the (contiguous) neighbor graph. In this case, it doesn't make a big difference whether we recompute the neighbor graph ('clean') or subset the existing neighbor graph from universe ('subset').

As a counterexample, I also made a dataset where I randomly subselected every 5th cell from the entire dataset. Here, the cells are from all over the neighbor graph, and many of the cells between those selected cells got dropped. The result is a ton of tiny neighbor graphs that are disconnected from one another. Here, we see that it makes a big difference whether we recompute the neighbor graph ('clean') or just subset the universal one ('subset').

Does that help? Happy to stop by and chat.

aopisco · 2020-02-12T20:03:10Z

A couple of thoughts here:

The most common user case will be the user wanting to subset a group of cells.
The user selects the cells they're interested in and click to subset. That should trigger re-embedding by running sc.tl.umap(adata_subset).
If the user is doing annotations using cellxgene, then the user can select a smaller group within the subset and annotate those and so one
At the moment because there is no re-embedding after subsetting it's impossible in many applications (like when there are T cells and NK cells) to get a finer look at the data which is limiting the potential of the subsetting function

ambrosejcarr · 2020-09-23T15:37:43Z

Spyros Darmanis mentioned that his group needs support for trajectory analyses, and that re-embedding + reanalysis of trajectories will be needed to drill into smaller developmental scales. He admits that the trajectory algorithms need more significant parameter tuning and wasn't a clear "default" that we could execute automatically.

brianraymor · 2020-12-21T21:46:00Z

@ambrosejcarr reports that this issue is addressed. Closing during triage.

colinmegill · 2020-12-21T21:51:59Z

Agree — this issue is resolved and implemented. If we return to trajectories, it's a new issue.

bkmartinjr assigned colinmegill and freeman-lab Sep 27, 2018

sidneymbell mentioned this issue Feb 5, 2019

Make UMAP the default layout #584

Closed

bkmartinjr mentioned this issue Feb 26, 2019

CLI MVP tracker #132

Closed

8 tasks

bkmartinjr mentioned this issue Aug 8, 2019

keep selection when resetting #666

Closed

bkmartinjr assigned liaprins-czi and unassigned freeman-lab Aug 8, 2019

bkmartinjr mentioned this issue Feb 12, 2020

feature: re-embed selected cells #1150

Closed

ambrosejcarr mentioned this issue Mar 10, 2020

Reclustering of summarized data #1207

Open

colinmegill mentioned this issue Jun 9, 2020

reembedding on hosted #1547

Closed

ktravaglini mentioned this issue Jun 16, 2020

Cannot select all cells after re-embedding and labeling a subset in a new category #1569

Closed

liaprins-czi removed their assignment Jun 22, 2020

signechambers1 added the P4 Priority 4 label Oct 1, 2020

brianraymor closed this as completed Dec 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regraph/Reset user story #280

Regraph/Reset user story #280

bkmartinjr commented Sep 27, 2018

freeman-lab commented Oct 16, 2018 •

edited

Loading

bkmartinjr commented Oct 16, 2018

laserson commented Oct 16, 2018

freeman-lab commented Oct 16, 2018

freeman-lab commented Oct 16, 2018

bkmartinjr commented Oct 16, 2018

colinmegill commented Oct 16, 2018 via email

freeman-lab commented Oct 16, 2018

bkmartinjr commented Oct 16, 2018

sidneymbell commented Feb 5, 2019 •

edited

Loading

bkmartinjr commented Feb 5, 2019

sidneymbell commented Feb 5, 2019

aopisco commented Feb 12, 2020

ambrosejcarr commented Sep 23, 2020 •

edited

Loading

brianraymor commented Dec 21, 2020

colinmegill commented Dec 21, 2020

Regraph/Reset user story #280

Regraph/Reset user story #280

Comments

bkmartinjr commented Sep 27, 2018

freeman-lab commented Oct 16, 2018 • edited Loading

bkmartinjr commented Oct 16, 2018

laserson commented Oct 16, 2018

freeman-lab commented Oct 16, 2018

freeman-lab commented Oct 16, 2018

bkmartinjr commented Oct 16, 2018

colinmegill commented Oct 16, 2018 via email

freeman-lab commented Oct 16, 2018

bkmartinjr commented Oct 16, 2018

sidneymbell commented Feb 5, 2019 • edited Loading

bkmartinjr commented Feb 5, 2019

sidneymbell commented Feb 5, 2019

aopisco commented Feb 12, 2020

ambrosejcarr commented Sep 23, 2020 • edited Loading

brianraymor commented Dec 21, 2020

colinmegill commented Dec 21, 2020

freeman-lab commented Oct 16, 2018 •

edited

Loading

sidneymbell commented Feb 5, 2019 •

edited

Loading

ambrosejcarr commented Sep 23, 2020 •

edited

Loading