Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Toolform support for selecting datasets from within collections #740

Closed
hexylena opened this issue Sep 17, 2015 · 41 comments
Closed

Toolform support for selecting datasets from within collections #740

hexylena opened this issue Sep 17, 2015 · 41 comments

Comments

@hexylena
Copy link
Member

My users are finding dataset collections to be not so user-friendly. They generate collections (e.g. sequencing data), then do the map step (assembly), and then are stuck not being able to access the data within collections. They want to do manual analysis of different files within that collection.

Within the "select single/multiple datasets" UI, it would be nice if collections were listed alongside (maybe bold) and then datasets within collections listed below the header and indented. Much like how timezones are used as headers here https://select2.github.io/examples.html

@hexylena
Copy link
Member Author

@guerler anything I can do to help here? This feature is getting more important as I'm updating more tools to use collections... I don't want my users to be "locked in" to collections and unable to do the analyses they need.

@guerler
Copy link
Contributor

guerler commented Oct 25, 2015

@erasche I think its a great idea. Does this require any backend modifications? ping @jmchilton. If not we should be able to access the hda ids of collection components and then populate the single dataset select field at https://github.com/galaxyproject/galaxy/blob/dev/client/galaxy/scripts/mvc/form/form-select-content.js as you mentioned above. We should be careful regarding scalability since some collections might contain several thousand components.

@jmchilton
Copy link
Member

@guerler I'm fairly certain the tool API will run just fine if given an individual HDA ID from within a collection. I can add a test case if you wish.

I would discourage doing this unless we do indeed fetch the collection elements on the fly and not with the initial request. Collections keep potentially huge histories small so the tool form for instance still works fine - it would be a real step backward to break those histories in order to populate the collection contents in this form.

As an aside - the workaround that people use I think is the un-hide the individual HDA that corresponds to the collection element they want to run a tool with. In an abstract way I do like that because it is declaring that you are indeed interesting in treating this dataset as a stand-alone thing - so it will be present for instance when this analysis is extracted from the history into a workflow. Without unhiding that dataset - this is just going to be a dangling input. When we get to loose with how we treat the contents of collection there is some conceptual tracibility or reproducibility we are loosing in my opinion - having the user declare inputs as inputs is a slight balance to that.

It is not to say we shouldn't do this - it is a high priority thing to me - I'd just place it at number 6 on the "collection" priority list after upload, re-running, naming issues, deletion, and improved state representation handling.

@hexylena
Copy link
Member Author

I like collections because, as you rightly mention, they keep histories small.

As an aside - the workaround that people use I think is the un-hide the individual HDA that corresponds to the collection element they want to run a tool with. In an abstract way I do like that because it is declaring that you are indeed interesting in treating this dataset as a stand-alone thing - so it will be present for instance when this analysis is extracted from the history into a workflow.

Asking them to unhide things.. they'll just ask me why I forced them to use this cumbersome new feature if they're just going to have to unhide things. And I'll end up back where I started, with non-collection enabled tools because of what my users see as a UX issue. Or I make collection enabled tools and my users complain because of a) the changes, and b) having to run an "explode collection" tool or unhiding datasets + deleting the collection.

In an abstract way, yes, I agree, I also like users declaring "I am pulling this out of a collection".

The specific use case I have in mind is the entrez tools. I think everyone benefits from those being a collection output since 95% of people want to treat them as a giant blob.

  • Some want to merge them into one file
  • Some want to batch their analyses over the collection of files and speed up processing, and then merge
  • My boss uses the outputs from that tool to review genomes, as one-by-one process, running different tools on each genome based on what claims are made in the papers. In that case collections would help him keep his history tidy, but not if he can't run tools in individual items without exploding the collection.

@guerler
Copy link
Contributor

guerler commented Mar 27, 2017

I agree with @jmchilton. I like the process of unhiding too, although we might want to rename it into something like 'extract' to make it more apparent. On the other hand I understand @erasche's concerns. It makes working with collections less straightforward. However, just adding all hdas of all collections to the data selection list will likely lead to severe performance issues, see: https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/tools/parameters/basic.py#L1778.

@hexylena
Copy link
Member Author

hexylena commented Mar 27, 2017 via email

@mvdbeek
Copy link
Member

mvdbeek commented Mar 28, 2017

If perf is a concern, could the select UI behave like history? Select from
this list or (click on to) enter a collection and see sub elements.

That would be a great solution. A small indicator that the item is a collection, and then a click on it will expand the collection to show and be able to select individual items.

@jmchilton
Copy link
Member

I agree with @guerler that implementing this in the tool form is tricky - I think it should be done but the tool form hasn't been setup to do this easily.

We keep talking about dragging and dropping datasets into the tool form - this would probably be easier to implement (right @guerler?) and something we definitely want to do anyway since the history filtering and such on the side is very powerful already.

If people could drag and drop collection elements into this form what percent of the UX concerns would be addressed by that workaround - only 5%, 50%, 85%?

@jmchilton
Copy link
Member

  • Some want to merge them into one file
  • Some want to batch their analyses over the collection of files and speed up processing, and then merge
  • My boss uses the outputs from that tool to review genomes, as one-by-one process, running different tools on each genome based on what claims are made in the papers. In that case collections would help him keep his history tidy, but not if he can't run tools in individual items without exploding the collection.

I understand that people want to deal with things in different ways for sure. Might it make sense to have different tools or different workflows? Ones that gave users in the first two scenarios collections and ones that produce individual datasets for the third use case. I think one of the strengths (or maybe just distinctions) of Galaxy's approach to that say that of CWL to tools is that we are aiming to produce little individual useful applications almost - not aiming to model a command-line tool and every configuration of its output. So if one tools produces different configurations of outputs or even if users may want to consume their outputs in different ways - it makes sense IMO to have different Galaxy tools for the same command-line.

@MoHeydarian
Copy link

MoHeydarian commented Mar 28, 2017

I think a dataset within a collection should be unhidden, or "extracted" as @guerler suggested, prior to running a tool on it. Running the tool on a hidden dataset mitigates tracibility and transparency, and breaks reproducibility if trying to build a workflow from that history.

The current way to expose a single dataset from a collection is to unhide all datasets in the history, select the dataset to expose, and unhide it. This can be cumbersome and frustrating (to have your browser grind to a halt while the history refreshes) when your history contains thousands of items hidden behind a collection. As @mvdbeek suggested, having a selectable box when one clicks into a collection and being able to "extract" individual datasets would be my preferred way of unhiding/exposing single datasets.

I'm not sure exposing single datasets from a collection is good for reproducibility. When I need to interrogate single datasets from a collection, I use the Collapse Collection tool (appending the file/sample name to each line) to generate a single file representing a whole collection. From here I can use the filter tool to wrangle data from specific samples. These steps can all be done in a workflow.

@hexylena
Copy link
Member Author

Lots of great points, thanks for having this discussion y'all. I know there are other, higher priority items on collections work, this is just the most visible one to me right now.

@jmchilton @guerler

We keep talking about dragging and dropping datasets into the tool form - this would probably be easier to implement (right @guerler?) and something we definitely want to do anyway since the history filtering and such on the side is very powerful already.

I initially thought "ugh, gross, DnD in the browser + moving my mouse all the way over when I'm vertically, linearly scanning the tool form." After some more consideration, I can see the logic in this, the history filtering is really good, maybe this makes sense to do. (I wonder about accessibility but we have bigger things to attack first on that topic.)

If people could drag and drop collection elements into this form what percent of the UX concerns would be addressed by that workaround - only 5%, 50%, 85%?

For me, that would solve my use cases completely.

I think one of the strengths (or maybe just distinctions) of Galaxy's approach to that say that of CWL to tools is that we are aiming to produce little individual useful applications almost - not aiming to model a command-line tool and every configuration of its output.

I am so completely in agreement with this, you have no idea. For non-bioinformatician users, the tools would ideally be useful and abstracted from the underlying implementation. They don't want to learn "Oh, I have to use a tool named Bowtie for mapping reads", I feel that they have that question and just want to see "map reads to genome" as a tool to make their foray into bioinformatics more obvious.

I have experimented with doing this (for my specific case, again), and it works OK, but I fear that it doesn't scale well since I have to do it on a per-tool/per-functionality case. Yes, I don't have to re-write the tool, but I now have two tools in the tool panel and my boss wonders which he should use and doesn't always use the right one.

@MoHeydarian

I think a dataset within a collection should be unhidden, or "extracted" as @guerler suggested, prior to running a tool on it. Running the tool on a hidden dataset mitigates tracibility and transparency, and breaks reproducibility if trying to build a workflow from that history.

Maybe it is just my reading, but it sounds like you think this should be an explicit action that a user takes, ahead of running the tool? Is that correct? If not: great, agreed, that's fine. (If so: Why does this have to be an explicit, additional step? If this is automatic / implicit, I'm fine with it. If it's explicit, then it's a cumbersome UX issue that requires user training, whereas if it's just "here's a folder (i.e. collection), look in there for datasets" then it's fine.)

I'm not sure exposing single datasets from a collection is good for reproducibility. When I need to interrogate single datasets from a collection, I use the Collapse Collection tool (appending the file/sample name to each line) to generate a single file representing a whole collection. From here I can use the filter tool to wrangle data from specific samples. These steps can all be done in a workflow.

That sounds like a workaround for the underlying issue. We have implicitly launched "convert" tools, surely we should treat the dataset extraction the same: launch a tool that extracts the specified collection element into its own dataset (or however that could happen without creating a duplicate file), as part of the described DnD setup?

Sounds like collapse collection tool only works on text files? (I'm trying to explore, but I don't use main and I seem to be far down the queue.)

@MoHeydarian
Copy link

@erasche

Maybe it is just my reading, but it sounds like you think this should be an explicit action that a user takes, ahead of running the tool? Is that correct? If not: great, agreed, that's fine. (If so: Why does this have to be an explicit, additional step? If this is automatic / implicit, I'm fine with it. If it's explicit, then it's a cumbersome UX issue that requires user training, whereas if it's just "here's a folder (i.e. collection), look in there for datasets" then it's fine.)

If a dataset within a collection can be chosen on a tool form and upon execution that hidden dataset is exposed/extracted/visible in the history, I think that would be great. I just think that the input should be visible after it has been used to allow tracibility.

Yes, the Collapse Collection tool only works on text files (for now and hopefully not too long), so I suppose the strategy I mentioned is kind of is a workaround, but in the case of working with single cell *-seq data it works great to operate on lots of expression tables (all text format).

@hexylena
Copy link
Member Author

I just think that the input should be visible after it has been used to allow traceability.

Sure, this is fine! Glad I was mis-reading that.

@jxtx
Copy link
Contributor

jxtx commented Mar 28, 2017

I've long thought we need some kind of advanced dataset picker in the tool form.

The default select list for a data parameter would only show datasets in the current history. Keep it small and simple.

Next to the select box would be a button that pops over a dataset picker that lets users browse in a more advanced fashion, including:

  • Digging into collections
  • Selecting datasets from other histories
  • Selecting datasets from data libraries

I also don't think the data needs to be added to the current history in any of these cases. We can provide ways to navigate the provenance graph.

Drag and drop should also happen of course. But doesn't solve important cases like libraries.

@jgoecks
Copy link
Contributor

jgoecks commented Mar 28, 2017

+1 @jxtx's comments. IMO, there is no need to deal with unhiding/exporting, we just need a more intelligent way to navigate and select datasets within collections and across histories. But:

I also don't think the data needs to be added to the current history in any of these cases. We can provide ways to navigate the provenance graph.

I disagree here. If we don't add datasets used to the current history, we are changing a fundamental aspect of Galaxy: the current history contains all provence for an analysis. Histories in this case become much less self-contained and more difficult to understand.

@hexylena
Copy link
Member Author

hexylena commented Aug 8, 2017

Was talking to @bebatut about this issue today, she's using bioblend for this but that isn't a solution for a lot of users.

@mvdbeek
Copy link
Member

mvdbeek commented Jan 25, 2018

I have been doing RNAseq experiments on a semi-regular basis with collections and subworkflows for at least the last 1,5 years now and I have to say that with the filter collection tools I think you can do anything that is necessary for RNAseq experiments (while of course it could be improved).

screen shot 2018-01-25 at 09 46 04

I also have a variant of this with salmon. So personally what would be highest on my wish-list would be the ability to more tightly integrate the collection filtering tools with the UI, so that you don't have to think in advance about the structure of your collection or upload a text file to be used in the filter collection tool.

@mvdbeek
Copy link
Member

mvdbeek commented Jan 25, 2018

(I'll try to write up something on how one can figure out a good "structure" for an analsys workflow).
There's also https://usegalaxy.org/u/marius/w/parent-workflow-chipseq that implements a similar pattern for ChIPseq, so I think this way of designing your workflow and inputs should apply to more cases.

@jmchilton
Copy link
Member

@mvdbeek

So personally what would be highest on my wish-list would be the ability to more tightly integrate the collection filtering tools with the UI, so that you don't have to think in advance about the structure of your collection or upload a text file to be used in the filter collection tool.

This is exciting to hear. Given that I've been building a hammer lately everything looks like a nail to me so my initial proposal for this would be merging #5365 and then implementing the third bullet item on #5381 (Apply Collection Builder to Collections). My initial thinking in that issue was it would be a good way to re-organize collections but it would be just as good at filtering right away I think given what has already been implemented in #5365. I think this is a cool approach but I'll admit it isn't obvious what I'm trying to say without me having a prototype ready to demonstrate. Want to check it out and let me know if you can imagine it being a good approach or if I'm not clear or you'd like to see something else can you sketch out a new issue describing what you would like to see and what functionalities it should have?

@mvdbeek
Copy link
Member

mvdbeek commented Jan 25, 2018

I can absolutely see that from the PR description / screenshot, yes. I was going to ask if we can apply this to existing collections as well, so that's cool!

@lparsons
Copy link
Contributor

Thanks for the workflow example @mvdbeek. That will work well for simple experiments, but most of the real ones I've come across have additional factors to include in the DESeq2 model (e.g. batch, individual, etc.), which requires the user to select a different grouping of the samples that isn't reflected in the collection organization. Thus the need for this issue.

@mblue9
Copy link
Contributor

mblue9 commented Jan 31, 2018

Totally agree with @lparsons. I think @mvdbeek your suggestion is good in theory and for some situations. But the user may also need to be a workflow master like yourself as that workflow looks a bit scary to me are they subworkflows you're got in there? I haven't even got a simple one to work fully yet with names! I just tried use collections from the beginning of a workflow and have still ended up in this mess below and it is just making me want to cry right now.

screen shot 2018-02-01 at 9 03 09 am

@mvdbeek
Copy link
Member

mvdbeek commented Feb 1, 2018

That will work well for simple experiments, but most of the real ones I've come across have additional factors to include in the DESeq2 model (e.g. batch, individual, etc.)

So my example includes the batch effect, you can see that if you trace the connections for factor 2. So that's treatment and control (factor 1), with a A/B, C/D pairing, where sample-prep for A and B were prepared at the same time. Individual pairing would be possible as well, but you'd need to split up your collection accordingly (that for example is not as straightforward as it could be).
I haven't done time-course analysis yet, (happens to be something I'll do today), so that may actually not be possible, but then that would be a limitation of the DESeq2 wrapper.

I do real analyses here, and the fact that I'm able to do it does of course not mean it is as simple as it could be.

which requires the user to select a different grouping of the samples that isn't reflected in the collection organization.

I touched on this above, but I doubt dragging from a collection will work reliably for a multi-factor analysis with multiple replicates. That is going to be very error-prone, and also not generalizable to a workflow. But yes, that is something to work on.

@mvdbeek
Copy link
Member

mvdbeek commented Feb 1, 2018

@mblue9 the issue is now that you can't identify what the collection represents ? Tagging them is a good start (that should work in 17.09), and then @jmchilton also fixed the rename output operations for collections in the workflow, so that will hopefully be a breeze in 18.01

@lparsons
Copy link
Contributor

lparsons commented Feb 1, 2018

@mvdbeek My apologies, I see now. I guess the issue is that you have to create a collection for every combination of factor levels, which isn't too practical for a lot of experiments and makes the workflow almost more trouble that it's worth (esp. when it comes to having to add rename actions, etc.) However, perhaps some of the changes made in 18.01 will help?

I doubt dragging from a collection will work reliably for a multi-factor analysis with multiple replicates

It seems to me that hashtags are great for handling factor levels. If there was a tag for each level, and I could somehow tell the workflow to use things from the collection with a specific tag...

In the meantime, being able to select from within collections would a manual workaround. I just don't see people setting up a workflow first, then running it for something like this. Instead, people create a single align and count workflow, run it on every dataset in a collection, and then manually run DESeq2, picking the datasets and factors they want. The workflow seems MUCH more complicated and difficult to setup for a one time use.

@mvdbeek
Copy link
Member

mvdbeek commented Feb 1, 2018

The workflow was just a graphic way to demonstrate what you need to do. You can also do this without separating the elements up front. So how about another tool that splits collections by tags, would that help ? (we've had that request before, I think.)

@lparsons
Copy link
Contributor

lparsons commented Feb 1, 2018

Seems unnecessary to create all these additional collections when one could simply specify which subset of a collection should be used for a specific input. It makes working with collections very cumbersome. Why the reluctance to allow users to treat collections like folders?

@jmchilton
Copy link
Member

Why the reluctance to allow users to treat collections like folders?

If researchers are pulling stuff out of collections and filling in boxes by hand - there is some metadata they are leveraging to do that - maybe in the name, maybe in a sample sheet. If the researcher knows how to access that metadata - Galaxy should make it possible and easy for the researcher to convey that information to the collection and should make it intuitive and easy to use that to map that set of files to the tool in an abstract way the is extractable and trackable and reproducible. Missing the modeling of that metadata means an important part of the analysis is not be captured by Galaxy and the analysis is missing important stuff in terms of reproducibility and accessibility. I understand the nitty gritty is difficult and the user experience of collections is rough in many ways currently - but these are the lofty goals.

If you are detecting reluctance to treat collections as folders - it is because they weren't meant to be used that way, it skirts the problem I was hoping collections would solve, and I ultimately think people will be unhappy if they use collections this way - even if we make it super slick. Collections are terribly rough in so many ways - but I'd rather be working on solving the problems they were meant to solve than building a folder structure into histories. This may be a mistake - it may be that capturing that metadata is too hard, building a UI for bridging that metadata from the research to Galaxy and then from Galaxy to the tool form and job structure is too hard, but the reluctance comes down to that being the goal. That is what at least I am trying to do.

I hope that is understandable - I also hope you understand the reluctance is not an unwillingness. No one has ever rejected an enhancement in that direction and I even opened this PR for you.

@mblue9
Copy link
Contributor

mblue9 commented Feb 2, 2018

@mblue9 the issue is now that you can't identify what the collection represents ? Tagging them is a good start (that should work in 17.09)

Yes that would be the current issue, I've now no idea what's inside each collection thanks to those cryptic "x on y" names.

How are you tagging? As I just tried this workaround and added a tag to each collection of fastqs that I have (12 collections) but then when I went to run the workflow just now, I've ended up with a history for each collection! so 12 histories!! Is that expected? I would have much preferred just one! As I'm working with multiple types of data at the moment and for multiple users so 12 histories for just one dataset is way too much imo. Do they have to split on the tag?

@mvdbeek
Copy link
Member

mvdbeek commented Feb 2, 2018 via email

@mblue9
Copy link
Contributor

mblue9 commented Feb 2, 2018

Yes I did Send to a New History as I already had a history full of those x on y. And I had just realised that was cause.

So looks like Sending to a New History is a big NoNo if you have tags on your collection, if you don't want them in a separate history for each collection. Sending to a New History worked differently without the tags so yes, this to me is unexpected non-obvious behaviour.

@mvdbeek
Copy link
Member

mvdbeek commented Feb 2, 2018

So looks like Sending to a New History is a big NoNo if you have tags on your collection

Been frustrated by this as well, but that has been this way when you select multiple inputs to a workflow. Clearly that's not ideal, but this is independent of tags

@mblue9
Copy link
Contributor

mblue9 commented Feb 2, 2018

Ah ok, yes I think I had not been Sending to New History when not using tags.
I don't want to speak too soon...but... this looks like it might work! Or at least be a big improvement on what I had. This is what I've done that's looking promising:

  • split samples into collections e.g. per group
  • add a tag (sample label) to each collection at the beginning (e.g fastq stage)
  • run workflow (but DO NOT Send to New History unless you actually want a history per collection)

@mvdbeek
Copy link
Member

mvdbeek commented Feb 2, 2018

That's one way to do it, yes!
I'm now checking if we can also use nested collections just until the point where individual collections are needed, e.g.
screen shot 2018-02-02 at 09 45 25
That would probably be easier to understand when you look at the history, instead of having parallel collections

@mvdbeek
Copy link
Member

mvdbeek commented Mar 20, 2018

Alright, it is now possible to drop datasets from collections into the tool form with #5657 being merged.

@hexylena
Copy link
Member Author

This is finally resolved with #7553! 🎉🎉🎉🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants