Skip to content

Commit fea51fc

Browse files
committed
Improved collections docs.
1 parent e68902e commit fea51fc

File tree

1 file changed

+66
-27
lines changed

1 file changed

+66
-27
lines changed

docs/_writing_collections.rst

Lines changed: 66 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,10 @@ Composite types include for instance the ``list:paired`` collection type -
2424
which represents a list of dataset pairs. In this case, instead of each
2525
dataset having a list idenifier, each pair of datasets does.
2626

27+
-------------------------------
28+
Consuming Collctions
29+
-------------------------------
30+
2731
Many Galaxy tools can in conjuction with collections used without
2832
modification. Galaxy users can take a collection and `map over` any tool that
2933
consumes individual datasets. For instance, early in typical bioinformatics
@@ -44,11 +48,14 @@ consuming lists, and consuming arbitrary collections.
4448
are likely doing something wrong. Just process and pair or a single dataset
4549
and allow the user to map over the collection.
4650

47-
Dataset collections are in their infancy - so for tools which process datasets
48-
the recommended best practice is to allow users to either supply paired
49-
collections or two individual datasets. Furthermore, many tools which process
50-
pairs of datasets can also process single datasets. The following
51-
``conditional`` captures this idiom.
51+
Processing Pairs
52+
-------------------------------
53+
54+
Dataset collections are not extensively used by typical Galaxy users yet - so
55+
for tools which process paired datasets the recommended best practice is to
56+
allow users to either supply paired collections or two individual datasets.
57+
Furthermore, many tools which process pairs of datasets can also process
58+
single datasets. The following ``conditional`` captures this idiom.
5259

5360
::
5461

@@ -90,7 +97,6 @@ Some example tools which consume paired datasets include:
9097
- `BWA MEM <https://github.com/galaxyproject/tools-devteam/blob/master/tools/bwa/bwa-mem.xml>`__
9198
- `Tophat <https://github.com/galaxyproject/tools-devteam/blob/master/tools/tophat2/tophat2_wrapper.xml>`__
9299

93-
-------------------------------
94100
Processing Lists (Reductions)
95101
-------------------------------
96102

@@ -133,10 +139,19 @@ the idiom:
133139
--input "${",".join(map(str, $inputs))}"
134140

135141

142+
Some example tools which consume multiple datasets (including lists) include:
143+
144+
- `multi_data_param <https://github.com/galaxyproject/galaxy/blob/dev/test/functional/tools/multi_data_param.xml>`__ (small test tool in Galaxy test suite)
145+
- `cuffmerge <https://github.com/galaxyproject/tools-devteam/blob/master/tool_collections/cufflinks/cuffmerge/cuffmerge_wrapper.xml>`__
146+
- `unionBedGraphs <https://github.com/galaxyproject/tools-iuc/blob/master/tools/bedtools/unionBedGraphs.xml>`__
147+
148+
Also see the tools-devteam repository `Pull Request #20 <https://github.com/galaxyproject/tools-devteam/pull/20>`__ modifying the cufflinks suite of tools for collection compatible reductions.
149+
150+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
136151
Identifiers
137-
-------------------------------
152+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
138153

139-
As mentioned previously sample identifiers are preserved through mapping
154+
As mentioned previously, sample identifiers are preserved through mapping
140155
steps, during reduction steps one may likely want to use these - for
141156
reporting, comparisons, etc.... When using these multiple ``data`` parameters
142157
the dataset objects expose a field called ``element_identifier``. When these
@@ -155,22 +170,21 @@ derived from using a little ficitious program called ``merge_rows``.
155170
merge_rows --name "${re.sub('[^\w\-_]', '_', $input.element_identifier)}" --file "$input" --to $output;
156171
#end for
157172

173+
Some example tools which utilize ``element_identifier`` include:
174+
175+
- `identifier_multiple <https://github.com/galaxyproject/galaxy/blob/dev/test/functional/tools/identifier_multiple.xml>`_
176+
- `identifier_single <https://github.com/galaxyproject/galaxy/blob/dev/test/functional/tools/identifier_single.xml>`_
177+
- `vcftools_merge <https://github.com/galaxyproject/tools-devteam/blob/master/tool_collections/vcftools/vcftools_merge/vcftools_merge.xml>`_
178+
- `jbrowse <https://github.com/galaxyproject/tools-iuc/blob/master/tools/jbrowse/jbrowse.xml>`_
179+
180+
.. TODO: https://github.com/galaxyproject/tools-devteam/pull/363/files
158181
159182
.. note:: Here we are rewriting the element identifiers to assure everything is safe to
160183
put on the command-line. In the future collections will not be able to contain
161184
keys are potentially harmful and this won't be nessecary.
162185

163-
Some example tools which consume collections include:
164-
165-
- `multi_data_param <https://github.com/galaxyproject/galaxy/blob/dev/test/functional/tools/multi_data_param.xml>`__ (small test tool in Galaxy test suite)
166-
- `cuffmerge <https://github.com/galaxyproject/tools-devteam/blob/master/tool_collections/cufflinks/cuffmerge/cuffmerge_wrapper.xml>`__
167-
- `unionBedGraphs <https://github.com/galaxyproject/tools-iuc/blob/master/tools/bedtools/unionBedGraphs.xml>`__
168-
169-
Also see the tools-devteam repository `Pull Request #20 <https://github.com/galaxyproject/tools-devteam/pull/20>`__ modifying the cufflinks suite of tools for collection compatible reductions.
170-
171-
-------------------------------
172-
Processing Collections
173-
-------------------------------
186+
More on ``data_collection`` parameters
187+
----------------------------------------------
174188

175189
The above three cases (users mapping over single tools, consuming pairs, and
176190
consuming lists using `multiple` ``data`` parameters) are hopefully the most
@@ -218,18 +232,30 @@ Some example tools which consume collections include:
218232

219233

220234
-------------------------------
221-
Collection as an Output
235+
Creating Collections
222236
-------------------------------
223237

224-
Whenever possible simpler operations that produce datasets should be implicitly "mapped over" to produce collections - but there are a variety of situations for which this idiom is insufficient.
238+
Whenever possible simpler operations that produce datasets should be
239+
implicitly "mapped over" to produce collections as described above - but there
240+
are a variety of situations for which this idiom is insufficient.
241+
242+
Progressively more complex syntax elements exist for the increasingly complex
243+
scenarios. Broadly speaking - the three scenarios covered are the tool
244+
produces...
225245

226-
Progressively more complex syntax elements exist for the increasingly complex scenarios. Broadly speaking - the three scenarios covered are the tool produces...
246+
1. a collection with a static number of elements (mostly for ``paired``
247+
collections, but if a tool does say fixed binning it might make sense to create a list this way as well)
248+
2. a ``list`` with the same number of elements as an input list
249+
(this would be a common pattern for normalization applications for
250+
instance).
251+
3. a ``list`` where the number of elements is not knowable until the job is
252+
complete.
227253

228-
- a collection with a static number of elements (mostly for paired, but if a tool does say fixed binning it might make sense to create a list this way as well)
229-
- a list with the same number of elements as an input (common pattern for normalization applications for instance).
230-
- a list where the number of elements is not knowable until the job is complete.
254+
1. Static Element Count
255+
-----------------------------------------------
231256

232-
For the first case - the tool can simply declare standard data elements below an output collection element in the outputs tag of the tool definition.
257+
For this first case - the tool can simply declare standard data elements
258+
below an output collection element in the outputs tag of the tool definition.
233259

234260
::
235261

@@ -239,7 +265,8 @@ For the first case - the tool can simply declare standard data elements below an
239265
</collection>
240266

241267

242-
Templates (e.g. the ``command`` tag) can then reference ``$forward`` and ``$reverse`` or whatever ``name`` the corresponding ``data`` elements are given - as demonstrated in ``test/functional/tools/collection_creates_pair.xml``.
268+
Templates (e.g. the ``command`` tag) can then reference ``$forward`` and ``$reverse`` or whatever ``name`` the corresponding ``data`` elements are given.
269+
- as demonstrated in ``test/functional/tools/collection_creates_pair.xml``.
243270

244271
The tool should describe the collection type via the type attribute on the collection element. Data elements can define ``format``, ``format_source``, ``metadata_source``, ``from_work_dir``, and ``name``.
245272

@@ -252,6 +279,9 @@ The above syntax would also work for the corner case of static lists. For paired
252279

253280
In this case the command template could then just reference ``${paried_output.forward}`` and ``${paired_output.reverse}`` as demonstrated in ``test/functional/tools/collection_creates_pair_from_type.xml``.
254281

282+
2. Computable Element Count
283+
-----------------------------------------------
284+
255285
For the second case - where the structure of the output is based on the structure of an input - a structured_like attribute can be defined on the collection tag.
256286

257287
::
@@ -262,6 +292,9 @@ Templates can then loop over ``input1`` or ``list_output`` when buliding up comm
262292

263293
``format``, ``format_source``, and ``metadata_source`` can be defined for such collections if the format and metadata are fixed or based on a single input dataset. If instead the format or metadata depends on the formats of the collection it is structured like - ``inherit_format="true"`` and/or ``inherit_metadata="true"`` should be used instead - which will handle corner cases where there are for instance subtle format or metadata differences between the elements of the incoming list.
264294

295+
3. Dynamic Element Count
296+
-----------------------------------------------
297+
265298
The third and most general case is when the number of elements in a list cannot be determined until runtime. For instance, when splitting up files by various dynamic criteria.
266299

267300
In this case a collection may define one of more discover_dataset elements. As an example of one such tool that splits a tabular file out into multiple tabular files based on the first column see ``test/functional/tools/collection_split_on_column.xml`` - which includes the following output definition:
@@ -272,6 +305,12 @@ In this case a collection may define one of more discover_dataset elements. As a
272305
<discover_datasets pattern="__name_and_ext__" directory="outputs" />
273306
</collection>
274307

308+
Nested Collections
309+
-----------------------------------------------
310+
311+
Galaxy `Pull Request #538 <https://github.com/galaxyproject/galaxy/pull/538>`__
312+
implemented the ability to define nested output collections. See the pull
313+
request and included example tools for more details.
275314

276315
----------------------
277316
Further Reading

0 commit comments

Comments
 (0)