New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BamNative datatype #5180

Merged
merged 28 commits into from Jan 19, 2018

Conversation

Projects
None yet
6 participants
@bgruening
Member

bgruening commented Dec 10, 2017

  • add a new BamNative datatype
    • BamNative will not sort or index the Bam file
  • Bam now subclasses from BamNative
  • new test tool
@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Dec 10, 2017

Can we make this queryname sorted instead/additionally ? If we can't assume reads are queryname-sorted (i.e as produced by aligners) many tools (duplicate marking, mapq correctors, split read extactors come to mind) would still need to either implement a check to see if the output is readname sorted or worse, sort them regardless of datatype.

@bgruening

This comment has been minimized.

Member

bgruening commented Dec 10, 2017

Mh, this datatype was meant to not touch the BAM file, take it as it is produced and do not sort at all (and not index), hence the name.

If tools define bam as input the converter comes into play and convert bam_native files to bam, right?

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Dec 10, 2017

I was assuming the point of this datatype is to avoid unnecessary conversions ?!
I.e paired-end fastq -> bwa-mem (which does coordinate sorting at the end) -> samblaster (which does queryname sorting for the input and coordinate sorting for the output). Instead we could directly go paired-end fastq -> bwa-mem -> samblaster and then, if necessary, coordinate sorting.

@bgruening

This comment has been minimized.

Member

bgruening commented Dec 10, 2017

@mvdbeek it should address the issue in: https://github.com/galaxyproject/tools-iuc/pull/1590/files

--reorder will take the order as it is in the input files, so no guarantee that it is name sorted afaik.
But given this PR we probably can add a third type or add a metadata element that indicates the sorting.

@galaxybot galaxybot added the triage label Dec 10, 2017

@galaxybot galaxybot added this to the 18.01 milestone Dec 10, 2017

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Dec 10, 2017

--reorder will take the order as it is in the input files, so no guarantee that it is name sorted afaik.

Right, which is the default for most aligners, I believe, but you're right in that this isn't strictly 'SO:queryname'. Maybe some additional metadata is the way to go that indicates that at least pairs of reads are adjacent ? Because that would be good enough for some important tools, but with the BamNative datatype as it is now this could be totally random. So basically only tools that absolutely do not care at all about sort order can consume this datatype, but I'm afraid people could start treating it as a queryname sorted file (because it could be queryname sorted).

@bgruening

This comment has been minimized.

Member

bgruening commented Dec 18, 2017

@mvdbeek I just tried the following.

       <actions>
            <action name="sort_order" type="metadata" default="name_sorted_bag" />
       </actions>

With some hacks we could use this and skip/adjust the set_meta function in a way that the index is not created/used. We will probably miss some metadata, but this would enable us to use just one BAM file.
Disadvantage, is that we could still not upload name-sorted BAM files afaik.

ping also @blankenberg to gather more comments

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Dec 20, 2017

With some hacks we could use this and skip/adjust the set_meta function in a way that the index is not created/used. We will probably miss some metadata, but this would enable us to use just one BAM file.

Yeah, that does look like a good idea. If you know that a tool places reads with the same queryname next to each other (which isn't necessarily what SO:queryname implies) you could use the action to set the metadata accordingly. So in a first instance tool wrappers could handle this logic in the command section, and eventually we could have a syntax to request this format from the framework.

Maybe something like

<param argument="--input" type="data" format="bam,sam" metadata_requirement="adjacent_queryname"/>

?

@bgruening

This comment has been minimized.

Member

bgruening commented Dec 21, 2017

@mvdbeek what about the upload problem? Such "unordered" files could not be uploaded afaik.

Your example would then be already possible by using something like this I think:

<param argument="--input" type="data" format="bam,sam" >
     <options options_filter_attribute="metadata.sorted_by" >
          <filter type="add_value" value="adjacent_queryname" />
     </options>
     <validator type="expression" message="">value is not None and value.metadata.sorted == "adjacent_queryname"</validator>
</param>
@blankenberg

This comment has been minimized.

Member

blankenberg commented Jan 2, 2018

I like this approach. The default action is to still sniff then sort and index bam files when uploaded by a user, but they can manually upload an un/differently-sorted bam.

@bgruening

This comment has been minimized.

Member

bgruening commented Jan 2, 2018

@blankenberg do you prefer this more explicit way or the alternative of having one BAM format with metadata indicating the sorting. The latter can probably not uploaded.

@bgruening

This comment has been minimized.

Member

bgruening commented Jan 2, 2018

@blankenberg or to be more concrete, do you like this PR or the way I sketched out the metadata approach?

@blankenberg

This comment has been minimized.

Member

blankenberg commented Jan 2, 2018

I like this PR, and is what I would have done myself to support unsorted bam. I think the best thing we can probably do at this point is to have the 2+ datatypes. So we can upload un/diff-sorted bam for tools that will need them this way and also don't need to worry about the many existing tools that expect 'bam' to be sorted and indexed and any potential (silent) errors that may be caused.

The use of the metadata field values for limiting inputs can be useful and powerful, but I think they can be a bit clunky to use when developing tools especially in cases when things can be represented by the hierarchical structure of datatypes:

bam_native --> [ bam_sorted --> ] sortedBamA(e.g. current BAM),sortedBamB,sortedBamC

The hierarchy approach can fall apart a bit when the key/important attribute is greater than one in number. But if sort_order is really the primary defining metadata value, it makes sense to use multiple datatypes.

@bgruening

This comment has been minimized.

Member

bgruening commented Jan 2, 2018

Ok, thats fine. Than this PR is ready for review. @mvdbeek any objections? Should I add a third datatype as well in this PR?

@@ -210,75 +207,19 @@ class Bam(Binary):
@staticmethod
def merge(split_files, output_file):
"""
Merges BAM files
Merges Bam files

This comment has been minimized.

@mvdbeek

mvdbeek Jan 3, 2018

Member

Can we leave this at BAM files ? xref #5037 (comment)

This comment has been minimized.

@bgruening
# fail metadata to end in the error state
pass
def sniff(self, filename):

This comment has been minimized.

@mvdbeek

mvdbeek Jan 3, 2018

Member

I'd move that into the base class, and use return super(Bam, self).sniff(filename) and self.dataset_content_needs_grooming(filename) (and maybe change filename to file_name for consistency ?)

This comment has been minimized.

@bgruening

bgruening Jan 3, 2018

Member

Make sense, but it should be return super(Bam, self).sniff(filename) and __not__ self.dataset_content_needs_grooming(filename) right?

]]>
</command>
<inputs>
<param name="input1" type="data" format="bam" label="Concatenate Dataset"/>

This comment has been minimized.

@mvdbeek

mvdbeek Jan 3, 2018

Member

Wouldn't this test use the sam_to_bam converter ?
Maybe change the input format to bam_native ?
If we add this to test/functional/tools/samples_tool_conf.xml it should be run during the functional tests.

This comment has been minimized.

@bgruening

bgruening Jan 3, 2018

Member

I fixed this test, it should test the conversion from SAM to BAM (unsorted).
Thanks @mvdbeek for the review.

@@ -327,7 +257,11 @@ def to_archive(self, trans, dataset, name=""):
def get_chunk(self, trans, dataset, offset=0, ck_size=None):
index_file = dataset.metadata.bam_index
with pysam.AlignmentFile(dataset.file_name, "rb", index_filename=index_file.file_name) as bamfile:
if index_file:

This comment has been minimized.

@mvdbeek

mvdbeek Jan 3, 2018

Member

We don't actually need the index at all for get_chunk, so we should probably drop lines 259-263.

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Jan 13, 2018

With bgruening#39 the tests should pass, but we'll need to rebase after the merge. I think we can and should do this without the additional sorting for now.

Otherwise we'd probably have to refine the caching of the results, because the first time we ask for conversions we store the order. Something like functools.lru_cache instead of caching the result in self._converters_by_datatype[ext] would work, but I'm not convinced we need the added complexity.

@bgruening

This comment has been minimized.

Member

bgruening commented Jan 14, 2018

@mvdbeek I think I found the difference, try this:

        <converter file="sam_to_bam.xml" target_datatype="bam"/>
        <converter file="sam_to_bam_native.xml" target_datatype="bam_native"/>
        <converter file="sam_to_bigwig_converter.xml" target_datatype="bigwig"/>

With your PR this will not work. Can you confirm?

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Jan 14, 2018

Right, but why would you do this ?
The datatypes in general are sensitive to the order in which they are defined / listed in the datatypes_conf.xml, so at least that is consistent.

@bgruening

This comment has been minimized.

Member

bgruening commented Jan 14, 2018

The datatypes in general are sensitive to the order in which they are defined / listed in the datatypes_conf.xml, so at least that is consistent.

afaik only the sniffers are sensible to the order not the datatypes. I think it is odd if such a functionality is depended on the order if it does not need to. But its your call. Happy to merge it.

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Jan 15, 2018

I'd be more comfortable not changing the way the converters are picked right now. I agree that if you ask for an extension (bam_native in this case) you should get that extension and not the extension of a subclass. I'd prefer to work on this in a separate PR if that doesn't bother you too much.

bgruening added some commits Jan 15, 2018

Merge pull request #39 from mvdbeek/bam_unsorted
Fix tests and revert additional sorting of converters
@bgruening

This comment has been minimized.

Member

bgruening commented Jan 15, 2018

Sure, if you prefer to keep it separate. I have merged your PR and resolved the conflict hopefully.

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Jan 15, 2018

Hmm, those failing tests are passing when I test them locally. I wonder if this is because Jenkins runs an older version of samtools ?

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Jan 15, 2018

Arg, I was thinking we could get around the samtools version problem by using pysam.view, but the implementation is only partial and doesn't support writing to a file. So what now ? Upgrading samtools in the testing container would be my next suggestion ...

@nsoranzo

This comment has been minimized.

Member

nsoranzo commented Jan 17, 2018

Maybe the failing framework tests can be fixed by auto-installing the samtools requirements with conda_auto_install = True option in config/galaxy.ini?

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Jan 17, 2018

We should be eating our own dogfood, right ? :D

@martenson

This comment has been minimized.

Member

martenson commented Jan 17, 2018

I wonder how much slower it will make the tests.

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Jan 17, 2018

@@ -851,6 +856,8 @@ def setup(self, config_object=None):
datatypes_conf=datatypes_conf_override,
prefer_template_database=getattr(config_object, "prefer_template_database", False),
log_format=log_format,
conda_auto_init=self.conda_auto_init,
conda_auto_install=self.conda_auto_install,

This comment has been minimized.

@nsoranzo

nsoranzo Jan 17, 2018

Member

Should these have value getattr(config_object, 'conda_auto_init', False) and getattr(config_object, 'conda_auto_install', False) respectively?
If yes, then probably the new lines 799-800 are not needed.

This comment has been minimized.

@mvdbeek

mvdbeek Jan 17, 2018

Member

Yes, that should work!

@bgruening

This comment has been minimized.

Member

bgruening commented Jan 19, 2018

Cool, thanks @mvdbeek seems to work!

@mvdbeek

mvdbeek approved these changes Jan 19, 2018 edited

Looks good to me! Keep in mind that BamNative makes no guarantees to sort oder at all.

We should (in another PR) implement some variants of this for the cases where reads are in the order that has been spit out by the aligner (BamReadsAdjacent ?) and where reads are ordered by queryname (BamQueryNameSorted ?) and by tag (BamTagSored ?).

@mvdbeek mvdbeek merged commit 8f65cd4 into galaxyproject:dev Jan 19, 2018

6 checks passed

api test Build finished. 343 tests run, 5 skipped, 0 failed.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
framework test Build finished. 171 tests run, 0 skipped, 0 failed.
Details
integration test Build finished. 67 tests run, 0 skipped, 0 failed.
Details
selenium test Build finished. 118 tests run, 3 skipped, 0 failed.
Details
toolshed test Build finished. 577 tests run, 0 skipped, 0 failed.
Details
@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Jan 19, 2018

Thanks a lot @bgruening !

@bgruening bgruening deleted the bgruening:bam_unsorted branch Jan 19, 2018

@bgruening

This comment has been minimized.

Member

bgruening commented Jan 19, 2018

Thanks to you @mvdbeek!
Do you want me to get this ordering of datatypes in or not?

@martenson

This comment has been minimized.

Member

martenson commented Jan 19, 2018

We should (in another PR) implement some variants of this for the cases where reads are in the order that has been spit out by the aligner (BamReadsAdjacent ?) and where reads are ordered by queryname (BamQueryNameSorted ?) and by tag (BamTagSored ?).

This warranties an issue I think

mvdbeek added a commit to mvdbeek/galaxy that referenced this pull request Feb 7, 2018

Fix recursion bug in BamNative datatype
We introduced this in
galaxyproject#5180 where we moved a lot
of code form the Bam class into the BamNative class.

Reported by @natefoo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment