galaxyproject / galaxy Public
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Samtools to pysam #5037
Samtools to pysam #5037
Conversation
|
It looks like there are some problems with running pysam.index() within the doctests on linux. The tests pass fine on OSX, but I can reproduce the failure with a docker image. Constructing a simple doctest that runs only |
|
doctests are terrible anyway for the most part IMO - I'd just convert these to unit tests if you think that would help. Let me know if you'd like me to work on that - I'd be happy to this is important work. Edit: upon re-reading your comments I'm realizing I probably misinterpreted them as being about doctest specific string handling. Opps. |
That was one thing I did think about after seeing pysam-developers/pysam#245 (comment), so maybe that was shining through somehow :D. I have been playing around with the pysam sources, and I can make the failing test pass by changing the option parsing, but I haven't understood at all what I'm doing. |
10ed942
to
afbb7f3
|
Alright, with the switch to unit tests the error has gone away! |
|
Meh, looks like the gff_to_tabix, bed_to_tabix and interval_to_tabix converters have never worked, since they require 2 input datasets, which will cause the converters to fail with |
|
I have this in my git stash that may be helpful for debugging converters: https://gist.github.com/nsoranzo/3618bbc0699ed43aa3e58a065d38e981 |
|
@mvdbeek These have been used in trackster for a long time, no? After swapping the one I did, it seemed to work in trackster, for me, at least. I think the second dataset is just a converted step, and the first one is kept as an input for for tracking purposes? |
|
so it's bed->bgzip -> tabix ? That's kinda wasteful given that creating a tabix |
|
it' actually interval -> bigwig -> tabix, and then fails when accessing locations in the converted files with: |
|
and the tabix format we are producing isn't acutally a tabix file, it's a tabix index. This is a bit messy :/ |
|
So it looks like |
|
I think pysam-developers/pysam#586 should fix this issue. |
|
@mvdbeek Hrmm. This is going to mean we have to wait for another release :/ |
|
Or we copy the index to a temporary location with a .TBI extension, that seems to work |
|
Yeah, good call, I'd prefer that over having to delay again. |
|
Just noticed that the coodinates that trackster requests through the API are completely off, so it's hard for me to verify our changes are working :/. |
…ts in the object store
Before pysam-developers/pysam#586 is merged and a new release is out we create a symlink to the tbi file, which is required for creating TabixFile instances. Since we want to cleanup the symlinks I turned `get_data_file` into a contextmanager. Along the way I also changed many open()/close() calls to `with` statements.
| return pysam.Tabixfile(self.dependencies['bgzip'].file_name, | ||
| index=self.converted_dataset.file_name) | ||
| # We create a symlnk to the index file. This is | ||
| # required until https://github.com/pysam-developers/pysam/pull/586 is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, this is already merged :).
|
Mehh, I just didn't migrate properly when I was developing #4690 |
|
Putting this back in review, I think this is OK now. There are still some things that aren't great, like converters that work only when triggered in trackster, but I think that's for another PR. |
This renames some variables to make it clearer what files they reflect. Also adds a very basic test that this works as intended.
|
Huge |
lib/galaxy/datatypes/binary.py
Outdated
| @staticmethod | ||
| def merge(split_files, output_file): | ||
| """ | ||
| Merges Bam files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Bam/BAM/ here and below.
lib/galaxy/datatypes/binary.py
Outdated
| with open(os.devnull, 'w') as devnull: | ||
| subprocess.check_call(cmd, stderr=devnull, shell=False) | ||
| needs_sorting = False | ||
| except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Exception/subprocess.CalledProcessError/
lib/galaxy/datatypes/binary.py
Outdated
| raise Exception("Error Grooming BAM file contents: %s" % stderr) | ||
| else: | ||
| print(stderr) | ||
| sorted_file_name = "%s.bam" % tmp_sorted_dataset_file_name_prefix # samtools accepts a prefix, not a filename, it always adds .bam to the prefix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment here is outdated, I think.
lib/galaxy/datatypes/tabular.py
Outdated
| @@ -15,6 +15,9 @@ | |||
| from cgi import escape | |||
| from json import dumps | |||
|
|
|||
| import pysam | |||
| import pysam.bcftools | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import seems unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's needed because this isn't imported by default:
In [1]: import pysam
In [2]: pysam.bcftools
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-2-cb47258bfbc9> in <module>()
----> 1 pysam.bcftools
AttributeError: 'module' object has no attribute 'bcftools'
In [3]: import pysam.bcftools
In [4]: pysam.bcftools
Out[4]: <module 'pysam.bcftools' from '/Users/mvandenb/.venv/lib/python2.7/site-packages/pysam/bcftools.pyc'>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that pysam.bcftools is used in binary.py , but not in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh you're right, I initially used this to replace bcftools concat but that didn't work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, thanks for fixing it! There are 3 small review comments left you may have missed, then it should be ready to merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to add them. Thanks again.
|
|
lib/galaxy/datatypes/tabular.py
Outdated
| def sniff(self, filename): | ||
| if not is_gzip(filename): | ||
| return False | ||
| return BaseVcf.sniff(self, filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BaseVcf.sniff(self, filename) -> super(VcfGz, self).sniff(filename)
lib/galaxy/datatypes/tabular.py
Outdated
| def sniff(self, filename): | ||
| if is_gzip(filename): | ||
| return False | ||
| return BaseVcf.sniff(self, filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BaseVcf.sniff(self, filename) -> super(Vcf, self).sniff(filename)
| def open_data_file(self): | ||
| return pysam.Tabixfile(self.dependencies['bgzip'].file_name, | ||
| index=self.converted_dataset.file_name) | ||
| # We create a symlnk to the index file. This is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/symlnk/symlink/
test/unit/datatypes/util.py
Outdated
|
|
||
|
|
||
| @contextmanager | ||
| def get_dataset(file, index_attr='bam_index', dataset_id=1, has_data=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file is a Python reserved word, can you use a different variable name? Also in get_input_files().
|
Fantastic, thanks @mvdbeek! |
|
Splendid! Thanks @mvdbeek !!! |
This PR changes all uses of samtools within the datatypes (except for the DataProviders) to pysam, which is a galaxy dependency and therefore doesn't need to be satisfied by conda. This allows us to drop the samtools requirement from the upload and set_metadata tools.
samtools is still required for trackster, which I believe makes use of DataProviders.
In principle pysam could also generatebcftools is still required because.csiindexes for Bcf files, but it seems that this isn't exposed in the pysam wrapper.pysam.bcftools.concatoutright crashes and exits the python interpreter.The text was updated successfully, but these errors were encountered: