Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Samtools to pysam #5037

Merged
merged 22 commits into from Dec 10, 2017
Merged

Samtools to pysam #5037

merged 22 commits into from Dec 10, 2017

Conversation

@mvdbeek
Copy link
Member

@mvdbeek mvdbeek commented Nov 19, 2017

This PR changes all uses of samtools within the datatypes (except for the DataProviders) to pysam, which is a galaxy dependency and therefore doesn't need to be satisfied by conda. This allows us to drop the samtools requirement from the upload and set_metadata tools.

samtools is still required for trackster, which I believe makes use of DataProviders. In principle pysam could also generate .csi indexes for Bcf files, but it seems that this isn't exposed in the pysam wrapper. bcftools is still required because pysam.bcftools.concat outright crashes and exits the python interpreter.

@mvdbeek
Copy link
Member Author

@mvdbeek mvdbeek commented Nov 19, 2017

It looks like there are some problems with running pysam.index() within the doctests on linux. The tests pass fine on OSX, but I can reproduce the failure with a docker image. Constructing a simple doctest that runs only pysam.index() works fine, so I am starting to think this is something more complex, perhaps with things that we import. Sometimes I see index: invalid option -- '8' or index: invalid option -- '[' or similar, so perhaps something is going wrong when converting python strings to c strings?

Loading

@jmchilton
Copy link
Member

@jmchilton jmchilton commented Nov 20, 2017

doctests are terrible anyway for the most part IMO - I'd just convert these to unit tests if you think that would help. Let me know if you'd like me to work on that - I'd be happy to this is important work.

Edit: upon re-reading your comments I'm realizing I probably misinterpreted them as being about doctest specific string handling. Opps. 😅

Loading

@mvdbeek
Copy link
Member Author

@mvdbeek mvdbeek commented Nov 20, 2017

Edit: upon re-reading your comments I'm realizing I probably misinterpreted them as being about doctest specific string handling. Opps

That was one thing I did think about after seeing pysam-developers/pysam#245 (comment), so maybe that was shining through somehow :D.

I have been playing around with the pysam sources, and I can make the failing test pass by changing the option parsing, but I haven't understood at all what I'm doing.
I'll make the tests unit tests, that's better anyway.

Loading

@mvdbeek mvdbeek force-pushed the samtools_to_pysam branch 3 times, most recently from 10ed942 to afbb7f3 Dec 3, 2017
@mvdbeek
Copy link
Member Author

@mvdbeek mvdbeek commented Dec 4, 2017

Alright, with the switch to unit tests the error has gone away!

Loading

@mvdbeek mvdbeek changed the title WIP: Samtools to pysam Samtools to pysam Dec 5, 2017
@mvdbeek mvdbeek mentioned this pull request Dec 5, 2017
@mvdbeek
Copy link
Member Author

@mvdbeek mvdbeek commented Dec 6, 2017

Meh, looks like the gff_to_tabix, bed_to_tabix and interval_to_tabix converters have never worked, since they require 2 input datasets, which will cause the converters to fail with

Specify a dataset of the required format / build.

Loading

@nsoranzo
Copy link
Member

@nsoranzo nsoranzo commented Dec 6, 2017

I have this in my git stash that may be helpful for debugging converters: https://gist.github.com/nsoranzo/3618bbc0699ed43aa3e58a065d38e981

Loading

@dannon
Copy link
Member

@dannon dannon commented Dec 6, 2017

@mvdbeek These have been used in trackster for a long time, no? After swapping the one I did, it seemed to work in trackster, for me, at least. I think the second dataset is just a converted step, and the first one is kept as an input for for tracking purposes?

Loading

@mvdbeek
Copy link
Member Author

@mvdbeek mvdbeek commented Dec 6, 2017

so it's bed->bgzip -> tabix ? That's kinda wasteful given that creating a tabix file index (seems to) always create a compressed variant of the input. In any case they don't work in the UI, which isn't good either. I'll have to take a closer look.

Loading

@mvdbeek
Copy link
Member Author

@mvdbeek mvdbeek commented Dec 6, 2017

it' actually interval -> bigwig -> tabix, and then fails when accessing locations in the converted files with:

galaxy.webapps.galaxy.api.datasets ERROR 2017-12-06 14:31:56,424 Error in dataset API at listing contents: could not open index for `/Users/mvandenb/src/galaxy/database/files/000/dataset_510.dat`: could not open index for `/Users/mvandenb/src/galaxy/database/files/000/dataset_510.dat`
Traceback (most recent call last):
  File "/Users/mvandenb/src/galaxy/lib/galaxy/webapps/galaxy/api/datasets.py", line 73, in show
    rval = self._data(trans, dataset, **kwd)
  File "/Users/mvandenb/src/galaxy/lib/galaxy/webapps/galaxy/api/datasets.py", line 242, in _data
    ref_seq=region, mean_depth=mean_depth, **kwargs)
  File "/Users/mvandenb/src/galaxy/lib/galaxy/visualization/data_providers/genome.py", line 189, in get_data
    data_file = self.open_data_file()
  File "/Users/mvandenb/src/galaxy/lib/galaxy/visualization/data_providers/genome.py", line 327, in open_data_file
    index=self.converted_dataset.file_name)
  File "pysam/libctabix.pyx", line 343, in pysam.libctabix.TabixFile.__cinit__
  File "pysam/libctabix.pyx", line 391, in pysam.libctabix.TabixFile._open
IOError: could not open index for `/Users/mvandenb/src/galaxy/database/files/000/dataset_510.dat`

Loading

@mvdbeek
Copy link
Member Author

@mvdbeek mvdbeek commented Dec 6, 2017

and the tabix format we are producing isn't acutally a tabix file, it's a tabix index. This is a bit messy :/

Loading

@mvdbeek
Copy link
Member Author

@mvdbeek mvdbeek commented Dec 6, 2017

So it looks like pysam.TabixFile accepts the index parameters but the index extension still needs to end in .tbi:

In [33]: pysam.TabixFile(filename='/Users/mvandenb/src/galaxy/database/files/000/dataset_527.dat', index='/tmp/index')
---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-33-8b1d4a08ebcd> in <module>()
----> 1 pysam.TabixFile(filename='/Users/mvandenb/src/galaxy/database/files/000/dataset_527.dat', index='/tmp/index')

pysam/libctabix.pyx in pysam.libctabix.TabixFile.__cinit__()

pysam/libctabix.pyx in pysam.libctabix.TabixFile._open()

IOError: could not open index for `/Users/mvandenb/src/galaxy/database/files/000/dataset_527.dat`

In [35]: !cp /Users/mvandenb/src/galaxy/database/files/000/dataset_527.dat.tbi /tmp/index.tbi

In [36]: pysam.TabixFile(filename='/Users/mvandenb/src/galaxy/database/files/000/dataset_527.dat', index='/tmp/index.tbi')
Out[36]: <pysam.libctabix.TabixFile at 0x10f6bc290>

Loading

@mvdbeek
Copy link
Member Author

@mvdbeek mvdbeek commented Dec 6, 2017

I think pysam-developers/pysam#586 should fix this issue.

Loading

@dannon
Copy link
Member

@dannon dannon commented Dec 7, 2017

@mvdbeek Hrmm. This is going to mean we have to wait for another release :/

Loading

@mvdbeek
Copy link
Member Author

@mvdbeek mvdbeek commented Dec 7, 2017

Or we copy the index to a temporary location with a .TBI extension, that seems to work

Loading

@dannon
Copy link
Member

@dannon dannon commented Dec 7, 2017

Yeah, good call, I'd prefer that over having to delay again.

Loading

@mvdbeek
Copy link
Member Author

@mvdbeek mvdbeek commented Dec 8, 2017

Just noticed that the coodinates that trackster requests through the API are completely off, so it's hard for me to verify our changes are working :/.
Try going to a specific location that you know has a feature, this just doesn't work, if you look at the API requests you can see that the coordinates are wrong.

Loading

mvdbeek added 6 commits Dec 8, 2017
Before pysam-developers/pysam#586 is merged and a new release is out
we create a symlink to the tbi file, which is required for creating TabixFile
instances. Since we want to cleanup the symlinks I turned `get_data_file` into a contextmanager.
Along the way I also changed many open()/close() calls to `with` statements.
return pysam.Tabixfile(self.dependencies['bgzip'].file_name,
index=self.converted_dataset.file_name)
# We create a symlnk to the index file. This is
# required until https://github.com/pysam-developers/pysam/pull/586 is merged.
Copy link
Member Author

@mvdbeek mvdbeek Dec 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, this is already merged :).

Loading

@mvdbeek
Copy link
Member Author

@mvdbeek mvdbeek commented Dec 8, 2017

This does fail on dev (and presumably other galaxy versions) too, fwiw.

Mehh, I just didn't migrate properly when I was developing #4690

Loading

@mvdbeek
Copy link
Member Author

@mvdbeek mvdbeek commented Dec 8, 2017

Putting this back in review, I think this is OK now. There are still some things that aren't great, like converters that work only when triggered in trackster, but I think that's for another PR.

Loading

This renames some variables to make it clearer what files they reflect.
Also adds a very basic test that this works as intended.
@mvdbeek mvdbeek force-pushed the samtools_to_pysam branch from b53fcfb to 46371d1 Dec 8, 2017
@jmchilton
Copy link
Member

@jmchilton jmchilton commented Dec 8, 2017

Huge 👍 from me - thanks so much @mvdbeek.

Loading

@staticmethod
def merge(split_files, output_file):
"""
Merges Bam files
Copy link
Member

@nsoranzo nsoranzo Dec 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Bam/BAM/ here and below.

Loading

with open(os.devnull, 'w') as devnull:
subprocess.check_call(cmd, stderr=devnull, shell=False)
needs_sorting = False
except Exception:
Copy link
Member

@nsoranzo nsoranzo Dec 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Exception/subprocess.CalledProcessError/

Loading

raise Exception("Error Grooming BAM file contents: %s" % stderr)
else:
print(stderr)
sorted_file_name = "%s.bam" % tmp_sorted_dataset_file_name_prefix # samtools accepts a prefix, not a filename, it always adds .bam to the prefix
Copy link
Member

@nsoranzo nsoranzo Dec 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment here is outdated, I think.

Loading

@@ -15,6 +15,9 @@
from cgi import escape
from json import dumps

import pysam
import pysam.bcftools
Copy link
Member

@nsoranzo nsoranzo Dec 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import seems unused.

Loading

Copy link
Member Author

@mvdbeek mvdbeek Dec 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's needed because this isn't imported by default:

In [1]: import pysam

In [2]: pysam.bcftools
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-cb47258bfbc9> in <module>()
----> 1 pysam.bcftools

AttributeError: 'module' object has no attribute 'bcftools'

In [3]: import pysam.bcftools

In [4]: pysam.bcftools
Out[4]: <module 'pysam.bcftools' from '/Users/mvandenb/.venv/lib/python2.7/site-packages/pysam/bcftools.pyc'>

Loading

Copy link
Member

@nsoranzo nsoranzo Dec 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that pysam.bcftools is used in binary.py , but not in this file.

Loading

Copy link
Member Author

@mvdbeek mvdbeek Dec 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh you're right, I initially used this to replace bcftools concat but that didn't work!

Loading

Copy link
Member

@nsoranzo nsoranzo Dec 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, thanks for fixing it! There are 3 small review comments left you may have missed, then it should be ready to merge.

Loading

Copy link
Member Author

@mvdbeek mvdbeek Dec 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to add them. Thanks again.

Loading

@nsoranzo
Copy link
Member

@nsoranzo nsoranzo commented Dec 8, 2017

lib/galaxy/datatypes/test/1.unsorted.bam seems to be unused now that _is_coordinate_sorted() has been removed.

Loading

def sniff(self, filename):
if not is_gzip(filename):
return False
return BaseVcf.sniff(self, filename)
Copy link
Member

@nsoranzo nsoranzo Dec 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BaseVcf.sniff(self, filename) -> super(VcfGz, self).sniff(filename)

Loading

def sniff(self, filename):
if is_gzip(filename):
return False
return BaseVcf.sniff(self, filename)
Copy link
Member

@nsoranzo nsoranzo Dec 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BaseVcf.sniff(self, filename) -> super(Vcf, self).sniff(filename)

Loading

def open_data_file(self):
return pysam.Tabixfile(self.dependencies['bgzip'].file_name,
index=self.converted_dataset.file_name)
# We create a symlnk to the index file. This is
Copy link
Member

@nsoranzo nsoranzo Dec 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/symlnk/symlink/

Loading



@contextmanager
def get_dataset(file, index_attr='bam_index', dataset_id=1, has_data=True):
Copy link
Member

@nsoranzo nsoranzo Dec 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file is a Python reserved word, can you use a different variable name? Also in get_input_files().

Loading

@mvdbeek mvdbeek force-pushed the samtools_to_pysam branch from bf1327d to 2805ee0 Dec 10, 2017
@mvdbeek mvdbeek force-pushed the samtools_to_pysam branch from 2805ee0 to 309b717 Dec 10, 2017
@nsoranzo nsoranzo merged commit e198dd7 into galaxyproject:dev Dec 10, 2017
6 checks passed
Loading
@nsoranzo
Copy link
Member

@nsoranzo nsoranzo commented Dec 10, 2017

Fantastic, thanks @mvdbeek!

Loading

@martenson
Copy link
Member

@martenson martenson commented Dec 11, 2017

Splendid! Thanks @mvdbeek !!! 💯

Loading

@mvdbeek mvdbeek mentioned this pull request Jan 3, 2018
@mvdbeek mvdbeek deleted the samtools_to_pysam branch Jun 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants