New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Samtools to pysam #5037

Merged
merged 22 commits into from Dec 10, 2017

Conversation

Projects
None yet
5 participants
@mvdbeek
Member

mvdbeek commented Nov 19, 2017

This PR changes all uses of samtools within the datatypes (except for the DataProviders) to pysam, which is a galaxy dependency and therefore doesn't need to be satisfied by conda. This allows us to drop the samtools requirement from the upload and set_metadata tools.

samtools is still required for trackster, which I believe makes use of DataProviders. In principle pysam could also generate .csi indexes for Bcf files, but it seems that this isn't exposed in the pysam wrapper. bcftools is still required because pysam.bcftools.concat outright crashes and exits the python interpreter.

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Nov 19, 2017

It looks like there are some problems with running pysam.index() within the doctests on linux. The tests pass fine on OSX, but I can reproduce the failure with a docker image. Constructing a simple doctest that runs only pysam.index() works fine, so I am starting to think this is something more complex, perhaps with things that we import. Sometimes I see index: invalid option -- '8' or index: invalid option -- '[' or similar, so perhaps something is going wrong when converting python strings to c strings?

@jmchilton

This comment has been minimized.

Member

jmchilton commented Nov 20, 2017

doctests are terrible anyway for the most part IMO - I'd just convert these to unit tests if you think that would help. Let me know if you'd like me to work on that - I'd be happy to this is important work.

Edit: upon re-reading your comments I'm realizing I probably misinterpreted them as being about doctest specific string handling. Opps. 😅

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Nov 20, 2017

Edit: upon re-reading your comments I'm realizing I probably misinterpreted them as being about doctest specific string handling. Opps

That was one thing I did think about after seeing pysam-developers/pysam#245 (comment), so maybe that was shining through somehow :D.

I have been playing around with the pysam sources, and I can make the failing test pass by changing the option parsing, but I haven't understood at all what I'm doing.
I'll make the tests unit tests, that's better anyway.

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Dec 4, 2017

Alright, with the switch to unit tests the error has gone away!

@mvdbeek mvdbeek changed the title from WIP: Samtools to pysam to Samtools to pysam Dec 5, 2017

@mvdbeek mvdbeek added status/review and removed status/WIP labels Dec 5, 2017

@mvdbeek mvdbeek referenced this pull request Dec 5, 2017

Closed

Tabix indexing fixes. #5131

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Dec 6, 2017

Meh, looks like the gff_to_tabix, bed_to_tabix and interval_to_tabix converters have never worked, since they require 2 input datasets, which will cause the converters to fail with

Specify a dataset of the required format / build.
@nsoranzo

This comment has been minimized.

Member

nsoranzo commented Dec 6, 2017

I have this in my git stash that may be helpful for debugging converters: https://gist.github.com/nsoranzo/3618bbc0699ed43aa3e58a065d38e981

@dannon

This comment has been minimized.

Member

dannon commented Dec 6, 2017

@mvdbeek These have been used in trackster for a long time, no? After swapping the one I did, it seemed to work in trackster, for me, at least. I think the second dataset is just a converted step, and the first one is kept as an input for for tracking purposes?

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Dec 6, 2017

so it's bed->bgzip -> tabix ? That's kinda wasteful given that creating a tabix file index (seems to) always create a compressed variant of the input. In any case they don't work in the UI, which isn't good either. I'll have to take a closer look.

@mvdbeek mvdbeek added status/WIP and removed status/review labels Dec 6, 2017

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Dec 6, 2017

it' actually interval -> bigwig -> tabix, and then fails when accessing locations in the converted files with:

galaxy.webapps.galaxy.api.datasets ERROR 2017-12-06 14:31:56,424 Error in dataset API at listing contents: could not open index for `/Users/mvandenb/src/galaxy/database/files/000/dataset_510.dat`: could not open index for `/Users/mvandenb/src/galaxy/database/files/000/dataset_510.dat`
Traceback (most recent call last):
  File "/Users/mvandenb/src/galaxy/lib/galaxy/webapps/galaxy/api/datasets.py", line 73, in show
    rval = self._data(trans, dataset, **kwd)
  File "/Users/mvandenb/src/galaxy/lib/galaxy/webapps/galaxy/api/datasets.py", line 242, in _data
    ref_seq=region, mean_depth=mean_depth, **kwargs)
  File "/Users/mvandenb/src/galaxy/lib/galaxy/visualization/data_providers/genome.py", line 189, in get_data
    data_file = self.open_data_file()
  File "/Users/mvandenb/src/galaxy/lib/galaxy/visualization/data_providers/genome.py", line 327, in open_data_file
    index=self.converted_dataset.file_name)
  File "pysam/libctabix.pyx", line 343, in pysam.libctabix.TabixFile.__cinit__
  File "pysam/libctabix.pyx", line 391, in pysam.libctabix.TabixFile._open
IOError: could not open index for `/Users/mvandenb/src/galaxy/database/files/000/dataset_510.dat`
@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Dec 6, 2017

and the tabix format we are producing isn't acutally a tabix file, it's a tabix index. This is a bit messy :/

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Dec 6, 2017

So it looks like pysam.TabixFile accepts the index parameters but the index extension still needs to end in .tbi:

In [33]: pysam.TabixFile(filename='/Users/mvandenb/src/galaxy/database/files/000/dataset_527.dat', index='/tmp/index')
---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-33-8b1d4a08ebcd> in <module>()
----> 1 pysam.TabixFile(filename='/Users/mvandenb/src/galaxy/database/files/000/dataset_527.dat', index='/tmp/index')

pysam/libctabix.pyx in pysam.libctabix.TabixFile.__cinit__()

pysam/libctabix.pyx in pysam.libctabix.TabixFile._open()

IOError: could not open index for `/Users/mvandenb/src/galaxy/database/files/000/dataset_527.dat`

In [35]: !cp /Users/mvandenb/src/galaxy/database/files/000/dataset_527.dat.tbi /tmp/index.tbi

In [36]: pysam.TabixFile(filename='/Users/mvandenb/src/galaxy/database/files/000/dataset_527.dat', index='/tmp/index.tbi')
Out[36]: <pysam.libctabix.TabixFile at 0x10f6bc290>
@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Dec 6, 2017

I think pysam-developers/pysam#586 should fix this issue.

@dannon

This comment has been minimized.

Member

dannon commented Dec 7, 2017

@mvdbeek Hrmm. This is going to mean we have to wait for another release :/

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Dec 7, 2017

Or we copy the index to a temporary location with a .TBI extension, that seems to work

@dannon

This comment has been minimized.

Member

dannon commented Dec 7, 2017

Yeah, good call, I'd prefer that over having to delay again.

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Dec 8, 2017

Just noticed that the coodinates that trackster requests through the API are completely off, so it's hard for me to verify our changes are working :/.
Try going to a specific location that you know has a feature, this just doesn't work, if you look at the API requests you can see that the coordinates are wrong.

mvdbeek added some commits Nov 13, 2017

Use pysam for metadata setting and grooming
There are also some noteworthy changes here:
  - We do always respect the sort-order specified in the header
  - If the sort order is not mentioned in the header or no header
    exists we coordinate-sort the file.
  - We do not use indexing to determine if a file is coordinate sorted,
    because this does not work reliably with samtools/pysam > 1.X,
    since arbitrarily sorted files can be indexed now.

This also fixes advanced metadata setting (sort_order, bam_version and more),
which appears to have been broken. This probably went by unnoticed because of the catch-all
try-except-pass.
The downside to fixing this is that I had to (temporarilly, hopefully)
comment out the reference_names, reference_lengths, bam_header and readgroups attributes,
because they led to the following error:

```
galaxy.model.metadata DEBUG 2017-11-19 14:47:30,540 loading metadata from file for: HistoryDatasetAssociation 582
galaxy.jobs.runners.local ERROR 2017-11-19 14:47:30,636 Job wrapper finish method failed
Traceback (most recent call last):
  File "/Users/mvandenb/src/galaxy/lib/galaxy/jobs/runners/local.py", line 130, in queue_job
    job_wrapper.finish(stdout, stderr, exit_code)
  File "/Users/mvandenb/src/galaxy/lib/galaxy/jobs/__init__.py", line 1357, in finish
    self.sa_session.flush()
  File "/Users/mvandenb/src/galaxy/.venv/lib/python2.7/site-packages/sqlalchemy/orm/scoping.py", line 157, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/Users/mvandenb/src/galaxy/.venv/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2019, in flush
    self._flush(objects)
  File "/Users/mvandenb/src/galaxy/.venv/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2137, in _flush
    transaction.rollback(_capture_exception=True)
  File "/Users/mvandenb/src/galaxy/.venv/lib/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 60, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "/Users/mvandenb/src/galaxy/.venv/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2101, in _flush
    flush_context.execute()
  File "/Users/mvandenb/src/galaxy/.venv/lib/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 373, in execute
    rec.execute(self)
  File "/Users/mvandenb/src/galaxy/.venv/lib/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 532, in execute
    uow
  File "/Users/mvandenb/src/galaxy/.venv/lib/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 170, in save_obj
    mapper, table, update)
  File "/Users/mvandenb/src/galaxy/.venv/lib/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 706, in _emit_update_statements
    execute(statement, multiparams)
  File "/Users/mvandenb/src/galaxy/.venv/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 914, in execute
    return meth(self, multiparams, params)
  File "/Users/mvandenb/src/galaxy/.venv/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/Users/mvandenb/src/galaxy/.venv/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1010, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/Users/mvandenb/src/galaxy/.venv/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1146, in _execute_context
    context)
  File "/Users/mvandenb/src/galaxy/.venv/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1341, in _handle_dbapi_exception
    exc_info
  File "/Users/mvandenb/src/galaxy/.venv/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 202, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/Users/mvandenb/src/galaxy/.venv/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context
    context)
  File "/Users/mvandenb/src/galaxy/.venv/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in do_execute
    cursor.execute(statement, parameters)
OperationalError: (psycopg2.OperationalError) index row size 7208 exceeds maximum 2712 for index "ix_history_dataset_association_metadata"
HINT:  Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.
 [SQL: 'UPDATE history_dataset_association SET update_time=%(update_time)s, blurb=%(blurb)s, peek=%(peek)s, metadata=%(_metadata)s WHERE history_dataset_association.id = %(history_dataset_association_id)s'] [parameters: {'_metadata': <psycopg2.extensions.Binary object at 0x119ed8990>, 'update_time': datetime.datetime(2017, 11, 19, 13, 47, 30, 598058), 'history_dataset_association_id': 582, 'blurb': '3.5 KB', 'peek': 'Binary bam alignments file'}]
```

mvdbeek added some commits Dec 5, 2017

Symlink tbi index to work around pysam limitation
Before pysam-developers/pysam#586 is merged and a new release is out
we create a symlink to the tbi file, which is required for creating TabixFile
instances. Since we want to cleanup the symlinks I turned `get_data_file` into a contextmanager.
Along the way I also changed many open()/close() calls to `with` statements.
return pysam.Tabixfile(self.dependencies['bgzip'].file_name,
index=self.converted_dataset.file_name)
# We create a symlnk to the index file. This is
# required until https://github.com/pysam-developers/pysam/pull/586 is merged.

This comment has been minimized.

@mvdbeek

mvdbeek Dec 8, 2017

Member

Alright, this is already merged :).

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Dec 8, 2017

This does fail on dev (and presumably other galaxy versions) too, fwiw.

Mehh, I just didn't migrate properly when I was developing #4690

@mvdbeek mvdbeek added status/review and removed status/WIP labels Dec 8, 2017

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Dec 8, 2017

Putting this back in review, I think this is OK now. There are still some things that aren't great, like converters that work only when triggered in trackster, but I think that's for another PR.

Decompose interval_to_tabix converter script
This renames some variables to make it clearer what files they reflect.
Also adds a very basic test that this works as intended.
@jmchilton

This comment has been minimized.

Member

jmchilton commented Dec 8, 2017

Huge 👍 from me - thanks so much @mvdbeek.

@staticmethod
def merge(split_files, output_file):
"""
Merges Bam files

This comment has been minimized.

@nsoranzo

nsoranzo Dec 8, 2017

Member

s/Bam/BAM/ here and below.

with open(os.devnull, 'w') as devnull:
subprocess.check_call(cmd, stderr=devnull, shell=False)
needs_sorting = False
except Exception:

This comment has been minimized.

@nsoranzo

nsoranzo Dec 8, 2017

Member

s/Exception/subprocess.CalledProcessError/

raise Exception("Error Grooming BAM file contents: %s" % stderr)
else:
print(stderr)
sorted_file_name = "%s.bam" % tmp_sorted_dataset_file_name_prefix # samtools accepts a prefix, not a filename, it always adds .bam to the prefix

This comment has been minimized.

@nsoranzo

nsoranzo Dec 8, 2017

Member

The comment here is outdated, I think.

@@ -15,6 +15,9 @@
from cgi import escape
from json import dumps
import pysam
import pysam.bcftools

This comment has been minimized.

@nsoranzo

nsoranzo Dec 8, 2017

Member

This import seems unused.

This comment has been minimized.

@mvdbeek

mvdbeek Dec 9, 2017

Member

No, that's needed because this isn't imported by default:

In [1]: import pysam

In [2]: pysam.bcftools
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-cb47258bfbc9> in <module>()
----> 1 pysam.bcftools

AttributeError: 'module' object has no attribute 'bcftools'

In [3]: import pysam.bcftools

In [4]: pysam.bcftools
Out[4]: <module 'pysam.bcftools' from '/Users/mvandenb/.venv/lib/python2.7/site-packages/pysam/bcftools.pyc'>

This comment has been minimized.

@nsoranzo

nsoranzo Dec 10, 2017

Member

I mean that pysam.bcftools is used in binary.py , but not in this file.

This comment has been minimized.

@mvdbeek

mvdbeek Dec 10, 2017

Member

oh you're right, I initially used this to replace bcftools concat but that didn't work!

This comment has been minimized.

@nsoranzo

nsoranzo Dec 10, 2017

Member

No worries, thanks for fixing it! There are 3 small review comments left you may have missed, then it should be ready to merge.

This comment has been minimized.

@mvdbeek

mvdbeek Dec 10, 2017

Member

Forgot to add them. Thanks again.

@nsoranzo

This comment has been minimized.

Member

nsoranzo commented Dec 8, 2017

lib/galaxy/datatypes/test/1.unsorted.bam seems to be unused now that _is_coordinate_sorted() has been removed.

def sniff(self, filename):
if not is_gzip(filename):
return False
return BaseVcf.sniff(self, filename)

This comment has been minimized.

@nsoranzo

nsoranzo Dec 8, 2017

Member

BaseVcf.sniff(self, filename) -> super(VcfGz, self).sniff(filename)

def sniff(self, filename):
if is_gzip(filename):
return False
return BaseVcf.sniff(self, filename)

This comment has been minimized.

@nsoranzo

nsoranzo Dec 8, 2017

Member

BaseVcf.sniff(self, filename) -> super(Vcf, self).sniff(filename)

def open_data_file(self):
return pysam.Tabixfile(self.dependencies['bgzip'].file_name,
index=self.converted_dataset.file_name)
# We create a symlnk to the index file. This is

This comment has been minimized.

@nsoranzo

nsoranzo Dec 8, 2017

Member

s/symlnk/symlink/

@contextmanager
def get_dataset(file, index_attr='bam_index', dataset_id=1, has_data=True):

This comment has been minimized.

@nsoranzo

nsoranzo Dec 8, 2017

Member

file is a Python reserved word, can you use a different variable name? Also in get_input_files().

@nsoranzo nsoranzo merged commit e198dd7 into galaxyproject:dev Dec 10, 2017

6 checks passed

api test Build finished. 334 tests run, 4 skipped, 0 failed.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
framework test Build finished. 165 tests run, 0 skipped, 0 failed.
Details
integration test Build finished. 60 tests run, 0 skipped, 0 failed.
Details
selenium test Build finished. 117 tests run, 2 skipped, 0 failed.
Details
toolshed test Build finished. 577 tests run, 0 skipped, 0 failed.
Details
@nsoranzo

This comment has been minimized.

Member

nsoranzo commented Dec 10, 2017

Fantastic, thanks @mvdbeek!

@martenson

This comment has been minimized.

Member

martenson commented Dec 11, 2017

Splendid! Thanks @mvdbeek !!! 💯

@mvdbeek mvdbeek referenced this pull request Jan 3, 2018

Merged

Add BamNative datatype #5180

@mvdbeek mvdbeek deleted the mvdbeek:samtools_to_pysam branch Jun 12, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment