Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display preview feature for Bam datatype #4279

Merged
merged 19 commits into from Aug 2, 2017

Conversation

Projects
None yet
7 participants
@ashvark
Copy link
Contributor

commented Jul 5, 2017

This PR has the implementation of display_data function for BAM dataype. It enables the users to view the content of the Bam file with in the galaxy itself.

@mvdbeek
Copy link
Member

left a comment

This is a promising start and could be quite useful for some applications!

ck_data = bamfile.text
for f in bamfile.fetch(until_eof=True):
lineNumber+=1
if (lineNumber > offset and lineNumber <= (offset + ck_size)):

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 5, 2017

Member

The time/cpu spent to reach the lines increases with every time the users displays more lines.
You can replace the line number approach for specifying the offset with bamfile.seek() and bamfile.tell() methods (which weirdly enough is not documented in http://pysam.readthedocs.io/en/latest/api.html). So after fetching the number of alignments that you want (maybe 1000 is more reasonable than 100?), set offset to bamfile.tell() and before iterating over alignments jump to the correct offset (if it is not 0) using bamfile.seek(offset)

@@ -236,6 +239,10 @@ class Bam( Binary ):
MetadataElement( name="reference_lengths", default=[], desc="Chromosome Lengths", param=MetadataParameter, readonly=True, visible=False, optional=True, no_value=[] )
MetadataElement( name="bam_header", default={}, desc="Dictionary of BAM Headers", param=MetadataParameter, readonly=True, visible=False, optional=True, no_value={} )

def __init__(self, **kwd):
"""Initialize taxonomy datatype"""

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 5, 2017

Member

That is not the correct docstring.


def get_chunk(self, trans, dataset, offset=0, ck_size=None):
bamfile = pysam.AlignmentFile(dataset.file_name, "rb")
ck_size = 100 # 100 lines

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 5, 2017

Member

You need to have 2 spaces before inlines comments

def get_chunk(self, trans, dataset, offset=0, ck_size=None):
bamfile = pysam.AlignmentFile(dataset.file_name, "rb")
ck_size = 100 # 100 lines
ck_data=""

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 5, 2017

Member

You need single spaces around operators.

bamfile = pysam.AlignmentFile(dataset.file_name, "rb")
ck_size = 100 # 100 lines
ck_data=""
lineNumber=0

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 5, 2017

Member

We are not using CamelCase for variables, this should probably be line_number = 0

if(offset == 0):
ck_data = bamfile.text
for f in bamfile.fetch(until_eof=True):
lineNumber+=1

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 5, 2017

Member

Since it's probably better to use the offset you can do for line_number, f in enumerate(bamfile):.
Also f is frequently use to indicate a file handle, in this situation it would probably be better to rename f to read, alignment or segment.

@@ -462,6 +469,53 @@ def to_archive(self, trans, dataset, name=""):
file_paths.append(dataset.metadata.bam_index.file_name)
return zip(file_paths, rel_paths)


def get_chunk(self, trans, dataset, offset=0, ck_size=None):
bamfile = pysam.AlignmentFile(dataset.file_name, "rb")

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 5, 2017

Member

you probably want to use with pysam.AlignmentFile(dataset.file_name, "rb") as bamfile: here.

ck_data=ck_data +"\n" + bamlineModified
elif (lineNumber > (offset + ck_size)):
break
last_read = offset + ck_size

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 5, 2017

Member

here you can set a new offset using bamfile.tell()

elif to_ext or not preview:
return super( Bam, self ).display_data( trans, dataset, preview, filename, to_ext, **kwd )
else:
column_names = 'null'

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 5, 2017

Member

I think '' would be better as a fallback.

column_types = []
column_number = dataset.metadata.columns
if column_number is None:
column_number = 'null'

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 5, 2017

Member

it may be more accurate to set this to 1?

@galaxybot galaxybot added the triage label Jul 5, 2017

@galaxybot galaxybot added this to the 17.09 milestone Jul 5, 2017

@ashvark

This comment has been minimized.

Copy link
Contributor Author

commented Jul 5, 2017

@mvdbeek Thank you very much for the review. I managed to do all the changes except the bamfile.seek() functionality. I could not iIterate the bamfile after i use seek() whereas I could do without seek() .

Below is the snapshot. Can you help me ?

image

@mvdbeek

This comment has been minimized.

Copy link
Member

commented Jul 5, 2017

You really need to use the offset that you get from .tell(), just specifying 3 is not likely to work.
See this as an example:

In [28]: with pysam.AlignmentFile('/Users/mvandenb/src/readtagger/test/data/cornercase.bam') as f:
    ...:     first_offset = f.tell()
    ...:     print(first_offset)
    ...:     r = f.next()
    ...:     print(r.query_name)
    ...:     print(f.tell())
    ...:     next_read = f.next()
    ...:     print(next_read.query_name)
    ...:     print(f.tell())
    ...:     f.seek(first_offset)
    ...:     first_read_seek = f.next()
    ...:     print(first_read_seek.query_name == r.query_name)
    ...:
1442119680
HWI-D00405:129:C6KNAANXX:3:2210:9430:87369
1442120130
HWI-D00405:129:C6KNAANXX:4:2103:18945:51370
1474232320
True
if offset == 0:
ck_data = bamfile.text
for line_number, alignment, in enumerate(bamfile):
if line_number > offset and line_number <= (offset + ck_size):

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 5, 2017

Member

you probably want to change this to if line_number < ck_size:

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 5, 2017

Member

Actually, we can switch this around and break here if the line_number exceeds ck_size:

                if line_number > ck_size:
                    break
line_number = 0
if offset == 0:
ck_data = bamfile.text
for line_number, alignment, in enumerate(bamfile):

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 5, 2017

Member

you can drop the , after alignment

# Below code will remove spaces between each tag.
bamline_modified = ('\t').join(bamline.split()[:11] + [('').join(bamline.split()[11:])])
ck_data = ck_data +"\n" + bamline_modified
elif line_number > (offset + ck_size):

This comment has been minimized.

Copy link
@mvdbeek
ck_data = ck_data +"\n" + bamline_modified
elif line_number > (offset + ck_size):
break
last_read = offset + ck_size

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 5, 2017

Member

I'd change this to offset = bamfile.tell()

break
last_read = offset + ck_size
return dumps( { 'ck_data': util.unicodify( ck_data ),
'offset': last_read } )

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 5, 2017

Member

and then change last_read to offset

@@ -478,7 +477,7 @@ def handle_uploaded_dataset_file( filename, datatypes_registry, ext='auto', is_m
ext = guess_ext( filename, sniff_order=datatypes_registry.sniff_order, is_multi_byte=is_multi_byte )

if check_binary( filename ):
if not Binary.is_ext_unsniffable(ext) and not datatypes_registry.get_datatype_by_extension( ext ).sniff( filename ):
if not galaxy.datatypes.binary.is_ext_unsniffable(ext) and not datatypes_registry.get_datatype_by_extension( ext ).sniff( filename ):

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 5, 2017

Member

Is this change necessary?

This comment has been minimized.

Copy link
@ashvark

ashvark Jul 6, 2017

Author Contributor

Compilation breaks when I have both 'from galaxy.datatypes.binary import Binary' in sniff.py and 'from galaxy.datatypes.tabular import Sam' in binary.py . I donot understand the reason behind that. Code is compiling properly only when i remove 'from galaxy.datatypes.binary import Binary' from sniff.py.

@@ -487,7 +487,7 @@ def handle_uploaded_dataset_file( filename, datatypes_registry, ext='auto', is_m
ext = guess_ext( filename, sniff_order=datatypes_registry.sniff_order, is_multi_byte=is_multi_byte )

if check_binary( filename ):
if not Binary.is_ext_unsniffable(ext) and not datatypes_registry.get_datatype_by_extension( ext ).sniff( filename ):
if not datatypes.binary.Binary.is_ext_unsniffable(ext) and not datatypes_registry.get_datatype_by_extension( ext ).sniff( filename ):

This comment has been minimized.

Copy link
@ashvark

ashvark Jul 6, 2017

Author Contributor

I would like to know whether datatypes.binary.Binary.is_ext_unsniffable(ext) is the right way to call this function ?

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 6, 2017

Member

No, I don't think you should change sniff.py.
I think you need very few things from the Sam class.
In fact I think you only need the column names, so you could just initialize them in the Bam class.

This comment has been minimized.

Copy link
@ashvark

ashvark Jul 6, 2017

Author Contributor

@mvdbeek I already tried initializing the column_names with out importing Sam class. But column names are not displayed in the galaxy app. Then when I inherited Sam for Bam, colmun names started populating. I donot understand the reason behind.. i think initializing column names alone in Bam class is not enough.

@mvdbeek

This comment has been minimized.

Copy link
Member

commented Jul 7, 2017

@galaxybot test this

MetadataElement( name="columns", default=0, desc="Number of columns", readonly=True, visible=False, no_value=0 )
MetadataElement( name="column_types", default=[], desc="Column types", param=metadata.ColumnTypesParameter, readonly=True, visible=False, no_value=[] )
MetadataElement( name="column_names", default=[], desc="Column names", readonly=True, visible=False, optional=True, no_value=[] )

This comment has been minimized.

Copy link
@ashvark

ashvark Jul 7, 2017

Author Contributor

It seems that above three lines are mandatory inorder to show the column names in galaxy

@@ -423,6 +432,8 @@ def set_meta( self, dataset, overwrite=True, **kwd ):
dataset.metadata.read_groups = [ read_group['ID'] for read_group in dataset.metadata.bam_header.get( 'RG', [] ) if 'ID' in read_group ]
dataset.metadata.sort_order = dataset.metadata.bam_header.get( 'HD', {} ).get( 'SO', None )
dataset.metadata.bam_version = dataset.metadata.bam_header.get( 'HD', {} ).get( 'VN', None )
dataset.metadata.columns = 12
dataset.metadata.column_types = ['str', 'int', 'str', 'int', 'int', 'str', 'str', 'int', 'int', 'str', 'str', 'str']

This comment has been minimized.

Copy link
@ashvark

ashvark Jul 7, 2017

Author Contributor

Could not understand the importance of above two lines. Column names are displayed even if i removing these two lines.

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 7, 2017

Member

I think (I could be wrong) the main purpose is for accessing this sort of metadata in the tool wrappers without needing to call str() or int(). But since those are always the same I would prefer setting this directly as the default in the MetadataElement.

last_read = 0
header_line_count = 0
if offset == 0:
ck_data = bamfile.text

This comment has been minimized.

Copy link
@ashvark

ashvark Jul 7, 2017

Author Contributor

@mvdbeek Like you mentioned already, What happens when bam file has several lines of header. Do I need to trim if it exceeds certain limit ?

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 7, 2017

Member

I think for now we should cut them off as you're doing now. In the future we could work with the comment_lines MetadataElement and be a little smarter about this, i.e let the user choose to see the full header. I don't think this is too important for the start.

return super( Bam, self ).display_data( trans, dataset, preview, filename, to_ext, **kwd )
else:
column_names = ''
if dataset.metadata.column_names:

This comment has been minimized.

Copy link
@ashvark

ashvark Jul 7, 2017

Author Contributor

dataset.metadata.column_names will be always empty because I am not setting anything in the set_meta() function. Even in the Sam class I could see the same thing.

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 7, 2017

Member

yes, I think you could set the column_names MetadataElement directly to the correct column_names, since they are not going to change for this datatype. Same for the column_number and column_types, they're always going to be the same, so you can just fix them in the MetadataElement.

return super( Bam, self ).display_data( trans, dataset, preview, filename, to_ext, **kwd )
else:
column_names = ''
if dataset.metadata.column_names:

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 7, 2017

Member

yes, I think you could set the column_names MetadataElement directly to the correct column_names, since they are not going to change for this datatype. Same for the column_number and column_types, they're always going to be the same, so you can just fix them in the MetadataElement.

last_read = 0
header_line_count = 0
if offset == 0:
ck_data = bamfile.text

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 7, 2017

Member

I think for now we should cut them off as you're doing now. In the future we could work with the comment_lines MetadataElement and be a little smarter about this, i.e let the user choose to see the full header. I don't think this is too important for the start.

# Below code will remove spaces between each tag.
bamline_modified = ('\t').join( bamline.split()[:11] + [ ('').join(bamline.split()[11:]) ] )
ck_data = ck_data + "\n" + bamline_modified
last_read = bamfile.tell()

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 7, 2017

Member

you can pull this out of the for loop, you only need to set this once, plus this prevents an undefined variable if you break out due to too many header lines. Also, could you change the variable name to offset ?

This comment has been minimized.

Copy link
@ashvark

ashvark Jul 7, 2017

Author Contributor

if i put the last_read = bamfile.tell() , I am missing one line form bamfile for every chunk. Because, I am breaking the loop only after enumerating.

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 7, 2017

Member

Ah, right. Then you should make this the first line of the loop, otherwise last_read will be undefined (or if you rename last_read to offset offset would always be 0) if the header is larger than ck_size.

@@ -423,6 +432,8 @@ def set_meta( self, dataset, overwrite=True, **kwd ):
dataset.metadata.read_groups = [ read_group['ID'] for read_group in dataset.metadata.bam_header.get( 'RG', [] ) if 'ID' in read_group ]
dataset.metadata.sort_order = dataset.metadata.bam_header.get( 'HD', {} ).get( 'SO', None )
dataset.metadata.bam_version = dataset.metadata.bam_header.get( 'HD', {} ).get( 'VN', None )
dataset.metadata.columns = 12
dataset.metadata.column_types = ['str', 'int', 'str', 'int', 'int', 'str', 'str', 'int', 'int', 'str', 'str', 'str']

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 7, 2017

Member

I think (I could be wrong) the main purpose is for accessing this sort of metadata in the tool wrappers without needing to call str() or int(). But since those are always the same I would prefer setting this directly as the default in the MetadataElement.

@mvdbeek

This comment has been minimized.

Copy link
Member

commented Jul 7, 2017

@galaxybot test this

for line_number, alignment in enumerate( bamfile ) :
# return only Header lines if 'header_line_count' exceeds 'ck_size'
# FIXME: Can be problematic if bam has million lines of header
last_read = bamfile.tell()

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 7, 2017

Member

Can you rename this to offset ? last_read is a little confusing here.

# Galaxy display each tag as separate column because 'tostring()' funcition put spaces in between each tag of tags column.
# Below code will remove spaces between each tag.
bamline_modified = ('\t').join( bamline.split()[:11] + [ ('').join(bamline.split()[11:]) ] )
ck_data = ck_data + "\n" + bamline_modified

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 7, 2017

Member

I'd prefer "%s\n%s" % (ck_data, bamline_modified), this is how we usually handle string concatenation in galaxy.

return super( Bam, self ).display_data( trans, dataset, preview, filename, to_ext, **kwd )
else:
column_names = dataset.metadata.column_names
if not column_names:

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 7, 2017

Member

I don't think you need to do this fallback, dataset.metadata.column_names should always be set (I haven't verified that).

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 7, 2017

Member

Yep, this is what MetadataElement(name='column_names') does. I think it's clearer if we remove the if not ... statements.

@mvdbeek

This comment has been minimized.

Copy link
Member

commented Jul 7, 2017

@galaxybot test this

ck_data = ""
header_line_count = 0
if offset == 0:
ck_data = bamfile.text

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Jul 7, 2017

Member

what do you think about doing ck_data = bamfile.text.replace('\t', ' ') ? That would prevent the header lines to stretch the columns too far (It is a dirty hack ... It would be better to set and render the comment lines in a single cell)

This comment has been minimized.

Copy link
@ashvark

ashvark Jul 10, 2017

Author Contributor

yes, i think it is good idea. I will update my code.

ashvark added some commits Jul 10, 2017

@ashvark

This comment has been minimized.

Copy link
Contributor Author

commented Jul 12, 2017

image

@mvdbeek I could not understand why 'continuous-integration/travis-ci/pr' failed. can you help me ?

@nsoranzo

This comment has been minimized.

Copy link
Member

commented Jul 12, 2017

@ashvark I suppose that was a temporary issue, I've restarted the failed builds.

@ashvark

This comment has been minimized.

Copy link
Contributor Author

commented Jul 12, 2017

@nsoranzo thank you very much

@jmchilton

This comment has been minimized.

Copy link
Member

commented Jul 13, 2017

@galaxybot test this

@ashvark

This comment has been minimized.

Copy link
Contributor Author

commented Jul 19, 2017

I could not understand why toolshed test has failed status eventhough all the tests are passed.

@bgruening

This comment has been minimized.

Copy link
Member

commented Aug 2, 2017

@ashvark everything looks ok now, the tests pass.

@martenson

This comment has been minimized.

Copy link
Member

commented Aug 2, 2017

@mvdbeek would you mind re-visiting your review?

@mvdbeek

This comment has been minimized.

Copy link
Member

commented Aug 2, 2017

Yeah, it's OK.

@mvdbeek mvdbeek merged commit ade8844 into galaxyproject:dev Aug 2, 2017

5 checks passed

api test Build finished. 279 tests run, 0 skipped, 0 failed.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
framework test Build finished. 150 tests run, 0 skipped, 0 failed.
Details
integration test Build finished. 37 tests run, 0 skipped, 0 failed.
Details
toolshed test Build finished. 579 tests run, 0 skipped, 0 failed.
Details
@mvdbeek

This comment has been minimized.

Copy link
Member

commented Aug 2, 2017

Thanks a lot @ashvark, this is great!

@ashvark

This comment has been minimized.

Copy link
Contributor Author

commented Aug 3, 2017

Thank you all for your help in implementing this feature.

@@ -462,6 +467,55 @@ def to_archive(self, trans, dataset, name=""):
file_paths.append(dataset.metadata.bam_index.file_name)
return zip(file_paths, rel_paths)

def get_chunk( self, trans, dataset, offset=0, ck_size=None ):
index_file = dataset.metadata.bam_index

This comment has been minimized.

Copy link
@jmchilton

jmchilton Oct 19, 2017

Member

@mvdbeek Do you happen to know if this function has the potential to consume unbounded memory if the bam file is problematic or is it going to always be pretty managable despite the bam file?

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Oct 20, 2017

Member

Yes, bam lines (reads) can become huge. If someone did a genome-against-genome alignment I think this could be problematic.

This comment has been minimized.

Copy link
@jmchilton

jmchilton Oct 24, 2017

Member

We verified the problem on main wasn't data loaded into peak attributes - sorry for the false alarm. Maybe we should create an issue for known points in Galaxy where the amount of data loaded is unbounded, I don't know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.