Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable skipping comment lines while sniffing files #4239

Merged
merged 4 commits into from Jun 28, 2017

Conversation

Projects
None yet
4 participants
@dpryan79
Copy link
Contributor

commented Jun 27, 2017

This is related to #3148.

For some data types, sniffing is done after skipping comment lines. That's problematic in cases of files with many comment lines at the beginning, where false assignments can be done (as in #3148). This PR adds a commentDesignator option to get_headers(), which, if set, causes comment lines starting with the desired character/string to be skipped. This won't solve every issue related to this (e.g., possible confusion of GFF and VCF files if they don't have the file type indicated in the header), but it'll solve at least some of them.

As an aside, this causes the most recent dm6 GFF to sniff as GFF rather than BED :)

dpryan79 added some commits Jun 27, 2017

@galaxybot galaxybot added the triage label Jun 27, 2017

@galaxybot galaxybot added this to the 17.09 milestone Jun 27, 2017

@mvdbeek

This comment has been minimized.

Copy link
Member

commented Jun 27, 2017

As an aside, this causes the most recent dm6 GFF to sniff as GFF rather than BED :)

Finally someone that cares! :)

@mvdbeek
Copy link
Member

left a comment

That looks like a solid improvement!

@dpryan79

This comment has been minimized.

Copy link
Contributor Author

commented Jun 27, 2017

@mvdbeek I work with a LOT of fly people :)

@nsoranzo

This comment has been minimized.

Copy link
Member

commented Jun 27, 2017

@dpryan79 Is that an alien race? 😆

@@ -198,24 +198,34 @@ def convert_newlines_sep2tabs( fname, in_place=True, patt="\\s+", tmp_dir=None,
return ( i + 1, temp_name )


def get_headers( fname, sep, count=60, is_multi_byte=False ):
def get_headers( fname, sep, count=60, is_multi_byte=False, commentDesignator=None ):

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo Jun 27, 2017

Member

It's really a nitpick, but can you use comment_designator for the parameter name instead? We don't use mixedCase.

This comment has been minimized.

Copy link
@dpryan79

dpryan79 Jun 27, 2017

Author Contributor

Camel case removed.

line = line.rstrip('\n\r')
if is_multi_byte:
# TODO: fix this - sep is never found in line
line = unicodify( line, 'utf-8' )
sep = sep.encode( 'utf-8' )
if comment_designator is not None:
comment_designator = comment_designator.encode( 'utf-8' )
if comment_designator is not None and line.startswith( comment_designator ):

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo Jun 27, 2017

Member
if comment_designator and line.startswith( comment_designator ):

would work also when comment_designator is the empty string.

This comment has been minimized.

Copy link
@dpryan79

dpryan79 Jun 28, 2017

Author Contributor

True, fixed.

try:
if not headers:
return False
for hdr in headers:
if (hdr[0] == '' or hdr[0].startswith( '#' )):
if (hdr[0] == ''):

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo Jun 27, 2017

Member

The parentheses are not needed.

This comment has been minimized.

Copy link
@dpryan79

dpryan79 Jun 28, 2017

Author Contributor

Removed

@nsoranzo nsoranzo merged commit 40cd2df into galaxyproject:dev Jun 28, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@nsoranzo

This comment has been minimized.

Copy link
Member

commented Jun 28, 2017

Thanks @dpryan79!

@nsoranzo nsoranzo referenced this pull request Feb 12, 2018

Closed

GFF file sniffed as BED #3148

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.