Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[20.09] Fix sniffers of interval datatypes for files starting with many comments #11416

Conversation

nsoranzo
Copy link
Member

@nsoranzo nsoranzo commented Feb 19, 2021

For files starting with >60 comment lines, the sniff_prefix() method of Bed, Gff, Gff3 and Gtf classes was returning True when actually all headers lines were skipped.

Also:

  • Don't skip lines having more than one column just because the first column is empty.
  • Properly skip comment lines in is_column_based()
  • Add test file which would have been incorrectly sniffed as gtf before. It still isn't sniffed as gff3, but that's fine because it has too many comment lines for the default header prefix.
  • Set Vcf metadata also for VcfGz files.

Reported on Gitter by @FredericBGA.

@nsoranzo nsoranzo added this to the 21.01 milestone Feb 19, 2021
@nsoranzo nsoranzo force-pushed the release_20.09_fix_sniffers_for_too_many_comments branch from 588901d to 0442638 Compare February 19, 2021 05:51
@mvdbeek
Copy link
Member

mvdbeek commented Feb 19, 2021

  • Add test file which would have been incorrectly sniffed as gtf before. It still isn't sniffed as vcf, but that's fine because it has too many comment lines for the default header prefix.

I was worried about that at first, but we check for https://github.com/mvdbeek/galaxy/blob/e82e23d08f2487a4eba387c89be474b9f15ad941/lib/galaxy/datatypes/tabular.py#L739 on the first line, which VCFs should do. The test data isn't actually valid vcf, is it ?
The specs say:

A single ‘fileformat’ field is always required, must be the first line in the file, and details the VCF format version
number. For example, for VCF version 4.2, this line should read:
##fileformat=VCFv4.2

@mvdbeek
Copy link
Member

mvdbeek commented Feb 19, 2021

maybe change the name of the test data file to header_invalid_vcf.tabular ?

@nsoranzo nsoranzo force-pushed the release_20.09_fix_sniffers_for_too_many_comments branch from 0442638 to ebf996f Compare February 19, 2021 11:50
For files starting with >60 comment lines, the `sniff_prefix()` method of
Bed, Gff, Gff3 and Gtf classes was returning `True` when actually all
headers lines were skipped.

Also:
- Don't skip lines having more than one column just because the first
column is empty.
- Properly skip comment lines in `is_column_based()`
- Add test file which would have been incorrectly sniffed as `gtf` before.
  It still isn't sniffed as `gff3`, but that's fine because it has too many
  comment lines for the default header prefix.
@nsoranzo
Copy link
Member Author

nsoranzo commented Feb 19, 2021

@mvdbeek Good point! I've replaced the non-comment lines in the new test file to make it a GFF3 file instead of VCF, since our Gff3 filter does not strictly require an initial comment line. So now if some comment lines are removed from the test file, it is correctly sniffed as GFF3.

@mvdbeek mvdbeek merged commit e43819f into galaxyproject:release_20.09 Feb 19, 2021
@mvdbeek
Copy link
Member

mvdbeek commented Feb 19, 2021

Thanks @nsoranzo and @FredericBGA!

@nsoranzo nsoranzo deleted the release_20.09_fix_sniffers_for_too_many_comments branch February 19, 2021 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants