Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document config for sniffing compressed files, enable by default #6641

Merged

Conversation

@martenson
Copy link
Member

martenson commented Aug 27, 2018

Me and @natefoo think that it is time to switch the default of this config option to True. At least for Main.

In addition I suspect there is a bug in data_source tools that prevent/avoid uncompressing uploaded archives and thus causing issues like #6334.

ping @jmchilton

@natefoo

This comment has been minimized.

Copy link
Member

natefoo commented Aug 27, 2018

Just to walk it back slightly - I think we should do this, but I think we ought to check the proportion of jobs over time running tools with fastq inputs and see how many of those are tools that support native fastqsanger.gz inputs first. If the majority of such jobs are still using non-fastqsanger.gz-supporting tools, this would probably be a net loss for usegalaxy.org (but we should work quickly to get any remaining tools updated to supporting fastqsanger.gz).

@martenson

This comment has been minimized.

Copy link
Member Author

martenson commented Aug 27, 2018

It could technically be a short run net loss (more space consumed because of the compressed archive presence in history after implicit conversion) but it is the path forward we want to promote I believe.

It will motivate tool authors to write tools that accept compressed datatypes as first-class citizens and also improve our storage efficiency in the long run.

If the majority of such jobs are still using non-fastqsanger.gz-supporting tools

I wouldn't give this ratio too much importance, since we are now being inefficient (on Main) by unpacking all compressed formats during upload that are not specifically set to compressed datatype (and setting compressed datatype is rarely used I believe). By stopping this practice (with enabling sniffing of compressed datatypes) we will save space in cases that will never run implicit conversion afterwards. My guess is that this saved space outweighs the extra space consumed by preservation of the compressed copies after implicit conversion (for the analyses that use it).

@@ -141,7 +141,7 @@ def __import_module(full_path, datatype_module, datatype_class_name):
auto_compressed_types = galaxy.util.listify(elem.get('auto_compressed_types', ''))
sniff_compressed_types = galaxy.util.string_as_bool_or_none(elem.get("sniff_compressed_types", "None"))
if sniff_compressed_types is None:
sniff_compressed_types = getattr(self.config, "sniff_compressed_dynamic_datatypes_default", True)
sniff_compressed_types = getattr(self.config, "sniff_compressed_dynamic_datatypes", True)

This comment has been minimized.

Copy link
@jmchilton

jmchilton Aug 28, 2018

Member

This can be configured/overridden on a per-datatype basis in the datatypes_conf.xml - so I think the old config value name is more appropriate.

This comment has been minimized.

Copy link
@martenson

martenson Aug 28, 2018

Author Member

I removed the renaming and added the note about configuration to the description. Thanks for pointing this out.

@martenson martenson force-pushed the martenson:rename-and-document-config-value branch from cdcd4cf to 90bc73e Aug 28, 2018
…clude in sample
@martenson martenson force-pushed the martenson:rename-and-document-config-value branch from 90bc73e to e0d1208 Aug 28, 2018
@jmchilton

This comment has been minimized.

Copy link
Member

jmchilton commented Aug 28, 2018

Did you mess up your gitfu here - I don't see the original changes anymore. I'd use git reflog to find the old head if you no longer have them in a named branch.

@martenson

This comment has been minimized.

Copy link
Member Author

martenson commented Aug 28, 2018

@jmchilton this is now a documentation PR only. I see it fine I think.

@martenson

This comment has been minimized.

Copy link
Member Author

martenson commented Aug 28, 2018

I have added a commit that switches the default for sniffing compressed to True.

@jmchilton

This comment has been minimized.

Copy link
Member

jmchilton commented Aug 28, 2018

I was super confused and remember the original PR being bigger and switching the default, I'm... old I guess.

Thanks for switching the default - I think all contemporary tools have been written with compressed files in mind and this will definitely save us space - but regardless you points about the ratio and being future facing are absolutely true regardless of that debate.

@nsoranzo

This comment has been minimized.

Copy link
Member

nsoranzo commented Aug 28, 2018

@martenson API test failures seem legit.

@martenson martenson changed the title Rename and document config value for sniffing compressed files document config for sniffing compressed files, enable by default Aug 29, 2018
so it tests the new defaults
@martenson

This comment has been minimized.

Copy link
Member Author

martenson commented Aug 29, 2018

I have added commit with adjustments to the compressed upload API tests. I see strange results locally so let's see what jenkins sees.

mvdbeek and others added 5 commits Aug 29, 2018
…fig-value

Use uses_test_history for selected upload tests
compressed beds are not part of the datatypes_conf.xml.sample
…on/galaxy into rename-and-document-config-value
I expect the datatypes_conf.xml bed compress default will change in near future
@martenson

This comment has been minimized.

Copy link
Member Author

martenson commented Aug 30, 2018

Tests now pass for me locally. Hopefully they make sense.

@nsoranzo nsoranzo self-assigned this Sep 6, 2018
@jennaj

This comment has been minimized.

Copy link
Member

jennaj commented Sep 6, 2018

This change will help for issues with fastqsanger data, however, if the data is fastqcssanger, probably not. That type of data is being sniffed incorrectly in two ways (.gz compression not recognized and base fastq type is wrong) -- not sure at what step that is done.

In short, EBI SRA "Send to Galaxy" data cannot be assumed to be fastqsanger. Example and details here: #6334 (comment)

@martenson

This comment has been minimized.

Copy link
Member Author

martenson commented Sep 7, 2018

@jennaj I don't think this affects the current PR. Please provide examples of these wrongly sniffed fastqcssanger files and create a new issue for it.

@nsoranzo nsoranzo merged commit 64bb339 into galaxyproject:dev Sep 7, 2018
6 checks passed
6 checks passed
api test Build finished. 417 tests run, 1 skipped, 0 failed.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
framework test Build finished. 187 tests run, 0 skipped, 0 failed.
Details
integration test Build finished. 117 tests run, 5 skipped, 0 failed.
Details
selenium test Build finished. No test results found.
Details
toolshed test Build finished. 577 tests run, 0 skipped, 0 failed.
Details
@nsoranzo nsoranzo deleted the martenson:rename-and-document-config-value branch Sep 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.