adding auto-annotate and inventory compilation scripts, plus new annotation files #52

nbokulich · 2016-12-09T23:20:28Z

@gregcaporaso would you mind reviewing?

Some big changes here — I generated two draft scripts, autoannotate-taxonomy.py and inventory-compiler.py, which automatically generate the expected-taxonomy.tsv files, database-identifiers.tsv, and inventory.tsv.

Then I did just that — all bacterial communities have new greengens and silva 97% and 99% OTU expected-taxonomy.tsv files and database-identifiers.tsv files. Others coming next.

This is a complete fix for #22 and #46 and a partial fix for #24

Probably need some clean up and inventory-compiler.py needs some work to automate whenever specific files are added (if it's possible — see notes).

…tation files

gregcaporaso

Hey @nbokulich,
Some high-level comments...

The auto-annotate functionality looks extremely handy, but I think it's beyond the scope of this specific repo to house this utility code. I can see people wanting to use the code, and then you having to support it, and that's not really what we designed this repo to be.

A few motivations for putting the code in another repo:

It will help keep mockrobiota focused as data resource. It wasn't intended to be a software package, and there are different things that you need to do to manage a software package versus a data resource (e.g., releasing on package managers like pypi is important for software, but not for data resources). The only code that is currently in this repo is to check for data integrity, and I think we should stick with that.
As of now there isn't any test code for this, so you really need to be careful with it. If you put it in this repo, which we point users at, we're sort of endorsing use of the code and people will start using it. We can't let that happen unless the code has been fully unit tested.

So, I think you should remove the code directory from this PR, and create a new repository with this code, for example github.com/nbokulich/mockrobiota-utilities. That doesn't mean that we could never re-visit this decision, or that you shouldn't use it to help with management of mockrobiota, but it would clearly distinguish the stable data resource from the ultility code that you use to help generate the data files. Once you make that transition, I could give more feedback on the code if you'd like (though transitioning from getopt to click as I mention in my inline comments would be helpful to do first, since that will reduce the amount of code by a lot).

Same goes for the other utility scripts that you've included in this pull request. After unit testing, supporting fuzzy matching, etc, I could see the auto-annotate script being publishable as an applications note if we wanted to go that route. So I think you should put these in another repo, use them for a while, and then we should discuss whether it's worth trying to make them into a tool that we'd publicly support/publish (in which case it might be time to put them into mockrobiota, though that's a decision we'd want to weigh at the time) or whether we just want them to remain utilities that are not support for use by others.

Other high-level comments:

could you rename the otu directories to use dashes instead of underscores (so 99-otus rather than 99_otus) for consistency with the other directory names in the repo? It's a pain to mix both as you can never remember/guess which will be used at any given time. (We like dashes because you don't have to hit shift... yep, programmers are that lazy...)
remove data/mock-12/.DS_Store, and any other .DS_store files that are hiding around the repo. If you add .DS_store to the .gitignore file, git won't try to track them anymore. Here's an example.
I think you should rename data/mock-12/source/ExtremeRefSeqs.fasta to data/mock-12/source/expected-sequences.fasta, for consistency with expected-taxonomy.tsv. What do you think of that? Simiarly, HMP_MOCK.fasta should become mock.1.fasta (and in general, the names should be .fasta).

The additional data sets are fantastic, and it's awesome that we have sequence data for some now!

One other comment: it's generally a good idea to keep your pull requests more modular. In this case, I would have made the changes to all of the data one PR, the addition of the new data sets another PR, and the addition of the code into a third PR. This makes them easier to review, and in this case would mean less work because we could just not merge the PR that adds the code directory, rather than having to revert the addtion in this PR.

I can review again when you've made these changes. Thanks for the work on this!

gregcaporaso · 2017-01-11T22:45:32Z

code/autoannotate.py

+from os import makedirs
+import re
+import sys
+import getopt


We generally use click for our command line interfaces. You might want to look into it, as I think it would reduce a lot of the boilerplate code here, and it would make it easier for others to help with maintenance of this. I think we should transition this from getopt to click, but it doesn't necessarily have to happen in this PR. If you want to do it in a different PR, could you create an issue for that?

gregcaporaso · 2017-01-11T22:52:21Z

code/autoannotate.py

+import getopt
+
+
+def add_lists(l1, l2):


As a convention, you should rename all of your functions to begin with an _ (so this would become _add_lists), unless you want them to be public (meaning that you support other Python programs importing these). If functions are public, then you have to think about backward compatibility, not changing interfaces, etc, so you generally want to make as little public as you can get away with. In this case, since your users would access this functionality though the CLI that you're creating, I highly recommend having no public functions. Users can still import functions that begin with _, but it's a general rule that they shouldn't (so it's their fault if they do and their workflows then break).

gregcaporaso · 2017-01-11T23:00:50Z

code/autoannotate.py

+            return lineage
+
+
+usage = '''usage: autoannotate-taxa.py -i <source_fp>


A lot of this is what you wouldn't have to write with click.

gregcaporaso · 2017-01-11T23:02:21Z

code/autoannotate.py

+            sp = arg
+
+    # generate dict of {name: (genus, species, abundances)}
+    with open(source_fp, "r") as source:


If source_fp doesn't exist, the user is going to see a traceback error. You should detect these types of issues and present more useful error messages.

gregcaporaso · 2017-01-11T23:06:23Z

code/autoannotate.py

+
+        # search for match at genus, then species level
+        for full, partial in ref_taxa.items():
+            if t[0][0] in partial[0]:


This exact matching is really strict - I wonder if you want to use fuzzy matching instead, to help you identify slight differences in spelling or spelling errors. Fuzzywuzzy is one package that does this. This isn't essential, but is a "nice to have" feature that would probably save you some time.

gregcaporaso · 2017-01-11T23:17:16Z

code/autoannotate.py

+    if not exists(destination_dir):
+        makedirs(destination_dir)
+
+    with open(join(destination_dir, 'expected-taxonomy.tsv'), "w") as dest:


This will overwrite the file if it already exists - is that what you want to do? There's not a right or wrong way to handle this, it's just worth thinking about whether you would want to error if the file already exists to avoid losing work if the user doesn't realize they're about to overwrite a file, or if you just want to overwrite. This decision generally depends on how hard it would be to regenerate the overwritten file if the user didn't really want to overwrite it.

gregcaporaso · 2017-01-11T23:27:00Z

data/mock-12/README.md

@@ -0,0 +1,11 @@
+#mock-12
+
+Composed of 27 bacterial strains amplified with 515f/806r primers. Intended to include more fine-scale variation than the other mock communities, the members of which were chosen in part for their well-separated 16S rRNA gene sequences. The Extreme strains are all distinguishable over the sequenced region of the 16S rRNA gene, but some pairs of strains differ by as little as one nucleotide. Generated by Benjamin Callahan at Stanford University in 2015.


more fine-scale variation -> more closely related taxa

The Extreme strains -> These strains, because you haven't mentioned the name extreme yet.

nbokulich · 2017-01-12T02:08:25Z

Thanks @gregcaporaso ! I have pushed all requested changes (with the exception of code edits, as these are now moved to a new repo).

gregcaporaso

This all looks good, but I noticed one more thing:

mockrobiota/data/example-1/greengenes/13_8/97/expected-taxonomy.tsv have 97-otus as the lowest level directory (ie., not just 97?

Once that change is made this is ready for merge.

nbokulich · 2017-01-12T15:01:16Z

Thanks @gregcaporaso ! Have made that fix and pushed the changes.

gregcaporaso · 2017-01-12T18:07:28Z

Thanks @nbokulich, looks good!

nbokulich added 8 commits December 9, 2016 16:51

adding auto-annotate and inventory compilation scripts, plus new anno…

9616c72

…tation files

updated auto-annotate code

36fee84

updated integrity tests

ca2135c

updated example dir structure

b8ce872

updated integrity checks

e1f87d4

cleaning up old data files

46343b5

cleaning up old files

cde399c

inventory compiler

862b778

gregcaporaso self-requested a review January 11, 2017 22:42

gregcaporaso requested changes Jan 11, 2017

View reviewed changes

nbokulich mentioned this pull request Jan 12, 2017

Sundry code revisions nbokulich/mockrobiota-utilities#1

Open

modifying directory path names

20b176c

gregcaporaso requested changes Jan 12, 2017

View reviewed changes

fix example file dir name

6cffa9b

gregcaporaso approved these changes Jan 12, 2017

View reviewed changes

gregcaporaso merged commit 4384b59 into master Jan 12, 2017

nbokulich mentioned this pull request Jan 12, 2017

additional mock communities to include #26

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding auto-annotate and inventory compilation scripts, plus new annotation files #52

adding auto-annotate and inventory compilation scripts, plus new annotation files #52

nbokulich commented Dec 9, 2016

gregcaporaso left a comment •

edited

gregcaporaso Jan 11, 2017

gregcaporaso Jan 11, 2017

gregcaporaso Jan 11, 2017

gregcaporaso Jan 11, 2017

gregcaporaso Jan 11, 2017

gregcaporaso Jan 11, 2017

gregcaporaso Jan 11, 2017

gregcaporaso Jan 11, 2017

nbokulich commented Jan 12, 2017

gregcaporaso left a comment

nbokulich commented Jan 12, 2017

gregcaporaso commented Jan 12, 2017

		return lineage


		usage = '''usage: autoannotate-taxa.py -i <source_fp>

		@@ -0,0 +1,11 @@
		#mock-12

		Composed of 27 bacterial strains amplified with 515f/806r primers. Intended to include more fine-scale variation than the other mock communities, the members of which were chosen in part for their well-separated 16S rRNA gene sequences. The Extreme strains are all distinguishable over the sequenced region of the 16S rRNA gene, but some pairs of strains differ by as little as one nucleotide. Generated by Benjamin Callahan at Stanford University in 2015.

adding auto-annotate and inventory compilation scripts, plus new annotation files #52

adding auto-annotate and inventory compilation scripts, plus new annotation files #52

Conversation

nbokulich commented Dec 9, 2016

gregcaporaso left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nbokulich commented Jan 12, 2017

gregcaporaso left a comment

Choose a reason for hiding this comment

nbokulich commented Jan 12, 2017

gregcaporaso commented Jan 12, 2017

gregcaporaso left a comment •

edited