New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R: Added R-code directory, w/ installable "biom" package #27

Closed
wants to merge 34 commits into
base: master
from

Conversation

Projects
None yet
4 participants
@joey711
Contributor

joey711 commented Jun 22, 2012

The package itself should be submitted to CRAN when we think it is ready. It is already built-out enough and documented enough for testing. It contains some unit tests and examples. It lacks a vignette, any integration with trees (is that Newick-string idea incorporated into the file format yet? Examples of it?), and also lacks any package-internal data stored as RData binaries in the "data" top-level directory.

All currently-included functions/classes are documented.

Other features needed:

  • write-to-biom Now supported
  • installation instructions in the absence of it sitting in an R repo or top-level GitHub repo Now included
  • A vignette and example data were the two biggest criteria holding back submission to CRAN. Now included

joey711 added some commits Jun 19, 2012

An initial incomplete skeleton for R-code
Need to fix sparse-parsing to not pass through dense matrix format.
Need to re-build documentation for class definitions, accessors, etc.
Need to test on examples. Need to add tests. Need to add print/show
methods. Need to add examples. And so on.
Basic stuff in place. Passes R CMD check without warning
No attempt to read trees yet. No tests yet. Copied example .biom files
for testing, in inst/extdata directory.
Added some tests. Need to get testthat-package to work
Some tests included, know they work. Need to get testthat-package to
work during R CMD check still. Tests fail until testthat-package works
as a testing dependency.
unit tests included, other basic tools
Can probably merge now…
make BIOM and RBIOM all lower case
including class, constructor, etc.
mod/add install/check scripts
Not sure which will be wrapped into the full biom-format package update
tests. Probably the unit_only test or something. Still useful to see
examples for installing the rbiom package or the Rscript call if you
haven't done R scripting before.
@gregcaporaso

View changes

R-code/rbiom/R/.Rapp.history Outdated
@@ -0,0 +1,685 @@

This comment has been minimized.

@gregcaporaso

gregcaporaso Jun 22, 2012

Member

Should this file be in .gitignore? Looks like a file we don't want in the repo (but correct me if I'm wrong).

This comment has been minimized.

@joey711

joey711 Jun 22, 2012

Contributor

Yes definitely. Sorry, I didn't notice it. There is not consequence to removing it. Purge and ignore for sure.

This comment has been minimized.

@joey711

joey711 Jun 23, 2012

Contributor

.gitignore has been updated in this pull request, and .Rapp.history was removed. See commits below

@gregcaporaso

View changes

R-code/rbiom/R/IO-methods.R Outdated
#' @keywords internal
parseGreenGenesPrefix <- function(char.vec){
# Define the meaning of each prefix according to GreenGenes (and RDP?) taxonomy
Tranks <- c(k="Kingdom", p="Phylum", c="Class", o="Order", f="Family", g="Genus", s="Species")

This comment has been minimized.

@gregcaporaso

gregcaporaso Jun 22, 2012

Member

I don't think RDP uses these same abbreviations.

This comment has been minimized.

@joey711

joey711 Jun 22, 2012

Contributor

The default is to not use the greengenes prefixes at all and include the entire header. There are likely other conventions, too. If there's a better way to organize this parsing I'd be happy to try and implement it. This actually touches on a bigger issue that is, what about when there is taxa data, not just (essentially) vectors of taxonomic ranks? The format is supposed to be able to handle that, but no working examples in our tiny test files.

This comment has been minimized.

@gregcaporaso

gregcaporaso Jun 24, 2012

Member

What type of taxa data are you thinking of specifically? The format currently doesn't specify what those should look like, although the vector of taxonomic ranks is convenient for use with QIIME. It might be worth having a page on the website with recommended conventions that we could consider for integration into the standard.

This comment has been minimized.

@joey711

joey711 Jul 13, 2012

Contributor

I don't have a specific example in hand, but it probably isn't long before users will want to be able to include additional data from taxa in their dataset. For example, data derived from reference genomes that match their OTU sequence exactly. In the case of shotgun metagenomes, this could mean including whatever information can be gleaned from the bits of genome adjacent to the phylogenetic marker on the contig.

This is vague, and I was mostly curious if there were any plans for addressing this. Seems perfectly reasonable to postpone any effort until/unless users start asking for it, have some actual scientific reason to want to do it.

This comment has been minimized.

@gregcaporaso

gregcaporaso Jul 13, 2012

Member

Yeah, let's postpone until we have specific examples, and we can start a page of recommendations at that point.

joey711 added some commits Jun 23, 2012

rm .Rapp.history, added to .gitignore
minore change to ignore unneeded support files for R IDEs
@jairideout

This comment has been minimized.

Member

jairideout commented Sep 25, 2012

@joey711, what is the status of this pull request? We have several potential users of the BIOM format who expressed interest in having an R API in addition to the Python one currently in place (this interest was expressed at GSC 14 last week where I presented the BIOM format).

Just wanted to check in and see how close you think this is to being merged. @gregcaporaso and I talked and thought it would be great if this could go into the next biom-format release (we don't have a scheduled date for the release though).

@joey711

This comment has been minimized.

Contributor

joey711 commented Sep 25, 2012

I would be happy to get this included. I didn't receive very much feedback and so let it sit idle. I haven't looked at it since (for a few months), but I believe it was pretty close to being inclusion-ready. We will need to add a few lines in the main testing directory to initiate the R-specific tests. There were a couple of features that the R-side can and should support, but don't yet:

(1) trees. The first of those being the phylogenetic tree support, which was a separate issue discussion. I think we agreed on adding a Newick string somewhere, and I should be able to make that supported on the R-side very quickly.

(2) chunking. The second might be a RAM-efficient option to page/stream the .biom file in chunks into a sparse matrix object. The current R-code first uses a relatively fast, but RAM-hungry, full-file read that converts the biom-file into an R-structured list.

(3) better tests. A third "feature" was better testing that is fully consistent with the python-side unit tests. If I remember, I didn't attempt to cover all tests from the python side, yet.

"How much of these three things needs to get done before merging?" ... is a question I want to pose to the group. The answer will affect how much time until we can do the merge.

Also, does anyone else dabble (or more) in R and want to contribute, or am I flying solo on this part?

@jairideout

This comment has been minimized.

Member

jairideout commented Sep 25, 2012

Great, thanks for getting back to us so quickly on this, @joey711!

My opinions on the remaining issues (curious to hear from the rest of the group too):

(1) This should not keep the merge from happening because the current BIOM format (and Python API) does not support this. Once it does, the R code would need to be updated.

(2) Same as (1): we are currently running into efficiency issues with the Python API on extremely large tables, and are going to be spending time optimizing the code. I don't think this should prohibit your changes from being merged, though it might need to be updated in the future to be more efficient.

(3) I think it is important to fully test the code, and if possible, mimic the unit tests that are currently in place. This would need to be completed before the merge.

I've read some R code and done some very simple stuff with the language, but I'm hesitant to even call myself an R novice. I'd be happy to read over the code and provide whatever feedback I can though.

@joey711

This comment has been minimized.

Contributor

joey711 commented Sep 25, 2012

Got it. I'll do what I can to bring the unit tests up to par as soon as I can. Any feedback is welcome, btw, however "novice".

@gregcaporaso

This comment has been minimized.

Member

gregcaporaso commented Sep 27, 2012

Hi all,
I agree with @jrrideout regarding the importance of the three different features that @joey711 describes.

@gregcaporaso

This comment has been minimized.

Member

gregcaporaso commented Nov 18, 2012

@joey711, we're shooting for a new release of biom within the next few weeks to month. Do you think it'd be possible to get your tests in within that time frame? I would love to include this in the next BIOM release.

@joey711

This comment has been minimized.

Contributor

joey711 commented Nov 18, 2012

Hi Greg,

Yes. I'm very sorry for the delay. I had some personal matters that - when added to some bigger-than-anticipated teaching obligations - took me out of any development for several weeks.

Things have settled on the personal side, and teaching is winding down. I should be able to handle that time frame.

Thanks for keeping me in the loop, and sorry again for going MIA.

@gregcaporaso

This comment has been minimized.

Member

gregcaporaso commented Nov 18, 2012

No problem - it'll be great to get this in place!

@ghost

This comment has been minimized.

ghost commented Mar 21, 2013

Can one of the admins verify this pull request?

@jairideout

This comment has been minimized.

Member

jairideout commented Mar 21, 2013

add to whitelist

@ghost

This comment has been minimized.

ghost commented Mar 21, 2013

Build results will soon be (or already are) available at: http://ci.qiime.org/job/biom-format-github-pr/2/

@joey711

This comment has been minimized.

Contributor

joey711 commented Mar 21, 2013

Sorry, I went missing again. The plan was for me to throw some more unit tests into the R package so we're comfortable making it official. Also not clear exactly what we want the feature set on the R side to be. I'm happy to do more work on it. I need help from a python guru on translating the python-side unit tests that are relevant. Don't need them translated to R, I can do that, I need English language high-level explanation of the tests that are relevant to the subset that we want the R library to be able to do. I can do the rest.

I should have said something, earlier. Sorry. I do respond well to being electronic nudges, though. (See: time-lag)

@gregcaporaso

This comment has been minimized.

Member

gregcaporaso commented Mar 22, 2013

Thanks for the update @joey711. @jrrideout, would you have time next week maybe to get on a brief skype call with @joey711 to discuss what we're doing in the python unit tests? That is probably more efficient than lots of emails.

@jairideout

This comment has been minimized.

Member

jairideout commented Mar 22, 2013

Sure! It may also be useful to have you or @wasade on the call, though if you guys can't make it that's okay.

@joey711, those comments that were added yesterday were from our new continuous integration system. We have something in place now to automatically test pull requests. Sorry for the confusion!

Let's schedule a time to chat via Skype next week if you're available. Feel free to send me an email at jai.rideout@gmail.com with times that you're available and we can schedule from there.

@joey711

This comment has been minimized.

Contributor

joey711 commented Mar 22, 2013

I'm in Mexico all next week. Not sure yet what my schedule or internet access will be. Maybe someone can post a doodle for the following week? I completely agree that a brief skype call makes sense.

@ghost

This comment has been minimized.

ghost commented Apr 17, 2013

Build results will soon be (or already are) available at: http://ci.qiime.org/job/biom-format-github-pr/19/

@jairideout

This comment has been minimized.

Member

jairideout commented Apr 17, 2013

I don't think fixing the merge conflict will cause issues with git, even if we end up going with a rename of some stuff (since your changes are all isolated from the rest of the biom-format project). I'd like to keep this all in the same pull request if possible.

I haven't heard back from @gregcaporaso or @wasade about the name changes, so I'm going to do a detailed review in the meantime.

@joey711

This comment has been minimized.

Contributor

joey711 commented Apr 17, 2013

No, I'm not worried at all about that one merge conflict in the .md file. Very easy to resolve. I'm just pointing out that once I start changing the name of a bunch of files/directories to reflect the new package name, git will treat these as deletes and file adds, breaking the file history anyway.

Either way. I agree the merge should probably work fine without creating a new branch.

@jairideout

View changes

R-code/rbiom/R/BIOM-class.R Outdated
#' print(x)
#' header(x)
#' biom_table(x)
#' observ_meta(x)

This comment has been minimized.

@jairideout

jairideout Apr 17, 2013

Member

Can you change this to observation_metadata and sample_metadata to more closely match the Python API?

@jairideout

View changes

R-code/rbiom/R/BIOM-class.R Outdated
#' @aliases header
#' @aliases biomclass
#' @rdname accessor-functions
biomclass = function(x){

This comment has been minimized.

@jairideout

jairideout Apr 17, 2013

Member

I think the biomclass name is a little confusing. Maybe rename to something like matrix_element_type?

This comment has been minimized.

@wasade

wasade Apr 18, 2013

Contributor

+1

This comment has been minimized.

@joey711

joey711 Apr 18, 2013

Contributor

K.

@jairideout

View changes

R-code/rbiom/R/BIOM-class.R Outdated
#' Retrieve and organize main data from \code{\link{biom-class}},
#' represented as a matrix with index names.
#'
#' @usage biom_table(x, rows, columns, parallel=FALSE)

This comment has been minimized.

@jairideout

jairideout Apr 17, 2013

Member

Can you rename this to biom_data_matrix, or some name that indicates it is returning the actual data contained in the biom table? In the Python API, we try to make the distinction between a BIOM table (which consists of a data matrix, sample/observation IDs, and metadata) and the data matrix itself.

This comment has been minimized.

@joey711

joey711 Apr 18, 2013

Contributor

Sure. Makes sense.

This comment has been minimized.

@joey711

joey711 Apr 18, 2013

Contributor

What is it called in the Python API?

This comment has been minimized.

@jairideout

jairideout Apr 18, 2013

Member

We don't directly expose the data matrix in the Python API, so the "private" member is named _data. I'm not sure what to call it here, though biom_data_matrix makes sense to me. @gregcaporaso @wasade thoughts?

This comment has been minimized.

@gregcaporaso

gregcaporaso Apr 23, 2013

Member

What about biom_data or even data, and the latter might match up better with the Python API. Those also seem like they would match up with the names for observation_metadata and sample_metadata (i.e., we have our data and we have our metadata). My vote is for data, but let me know if I'm missing some reason why that's a bad idea.

This comment has been minimized.

@jairideout

jairideout Apr 23, 2013

Member

👍 for data

This comment has been minimized.

@joey711

joey711 Apr 23, 2013

Contributor

@gregcaporaso @wasade @jrrideout
Woops, I read this too fast the first time. The name biom_data would work without conflicts. data by itself would be strange and confusing to R users, because data is already a special core function that loads example data into the workspace. It takes strings or symbols as its first argument, and then goes and grabs and loads the corresponding data from available loaded packages. Try data() in an R session to see what I mean...

Semantically I can see why data makes sense, just won't play well in the R environment.

Seems like either biom_data or observation_data would both work fine semantically. I tend to like shorter names, so my vote would be for biom_data.

Is that alright? I'm glad I double-checked. Was about to start renaming...

This comment has been minimized.

@jairideout

jairideout Apr 23, 2013

Member

Good catch- let's go with biom_data. Thanks!

@jairideout

This comment has been minimized.

Member

jairideout commented Apr 17, 2013

Let's keep this all within a single pull request (unless we run into issues). To see the full history of a file (including renames), use --follow:

git log --follow <filename>
@jairideout

View changes

R-code/rbiom/R/validity-methods.R Outdated
return(TRUE)
}
################################################################################
## assign the function as the validity method for the otuTable class

This comment has been minimized.

@jairideout

jairideout Apr 18, 2013

Member

otuTable -> biom

This comment has been minimized.

@joey711

joey711 Apr 18, 2013

Contributor

whoops. Typo. will rm

@jairideout

View changes

R-code/rbiom_unit_only.sh Outdated
@@ -0,0 +1,4 @@
# This simply runs the the testthat-package based unit tests in rbiom.

This comment has been minimized.

@jairideout

jairideout Apr 18, 2013

Member

the the -> the

@jairideout

This comment has been minimized.

Member

jairideout commented Apr 18, 2013

I've finished my review, with comments added inline. All comments are very minor, most having to do with name changes.

One general comment: I notice that the public functions/methods follow two naming schemes, one with underscores separating words, and the other without underscores. For example, sample_meta and biomshape. I think you should choose one style and go with it (whatever makes more sense following R's conventions). This will help keep things consistent and easier to predict/remember for users of this library.

Thanks for all of the hard work you've put into this! I think this is very close, though I'd like to wait on merging until @gregcaporaso and @wasade have a chance to weigh in on the package/class naming (and my most recent comments have been addressed).

@wasade

View changes

R-code/rbiom/inst/extdata/min_dense_otu_table.biom Outdated
@@ -0,0 +1,32 @@
{

This comment has been minimized.

@wasade

wasade Apr 18, 2013

Contributor

These look like copies from the biom-format examples directory. Do these need to be replicated? I'm not very familiar with the R install methods, so I apologize if I'm overlooking something.

This comment has been minimized.

@joey711

joey711 Apr 18, 2013

Contributor

I don't like replicating them, either. Not very DRY. Perhaps we can automate updating them when they change in the biom-format examples directory? I'm open to suggestions.

To answer your question, though, yes, it is recommended to include example data in the package itself since we'll be submitting this to CRAN. On the one hand, I imagine their package check servers are internet-facing and so could instead grab the files directly from the biom-format examples directory. The problem with that solution is that you could then remotely break the R package unit-tests and examples/documentation and have no idea. The success of our package checks on CRAN would then be fragile to GitHub changing the way those files are exposed in a URL (I think this happened at least once, recently), or if you guys change those files for some otherwise perfectly-legitimate reason. You can see the dilemma. My thought was possibly just throwing a warning within the larger biom-format project if the corresponding example files don't match. Then we would know to do something about it and update the corresponding unit-tests at the same time, but without ever putting the current stable release/CRAN version at risk.

Happy to entertain alternatives. This seemed easy, stable, and the files are small.

This comment has been minimized.

@jairideout

jairideout Apr 23, 2013

Member

I think for now we should just leave this as-is, though we'll need to come up with a better way to handle this in the future. @joey711 can you create an issue for this on the issue tracker so that we don't forget?

@wasade

This comment has been minimized.

Contributor

wasade commented Apr 18, 2013

Thanks @jrrideout and @joey711, this is a pretty substantial pull request and I'm excited to see it get incorporated into BIOM.

v0.3.4 name changes, method improvements
Did not rename package name or main class. That is still TBD. Need to
add at least a few unit tests for the new accessors: colnames,
rownames, nrow, ncol
@joey711

This comment has been minimized.

Contributor

joey711 commented Apr 19, 2013

Just a few minor unit tests to add, and waiting to hear a final word about the overal package name change...

@ghost

This comment has been minimized.

ghost commented Apr 19, 2013

Build results will soon be (or already are) available at: http://ci.qiime.org/job/biom-format-github-pr/20/

@ghost

This comment has been minimized.

ghost commented Apr 23, 2013

Build results will soon be (or already are) available at: http://ci.qiime.org/job/biom-format-github-pr/22/

@gregcaporaso

This comment has been minimized.

Member

gregcaporaso commented Apr 23, 2013

Hi @joey711,
I'm going to have some time to do a quick review today (won't go too in-depth since that's already done). @jrrideout and I can then chat about any remaining things to do before merge, which I'm guessing would be minimal at this point.

@gregcaporaso

This comment has been minimized.

Member

gregcaporaso commented Apr 23, 2013

OK, I responded to the specific cases where there were questions for me. @jrrideout, I think you should merge this when you and @joey711 think it's ready.

This is really exciting! I think for the time-being we should bill the project as having beta support for R. We'll go through one release cycle with beta support, and then update to full support. This will give us time to allow users to catch any issues, make any non-backward-compatible API changes that we decide are necessary once we start really stress testing, etc. Does that sound right to you guys?

@jairideout

This comment has been minimized.

Member

jairideout commented Apr 23, 2013

Thanks @gregcaporaso for reviewing! That sounds good to me re: beta support.

@joey711 I think there's three outstanding issues before merge:

  • change rbiom package name to biom
  • change biom_table method name to data
  • fix merge conflicts

Sound good?

@joey711

This comment has been minimized.

Contributor

joey711 commented Apr 23, 2013

@gregcaporaso initial round of beta support is fine with me. I'm curious to see how non-OTU files behave, and I'd also like to do some performance tests with much larger example files. I might have something laying around, but if you guys have a big .biom file laying around that you use for testing speed and don't mind sharing, let me know.

@jrrideout Those changes work for me. I'll try to get those in quickly. I think we deal with the merge conflict after merging, right? I thought I already changed that .md file back to the way it was originally... Can just merge then go delete the joey711-originated line from that file.

@jairideout

This comment has been minimized.

Member

jairideout commented Apr 23, 2013

I can't automatically merge your pull request because there's a conflict somewhere. I suggest making the other changes first, and then fixing the conflict last (see my previous comment for instructions).

In general we usually don't merge pull requests with conflicts, and I don't want to merge yours in case the conflicts are farther-reaching than the README.md file.

@joey711

This comment has been minimized.

Contributor

joey711 commented Apr 23, 2013

Ah, right, sorry, now I remember those instructions. Should be soon...

Actually, please see my comment above about why data won't work for us as an accessor method name. This is a special function in core R already...

v0.3.6 rename pkg: rbiom -> biom… and biom_table() -> biom_data()
Basically every file in R-code changed to reflect these two name
changes... All tests passed, and vignette appears to build properly.
@ghost

This comment has been minimized.

ghost commented Apr 23, 2013

Build results will soon be (or already are) available at: http://ci.qiime.org/job/biom-format-github-pr/23/

@joey711

This comment has been minimized.

Contributor

joey711 commented Apr 24, 2013

Alright, done! Should be able to close this and merge Pull 130 instead.

@joey711

This comment has been minimized.

Contributor

joey711 commented Apr 24, 2013

Also, with a final confirmation, and once this is merged, I'll go ahead and submit the package itself to CRAN.

@jairideout jairideout closed this Apr 24, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment