Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contingency Table API #848

Open
ebolyen opened this Issue Feb 17, 2015 · 25 comments

Comments

Projects
None yet
8 participants
@ebolyen
Copy link
Member

ebolyen commented Feb 17, 2015

This will be a place to discuss the specific features that a contingency table should be able to support and what that API will looks like.

@ebolyen ebolyen referenced this issue Feb 17, 2015

Open

Add Contingency Table Object #821

0 of 4 tasks complete

@ebolyen ebolyen added the on deck label Feb 17, 2015

@mortonjt

This comment has been minimized.

Copy link
Contributor

mortonjt commented Feb 18, 2015

This may be a stupid question about the skbio.io.biom registry mentioned before.

But isn't BIOM backwards compatible? It seems to me that it would be more straightforward to use the BIOM parsers and create the Contingency table from the biom.Table object.

@mortonjt

This comment has been minimized.

Copy link
Contributor

mortonjt commented Feb 18, 2015

Also I remember there was some discussion about whether to use Pandas or the biom.Table object as the core data type. Here are the pros and cons that I see

pandas.DataFrame
Pros
* Dense matrix storage (could be easier for users to manipulate)

  • Nice text visualization
  • Has apply() function

Cons

  • No collapse() functionality ( this could be added - but it could get messy )
  • No support for 2D metadata (can be remedied with 3 pandas objects)

biom.Table
Pros
`* Sparse matrix storage. Lenient on memory. Could give a speed boost via sparse matrix manipulations

  • Nice support for 2D metadata

Cons

  • No apply() function (would need to build it from scratch)
  • Visualization of raw data can overflow screen buffer

I'm still in favor of having the BIOM table as the underlying datatype. I don't think it would be too difficult to add in an apply() function. And it already has many similar functions supported by Pandas. But I don't have particularly strong opinions - I think both solutions are reasonable.

If we do want to go the Pandas route - converting from a biom.Table to pandas.DataFrame is fairly straightforward

Its just

from biom import Table
import pandas as pd
table = load_table("some_table.biom")
mat = table._get_sparse_data().toarray().transpose()
df = pd.DataFrame(mat)
df.index = table.ids(axis='sample')
df.columns = table.ids(axis='observation')
@Jorge-C

This comment has been minimized.

Copy link
Member

Jorge-C commented Feb 18, 2015

@mortonjt

This comment has been minimized.

Copy link
Contributor

mortonjt commented Feb 18, 2015

Hmm. I have never used that feature. Good catch @Jorge-C !

@gregcaporaso

This comment has been minimized.

Copy link
Member

gregcaporaso commented Feb 18, 2015

@mortonjt, note that Pandas does have GroupBy, which I think gets the collapse functionality that you mention.

Can you give an example of what you mean by 2D metadata?

@mortonjt

This comment has been minimized.

Copy link
Contributor

mortonjt commented Feb 18, 2015

Pardon my sloppy notation. When I meant by 2D metadata. I was just referring to observation and sample metadata found in biom (aka zip codes and taxonomy information)

Yes, I am familiar with the pandas group by operation. I actually used it to perform the collapse function before I became familiar with biom.

My major concern with the pandas approach is figuring out how to manage the metadata. I think merging all of the metadata and abundance data into a single dataframe will cause much heartache later.

@gregcaporaso

This comment has been minimized.

Copy link
Member

gregcaporaso commented Feb 18, 2015

Ok, got it. It's definitely not immediately obvious what the right option is. I think mutliple dataframes is one option (as you point out). Using the multi-index might work out too, but I do think multiple dataframes is probably better.

Re: groupby/collapse, if we went the DataFrame-under-the-hood route what I think we'd want to do is write our own collapse that wraps GroupBy to simplify the functionality that we use regularly. That would then simplify the interface for what we need to do regularly, but also give us the power to make use of the more advanced pandas functionality.

@mortonjt

This comment has been minimized.

Copy link
Contributor

mortonjt commented Feb 18, 2015

Having a multi-index is not a bad idea. However, doing so would enforce the following design constraints

  • DataFrames with multiple different metadata fields for OTUs would be tricky to handle.
  • DataFrames can only have a specific biological data type (cannot merge OTUs with KEGGs). This could be a major roadblock for statistical tools. I can think of scenarios where I would want to simultaneously analyze different biological datasets and having them in a single contingency table would be useful.

I know that the collapse function is very doable using pandas. So that shouldn't be an issue.

@ebolyen

This comment has been minimized.

Copy link
Member Author

ebolyen commented Feb 18, 2015

I appreciate all the interest in this topic, but if we could keep the discussion focused on what API features would be useful, and less about what certain implementation decisions make convenient/difficult, that would be great! The topic seems to be trending in that direction anyways, so thank you all.

To directly address why we wouldn't just import the biom package:
There is no need (for us) to maintain a core bioinformatics object external to scikit-bio. It would be easier to maintain API compatibility and ensure our deprecation process is viable if the concept of biom.table was migrated instead of wrapped. This also allows the biom-format to focus more on what needs to exist in the file format and to create some useful command line tools with scikit-bio without needing to worry about API management of the biom.table object (these considerations would be routine for scikit-bio anyways). Basically, we won't be duplicating code so much as deprecating biom.table in favor of scikit-bio's contingency table (there is likely to be a fair amount of logic and API migrated in this endeavour)

But before we do any of that, we need to figure out what a contingency table can do, what those operations look like, and why those are the best operations to provide. Let's keep that the focus of this discussion from here on out. Feel free to open an issue to discuss implementation, but I feel that is premature at this moment (next week would be a better time to worry about it). Discussions about the pandas and biom.table API are of course completely encouraged as they can serve as a model for our contingency table.

@mortonjt

This comment has been minimized.

Copy link
Contributor

mortonjt commented Feb 18, 2015

That makes sense.

I do agree with @ebolyen . It would be wise not to rush through this.

At this point, I'm willing to buy that pandas may be the way to go.

As far as API functions go, here are a few functions that I think would be key to implement. I'm going to try to rank them by importance

  • read()
  • write()
  • apply( func, axis )
  • filter(func, axis)
  • groupby( id, axis )
  • merge(table)
  • collapse( func, axis )
  • sort( axis, index [optional] )
  • __str__()
@mortonjt

This comment has been minimized.

Copy link
Contributor

mortonjt commented Feb 18, 2015

@ebolyen , I thought a little more about what you said.

We could be potentially shooting ourselves in foot if we move ahead and try to rewrite all of the biom parsers for each biom version. For every future release of biom, we would have to write up a new set of parsers. To me, this seems to be waste of developer time.

Instead, I would think it would be more manageable to import BIOM, but have a single conversion method that would convert the BIOM object to our new contingency table object. That way, we could take full advantage of BIOM backwards compatibility while keeping the APIs relatively separate

@ebolyen

This comment has been minimized.

Copy link
Member Author

ebolyen commented Feb 18, 2015

@mortonjt BIOM is honestly less backwards compatible than you might think. Additionally it is largely the same development team either way, so those parses have to be written one way or another if the format is updated again. What scikit-bio does provide is a very robust IO mechanism which should actually make identifying and parsing different variations of BIOM much simpler than it is right now.

@josenavas

This comment has been minimized.

Copy link
Member

josenavas commented Feb 18, 2015

👍 to @ebolyen

@mortonjt

This comment has been minimized.

Copy link
Contributor

mortonjt commented Feb 18, 2015

That makes sense. Thanks for clarifying @ebolyen !

@mortonjt

This comment has been minimized.

Copy link
Contributor

mortonjt commented Feb 22, 2015

At this point, it seems that we agree on the following

  • The Contingency table should be completely separate from the BIOM API
  • BIOM parsers will need to be built
  • A pandas.DataFrame should be used as the core data type

I thought a bit about @gregcaporaso comment about pandas multiIndex and I actually think it is a really good idea - it naturally allows for an intuitive display/organization of taxonomies.

However, if a MultiIndex is to be used, the following should be considered

  • Do we care about other metadata for OTUs other than taxonomies? For example - would we want to store information about gram positive vs. gram negative for each OTU
  • Do we care about storing different data types (e.g. store metabolites, genes and OTUs) all in the same table?

I don't think these issues are insurmountable when designing the table - but they are things to consider during implementation.

The other option is to use multiple DataFrames to store the metadata, or a combination of the two approaches.

@ebolyen

This comment has been minimized.

Copy link
Member Author

ebolyen commented Feb 22, 2015

Thanks for bringing up the interesting point about observations being of different types, @mortonjt. Also, keep in mind that we aren't really worried about how to do something, it may be true that a pandas dataframe is convenient, but we may also find a numpy array is all we need, either way, that is a decision to make after we decide what we want to accomplish, rather than constraining ourselves off the bat to what a particular implementation makes easy or hard.

Back to what you brought up regarding different data types:

I think the ability to have polymorphic observations within a table may throw off some of the statistics, which assume observations are equally weighted. That being said, I don't think ContingencyTable should ever particularly care what an observation is. It's primary focus is the contingency itself between observations and samples, not necessarily what either of them represent.

Another concept we are starting to toy with is the idea that per-sample metadata doesn't always belong to an object, as much as an ID. In an effort to reduce duplication of ID validation, it occurred to us that a discrete ID and IDCollection/Map/Whatever class may be useful. We don't know if this is a user facing abstraction yet, but it would seem that metadata about an observation and a sample could fit well with this. Something I think might be even more powerful, is if the ID represented an enforceable relationship. So instead of storing taxonomy as metadata on an observation, you would instead associate a taxonomy with an ID. ID's may even be able to track collapsing and renaming (like sequence demultiplexing).

@mortonjt

This comment has been minimized.

Copy link
Contributor

mortonjt commented Feb 25, 2015

I like that idea about associating taxonomy with an id
Here are a couple of questions

  • Would we want to enforce symmetry between sample metadata and observation metadata? For example, it makes sense to collapse observations based on taxonomy. But would we interested in collapsing metadata based on location for instance (aka collapsing samples by state or country).
  • Would we want to have different types of metadata? Since there are metadata fields that cannot be associated with an ID, would it make sense to separate these fields into two groups - fields associated with ID and other descriptor fields?
@wasade

This comment has been minimized.

Copy link
Contributor

wasade commented Feb 25, 2015

Yes, in the case of sample replicates or even sample taxonomy (in the case
of comparative genomics)

On Wed, Feb 25, 2015 at 12:02 PM, mortonjt notifications@github.com wrote:

I like that idea about associating taxonomy with an id
Here are a couple of questions

  • Would we want to enforce symmetry between sample metadata and
    observation metadata? For example, it makes sense to collapse observations
    based on taxonomy. But would we interested in collapsing metadata based on
    location for instance (aka collapsing samples by state or country).
  • Would we want to have different types of metadata? Since there are
    metadata fields that cannot be associated with an ID, would it make sense
    to separate these fields into two groups - fields associated with ID and
    other descriptor fields?


Reply to this email directly or view it on GitHub
#848 (comment).

@mortonjt

This comment has been minimized.

Copy link
Contributor

mortonjt commented Feb 27, 2015

Ok. That would complicate things a bit. That would implicate the multiple metadata fields could serve as an index for observations/samples. In addition, how would samples be combined? That would imply that there is some normalization scheme allowing samples to be compared to each other. As far as I can tell, that is an open ended problem.

Let me take a step back and pose the following question.

  • What key features are missing in the BIOM API that we think are crucial in the Contingency table?

I imagine there must be several insurmountable key features missing - otherwise we could have just merged/ported/moved the entire BIOM API into skbio

@Jorge-C

This comment has been minimized.

Copy link
Member

Jorge-C commented Feb 27, 2015

👍 being able to answer that question will get us much closer to a proposal that is both concrete and improves on the state of the art.

@wasade

This comment has been minimized.

Copy link
Contributor

wasade commented Feb 27, 2015

@gregcaporaso, what repo did the code for handling sample replicates go into? Can't find it right now

@mortonjt, looking through, potentially just apply, although it is nearly the same as transform with the exception that you might want to apply to the whole table, not just per sample or observation.

To be honest, we have spent a lot of time over the last few years refining the API, in particular, this past spring. The biggest thing I can see is perhaps changing how we manage metadata, particularly group metadata (e.g., a phylogenetic tree). But, it isn't clear to me if that necessitates API changes and I don't think that should be a blocker for getting the existing API in place with maybe some method name changes, and of course, pulling out IO operations.

@Jorge-C

This comment has been minimized.

Copy link
Member

Jorge-C commented Feb 28, 2015

I agree with @wasade's comments. What are the drawbacks of porting biom's table with its current API / reasons to start from scratch?

@gregcaporaso

This comment has been minimized.

Copy link
Member

gregcaporaso commented Mar 2, 2015

I agree that we should discuss this. Could this be a discussion topic for
next week's sprint?

On Fri, Feb 27, 2015 at 6:34 PM, Jorge Cañardo Alastuey <
notifications@github.com> wrote:

I agree with @wasade https://github.com/wasade's comments. What are the
drawbacks of porting biom's table with its current API / reasons to start
from scratch?

Reply to this email directly or view it on GitHub
#848 (comment).

@antgonza

This comment has been minimized.

Copy link
Contributor

antgonza commented Mar 2, 2015

@gregcaporaso

This comment has been minimized.

Copy link
Member

gregcaporaso commented Mar 5, 2015

Thanks for the input everyone, we've been discussing this and are thinking that we should hold off on this for now as there are other higher priority issues. We can discuss further next week if we want more discussion at this time.

@gregcaporaso gregcaporaso removed the on deck label Mar 5, 2015

@jairideout jairideout added this to the 0.4.1: The Contingency Plan milestone Mar 9, 2015

@jairideout jairideout removed this from the 0.4.1: The Contingency Plan milestone Jun 15, 2015

@ebolyen ebolyen removed discussion labels Aug 7, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.