Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metadata_to_dataframe order matters #855

Open
guillaume-gricourt opened this issue Feb 10, 2021 · 3 comments
Open

metadata_to_dataframe order matters #855

guillaume-gricourt opened this issue Feb 10, 2021 · 3 comments

Comments

@guillaume-gricourt
Copy link

guillaume-gricourt commented Feb 10, 2021

Hi,
When you have this:
good.txt
it'ok
When the order of metadata is different :
bad.txt
You have :
ValueError: 2 columns passed, passed data had 6 columns
Maybe, taking account the maximum of value before parsing them ?
biom-format v2.1.10

@wasade
Copy link
Member

wasade commented Feb 10, 2021

Hi @guillaume-gricourt, that parser was designed to support classic OTU tables from QIIME1 where the lineages were assured to be balanced with placeholders for unidentified names. TSVs are not BIOM-Format, and are unstructured which, which creates a wide range of edge cases.

As a work around, you could parse counts without metadata, parse the taxonomy separately and add it in with biom.Table.add_metadata?

@guillaume-gricourt
Copy link
Author

Yeah it's a good workaround.
I create biom files from tsv to load data into Phyloseq package. Also, this file is my entrypoint to perform others analysis.
From now on, when I'll create this biom file I'll check the order of metadata on my tsv file.
As you can create this kind of biom file, it seems to me, it's a feature of interest to implement ?

@wasade
Copy link
Member

wasade commented Feb 10, 2021

I'd greatly welcome a pull request to resolve this feature request, otherwise I'm not sure when I'll be able to get to it. A possible work around is below.

$ biom convert -i bad.txt -o bad.biom --to-hdf5
$ python
Python 3.6.11 | packaged by conda-forge | (default, Aug  5 2020, 20:19:23)
[GCC Clang 10.0.1 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import biom
>>> df = pd.read_csv('bad.txt', sep='\t')
>>> df.set_index('#OTU ID', inplace=True)
>>> t = biom.load_table('bad.biom')
>>> formatted = {k: {'taxonomy': v.split(';')} for k, v in df['taxonomy'].items()}
>>> t.add_metadata(formatted, axis='observation')
>>> with biom.util.biom_open('okay.biom', 'w') as fp:
...   t.to_hdf5(fp, 'converted')
... 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants