Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_hdf5 cannot handle None's in metadata #609

Open
wasade opened this issue Mar 7, 2015 · 8 comments
Open

to_hdf5 cannot handle None's in metadata #609

wasade opened this issue Mar 7, 2015 · 8 comments
Labels
Milestone

Comments

@wasade
Copy link
Member

wasade commented Mar 7, 2015

This table:

> t.metadata()
Out[5]:
(defaultdict(<function <lambda> at 0x10d595848>, {u'pH': 7.0}),
 defaultdict(<function <lambda> at 0x10d5958c0>, {u'pH': 8.0}),
 defaultdict(<function <lambda> at 0x10d595938>, {u'pH': 7.0}),
 defaultdict(<function <lambda> at 0x10d5959b0>, {}))

Will cause to_hdf5 to except. It is valid for to_json.

@wasade wasade added the bug label Mar 7, 2015
wasade added a commit to wasade/picrust that referenced this issue Mar 7, 2015
…efined pH for one of the samples, which is invalid for the hdf5 formatter, issue biocore/biom-format#609
@ElDeveloper
Copy link
Member

👍 just found this in a table I'm working on, it is forever to be stored in 1.0 😟

@wasade
Copy link
Member Author

wasade commented Mar 31, 2015

:(

@ElDeveloper
Copy link
Member

In the meantime I've %s/null/""/g ... not ideal but works.

On (Mar-30-15|18:04), Daniel McDonald wrote:

:(


Reply to this email directly or view it on GitHub:
#609 (comment)

@Jorge-C
Copy link
Member

Jorge-C commented Mar 31, 2015

Unfortunately this is a big question: how to serialize missing data.

We can't serialize None (a Python object) through hdf5, so we need to choose a value that represents missing data in a way that round trips safely. nan could potentially work for float fields, but not for integer- or string-fields. Another option that comes to mind is to save another array that marks whether the corresponding value is missing or not (à la masked arrays from numpy).

@Jorge-C
Copy link
Member

Jorge-C commented Mar 31, 2015

Actually, I don't think we're using masked arrays in any of our projects, maybe we should look deeper into them.

@wasade
Copy link
Member Author

wasade commented Nov 5, 2016

We need better enforcement surrounding metadata... a reserved word for indicating a null entry for HDF5, but it would need to be defined in the spec itself, which would trigger a change to format version 2.1.1, which is not ideal. The masked arrays would really need to trigger a 2.2.0 format as that would be defining a separate dataset.

All paths are not fun -- I think the best direction is to, at write, detect nulls like this such that in the original example, the data are implicitly transformed to:

> t.metadata()
Out[5]:
(defaultdict(<function <lambda> at 0x10d595848>, {u'pH': 7.0}),
 defaultdict(<function <lambda> at 0x10d5958c0>, {u'pH': 8.0}),
 defaultdict(<function <lambda> at 0x10d595938>, {u'pH': 7.0}),
 defaultdict(<function <lambda> at 0x10d5959b0>, {u'pH': None}))

...which I believe would fly for the format and spec.

@wasade wasade added this to the 2.1.6 milestone Nov 5, 2016
@josenavas
Copy link
Member

I think that works @wasade and shouldn't need a new format/spec

@wasade
Copy link
Member Author

wasade commented Feb 4, 2017

Deferring to 2.2 as this is also lumped into the grand ol' refactor of the formatters and parsers. It would be nice to defer the type detection as well back to pandas as this could get nasty fast.

@wasade wasade modified the milestones: BIOM 2.2, 2.1.6 Feb 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants