to_hdf5 cannot handle None's in metadata #609

wasade · 2015-03-07T18:37:03Z

This table:

> t.metadata()
Out[5]:
(defaultdict(<function <lambda> at 0x10d595848>, {u'pH': 7.0}),
 defaultdict(<function <lambda> at 0x10d5958c0>, {u'pH': 8.0}),
 defaultdict(<function <lambda> at 0x10d595938>, {u'pH': 7.0}),
 defaultdict(<function <lambda> at 0x10d5959b0>, {}))

Will cause to_hdf5 to except. It is valid for to_json.

The text was updated successfully, but these errors were encountered:

…efined pH for one of the samples, which is invalid for the hdf5 formatter, issue biocore/biom-format#609

ElDeveloper · 2015-03-31T00:53:36Z

👍 just found this in a table I'm working on, it is forever to be stored in 1.0 😟

wasade · 2015-03-31T01:04:54Z

:(

ElDeveloper · 2015-03-31T01:19:51Z

In the meantime I've %s/null/""/g ... not ideal but works.

On (Mar-30-15|18:04), Daniel McDonald wrote:

:(

Reply to this email directly or view it on GitHub:
#609 (comment)

Jorge-C · 2015-03-31T01:34:46Z

Unfortunately this is a big question: how to serialize missing data.

We can't serialize None (a Python object) through hdf5, so we need to choose a value that represents missing data in a way that round trips safely. nan could potentially work for float fields, but not for integer- or string-fields. Another option that comes to mind is to save another array that marks whether the corresponding value is missing or not (à la masked arrays from numpy).

Jorge-C · 2015-03-31T01:38:01Z

Actually, I don't think we're using masked arrays in any of our projects, maybe we should look deeper into them.

wasade · 2016-11-05T20:56:05Z

We need better enforcement surrounding metadata... a reserved word for indicating a null entry for HDF5, but it would need to be defined in the spec itself, which would trigger a change to format version 2.1.1, which is not ideal. The masked arrays would really need to trigger a 2.2.0 format as that would be defining a separate dataset.

All paths are not fun -- I think the best direction is to, at write, detect nulls like this such that in the original example, the data are implicitly transformed to:

> t.metadata()
Out[5]:
(defaultdict(<function <lambda> at 0x10d595848>, {u'pH': 7.0}),
 defaultdict(<function <lambda> at 0x10d5958c0>, {u'pH': 8.0}),
 defaultdict(<function <lambda> at 0x10d595938>, {u'pH': 7.0}),
 defaultdict(<function <lambda> at 0x10d5959b0>, {u'pH': None}))

...which I believe would fly for the format and spec.

josenavas · 2016-11-09T01:20:22Z

I think that works @wasade and shouldn't need a new format/spec

wasade · 2017-02-04T08:04:25Z

Deferring to 2.2 as this is also lumped into the grand ol' refactor of the formatters and parsers. It would be nice to defer the type detection as well back to pandas as this could get nasty fast.

wasade added the bug label Mar 7, 2015

wasade added a commit to wasade/picrust that referenced this issue Mar 7, 2015

TST: the two tests that can be run pass now. The OTU table had an und…

f3e2571

…efined pH for one of the samples, which is invalid for the hdf5 formatter, issue biocore/biom-format#609

wasade added this to the 2.1.6 milestone Nov 5, 2016

wasade modified the milestones: BIOM 2.2, 2.1.6 Feb 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_hdf5 cannot handle None's in metadata #609

to_hdf5 cannot handle None's in metadata #609

wasade commented Mar 7, 2015

ElDeveloper commented Mar 31, 2015

wasade commented Mar 31, 2015

ElDeveloper commented Mar 31, 2015

Jorge-C commented Mar 31, 2015

Jorge-C commented Mar 31, 2015

wasade commented Nov 5, 2016

josenavas commented Nov 9, 2016

wasade commented Feb 4, 2017

to_hdf5 cannot handle None's in metadata #609

to_hdf5 cannot handle None's in metadata #609

Comments

wasade commented Mar 7, 2015

ElDeveloper commented Mar 31, 2015

wasade commented Mar 31, 2015

ElDeveloper commented Mar 31, 2015

Jorge-C commented Mar 31, 2015

Jorge-C commented Mar 31, 2015

wasade commented Nov 5, 2016

josenavas commented Nov 9, 2016

wasade commented Feb 4, 2017