# Datasets 101

**Source:** *Python and HDF5* by Andrew Collette, O'Reilly 2013.

In this section, we take a closer look at HDF5 datasets. We'll learn the basic **C**reate, **R**ead, **U**pdate, and **D**elete (CRUD) operations.

In [1]:
import numpy as np, h5py

In [2]:
f = h5py.File("weather.h5", "r")

## Dataset = (Element)Type + (Logical)Shape + Value

In [3]:
dataset = f["/15/temperature"]

In [4]:
dataset.dtype

dtype('float64')

In [5]:
dataset.shape

(1024,)

In [6]:
dataset.value

array([ 0.08903496,  0.97260236,  0.20384326, ...,  0.07204098,
        0.89709195,  0.12329886])

Sometimes people refer to datasets as *array variables*. Every variable has a type and value.

## Create

In Python, we can create HDF5 datasets directly from NumPy arrays or from the groud up (*greater control!*).

### From a NumPy Array

In [10]:
f = h5py.File("testfile.hdf5", "w", libver="latest", driver="core")

(With the `driver="core"` keyword argument we instruct `h5py` to use a memory buffer as HDF5 file rather than a file on disk.)

In [11]:
arr = np.ones((5,2))

In [12]:
f["my dataset"] = arr

In [13]:
dset = f["my dataset"]

In [14]:
dset.value

array([[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.]])

In [15]:
dset

<HDF5 dataset "my dataset": shape (5, 2), type "<f8">

### From Scratch

In [16]:
dset = f.create_dataset("test1", (10, 10))

In [17]:
dset

<HDF5 dataset "test1": shape (10, 10), type "<f4">

In [18]:
dset2 = f.create_dataset("test34", (10, 10), dtype=np.float64)

There are several other keyword arguments, for example, to control the dataset layout in the file, compression, etc. Check out the `h5py` documentation or Andrew's book for details.

In [19]:
dset2

<HDF5 dataset "test34": shape (10, 10), type "<f8">

## Create large/empty dataset

In [24]:
dset_large =f.create_dataset("big_dset", (1024**1,), dtype=np.float64)

In [25]:
dset_large[0:819]= np.arange(819)

In [26]:
list(f.keys())

[u'my dataset', u'test1', u'test34', u'big_dset']

In [27]:
list(f.keys())

[u'my dataset', u'test1', u'test34', u'big_dset']

In [28]:
f.flush()  # scrive sul disco. Prima stava nella ram

In [29]:
!ls -lrt testfile.hdf5  

-rw-r--r--  1 Yak52  staff  65536 24 Nov 11:01 testfile.hdf5


## Working with resizable datasets 

When you create a dataset, in addition to setting its shape, you have the opportunity to
make it resizable up to a certain maximum set of dimensions

This is called  maxshape on the h5py side.

Like shape, maxshape is specified when the dataset is created, but can’t be changed. As
you saw earlier, if you don’t explicitly choose a maxshape, HDF5 will create a nonresizable
dataset and set maxshape = shape.

In [113]:
dset2 = f.create_dataset('resizable_set', (2,2), maxshape=(2,2))

In [114]:
dset2.shape

(2, 2)

In [115]:
dset2.maxshape

(2, 2)

Let us try to resize...

In [117]:
dset2.resize((1,1))

In [118]:
dset2.shape

(1, 1)

In [119]:
dset2.resize((3,3))

ValueError: Unable to set extend dataset (Dimension cannot exceed the existing maximal size (new: 3 max: 2))

What about if we do not know the maxshape ? 

In [120]:
dset = f.create_dataset('sizetest', (2,2), dtype=np.int32, maxshape=(None,
None))

## Read

NumPy-style slicing and dicing

In [70]:
out = dset2[...]

In [71]:
out

array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

In [72]:
type(out)

numpy.ndarray

In [74]:
dset[1:3,2:4]

array([[ 0.,  0.],
       [ 0.,  0.]], dtype=float32)

## Update

NumPy-style slicing and dicing

In [76]:
dset[1:4,1] = 2.0

In [77]:
dset[:,1]

array([ 0.,  2.,  2.,  2.,  0.,  0.,  0.,  0.,  0.,  0.], dtype=float32)

## Delete

The objects in an HDF5 file (groups, datasets, datatype objects) are interlinked. Deleting an object means first and foremost to *unlink* the object. The storage of the underlying object *may or may not* be freed as a result. We'll return to this point later.

In [78]:
list(f.keys())

[u'my dataset', u'test1', u'test34', u'big_dset']

In [80]:
del f['test1']

In [81]:
list(f.keys())

[u'my dataset', u'test34', u'big_dset']

In [89]:
f

<Closed HDF5 file>

In [90]:
!ls -lrt *hdf5

-rw-r--r--  1 cozzini  staff         800 Nov 23 09:26 old.hdf5
-rw-r--r--  1 cozzini  staff      167384 Nov 23 09:27 attrsdemo.hdf5
-rw-r--r--  1 cozzini  staff           0 Nov 23 10:42 groups.hdf5
-rw-r--r--  1 cozzini  staff           0 Nov 23 10:43 linksdemo.hdf5
-rw-r--r--  1 cozzini  staff    80348797 Nov 23 11:44 imagetest.hdf5
-rw-r--r--  1 cozzini  staff  8589936768 Nov 23 12:51 testfile.hdf5


### Little endian vs Big endian

HDF5 is designed to preserve data in any format you want. 
Let us play with endianness, which relates to how
multibyte numbers are represented. 
a floating-point number can be stored in memory:

##### with the least significant byte first (little-endian)
##### with the most significant byte first (big-endian). 

Modern Intel-style x86 chips use the little-endian format, but data can be stored in HDF5 in either fashion.

In [83]:
a=np.ones((1000,1000),dtype='<f4') #Little endian 4-byte float

In [84]:
b=np.ones((1000,1000),dtype='>f4') #Big endian 4-byte float

In [128]:
from timeit import timeit

In [130]:
timeit(a.mean, number=1000)

0.4468519687652588

In [131]:
timeit(b.mean, number=1000)

1.0125041007995605

In [41]:
c=b.view("float32")

In [42]:
c[:]=b

In [43]:
b=c

In [44]:
timeit(b.mean, number=1000)

0.4452519416809082

In [45]:
d=np.ones((1000,1000),dtype='f4') #standard approach

In [46]:
timeit(d.mean, number=1000)

0.4459209442138672