# Datasets 101

**Source:** *Python and HDF5* by Andrew Collette, O'Reilly 2013.

In this section, we take a closer look at HDF5 datasets. We'll learn the basic **C**reate, **R**ead, **U**pdate, and **D**elete (CRUD) operations.

In [1]:
import numpy as np, h5py

In [2]:
f = h5py.File("weather.h5", "r")

## Dataset = (Element)Type + (Logical)Shape + Value

In [3]:
dataset = f["/15/temperature"]

In [4]:
dataset.dtype

dtype('float64')

In [5]:
dataset.shape

(1024,)

In [6]:
dataset.value

array([ 0.62721836,  0.78295604,  0.05988774, ...,  0.34764036,
        0.50080988,  0.61436234])

Sometimes people refer to datasets as *array variables*. Every variable has a type and value. What are the type and value of an HDF5 dataset?

## Create

In Python, we can create HDF5 datasets directly from NumPy arrays or from the groud up (*greater control!*).

### From a NumPy Array

In [7]:
f = h5py.File("testfile.hdf5", "w", libver="latest", driver="core")

(With the `driver="core"` keyword argument we instruct `h5py` to use a memory buffer as HDF5 file rather than a file on disk.)

In [8]:
arr = np.ones((5,2))

In [9]:
f["my dataset"] = arr

In [10]:
dset = f["my dataset"]

In [13]:
dset

<HDF5 dataset "my dataset": shape (5, 2), type "<f8">

### From Scratch

In [14]:
dset = f.create_dataset("test1", (10, 10))

In [15]:
dset

<HDF5 dataset "test1": shape (10, 10), type "<f4">

In [16]:
dset = f.create_dataset("test2", (10, 10), dtype=np.complex64)

There are several other keyword arguments, for example, to control the dataset layout in the file, compression, etc. Check out the `h5py` documentation or Andrew's book for details.

In [17]:
dset

<HDF5 dataset "test2": shape (10, 10), type "<c8">

## Read

NumPy-style slicing and dicing

In [18]:
out = dset[...]

In [19]:
out

array([[ 0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,
         0.+0.j,  0.+0.j,  0.+0.j],
       [ 0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,
         0.+0.j,  0.+0.j,  0.+0.j],
       [ 0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,
         0.+0.j,  0.+0.j,  0.+0.j],
       [ 0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,
         0.+0.j,  0.+0.j,  0.+0.j],
       [ 0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,
         0.+0.j,  0.+0.j,  0.+0.j],
       [ 0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,
         0.+0.j,  0.+0.j,  0.+0.j],
       [ 0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,
         0.+0.j,  0.+0.j,  0.+0.j],
       [ 0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,
         0.+0.j,  0.+0.j,  0.+0.j],
       [ 0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,
         0.+0.j,  0.+0.j,  0.+0.j],
       [ 0.+0.j,  0.+0.j,  0.+0.j,  0

In [20]:
type(out)

numpy.ndarray

In [21]:
dset[1:3,2:4]

array([[ 0.+0.j,  0.+0.j],
       [ 0.+0.j,  0.+0.j]], dtype=complex64)

## Update

NumPy-style slicing and dicing

In [22]:
dset[1:4,1] = 2.0+0.j

In [23]:
dset[:,1]

array([ 0.+0.j,  2.+0.j,  2.+0.j,  2.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,
        0.+0.j,  0.+0.j,  0.+0.j], dtype=complex64)

## Delete

The objects in an HDF5 file (groups, datasets, datatype objects) are interlinked. Deleting an object means first and foremost to *unlink* the object. The storage of the underlying object *may or may not* be freed as a result. We'll return to this point later.

In [24]:
list(f.keys())

[u'my dataset', u'test1', u'test2']

In [25]:
del f['test2']

In [26]:
list(f.keys())

[u'my dataset', u'test1']

In [27]:
f.close()

### Advanced Topic for Discussion

*What really happens when a dataset is deleted?*