# Structured Data

Structured arrays and record arrays provide efficient storage for compound, heterogeneous data. Note, however, it is often better to use Pandas Dataframes for this.

In [1]:
import numpy as np

In [3]:
name = ['Alice', 'Bob', 'Cathy', 'Doug']  # strings
age = [25, 45, 37, 19]                    # ints
weight = [55.0, 85.5, 68.0, 61.5]         # floats 

Recall we created a simple array using:

In [4]:
x = np.zeros(4, dtype=int)

For a structured array we can use:

In [5]:
data = np.zeros(4, dtype={'names': ('name', 'age', 'weight'),
                          'formats': ('U10', 'i4', 'f8')})
print(data.dtype)

[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')]


'U10' translates to Unicode string, max length 10. <br>
'i4' is a 4-byte (32 bit) integer. <br>
'f8' is an 8-byte (64 bit) float

In [8]:
data['name'] = name
data['age'] = age
data['weight'] = weight
print(data)

[('Alice', 25, 55. ) ('Bob', 45, 85.5) ('Cathy', 37, 68. )
 ('Doug', 19, 61.5)]


We can now refer to these by index or name:

In [14]:
data['name']

array(['Alice', 'Bob', 'Cathy', 'Doug'], dtype='<U10')

In [12]:
data[0]

'Alice'

In [11]:
data[-1]['name']

'Doug'

We can combine this with Boolean masking:

In [15]:
# get all names where age is under 30
data[data['age'] < 30]['name']

array(['Alice', 'Doug'], dtype='<U10')

If we want to do anything more complicated than this, it's probably best to use a Pandas Dataframe.

## Creating Structured Arrays

We saw this dictionary method:

In [16]:
np.dtype({'names':('name', 'age', 'weights'),
          'formats':('U10', 'i4', 'f8')})

dtype([('name', '<U10'), ('age', '<i4'), ('weights', '<f8')])

We can use Python types of Numpy dtypes if we want:

In [17]:
np.dtype({'names':('name', 'age', 'weights'),
          'formats':((np.str_, 10), int, np.float32)})

dtype([('name', '<U10'), ('age', '<i4'), ('weights', '<f4')])

We can pass a list of tuples:

In [18]:
np.dtype([('name', 'S10'), ('age', 'i4'), ('weight', 'f8')])

dtype([('name', 'S10'), ('age', '<i4'), ('weight', '<f8')])

If the names of the types don't matter:

In [19]:
np.dtype('S10,i4,f8')

dtype([('f0', 'S10'), ('f1', '<i4'), ('f2', '<f8')])

The (optional) < or > (little/big endian) specify ordering convention for signifcant bits. The next is the type (see below), and the last char represents the size of the string in bytes.

**CharacterDescription Example**<br>
'b' Byte np.dtype('b') <br>
'i' Signed integer np.dtype('i4') == np.int32 <br>
'u' Unsigned integer np.dtype('u1') == np.uint8 <br>
'f' Floating point np.dtype('f8') == np.int64 <br>
'c' Complex floating point np.dtype('c16') == np.complex128 <br>
'S', 'a' String np.dtype('S5')<br>
'U'  Unicode string np.dtype('U') == np.str_ <br>
'V' Raw data (void) np.dtype('V') == np.void <br>

## More Advanced Compound Types

We can create a type where each element contains an array or matrix of values:

In [23]:
tp = np.dtype([('id', 'i8'), ('mat', 'f8', (3, 3))])
X = np.zeros(1, dtype=tp)
print(X[0])
print(X['mat'][0])

(0, [[0., 0., 0.], [0., 0., 0.], [0., 0., 0.]])
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]


Now every element in X is an id and a 3x3 matrix. Why would we do this instead of a multi-dim array or a Python dictionary? dtype maps directly onto a C structure definition, so the buffer containing the the array content can be accessed directly within a C program. 

## Record Arrays

The np.recarray class is almost identical to structured arrays but with one added difference: fields can be accessed as attributes rather than dictionary keys. Previously we accessed them as:

In [27]:
data['age']

array([25, 45, 37, 19])

Now we can use fewer keystrokes:

In [28]:
data_rec = data.view(np.recarray)
data_rec.age

array([25, 45, 37, 19])

The downside is extra overhead, even when using the same method:

In [29]:
%timeit data['age']
%timeit data_rec['age']
%timeit data_rec.age

104 ns ± 2.93 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
2.33 µs ± 155 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.47 µs ± 520 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
