## NumPy Data Types
More than meets the eye

Austin Godber
@godber

DesertPy - 8/26/201

# What are NumPy Data Types?

We've seen them before as the simple data type of a NumPy array

In [1]:
import numpy as np
np.ones((3,4), dtype=np.int32)

array([[1, 1, 1, 1],
       [1, 1, 1, 1],
       [1, 1, 1, 1]], dtype=int32)

So, what's there to talk about?

What simple data types are there and how can we specify them?

Well ... there are about half a dozen ways.

The one that clicks for me is the "array interface".  Call `np.dtype()` with a string whose first character is the type and the next characters are the size in bytes, like `np.dtype('i4')`, is a 32bit signed integer.  Here are the valid characters:

```
'b'       boolean
'i'       (signed) integer
'u'       unsigned integer
'f'       floating-point
'c'       complex-floating point
'O'       (Python) objects
'S', 'a'  (byte-)string
'U'       Unicode
'V'       raw data (void)
```

In [2]:
np.ones((2,3), dtype=np.dtype('i2'))

array([[1, 1, 1],
       [1, 1, 1]], dtype=int16)

In [3]:
np.ones((2,3), dtype=np.dtype('i4'))

array([[1, 1, 1],
       [1, 1, 1]], dtype=int32)

So, these look the same, but they are stored differently in memory, right ...

`i2` uses two bytes to store the integers in the array below:

In [4]:
np.ones((2,3), dtype=np.dtype('i2')).tostring()

b'\x01\x00\x01\x00\x01\x00\x01\x00\x01\x00\x01\x00'

`i4` uses four bytes to store the integers in the array below:

In [5]:
np.ones((2,3), dtype=np.dtype('i4')).tostring()

b'\x01\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00'

One can even specify byte order by starting the string with `>` (big-endian) or `<` (little-endian).

In [6]:
np.ones((2,3), dtype=np.dtype('>i2'))

array([[1, 1, 1],
       [1, 1, 1]], dtype=int16)

In [7]:
np.ones((2,3), dtype=np.dtype('>i2')).tostring()

b'\x00\x01\x00\x01\x00\x01\x00\x01\x00\x01\x00\x01'

## Great, now what?

# Let Us Dig Deeper

... take a look at the NumPy docs on data types ...

A data type object (an instance of numpy.dtype class) describes how
the bytes in the fixed-size block of memory corresponding to an
array item should be interpreted. 

It describes the following aspects of the data:

1. Type of the data
2. Size of the data
3. Byte order of the data
4. If the data type is a record, an aggregate of other data types,
  1. what are the names of the “fields” of the record, by which they can be accessed,
  2. what is the data-type of each field, and
  3. which part of the memory block each field takes.
5. If the data type is a sub-array, what is its shape and data type.

## Whoa, did you see number 4 and 5!?!?

It describes the following aspects of the data:

1. Type of the data
2. Size of the data
3. Byte order of the data
4. **If the data type is a record**, an aggregate of other data types,
  1. what are the names of the “fields” of the record, by which they can be accessed,
  2. what is the data-type of each field, and
  3. which part of the memory block each field takes.
5. **If the data type is a sub-array**, what is its shape and data type.

# What, pray tell, is a RECORD?

A record is an array of C structures.  These are arrays of composite data types where one can use python dictionary type notation to interact with array elements.

So, what's that mean?

Let's look.

In [8]:
simple_record_dt = np.dtype('a5,i2')
simple_record = np.zeros((2,3), dtype=simple_record_dt)
simple_record

array([[(b'', 0), (b'', 0), (b'', 0)],
       [(b'', 0), (b'', 0), (b'', 0)]], 
      dtype=[('f0', 'S5'), ('f1', '<i2')])

What do we get?  A 2x3 array of 5 character strings, followed by two byte integers.  But what are the `'f0'` and `'f1'` values?

Implicitly assigned field names.

## Using field names

In [9]:
simple_record['f0'] = (('a', 'b', 'c'), ('d', 'e', 'f'))
simple_record

array([[(b'a', 0), (b'b', 0), (b'c', 0)],
       [(b'd', 0), (b'e', 0), (b'f', 0)]], 
      dtype=[('f0', 'S5'), ('f1', '<i2')])

In [10]:
simple_record.tostring()

b'a\x00\x00\x00\x00\x00\x00b\x00\x00\x00\x00\x00\x00c\x00\x00\x00\x00\x00\x00d\x00\x00\x00\x00\x00\x00e\x00\x00\x00\x00\x00\x00f\x00\x00\x00\x00\x00\x00'

## Broadcasting to a field

In [11]:
simple_record['f1'] = 21
simple_record

array([[(b'a', 21), (b'b', 21), (b'c', 21)],
       [(b'd', 21), (b'e', 21), (b'f', 21)]], 
      dtype=[('f0', 'S5'), ('f1', '<i2')])

In [12]:
simple_record.tostring()

b'a\x00\x00\x00\x00\x15\x00b\x00\x00\x00\x00\x15\x00c\x00\x00\x00\x00\x15\x00d\x00\x00\x00\x00\x15\x00e\x00\x00\x00\x00\x15\x00f\x00\x00\x00\x00\x15\x00'

## Maniuplating Record Fields

In [13]:
simple_record['f1'] = ((1, 2, 3), (4, 5, 6))
simple_record['f1'] = simple_record['f1'] * 10
simple_record

array([[(b'a', 10), (b'b', 20), (b'c', 30)],
       [(b'd', 40), (b'e', 50), (b'f', 60)]], 
      dtype=[('f0', 'S5'), ('f1', '<i2')])

## Accessing records

* Indexing with a single field returns a view

```
simple_record['f1']
```

* Indexing with a list of fieldnames returns an array with values copied into it.

```
simple_record[['f1', 'f2']]
```


## Naming Fields

Provide a list of tuples where the first element of the tuple is the name and the following value(s) define the type.

In [14]:
name_grade_dt = np.dtype([('name', 'S5'), ('grades', 'f2', (2,))])
name_grade_dt

dtype([('name', 'S5'), ('grades', '<f2', (2,))])

In [15]:
a = np.zeros((3), dtype=name_grade_dt)
a

array([(b'', [0.0, 0.0]), (b'', [0.0, 0.0]), (b'', [0.0, 0.0])], 
      dtype=[('name', 'S5'), ('grades', '<f2', (2,))])

## Filling out the array

In [16]:
a['name'] = ('Steve', 'Bob', 'Sally')
a['grades'] = [np.random.rand(2) for x in range(3)]
a

array([(b'Steve', [0.91796875, 0.68359375]),
       (b'Bob', [0.310302734375, 0.33349609375]),
       (b'Sally', [0.71630859375, 0.041839599609375])], 
      dtype=[('name', 'S5'), ('grades', '<f2', (2,))])

What does it look like in memory?

In [17]:
a.tostring()

b'SteveX;x9Bob\x00\x00\xf74V5Sally\xbb9[)'

Good reminder of how different floating point is in memory from its text representation.

Compute the averages of grades...

In [18]:
np.vstack((a['name'], a['grades'].mean(axis=1)))

array([[b'Steve', b'Bob', b'Sally'],
       [b'0.80078125', b'0.32177734375', b'0.379150390625']], 
      dtype='|S32')

# Thats all well and good, why don't I just use Pandas

You can, but this is super useful for reading arbitrary packed binary data files.

Writing out a sample file:

In [19]:
a.tofile('grades.bin')
%cat grades.bin

SteveX;x9Bob  �4V5Sally�9[)

Reading in the sample file, specifying a single record type using `dtype`:

In [20]:
np.fromfile('grades.bin', dtype=[('name', 'S5'), ('grades', 'f2', (2,))])

array([(b'Steve', [0.91796875, 0.68359375]),
       (b'Bob', [0.310302734375, 0.33349609375]),
       (b'Sally', [0.71630859375, 0.041839599609375])], 
      dtype=[('name', 'S5'), ('grades', '<f2', (2,))])