## NumPy Data Types
More than meets the eye

Austin Godber
@godber

DesertPy - 8/26/201

# What are NumPy Data Types?

We've seen them before as the simple data type of a NumPy array

In [14]:
import numpy as np
np.ones((3,4), dtype=np.int32)

array([[1, 1, 1, 1],
       [1, 1, 1, 1],
       [1, 1, 1, 1]], dtype=int32)

So, what's there to talk about?

What simple data types are there and how can we specify them?

Well ... there are about half a dozen ways.

The one that clicks for me is the "array interface".  Call `np.dtype()` with a string whose first character is the type and the next characters are the size in bytes, like `np.dtype('i4')`, is a 32bit signed integer.  Here are the valid characters:

```
'b'       boolean
'i'       (signed) integer
'u'       unsigned integer
'f'       floating-point
'c'       complex-floating point
'O'       (Python) objects
'S', 'a'  (byte-)string
'U'       Unicode
'V'       raw data (void)
```

In [96]:
np.ones((2,3), dtype=np.dtype('i2'))

array([[1, 1, 1],
       [1, 1, 1]], dtype=int16)

In [97]:
np.ones((2,3), dtype=np.dtype('i4'))

array([[1, 1, 1],
       [1, 1, 1]], dtype=int32)

So, these look the same, but they are stored differently in memory, right ...

`i2` uses two bytes to store the integers in the array below:

In [100]:
np.ones((2,3), dtype=np.dtype('i2')).tostring()

'\x01\x00\x01\x00\x01\x00\x01\x00\x01\x00\x01\x00'

`i4` uses four bytes to store the integers in the array below:

In [101]:
np.ones((2,3), dtype=np.dtype('i4')).tostring()

'\x01\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00'

One can even specify byte order by starting the string with `>` (big-endian) or `<` (little-endian).

In [102]:
np.ones((2,3), dtype=np.dtype('>i2'))

array([[1, 1, 1],
       [1, 1, 1]], dtype=int16)

In [103]:
np.ones((2,3), dtype=np.dtype('>i2')).tostring()

'\x00\x01\x00\x01\x00\x01\x00\x01\x00\x01\x00\x01'

## Great, now what?

# Let Us Dig Deeper

... take a look at the NumPy docs on data types ...

A data type object (an instance of numpy.dtype class) describes how
the bytes in the fixed-size block of memory corresponding to an
array item should be interpreted. 

It describes the following aspects of the data:

1. Type of the data
2. Size of the data
3. Byte order of the data
4. If the data type is a record, an aggregate of other data types,
  1. what are the names of the “fields” of the record, by which they can be accessed,
  2. what is the data-type of each field, and
  3. which part of the memory block each field takes.
5. If the data type is a sub-array, what is its shape and data type.

## Whoa, did you see number 4 and 5!?!?

It describes the following aspects of the data:

1. Type of the data
2. Size of the data
3. Byte order of the data
4. **If the data type is a record**, an aggregate of other data types,
  1. what are the names of the “fields” of the record, by which they can be accessed,
  2. what is the data-type of each field, and
  3. which part of the memory block each field takes.
5. **If the data type is a sub-array**, what is its shape and data type.

# What, pray tell, is a RECORD?

A record is an array of C structures.  These are arrays of composite data types where one can use python dictionary type notation to interact with array elements.

So, what's that mean?

Let's look.

In [88]:
simple_record_dt = np.dtype('a5,i2')
simple_record = np.zeros((2,3), dtype=simple_record_dt)
simple_record

array([[('', 0), ('', 0), ('', 0)],
       [('', 0), ('', 0), ('', 0)]], 
      dtype=[('f0', 'S5'), ('f1', '<i2')])

What do we get?  A 2x3 array of 5 character strings, followed by two byte integers.  But what are the `'f0'` and `'f1'` values?

Implicitly assigned field names.

## Using field names

In [160]:
simple_record['f0'] = (('a', 'b', 'c'), ('d', 'e', 'f'))
simple_record

array([[('a', 10), ('b', 20), ('c', 30)],
       [('d', 40), ('e', 50), ('f', 60)]], 
      dtype=[('f0', 'S5'), ('f1', '<i2')])

In [90]:
simple_record.tostring()

'a\x00\x00\x00\x00\x00\x00b\x00\x00\x00\x00\x00\x00c\x00\x00\x00\x00\x00\x00d\x00\x00\x00\x00\x00\x00e\x00\x00\x00\x00\x00\x00f\x00\x00\x00\x00\x00\x00'

## Broadcasting to a field

In [93]:
simple_record['f1'] = 21
simple_record

array([[('a', 21), ('b', 21), ('c', 21)],
       [('d', 21), ('e', 21), ('f', 21)]], 
      dtype=[('f0', 'S5'), ('f1', '<i2')])

In [94]:
simple_record.tostring()

'a\x00\x00\x00\x00\x15\x00b\x00\x00\x00\x00\x15\x00c\x00\x00\x00\x00\x15\x00d\x00\x00\x00\x00\x15\x00e\x00\x00\x00\x00\x15\x00f\x00\x00\x00\x00\x15\x00'

## Maniuplating Record Fields

In [110]:
simple_record['f1'] = ((1, 2, 3), (4, 5, 6))
simple_record['f1'] = simple_record['f1'] * 10
simple_record

array([[('a', 10), ('b', 20), ('c', 30)],
       [('d', 40), ('e', 50), ('f', 60)]], 
      dtype=[('f0', 'S5'), ('f1', '<i2')])

## Accessing records

* Indexing with a single field returns a view

```
simple_record['f1']
```

* Indexing with a list of fieldnames returns an array with values copied into it.

```
simple_record[['f1', 'f2']]
```


## Naming Fields

Provide a list of tuples where the first element of the tuple is the name and the following value(s) define the type.

In [167]:
name_grade_dt = np.dtype([('name', 'S5'), ('grades', 'f2', (2,))])
name_grade_dt

dtype([('name', 'S5'), ('grades', '<f2', (2,))])

In [168]:
a = np.zeros((3), dtype=name_grade_dt)
a

array([('', [0.0, 0.0]), ('', [0.0, 0.0]), ('', [0.0, 0.0])], 
      dtype=[('name', 'S5'), ('grades', '<f2', (2,))])

## Filling out the array

In [170]:
a['name'] = ('Steve', 'Bob', 'Sally')
a['grades'] = [np.random.rand(2) for x in range(3)]
a

array([('Steve', [0.81982421875, 0.89990234375]),
       ('Bob', [0.1552734375, 0.25]),
       ('Sally', [0.81982421875, 0.98974609375])], 
      dtype=[('name', 'S5'), ('grades', '<f2', (2,))])

What does it look like in memory?

In [171]:
a.tostring()

'Steve\x8f:3;Bob\x00\x00\xf80\x004Sally\x8f:\xeb;'

Good reminder of how different floating point is in memory from its text representation.

In [173]:
a['grades'].mean(axis=1)

array([ 0.85986328,  0.20263672,  0.90478516], dtype=float16)

In [199]:
np.hstack((a['name'], a['grades'].mean(axis=1)))

array(['Steve', 'Bob', 'Sally', '0.85986328125', '0.20263671875',
       '0.90478515625'], 
      dtype='|S32')

In [212]:
np.vstack((a['name'], a['grades'].mean(axis=1)))

array([['Steve', 'Bob', 'Sally'],
       ['0.85986328125', '0.20263671875', '0.90478515625']], 
      dtype='|S32')

In [210]:
np.concatenate((a['name'], a['grades'].mean(axis=1)), axis=0)

array(['Steve', 'Bob', 'Sally', '0.85986328125', '0.20263671875',
       '0.90478515625'], 
      dtype='|S32')

In [219]:
[name for name in np.nditer(a['name'])]

[array('Steve', 
       dtype='|S5'), array('Bob', 
       dtype='|S5'), array('Sally', 
       dtype='|S5')]

In [215]:
a['name']

array(['Steve', 'Bob', 'Sally'], 
      dtype='|S5')