# Numpy

## Data types in Numpy

Numpy is based on **arrays**. You can think of an array as a list, or a table.

An array is a table of elements. 
An array in Python can be created from a pre-specified list or a tuple with the `array` function. 
Make sure you do not forget the [ ] or ( ) brackets! 
An n-dimensional array can also be created based on a combination of tuples and lists.

For more information about creating arrays: 
http://docs.scipy.org/doc/numpy/user/basics.creation.html

In [2]:
# create array from a list:
print( numpy.array([1, 3, 5, 7, 9]) )

# create array from a tuple:
print( numpy.array((2, 4, 6, 8)) )

[1 3 5 7 9]
[2 4 6 8]


Also, an empty array or an array with only zeros or ones can be created with the `empty`, `zeros` and `ones` functions. 
For this, the shape of the array is the input for the function. 
More about the shape and dimensionality is explained in the next section. 
The content of the empty array can be anything.

In [3]:
# create an empty array:
print( numpy.empty((3)) )

# create array with only zeros:
print( numpy.zeros((4)) )

# create array with only ones:
print( numpy.ones((5)) )

[ -2.68156159e+154  -2.68156159e+154   6.74179579e+199]
[ 0.  0.  0.  0.]
[ 1.  1.  1.  1.  1.]


If we want to create an array with random numbers we can use the `numpy.random.random` function, this gives an array with numbers between 0 and 1.

The expression `numpy.random.uniform(x, y, z)` returns an array with `z` random numbers uniformly drawn from the interval between `x` and `y`.

In [4]:
# create array with random numbers between 0 and 1:
print( numpy.random.random(4) )

# create array with random numbers between 5 and 10:
print( numpy.random.uniform(5, 10, 5) )

[ 0.81991612  0.9588003   0.41758327  0.13524802]
[ 9.85064941  7.22872404  9.77355862  7.72513108  5.72537798]


The identity matrix is a $n \times n$ matrix with all zeros, but with ones on the diagonal. 
The identity matrix can be created with the `eye(n)` function. Since the identity matrix is always a square only one input parameter is needed to create a 2-dimensional matrix.

In [5]:
# create n by n identity matrix:
print( numpy.eye(2) )

[[ 1.  0.]
 [ 0.  1.]]


Numpy has a function which is similar to the `range()` function which is `arange(a, b, s)`, where $a$ is the start point, $b$ is the end point and $s$ is the step size. The function can have integers or floats as inputs.
Another function to create a range is the `linspace(a, b, i)` function, where $a$ is the start point, $b$ is the end point, and $i$ is the number of items.

The advantage of the `linspace` function is that you can specify the number of items and the advantage of the `arange` function is that you can specify the step size.

In [6]:
# fixed step size:
x = numpy.arange(1, 11.9, 2.1)
print( x )

# fixed number of items:
y = numpy.linspace(0, 2, 9)
print( y )

[  1.    3.1   5.2   7.3   9.4  11.5]
[ 0.    0.25  0.5   0.75  1.    1.25  1.5   1.75  2.  ]


## Dimensionality of Arrays

The shape of array objects is determined by the number of dimensions in the array and the length of each dimension.

### Creating one dimensional arrays

1-dimensional arrays are unique in that there are two types: row or column vectors.

#### Row vectors

A row vector consists of a single row of values with multiple columns. By default, a one-dimensional Numpy array is a row vector. In order to create a column vector, you need to use a two-dimensional array, with the second dimension equal to 1. We will discuss them after 2D arrays.

In [7]:
a = numpy.zeros((3))
b = numpy.array([1, 2, 3])
print( a )
print( a.ndim, a.shape, a.size )
print( b )
print( b.ndim, b.shape, b.size )

[ 0.  0.  0.]
1 (3,) 3
[1 2 3]
1 (3,) 3


### Create two dimensonal arrays

An array can be 2-dimensional, where $m$ is the number of rows and $n$ the number of columns. 
In case of a 2-dimensional array, the first dimension is the rows and the second dimension is the columns.

In [8]:
print( numpy.array([[1, 3, 6, 5], [2, 4, 3, 0]]) )

[[1 3 6 5]
 [2 4 3 0]]


### N dimensional arrays

In general, an array can be $n$-dimensional. You think of $n$-dimensional arrays in terms of the bookshelf analogy:

- 1d array is a single row of a bookshelf, where a book can be identified by its position in the row
- 2d array is the whole bookshelf, where a book can be identified by its row number and its position in the row
- 3d array is a room full of bookshelves, where a book can be identified by the number of the bookshelf, row, and position in the row
- 4d array is a library with rooms with bookshelves, where a book can be identified by the room, bookshelf, row and position in the row

In [9]:
# create a three dimensional array:
z1 = numpy.array([[[1, 3], [2, 4]], [[11, 13], [12, 14]] ])

# dimensions/shape/size of the array:
print(z1)
print( "Number of dimensions:", z1.ndim )
print( "Length of each dimension:", z1.shape )
print( "The total number of elements:", z1.size )

[[[ 1  3]
  [ 2  4]]

 [[11 13]
  [12 14]]]
Number of dimensions: 3
Length of each dimension: (2, 2, 2)
The total number of elements: 8


In [10]:
# access some elements in the three dimensional array:
print( "First element:", z1[0, 0, 0] )
print( "Last element:", z1[1, 1, 1] )
print( "Some element:", z1[1, 0, 1] )

First element: 1
Last element: 14
Some element: 13


In [11]:
## Extract dimensions
a = numpy.random.random(2*3).reshape(2,3)
print(a)
print( "First row:\n", a[0, :] )
print( "Second column:\n", a[:, 1] )
print( "First two columns:\n", a[:,0:2] )

[[  5.26626229e-01   3.58504156e-01   6.04682331e-01]
 [  6.77856813e-01   3.55446355e-04   1.63886240e-01]]
First row:
 [ 0.52662623  0.35850416  0.60468233]
Second column:
 [  3.58504156e-01   3.55446355e-04]
First two columns:
 [[  5.26626229e-01   3.58504156e-01]
 [  6.77856813e-01   3.55446355e-04]]


### Reshaping arrays
A 1D array can be converted to an $n$ dimensional array using the `reshape()` function. 

Note that the the total number of elements in the array have to be the same as the product of the lengths of the dimensions. For example, if the length of the list is 24, then we can reshape it to a 4 by 6 matrix, but also to a 2 by 3 by 4 matrix.

Let's assume we have a 2 by 3 by 4 matrix, which we will call z. Since the index in Python starts at 0, the first element of the array is `z[0,0,0]`, but the last element of the array is not z[2, 3, 4] but rather `z[1, 2, 3]`.

Let x be the ndarray. Some important functions to get insight in the dimensionality:
- `x.ndim` : the number of dimensions.
- `x.shape` : the length of each dimension.
- `x.size` : the total number of elements.

In [12]:
# reshape into a two dimensional array:
print( numpy.arange(2, 14, 2).reshape((2, 3)) )

# reshape into a three dimensional array:
z2 = numpy.arange(24).reshape((2, 3, 4))

# dimensions/shape/size of the array:
print( z2 )
print("Number of dimensions:", z2.ndim)
print("Length of each dimension:", z2.shape)
print("The total number of elements:", z2.size)



[[ 2  4  6]
 [ 8 10 12]]
[[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]

 [[12 13 14 15]
  [16 17 18 19]
  [20 21 22 23]]]
Number of dimensions: 3
Length of each dimension: (2, 3, 4)
The total number of elements: 24


### Creating one dimensional column vectors

Column vectors have multiple rows and a single column. Numpy represents column vectors as a two-dimensional array with only one column.

In [13]:
a = numpy.zeros((3,1))
b = numpy.array([[1], [2], [3]])
c = numpy.array([1,2,3]).reshape((3,1))
print( a )
print(a.ndim, a.shape, a.size)
print( b )
print(b.ndim, b.shape, b.size)
print( c )
print(c.ndim, c.shape, c.size)

[[ 0.]
 [ 0.]
 [ 0.]]
2 (3, 1) 3
[[1]
 [2]
 [3]]
2 (3, 1) 3
[[1]
 [2]
 [3]]
2 (3, 1) 3


## Array Indexing

For complete information  about indexing see
http://docs.scipy.org/doc/numpy/user/basics.indexing.html

There are many ways to address content in an array:

- We have already gone over Matrix indexing, in which the contents of the array are accessed by specifying an index for each dimension

In [22]:
a = numpy.random.uniform(-0.5,.5,(5,5))
print( a )

# Return a value specified by a matrix index
print( a[2, 3] )

[[-0.24589574  0.11177302  0.13531942  0.3479577  -0.4192539 ]
 [ 0.37813752 -0.35211268 -0.41315332 -0.44628705 -0.36826755]
 [-0.23055912 -0.45735929 -0.14739955  0.35885095  0.21489465]
 [ 0.48975864  0.40247149 -0.35892262 -0.30953596 -0.04852892]
 [ 0.24586157 -0.29458672  0.18208263 -0.10046938 -0.1520591 ]]
0.358850951681


- Linear indexing transform the n-dimensional array to a 1-dimensional list. One example of the linear index is returned when the `argmin` and `argmax` function are applied to an n-dimensional array. 

In [23]:
# Print the index of the maximum value
max_value = numpy.argmax(a)
print( max_value )

# Print that value, note the necessary flattening
print( a.flat[max_value] )

15
0.489758641035


- Boolean indexing, which returns all values in the array for which the index is True.

In [24]:
# Create a boolean index for positive numbers in array a
index = a > 0.0
print( index )
# Print all the positive numbers
print( a[index] )

[[False  True  True  True False]
 [ True False False False False]
 [False False False  True  True]
 [ True  True False False False]
 [ True False  True False False]]
[ 0.11177302  0.13531942  0.3479577   0.37813752  0.35885095  0.21489465
  0.48975864  0.40247149  0.24586157  0.18208263]


- Indexing with an array of indices. In this case you specify a separate array in which you store the indices as integers and you will return exactly the elements of the array with these indices. 

In [25]:
b = numpy.linspace(0,1,10)
print( b )

# Print numbers at prime indices
index = numpy.array([ 2, 3, 5, 7])
print( b[index] )

[ 0.          0.11111111  0.22222222  0.33333333  0.44444444  0.55555556
  0.66666667  0.77777778  0.88888889  1.        ]
[ 0.22222222  0.33333333  0.55555556  0.77777778]


### Linear and matrix indexing

Indexing in a 1-dimensional matrix is the same as the indexing in a Python list. And if you want to apply something to every element of the array then one simple for-loop over the items can do the trick.

Indexing in a n-dimensional array has one index for every dimension. To access one element of the array, the index of every dimension should be given. When accessing more than one element, the slicing `":"` can be used, and this works similar as it works with lists, but then you can use the `":"` for every dimension. If no index is given for a dimension, then the `":"` will be given.
If the index is `[a:b]` then indices that are used are `a` up to but not including `b`.

In [26]:
z = numpy.arange(24).reshape((2, 3, 4))
print( z )

# Print a few slices of a 3-dimensional array:
print( "\nSlices:" )
print( z[0:2, 1:3, 3] )
print( z[:, 2, :] )

[[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]

 [[12 13 14 15]
  [16 17 18 19]
  [20 21 22 23]]]

Slices:
[[ 7 11]
 [19 23]]
[[ 8  9 10 11]
 [20 21 22 23]]


To convert from a linear index to a matrix index, use the function `numpy.unravel_index()`. The first argument is the linear index and the second argument is the shape of the array for which you want to transform the index. For example: `numpy.unravel_index(linear_index, (2,3))`. 

In [27]:
# Converting a linear index to a matrix index:
linear_index = 10
matrix_index = numpy.unravel_index(linear_index, z.shape)
print( z.flatten()[linear_index] )
print( z[matrix_index] )

print("For a matrix with dimensions (2, 3, 4), the linear index: ", linear_index, " is equal to \
matrix index: ", matrix_index)

10
10
For a matrix with dimensions (2, 3, 4), the linear index:  10  is equal to matrix index:  (0, 2, 2)


### Boolean indexing

A boolean index can be created directly, but most often it is built by specifying a certain condition.

The condition will return a True or False for every position in the array and when the condition is True then the corresponding element will be retrieved.

In [29]:
# Boolean indexing
x = numpy.arange(1, 6)
y = numpy.array([True, False, True, False, True ])
print( x[y] )

# Boolean indexing by using a condition
print( x[x>3] )

[1 3 5]
[4 5]


### Indexing with an array of indices

Two advantages of specifying an array of indices to select is that you can explicitly specify the order in which you want to have the values and you can return multiple times the value at a certain position. 

In [30]:
x = numpy.arange(100, 111)
indices = numpy.array([8, 3, 8, 4, 9, 3])
print( x )
print( indices )
print( x[indices] )

[100 101 102 103 104 105 106 107 108 109 110]
[8 3 8 4 9 3]
[108 103 108 104 109 103]


## Saving arrays to file

When you want to save an array from numpy as a separate file you always have to specify the filename and the array you want to save and you can use the following functions:
- `numpy.savetxt(filename, array)` : save an array to a text file. Some optional arguments are: delimiter=' ', newline = '\n', header = ' '. http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.savetxt.html#numpy.savetxt
- `numpy.save(filename, array)` : save an array to a binary file in numpy `.npy` format. http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.save.html#numpy.save


# Extras about arrays

### For loops over arrays

Most of the time you want to use arrays to avoid loops, but sometimes you want to use a loop. Iterating over a 1-dimensional Numpy array is the same as in base python:

In [32]:
a = numpy.arange(5, 10)

for element in a:
    print( element )
    
for index, element in enumerate(a):
    print( "Element {} at index {}".format(element, index) )

5
6
7
8
9
Element 5 at index 0
Element 6 at index 1
Element 7 at index 2
Element 8 at index 3
Element 9 at index 4


It is also possible to iterate across a multi-dimensional Numpy array. If only one loop is used, then it will iterate over the first dimension and process a Numpy array in each step of the loop.

In [33]:
z = numpy.arange(24).reshape((2, 3, 4))
print( z )

# for loop over the first dimension:
for firstdimension in z:
    print( "Print separator" )
    print( firstdimension )

[[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]

 [[12 13 14 15]
  [16 17 18 19]
  [20 21 22 23]]]
Print separator
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
Print separator
[[12 13 14 15]
 [16 17 18 19]
 [20 21 22 23]]


If two nested for loops are used, then the loops iterate over the first two dimensions. 

In [34]:
for firstdimension in z:
    for seconddimension in firstdimension:
        print( "Print separator" )
        print( seconddimension )

Print separator
[0 1 2 3]
Print separator
[4 5 6 7]
Print separator
[ 8  9 10 11]
Print separator
[12 13 14 15]
Print separator
[16 17 18 19]
Print separator
[20 21 22 23]


The same process occurs for multi-dimensional arrays. For loops will iterate over Numpy arrays or elements, depending on the dimensionality.

When you use a for loop over a 3 dimensional array with shape (2, 3, 4) then you will receive exactly 2 iterations and in each iteration a 3 by 4 array is used. 
When you use two nested for loops over a 3 dimensional array with shape (2, 3, 4) then you will receive exactly 2*3=6 iterations and in each iteration an array of length 4 is used. 

If you want to loop over all the elements in the array regardless of dimensionality, then you should use the flat method instead of nested loops:

In [35]:
for element in z.flat:
    print( element )

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


### Copy of matrix

When you do operations on an array, then you have to be careful whether or not the array is copied or only referenced to. When you change something in the copied array, then you have to be careful whether or not also the original array is changed.

There are three cases:
- No copy at all: only a reference to the existing object. When you change anything then the original array is also changed.
- View or shallow copy: different array objects share the same data. When you reshape the array then the change in shape is not shared with the original array, but when you change the data then the change in data is shared with the original array and the original array is also changed. A shallow copy can be made with the y = x.view() function. When you slice an array then only a view is returned and no copy of the data is made.
- Deep copy: complete copy of array and data. When you change something in the copied array then the original array remains the same. A deep copy can be made in numpy with the y = x.copy() function. 



In [38]:
# no copy
a = numpy.arange(10)
b = a
b.shape = (5,2)
b[0,1] = 1000
print( a )
print( b )

[[   0 1000]
 [   2    3]
 [   4    5]
 [   6    7]
 [   8    9]]
[[   0 1000]
 [   2    3]
 [   4    5]
 [   6    7]
 [   8    9]]


In [39]:
# shallow copy
a = numpy.arange(10)
c = a.view()
c.shape = (5,2)
c[0,1] = 1000
print( a )
print( c )


[   0 1000    2    3    4    5    6    7    8    9]
[[   0 1000]
 [   2    3]
 [   4    5]
 [   6    7]
 [   8    9]]


In [40]:
# deep copy
a = numpy.arange(10)
d = a.copy()
d.shape = (5,2)
d[0,1] = 1000
print( a )
print( d )

[0 1 2 3 4 5 6 7 8 9]
[[   0 1000]
 [   2    3]
 [   4    5]
 [   6    7]
 [   8    9]]


# Numpy Data types

### Data types

Data types were already discussed in the previous notebook, but here is a small recap. 

One array can only have one data type. 
The data type of the array can be obtained with the `x.dtype` function. 
In case you mix different data types, the elements are converted to the same type. If you want, you can specify the data type of the array when you create the array with the array() function then you can use the dtype argument.
The data type of the array can be:
- float (float64)
- integer (int32 or int64)
- boolean (bool)
- complex (complex128)
- string (e.g. `<U16`)

Let assume you created an array `z` with the datatype integer and you want to convert it afterwards to the data type `floa`t, then you can use the astype() function. For example, `z.astype('float')` to convert `z` to the data type 'float'.

### Data types in Python

Before discussing Numpy datatypes, first a small list of the data types in Python:

##### Immutable types:
- boolean (True, False)
- int (integer)
- float
- complex 
- str (string)
- byte
- tuple ( )

The type of these variables cannot be changed after they are created

##### Mutable types:
- list [ ]
- set
- dict { } (dictionary)

The type of these variables can be changed after being created

#### Examples of Numpy arrays of different types:

In [2]:
# array with integers:
x = numpy.array([1, 3, 5, 7, 9])
print(x, x.dtype)

(array([1, 3, 5, 7, 9]), dtype('int64'))


In [3]:
# array with floats:
y = numpy.array([2.2, 4.4, 6.6, 8.8])
print(y, y.dtype)

(array([ 2.2,  4.4,  6.6,  8.8]), dtype('float64'))


In [4]:
# array with booleans:
z = numpy.array([True, False, True])
print(z, z.dtype)

(array([ True, False,  True], dtype=bool), dtype('bool'))


In [5]:
# array with strings:
x = numpy.array(["a", "b", "cde"])
print(x, x.dtype)

(array(['a', 'b', 'cde'], 
      dtype='|S3'), dtype('S3'))


### Type conversion

In case you mix different data types and you do not explicitly specify the data types, then the elements are converted to the same type. 

You can specify the data type of the array when you create the array with the `array()` function using the `dtype` keyword argument. Note that not every combination is possible. For example, when your array contains text then you cannot choose float as data types, because the text cannot be converted to floats, except when the text is exactly representing a floating point number.

In [6]:
# array with mixed data types:
x = numpy.array([1, 3.4, True, 2.3+4.5j, "a"])
print(x, x.dtype)

(array(['1', '3.4', 'True', '(2.3+4.5j)', 'a'], 
      dtype='|S64'), dtype('S64'))


In [7]:
# explicitly specify the data type:
y = numpy.array([9, 8, 7, 6], dtype='float')
print(y, y.dtype)

(array([ 9.,  8.,  7.,  6.]), dtype('float64'))


In [8]:
# Try to convert strings to integers

strings = ["12", "3", "24"]
z = numpy.array(strings, dtype='int')
print (z, z.dtype)

(array([12,  3, 24]), dtype('int64'))


In [9]:
numerals = ["one", "two", "three"]
a = numpy.array(numerals, dtype='int')
print (a, a.dtype)

ValueError: invalid literal for long() with base 10: 'one'

### Convert to a different data type

Using the function `astype` is another way to convert between different datatypes.
For example, `x.astype(float)` to convert x to the data type float. 
Converting data types is called casting. 

In [None]:
# convert an array with integers to the data type float:
x = numpy.array([1, 3, 5, 7, 9])
print(x, x.dtype)
g = x.astype('float')
print(g, g.dtype)

In [None]:
# convert an array with strings to float:
y = numpy.array(["1.4", "3.4", "5.4"])
print(y, y.dtype)
h = y.astype('float')
print(h, h.dtype)

Sometimes, the data types are converted but the content is slightly changed. 
For example, when converting from float to integer, then the numbers are rounded down (floor). 
For example, when converting an array with only zeros and ones to the data type Boolean than the 0 is converted to False and the 1 is converted to True. 

In [None]:
# from float to integer
x = numpy.array([2.2, 3.2, 2.8])
print(x, x.dtype)
a = x.astype('int')
print(a, a.dtype)

In [None]:
# from 0-1 to boolean
x = numpy.array([0, 1, 1, 0])
print(x, x.dtype)
a = x.astype('bool')
print(a, a.dtype)

Not every data type can be converted to all other data types. Some examples:


In [None]:
# from string to boolean
x = numpy.array(["true", "false"])
a = x.astype('bool')
print(a, a.dtype)

In [None]:
# from non-numeric strings to integer
x = numpy.array(["a", "b", "c"])
#x = numpy.array(["1", "2"])
a = x.astype('bool')
print(a, a.dtype)

### To infinity and beyond

In Numpy, division by zero results in inf (infinity) and a RuntimeWarning. This is different than basic Python, where division by zero results in the ZeroDivisionError message.

In [None]:
# divison by zero in Numpy
x = numpy.array([4])
y = numpy.array([0])
print(x/y)

#### Values that are infinity

When you want to use infinity as a value for a variable or inside an array, then you can use the following expressions:
- numpy.inf for infinity
- numpy.PINF for positive infinity
- numpy.NINF for negative infinity

In order to check whether the value of a variable is infinity the following functions are useful:
- numpy.isinf(x) : This function returns True when the value of x is either positive infinity or negative infinity. 
- numpy.isneginf(x) : This function returns True when the value of x is negative infinity.
- numpy.isposinf(x) : This function returns True when the value of x is positive infinity.
- numpy.isfinite(x) : This function is the opposite of the isinf() function.

#### Similarly, you can create NaN values

Then there is something called Not A Number (NAN):
- numpy.nan : Create NAN value
- numpy.isnan(x) : This function returns True when the value of x is Not A Number (NAN).

# Structured arrays

A structured array consists of a number of columns, where each column can be a different datatype. 

Full information about structured arrays: 
http://docs.scipy.org/doc/numpy-1.10.1/user/basics.rec.html#structured-arrays

One of the possible ways to specify a structured array is to use a list of tuples as `dtype`:
For every column in the array a tuple is specified with the name of the column and the type of data in it. For example: 

In [14]:
dtype = [('Name', 'U10'), ('Country', 'U10'), ('Area', 'float64')]

The content of the array can then be given as a list of tuples, like so:

In [15]:
city = numpy.array([('Amsterdam', 'Netherlands', 219.3),
                    ('Paris',     'France',      105.4 ),
                    ('Barcelona', 'Spain',       101.9 )],
                     dtype=dtype)
print( city )

[('Amsterdam', 'Netherland', 219.3) ('Paris', 'France', 105.4)
 ('Barcelona', 'Spain', 101.9)]


### Dimensionality of structured arrays

Despite structured arrays consisting of rows and columns, structured arrays are treated as one-dimensional arrays by Numpy.


In [16]:
# Print information about the array
print( city.shape )
print( city.dtype )

(3,)
[('Name', '<U10'), ('Country', '<U10'), ('Area', '<f8')]


### Indexing structured arrays

The rows in a structured array can be accessed by standard array indexing. The columns of the array are indexed by using the column names that are specified when the array was created.

In [17]:
# Access first row
print( city[0] )

# Access first two rows
print( city[0:2] )

# Access column by name
print( city['Area'] )

# Access two columns using list of names
print( city[['Name', 'Area']] )

('Amsterdam', 'Netherland', 219.3)
[('Amsterdam', 'Netherland', 219.3) ('Paris', 'France', 105.4)]
[ 219.3  105.4  101.9]
[('Amsterdam', 219.3) ('Paris', 105.4) ('Barcelona', 101.9)]


### Accessing and modifying column names in structured arrays

Uses `.dtype.names'

In [18]:
print( city.dtype.names )

('Name', 'Country', 'Area')


In [19]:
city.dtype.names = ('name', 'country', 'area')
print( city['area'] )

[ 219.3  105.4  101.9]


### Loading data into structured arrays

Structured arrays are useful for loading and working with tabular data with heterogeneous column types. 

An alternative way of loading tabular data using `genfromtxt`:

In [21]:
population = numpy.genfromtxt("populations.txt", 
             names=True,
             dtype=['int','float','float','float'])

# Print the  lynx column
print( population['lynx'] )

[  4000.   6100.   9800.  35200.  59400.  41700.  19000.  13000.   8300.
   9100.   7400.   8000.  12300.  19500.  45700.  51100.  29700.  15800.
   9700.  10100.   8600.]


# Vectorized operations

## Basic operations on arrays with the same shape

The basic operations on arrays are applied elementwise.
The basic operations are addition, subtraction, multiplication, division and power.
The simplest case is when the shapes of the arrays are exactly the same, then an elementwise operation is straightforward. 

In [101]:
# basic operations between two arrays with the same shape:
x = numpy.array([10, 20, 30, 40])
y = numpy.array([5, 7, 52, 34])

print("y - x = ", y - x)
print("x + y = ", x + y)
print("x * y = ", x * y)
print("x / y = ", x / y)

y - x =  [ -5 -13  22  -6]
x + y =  [15 27 82 74]
x * y =  [  50  140 1560 1360]
x / y =  [ 2.          2.85714286  0.57692308  1.17647059]


## Basic operations on arrays with different shapes

Besides operations between arrays of the same shape, also operations between arrays of different shapes are allowed, but are not always possible. Operations on arrays with different shapes is often called broadcasting.

There are some different types of broadcasting:
- Basic operations between an array and a constant, then there are no restrictions on the shape.

- Basic operations between an array and a row vector, then the number of columns in the array has to be the same as the length of the row vector.

- Basic operations between an array and a column vector, then the number of rows in the array has to be the same as the length of the column vector.

When applying operations between an array and a row or column vector the shapes are still important.
For example, let $x$ be a $2\times 3$ array, let $y$ be a row vector with 3 elements, and let $z$ be a column vector with 2 elements.

When applying operations between array $x$ and row vector $y$, then the operations are applied for each row, and the number of column in the array has to be the same as the length of the row vector.

When applying operations between array $x$ and column vector $z$, then the operations are applied for each column, and the number of rows in the array has to be the same as the length of the column vector. 
When operations are applied between arrays of different shapes and the number of rows or columns is not the same, then this will return an error message.

For more information about Broadcasting:
http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html

In [102]:
# constant term
x = numpy.array([20, 25, 30, 35])
print("x - 2 = ", x - 2)
print("x * 2 = ", x * 2)
print("x **2 = ", x**2)

x - 2 =  [18 23 28 33]
x * 2 =  [40 50 60 70]
x **2 =  [ 400  625  900 1225]


In [103]:
# operations between array and vector
x = numpy.array([[1, 2, 3], [4, 5, 6]])
y = numpy.array([5, 5, 5]) # row vector
z = numpy.array([[1], [2]]) # column vector

print(x)
print(y)
print(z)

[[1 2 3]
 [4 5 6]]
[5 5 5]
[[1]
 [2]]


In [104]:
# array and row vector
print("Operations between x and y which are applied for each row")
print("x + y = \n", x+y)
print("x * y = \n", x*y)

Operations between x and y which are applied for each row
x + y = 
 [[ 6  7  8]
 [ 9 10 11]]
x * y = 
 [[ 5 10 15]
 [20 25 30]]


In [105]:
# array and column vector
print("Operations between x and z which are applied for each column")
print("x + z = \n", x+z)
print("x * z = \n", x*z)

Operations between x and z which are applied for each column
x + z = 
 [[2 3 4]
 [6 7 8]]
x * z = 
 [[ 1  2  3]
 [ 8 10 12]]


## Boolean operations on arrays

Boolean conditions can also applied to the arrays. They are applied to every element in the array. Several different conditions can be used, such as: equal to (==), not equal to (!=), greater than (>= or >), or smaller than (<= or <). 

In [107]:
# boolean operations on arrays
x = numpy.array([10, 20, 30, 14, 15, 16])
y = numpy.array([7, 5, 5, 7, 5, 7]) 
print("(x > 15) = ", x>15)
print("(y == 7) = ", y==7)

(x > 15) =  [False  True  True False False  True]
(y == 7) =  [ True False False  True False  True]


## Mathematical functions applied on vectors

A lot of mathematical functions can be applied to arrays and they are applied elementwise, such as:
- numpy.sqrt(x): square root
- numpy.sin(x): sine
- numpy.cos(x): cosine
- numpy.tan(x): tangent
- numpy.exp(x): exponential
- numpy.log(x): natural logarithm

In [108]:
x = numpy.array([1, 2, 3, 4])
print("x = ", x)
print("sqrt(x) = ", numpy.sqrt(x))
print("sin(x) = ", numpy.sin(x) )
print("cos(x) = ", numpy.cos(x) )
print("tan(x) = ", numpy.tan(x) )
print("exp(x) = ", numpy.exp(x) )
print("log(x) = ", numpy.log(x) )

x =  [1 2 3 4]
sqrt(x) =  [ 1.          1.41421356  1.73205081  2.        ]
sin(x) =  [ 0.84147098  0.90929743  0.14112001 -0.7568025 ]
cos(x) =  [ 0.54030231 -0.41614684 -0.9899925  -0.65364362]
tan(x) =  [ 1.55740772 -2.18503986 -0.14254654  1.15782128]
exp(x) =  [  2.71828183   7.3890561   20.08553692  54.59815003]
log(x) =  [ 0.          0.69314718  1.09861229  1.38629436]


## Reductions

Some functions can be applied to the entire array or to only one dimensio:

- x.sum() and numpy.cumsum(x)
- x.min() and x.argmin()
- x.max() and x.argmax()

These functions have a parameter which is called axis. When `axis=0` then sum per column (or the minimum etc) per column is returned. When `axis=1` then sum per row is returned. In higher dimensional arrays, the same logic applies. 

One important thing to notice is that when the `argmin` or `argmax` functions are applied, then the index of the minimum or maximum is returned, but this index is the linear index and not the index in all the dimensions (see  [2b_arrays.ipynb](2b_arrays.ipynb))


In [109]:
x = numpy.array([[1, 6, 5], [2, 7, 8]])

# functions applied to the entire array:
print("sum:", x.sum())
print("minimum:", x.min(), "and index of minimum:", x.argmin())
print("maximum:", x.max(), "and index of maximum:", x.argmax())

sum: 29
minimum: 1 and index of minimum: 0
maximum: 8 and index of maximum: 5


In [110]:
# functions applied to only one dimension of the array:
print("column sums:", x.sum(axis=0))
print("row sums:", x.sum(axis=1))
print("minimum per column:", x.min(axis=0))
print("maximum per row:", x.max(axis=1))

column sums: [ 3 13 13]
row sums: [12 17]
minimum per column: [1 6 5]
maximum per row: [6 8]


## Sorting
The arrays can be sorted which is similiar as sorting lists in Python. The functions `sort` and `argsort` can be applied to arrays.
When applied to a 2-dimensional array the sort operation will apply per row and therefore also the indices are based on the position in the row.

In [113]:
# sorting an 1-dimensional array:
print("Applied to 1-dimensional array")
x = numpy.array([5, 3, 6, 2, 6, 8])
print("unsorted x:", x)
y = x.argsort()
x.sort()
print("sorted x: ", x)
print("indices of argsort:", y)

Applied to 1-dimensional array
unsorted x: [5 3 6 2 6 8]
sorted x:  [2 3 5 6 6 8]
indices of argsort: [3 1 0 2 4 5]


In [114]:
# sorting an 2-dimensional array:
print("Applied to 2-dimensional array")
x = numpy.array([[5, 3, 6], [2, 6, 8]])
print("unsorted x:", x)
y = x.argsort()
x.sort()
print("sorted x: ", x)
print("indices of argsort:", y)

Applied to 2-dimensional array
unsorted x: [[5 3 6]
 [2 6 8]]
sorted x:  [[3 5 6]
 [2 6 8]]
indices of argsort: [[1 0 2]
 [0 1 2]]


## Reversing

There is a special indexing syntax in `numpy` to obtain a view of the array in the reverse order. 

In [115]:
a = numpy.random.randint(0,10,5)
print(a)
print()
print(a[::-1])

[8 2 8 3 7]

[7 3 8 2 8]


### Rounding 

If you want to round every element in the array then the following rounding functions can be used:
- numpy.round(x, decimals = 2 )
- numpy.floor(x)
- numpy.ceil(x)


In [117]:
# rounding 
x = 10*numpy.random.random((1,5))
print("not rounded:", x)

x1 = numpy.round(x, decimals = 2)
print("round:", x1)

x2 = numpy.floor(x)
print("floor:", x2)

x3 = numpy.ceil(x)
print("ceil:", x3)

not rounded: [[ 6.62056169  9.99372912  5.54082401  5.97427961  1.36261238]]
round: [[ 6.62  9.99  5.54  5.97  1.36]]
floor: [[ 6.  9.  5.  5.  1.]]
ceil: [[  7.  10.   6.   6.   2.]]


### Statistics

To apply some basic statistical functions to the numpy array x, the following functions can be useful:
- numpy.median(x) : median
- numpy.mean(x) : mean
- numpy.average(x, axis= , weights= ) : (weighted) average
- numpy.std(x) : standard deviation
- numpy.var(x) : variance
- numpy.cov(x) : covariance matrix
- numpy.corrcoef(x) : Pearson product-moment correlation coefficients

These functions can be applied to the entire array, or to only one axis. When applied to one axis, then the parameter axis can be used. Similar functions exists which ignore NAN, these functions are called: `nanmedian`, `nanmean`, `nanstd`, `nanvar`. 

For more statistical functions in numpy: http://docs.scipy.org/doc/numpy/reference/routines.statistics.html

# SciPy

SciPy is designed to be the library package for scientific and technical computing. What that means is a bit fuzzy.

But it has a large set of really useful tools, the building blocks of models: linear algebra, optimization routines, integration routines, tools for dealing with sparse datasets, and a number of special purpose tools.


https://docs.scipy.org/doc/scipy/reference/

## Exercises

### Exercise 1
Form the 2-D array (without typing it in explicitly):
```python
[[1,  6, 11],
 [2,  7, 12],
 [3,  8, 13],
 [4,  9, 14],
 [5, 10, 15]]
 
 ```
and generate a new array containing its 2nd and 4th rows.

### Exercise 2

Generate a $5\times 5 \times 5$ 3D array of random numbers between -10.0 and 10.0. Reshape it to a $5 \times 25$ matrix, and extract the first two rows of this matrix. 


### Exercise 3

Load the content of the file [populations.txt](populations.txt) into a numpy array. Extract the first column into a vector and assign it to variable named `year`, extract the second column and assign it to variable `hare`, etc for the four columns.

Convert the variables `year` and `carrot` into the datatype `int`.


### Exercise 4

Modify the code above to determine what happens when you use a list of lists instead of a list of tuples to definte the types? Or when specifying the values in a structured array?

### Exercise 5

Create a $4\times3$ matrix of random numbers between $0$ and $1$. 
Find the row and column position of the minimum and the maximum value.

### Exercise 6

Uncomment and complete the following code to print years with the smallest number of hares, lynxes and carrots in the 
populations dataset.

In [28]:
#for species in [....]:
#    year = ...
#    print("Least # of {} in year {}".format(species, year))

### Exercise 7

Use the population data to

1. Select all the years in which there are more than 50000 lynxes;
2. Select all the years in which there are more lynxes than hares.

### Exercise 8

Indexing with an array is often useful when we want to randomize the order of items in some data. Uncomment and complete the following code which creates a scrambled version of the population data

In [31]:
## Create an index for the rows of population (from 0 to population.shape[0])
#index = ...

## Shuffle the index
#numpy.random.shuffle(index)

## Create a scrambled version
#population_rand = ...

### Exercise 9

Save the population data to a `.npy` file. Figure out how to load it back into a numpy array.

### Exercise 10
The files

- [irisa.txt](irisa.txt)
- [irisb.txt](irisb.txt)
- [irisc.txt](irisc.txt)

contain the data for the iris dataset. Each file has these columns:

- `SepalLength` 
- `SepalWidth`
- `PetalLength` 
- `PetalWidth` 
- `Species`

Load this data, and create a single array with all the species.

### Exercise 11

Some conversions between datatypes lose information, and are therefore not reversible. 
Decide which of the following conversions are lossy i.e. not reversible. Write examples to check your guess.

1. float -> int
2. bool  -> int
3. int   -> '<U16'
4. int   -> float
5. float64 -> float32
6. float32 -> float64

In [None]:
# write your examples here

### Exercise 12

The infinity value (numpy.inf) can be included in an array with some datatypes but not all. Write example code to show which datatypes are consistent with including infinity and which are not.

In [None]:
# write your examples here

### Exercise 13

The following function `linetofloat` takes a string which contains float numbers separated by 
spaces and returns an array of floats. Complete the definition of the function
For example:

```linetofloat("3.14 12.3 4.0") -> array([3.14, 12.3, 4.0])
```

In [None]:
def linetofloat(text):
    #
    #
    #
    return

### Exercise 14

Open and inspect the file [populations.txt](populations.txt). It contains some numerical data. 
Search the internet to find out which numpy function you can use to load this data into a numpy array. Load the data, and convert it to `float32`.

In [None]:
# convert values in populations.txt to float32 here

### Exercise 15

Uncomment and complete the following code loading the data from file [populations.txt](populations.txt). Load the year column as an `int`, and the other columns as `float`.

In [1]:
#dtype = [('year',  ...
#         ('hare',  ...
#         ...
#          ] 
#population = numpy.loadtxt("populations.txt", dtype=...)

### Exercise 16

Define function `standardize` which converts a vector of numbers to z-scores.

In [106]:
def standardize(x):
    return ...

### Exercise 17

- Define function `to_cm` which takes a vector of measurements in inches and converts them to centimeters.
- Define function `to_celsius` which takes a vector of measurements in Fahrenheit and converts them to Celsius: C = (F-32)/1.8


### Exercise 18

Define function `scale` which takes a vector of numbers and brings them to the range from 0 to 1:
$$\mathrm{scale}(x_i) = \frac{x_i - min(x)}{max(x) - min(x)}$$

In [111]:
def scale(x):
    return (x - x.min())/(x.max() - x.min())

### Exercise 19

The function `softmax` is often used in machine learning and statistics to convert a vector of arbitrary numbers into a vector of probabilities summing up to $1$. Softmax is computed by computing the exponential of each number, and then dividing each number by the sum of the exponentials:
$$ \mathrm{softmax}(x_i): \frac{\exp(x_i)}{\sum_{k=1}^N \exp(x_k)}$$

Implement the softmax function. Verify that in the resulting vector all number are between 0 and 1. Verify that the resulting numbers sum up to $1$.



In [112]:
def softmax(x):
    ...

### Exercise 20

The file `winequality-red.csv` contains measurements of wine samples, together with a quality rating. You can load this data into a structured array like this:

In [116]:
data = numpy.genfromtxt("winequality-red.csv", names=True, delimiter=';')

- Sort the data according to the quality rating, from lowest to highest
- Now sort the wines from highest to lowest