# 2. Arrays and shapes

Make sure that you always first import numpy! Otherwise the functions will not work at all.

In [2]:
import numpy

### Create an array
An array is a list of elements. 
An array in Python can be created from a pre-specified list or a tuple with the array() function. 
Make sure you do not forget the [ ] or ( ) brackets! 
An n-dimensional array can also be created based on a combination of tuples and lists.

For more information about creating arrays: 
http://docs.scipy.org/doc/numpy/user/basics.creation.html

In [8]:
# create array from a list:
print( numpy.array([1, 3, 5, 7, 9]) )

# create array from a tuple:
print( numpy.array((2, 4, 6, 8)) )

[1 3 5 7 9]
[2 4 6 8]


Also, an empty array or an array with only zeros or ones can be created with the empty(), zeros() and ones() functions. 
For this, the shape of the array is the input for the function. 
More about the shape and dimensionality is explained in the next section. 
The content of the empty array can be everything and is not always the same.

In [9]:
# create an empty array:
print( numpy.empty((3)) )

# create array with only zeros:
print( numpy.zeros((4)) )

# create array with only ones:
print (numpy.ones((5)) )

[ 0.  0.  0.]
[ 0.  0.  0.  0.]
[ 1.  1.  1.  1.  1.]


If we want to create an array with random numbers we can use the numpy.random.random() function, this gives an array with numbers between 0 and 1. Let x be an array with numbers between 0 and 1, and in order to get an array with values between some numbers a and b, then we can use the transformation y=a+(b-a)*x.

In [11]:
# create array with random numbers between 0 and 1:
print(numpy.random.random(4))

# create array with random numbers between 5 and 10:
print( 5+(10-5)*numpy.random.random(4) )

[ 0.19233279  0.3567885   0.44667283  0.57190252]
[ 7.13343011  7.78418765  9.39459221  9.54447269]


The identity matrix is a n by n matrix with all zeros, but with ones on the diagonal. 
The identity matrix can be created with the eye(n) function. Since the identity matrix is always a square only one input parameter is needed to create a 2-dimensional matrix.

In [12]:
# create n by n identity matrix:
print( numpy.eye(2))

[[ 1.  0.]
 [ 0.  1.]]


Numpy has a function which is similar to the range() function which is the arange(a, b, s), where a is the start point, b is the end point and s is the step size. The function can have floats as inputs.
Another function to create a range is the linspace(a, b, i) function, where a is the start point, b is the end point, and i is the number of items.
The advantage of the linspace function is that you can specify the number of items and the advantage of the arange function is that you can specify the step size.

In [14]:
# use fixed step size:
x = numpy.arange(1, 11.9, 2.1)
print(x)

# use fixed number of items:
y = numpy.linspace(0, 2, 9)
print(y)

[  1.    3.1   5.2   7.3   9.4  11.5]
[ 0.    0.25  0.5   0.75  1.    1.25  1.5   1.75  2.  ]


### Dimensions
The array class in Python is called "ndarray". 
An array can be 1-dimensional, which is often called a (row or column) vector. When only one input is specified it will be a row vector in Python. In order to create a column vector, we can use for example: zeros((m,1)).
An array can be 2-dimensional, where m is the number of rows and n the number of columns. 
In case of a 2-dimensional array, the first dimension is the rows and the second dimension is the columns.
In order to create a array with m rows and n columns, we can use for example: zeros((m,n)).

An array can be n-dimensional. It might be hard to understand the contents if the array is displayed. 
For example, take an array with dimensions (2, 3, 4) then this will be displayed as if it is a list with two 3 by 4 matrices.
The first 3 by 4 matrix will display the elements of the 2nd and 3rd dimension in case the index of the first dimension is 0 and the second 3 by 4 matrix will display the elements of the 2nd and 3rd dimension in case the index of the second dimension is 1.

If an array is a 1-dimensional list and you want to create from this list an higher dimensional array, you can use the reshape() function. Notice that the the total number of elements in the array have to be the same as the product of the lengths of the dimensions. For example, if the length of the list is 24, then we can reshape it to a 4 by 6 matrix, but also to a 2 by 3 by 4 matrix.
Let's assume we have a 2 by 3 by 4 matrix, which we will call z. Since the index in Python starts at 0, the first element of the array is z[0,0,0], but the last element of the array is not z[2, 3, 4] but it is z[1, 2, 3]! 

Let x be the ndarray. Some important functions to get insight in the dimensionality:
- x.ndim : the number of dimensions.
- x.shape : the length of each dimension.
- x.size : the total number of elements.


In [16]:
# create one dimensional arrays:
print( numpy.zeros((2)) )
print( numpy.array([1, 2, 3]) )

# create two dimensonal arrays:
print( numpy.arange(2, 14, 2).reshape((2, 3)) )
print( numpy.array([[1, 3], [2, 4]]) )

[ 0.  0.]
[1 2 3]
[[ 2  4  6]
 [ 8 10 12]]
[[1 3]
 [2 4]]


In [17]:
# create three dimensional arrays:
z1 = numpy.arange(24).reshape((2, 3, 4))
z2 = numpy.array([[[1, 3], [2, 4]], [[11, 13], [12, 14]] ])

# dimensions/shape/size of the array:
print(z1)
print("Number of dimensions:", z1.ndim)
print("Length of each dimension:", z1.shape)
print("The total number of elements:", z1.size)

# access some elements in the three dimensional array:
print("First element:", z1[0, 0, 0])
print("Last element:", z1[1, 2, 3])
print("Some element:", z1[1, 0, 1])

[[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]

 [[12 13 14 15]
  [16 17 18 19]
  [20 21 22 23]]]
Number of dimensions: 3
Length of each dimension: (2, 3, 4)
The total number of elements: 24
First element: 0
Last element: 23
Some element: 13


### Data types

Data types were already discussed in the previous notebook, but here is a small recap. 

One array can only have one data type. 
The data type of the array can be obtained with the x.dtype function. 
In case you mix different data types, the elements are converted to the same type. If you want, you can specify the data type of the array when you create the array with the array() function then you can use the dtype argument.
The data type of the array can be:
- float (float64)
- integer (int32 or int64)
- boolean (bool)
- complex (complex128)
- string (<U3 or <U64)

Let assume you created an array z with the datatype integer and you want to convert it afterwards to the data type float, then you can use the astype() function. For example, z.astype(float) to convert z to the data type float.

### Structured arrays

In principle, you can mix data types as long as you specify which part of the array has which data type. This type of array is often called a structured array. Parts of the array are indicated with a named field. There are a lot of different ways to specify a structured array.

One of the possible ways to specify a structured array is to use a list of tuples as dtype. 
For every column in the array a tuple is specified in which a name is specified and a data type.
For example: dtype = [('name of column 1', 'data type of column 1'), ('name of column 2', 'data type of column 2')]. 
Where the name of the column and the data type are strings.

One can access the rows of the array by normal indexing and one can access the columns of the array by using the column names that are specified when the array was created.

Another possibility of structured arrays is that inside the array another array can be located as record. 

For more information about structured arrays: 
http://docs.scipy.org/doc/numpy-1.10.1/user/basics.rec.html#structured-arrays

In [None]:
# create a structured array
x = numpy.array([(1, 2.45, 4+6j), (2, 3.45, 8+12j)], dtype=[('col1', 'int8'),('col2', 'float32'), ('col3', 'complex128')])
print(x)
print(x.dtype)

# access one row or column
print(x[1])
print(x['col2'])

### Indexing

Basically there are four different ways of indexing:
- Linear indexing, which transform the n-dimensional array to a 1-dimensional list. This linear index is returned when the argmin and argmax function are applied to an n-dimensional array. When the shape of the array is known, then we can transform the linear index to the matrix index.
- Matrix indexing, which returns for every dimension of the array the index. This is the often used way to access elements in the array. In this case it is important to keep in mind that the index in Python starts at 0, so for every dimensions the index starts at 0 and ends at the length of the dimension - 1. 
- Boolean indexing, which returns all values in the array for which the index is True.
- Indexing with an array of indices. In this case you specify a separate array in which you store the indices as integers and you will return exactly the elements of the array with these indices. 

For more information about indexing:
http://docs.scipy.org/doc/numpy/user/basics.indexing.html


### Linear and matrix indexing

Indexing in a 1-dimensional matrix is the same as the indexing in a Python list. And if you want to apply something to every element of the array then one simple for-loop over the items can do the trick.

Indexing in a n-dimensional matrix has one index for every dimension. To access one element of the array, the index of every dimension should be given. When accessing more than one element, the slicing ":" can be used, and this works similar as it works with lists, but then you can use the ":" for every dimension. If no index is given for a dimension, then the : will be given.
If the index is [a:b] then indices that are used are a up to but not including b.

If you have the linear index and you want to convert it to a matrix index, then you can use the function: numpy.unravel_index().
The first argument is the linear index and the second argument is the shape of the array for which you want to transform the index. For example: numpy.unravel_index(linear_index, (2,3)). The function also works on a list of linear indices, and then returns a list of indices in the first dimension and a list of indices in the second dimension.


In [23]:
# indexing in a 3-dimensional array
z = numpy.arange(24).reshape((2, 3, 4))
print(z)

# slices
print(z[0:2, 1:3, 3])
print(z[:, 2, :])

# linear indexing
linear_index = 10
print("\n For a matrix with dimensions (2, 3, 4), the linear index: ", linear_index, " is equal to \
matrix index: ", numpy.unravel_index(linear_index, z.shape))


[[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]

 [[12 13 14 15]
  [16 17 18 19]
  [20 21 22 23]]]
[[ 7 11]
 [19 23]]
[[ 8  9 10 11]
 [20 21 22 23]]

 For a matrix with dimensions (2, 3, 4), the linear index:  10  is equal to matrix index:  (0, 2, 2)


### Boolean indexing

Boolean indexing can be easy when you exactly know which elements of the array you want to have.
Then you can create a boolean array with True at the position of the elements you want to have.
Most of the time, you will not use this approach, but you will use boolean indexing by specifying a certain condition.
The condition will return a True or False for every position in the array and when the condition is True then the corresponding element will be retrieved.

In [30]:
# Boolean indexing
x = numpy.arange(1, 6)
y = numpy.array([True, False, True, False, True ])
print("Only elements of x for which the value in y is equal to True: ", x[y])

# boolean indexing by using a condition
print("Only elements of x for which the condition is True: ", x[x>3])

Only elements of x for which the value in y is equal to True:  [1 3 5]
Only elements of x for which the condition is True:  [4 5]


### Indexing with an array of indices

Indexing with an array of indices. In this case you specify a separate array in which you store the indices as integers and you will return exactly the elements of the array with these indices.
One advantage of this is that you can explicitly specify the order in which you want to have the values and you can return multiple times the value at a certain position. 


In [38]:
x = numpy.arange(100, 111)
y = numpy.array([8, 3, 8, 4, 9, 3])
print("Array x: ", x)
print("Array with indices: ", y)
print("Indexing with an array of indices will give:", x[y])

Array x:  [100 101 102 103 104 105 106 107 108 109 110]
Array with indices:  [8 3 8 4 9 3]
Indexing with an array of indices will give: [108 103 108 104 109 103]


### For loops over arrays

Most of the time you want to use arrays to avoid loops, but sometimes you want to use a loop. Then it is important to know how numpy iterates over an array. When a for loop is used to iterate over an array then the first for loop will iterate over the first dimension. When two nested for loops are used, then the loops iterate over the first two dimensions. 

It is easier to use an example to explain it.
When you use a for loop over a 3 dimensional array with shape (2, 3, 4) then you will receive exactly 2 iterations and in each iteration a 3 by 4 array is used. 
When you use two nested for loops over a 3 dimensional array with shape (2, 3, 4) then you will receive exactly 2*3=6 iterations and in each iteration an array of length 4 is used. 

When you want to have a for loop over all the elements in the array, then you can better use the flat method instead of nested loops. For example, x.flat. 

In [18]:
# for loop over the first dimension:
z = numpy.arange(24).reshape((2, 3, 4))
print("\n For loop over the first dimension")
for firstdimension in z:
    print(firstdimension)

print("\n For loop over the first and second dimension")
# for loop over the first and second dimension:
for firstdimension in z:
    for seconddimension in firstdimension:
        print(seconddimension)
     
print("\n For loop over all elements with the flat method")
for element in z.flat:
    print(element)


 For loop over the first dimension
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
[[12 13 14 15]
 [16 17 18 19]
 [20 21 22 23]]

 For loop over the first and second dimension
[0 1 2 3]
[4 5 6 7]
[ 8  9 10 11]
[12 13 14 15]
[16 17 18 19]
[20 21 22 23]

 For loop over all elements with the flat method
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


### Vector stacking

Sometimes you want to combine two or more vectors to create an array. This is called vector stacking. Vector stacking can be done in two different ways horizontal and vertical. 
- horizontal stack: numpy.hstack([x, y, z])
- vertical stack: numpy.vstack([x, y, z])

In [47]:
x = numpy.arange(0,5)                     
y = numpy.arange(5, 10)   
z = numpy.arange(10, 15)
print("Horizontal stack: ",  numpy.hstack([x,y, z]) )
print("Vertical stack: ")
print( numpy.vstack([x,y, z]))

Horizontal stack:  [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]
Vertical stack: 
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]


### Split array into vectors

The opposite of vector stacking is to split the array into pieces. The first argument is the array and the second argument is the number of pieces to split the array in. It is important that that the number of pieces you want is possible given the number of rows or columns. Two different ways of splitting:
- Vertical split: numpy.vsplit(x, 2)
- Horizontal split: numpy.hsplit(x, 3)


In [54]:
x = numpy.arange(16).reshape(4,4)
print(x)
print("Vertical split: ")
print(numpy.vsplit(x, 2))
print("Horizontal split: ")
print(numpy.hsplit(x, 4))

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]
Vertical split: 
[array([[0, 1, 2, 3],
       [4, 5, 6, 7]]), array([[ 8,  9, 10, 11],
       [12, 13, 14, 15]])]
Horizontal split: 
[array([[ 0],
       [ 4],
       [ 8],
       [12]]), array([[ 1],
       [ 5],
       [ 9],
       [13]]), array([[ 2],
       [ 6],
       [10],
       [14]]), array([[ 3],
       [ 7],
       [11],
       [15]])]


### Copy of matrix

When you do operations on an array, then you have to be careful whether or not the array is copied or only referenced to. When you change something in the copied array, then you have to be careful whether or not also the original array is changed.

There are three cases:
- No copy at all: only a reference to the existing object. When you change anything then the original array is also changed.
- View or shallow copy: different array objects share the same data. When you reshape the array then the change in shape is not shared with the original array, but when you change the data then the change in data is shared with the original array and the original array is also changed. A shallow copy can be made with the y = x.view() function. When you slice an array then only a view is returned and no copy of the data is made.
- Deep copy: complete copy of array and data. When you change something in the copied array then the original array remains the same. A deep copy can be made in numpy with the y = x.copy() function. 



In [64]:
# shallow copy
a = numpy.arange(10)
b = a.view()
b.shape = (2,5)
b[0, 3] = 1000
print("shallow copy: ", b)
print("original: ", a)

shallow copy:  [[   0    1    2 1000    4]
 [   5    6    7    8    9]]
original:  [   0    1    2 1000    4    5    6    7    8    9]


In [67]:
# deep copy
a = numpy.arange(10)
c = a.copy()
c.shape = (5,2)
c[3,1] = 1000
print("deep copy: ", c)
print("original: ", a)

deep copy:  [[   0    1]
 [   2    3]
 [   4    5]
 [   6 1000]
 [   8    9]]
original:  [0 1 2 3 4 5 6 7 8 9]


### Load data set

Before you can load a data set you have to make sure that the data set is in the right path. When you want to use the data set in Jupyter you have to make sure that the data set is uploaded to your Jupyter server and is located in the directory of the notebook.

Most functions to load a data set need a filename as argument, note that the filename has to have a specific extension. 
Next to a filename, often also a delimiter is specified, then the delimiter is the character which is used to split the columns. Moreover, a lot of optional arguments can be specified, such as dtype, missing_values, skip_header, and names. The arguments that are available are dependend on the function that is used, so when you want to use a function you can best look up the optional arguments in the reference guide.

Loading a data set with Numpy:
- numpy.genfromtxt(filename) : this can be used to open .txt files. The advantage of the genfromtxt function is that it can specify how to handle missing values. http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.genfromtxt.html
- numpy.load(filename) : this can be used to open .npy and .npz files. http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.load.html
- numpy.loadtxt(filename): this can be used to open .txt files. http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.loadtxt.html

When you want to load a data set from CSV format then you can use the csv library. You can use the function csv.reader(open(filename.txt), delimiter=' '). See the code block below for an example of the csv library.

In [None]:
# The use of the csv library to load a csv file
import csv

def load_table(path):
    reader = csv.reader(open(path), delimiter=' ')
    return [ (row[0:-1], row[-1]) for row in reader ]

X, Y = zip(*load_table('iris-train.txt'))
X = numpy.array(X, dtype='float')

### Save data set to file

When you want to save an array from numpy as a separate file you always have to specify the filename and the array you want to save and you can use the following functions:
- numpy.savetxt(filename, array) : save an array to a text file. Some optional arguments are: delimiter=' ', newline = '\n', header = ' '. http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.savetxt.html#numpy.savetxt
- numpy.save(filename, array) : save an array to a binary file in numpy .npy format. http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.save.html#numpy.save


#### Exercise 2.1 Load, transform and save data set

Load a data set from the internet, convert some of the data types of the columns, replace the missing values, select a subset of the columns, select a subset of the rows based on a condition, and then save the array to a file.

One website where you can find a lot of data sets is:
http://archive.ics.uci.edu/ml/datasets.html

In [None]:
# Exercise 2.1


