# Answers to exercises

In [1]:
import numpy

### 2.1 Load, transform and save data set

Load a data set from the internet, convert some of the data types of the columns, replace the missing values, select a subset of the columns, select a subset of the rows based on a condition, and then save the array to a file.

One website where you can find a lot of data sets is:
http://archive.ics.uci.edu/ml/datasets.html

### 3.1) Transformation
Load the iris train data set. Make sure that the data set is uploaded on your Jupyter server. The features are in array X. The features are the length and the width of the sepals and petals. The lengths and widths are measured in centimetres. Suppose we want to create the following transformations:
- For the first feature (sepal lenght in cm) we want to transform it to inches (1 inch = 2,54 cm).
- For the second feature (sepal width in cm) we want to transform it to a range in between 0 and 1.
- For the third feature (petal length in cm) we want to transform it to a boolean expression indicating whether or not it is larger than the mean. (1 if larger than mean, and 0 otherwise). 
- For the fourth feature (petal width in cm) we want to transform it to the exponential function.

Normally, when you will use the features to train an algorithm you will not change every feature from the same metric to different metrics, but most of the time the other way around. Most of the time, you will change features such that they have the same metric and that they are normalized. For now it is a good exercise to practice some transformations.

In [48]:
import csv

def load_table(path):
    reader = csv.reader(open(path), delimiter=' ')
    return [ (row[0:-1], row[-1]) for row in reader ]

X, Y = zip(*load_table('iris-train.txt'))
X = numpy.array(X, dtype='float')

X_new = X

# feature 1
X_new[:, 0] = X[:, 0] / 2.54
# feature 2
X_new[:, 1] = (X[:, 1] - X[:, 1].min()) / ( X[:, 1].max() - X[:, 1].min() )
# feature 3
X_new[:, 2] = (X[:, 2] >= numpy.mean(X[:, 2]))
# feature 4
X_new[:, 3] = numpy.exp(X[:, 3])

print(X_new[0:5, :])

[[ 2.16535433  0.18181818  1.          3.32011692]
 [ 2.20472441  0.22727273  1.          3.66929667]
 [ 1.92913386  0.40909091  0.          1.10517092]
 [ 2.20472441  0.27272727  1.          7.3890561 ]
 [ 2.36220472  0.          0.          2.71828183]]


### 3. 2) Practice with indexing in 2-dimensional arrays
Create one 2-dimensional array, which will be called A, you are free to choose the content as long as it is either the integer or float data type.
Then try to do the following things:
- Compute the total sum of the elements in the array in two ways, first with the sum() function and second with using a loop instead of the sum() function.
- Return the minimum of the entire array and the corresponding index, but transform the linear index into a row index and a column index. Think of a general formula to transform the linear index into a two dimensional index. Afterwards, you can check your result with the unravel_index() function.
- Create another 2-dimensional array (B) with the same shape of A. Then sort the array A, and then sort array B on the same way as array A. For example, if the element A[1,4] is after sorting in position [0,0], then the element B[1,4] also has to be in position [0,0]. For this you can use the argsort method, see the documentation: http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.argsort.html

Answer: a general formula to transform the linear index into row and column index is the following: the column index is the remainder after division of the linear index by number of columns in A and the row index is the floor of the linear index divided by the number of columns. The unravel_index() function is exactly doing the transformation of a linear index into the row and column indices and it also works for higher dimensional matrices.

For using the argsort() method on the entire array, the axis parameter have to be set to None. And to sort an array according to a given order the trick A[order] can be used. But note that if we used the argsort(axis=None) then it returns a linear index and we have to transform this linear index back to the matrix index, which can be done with the unravel_index() function.

In [2]:
A = numpy.floor(10*numpy.random.random((2,4)))
print(A)

# total sum with sum method
total_sum = A.sum().sum()
print("Total sum with sum method: ", total_sum)

# total sum with for loops
total_sum = 0
for row in A:
    for element in row:
        total_sum += element
print("Total sum with for loops:", total_sum)

# minimum and linear index
minimum = A.min()
linear_index = A.argmin()

# transform the linear index into row and column index
n, m = A.shape
row = int(numpy.floor(linear_index / m))
column = linear_index % m
print("Minimum is: ", minimum, "and linear index is ", linear_index)
print("Transform the linear index: ", linear_index, " into row index: ", row, "and column index: ", column)
# you can check if you have done the transformation correctly when applying the unravel_index function
print("Unravel_index gives indices: ", numpy.unravel_index(linear_index, A.shape))

[[ 6.  5.  4.  8.]
 [ 1.  8.  5.  1.]]
Total sum with sum method:  38.0
Total sum with for loops: 38.0
Minimum is:  1.0 and linear index is  4
Transform the linear index:  4  into row index:  1 and column index:  0
Unravel_index gives indices:  (1, 0)


In [3]:
# sorting
A = numpy.floor(10*numpy.random.random((2,4)))
B = numpy.floor(10*numpy.random.random((2,4)))
print("A:", A)
print("B:", B)
C = A.argsort(axis = None)
print("Argsort:", C)
print("Sorted A: ", A[numpy.unravel_index(C, A.shape)])
print("Sorted B: ", B[numpy.unravel_index(C, A.shape)])

A: [[ 5.  5.  3.  4.]
 [ 8.  3.  9.  1.]]
B: [[ 9.  5.  7.  0.]
 [ 9.  7.  6.  3.]]
Argsort: [7 2 5 3 0 1 4 6]
Sorted A:  [ 1.  3.  3.  4.  5.  5.  8.  9.]
Sorted B:  [ 3.  7.  7.  0.  9.  5.  9.  6.]


### 3.3) Normalization
Load the iris train data set. Make sure that the data set is on your Jupyter server. The features of the data set are located in the array X. Then calculate the basic statistics for every feature and then normalize the features.

In [43]:
import csv

def load_table(path):
    reader = csv.reader(open(path), delimiter=' ')
    return [ (row[0:-1], row[-1]) for row in reader ]

X, Y = zip(*load_table('iris-train.txt'))
X = numpy.array(X, dtype='float')

X_normal = X
print("Mean for every feature: ", numpy.mean(X, axis=0))
print("Standard deviation for every feature: ", numpy.std(X, axis=0))

print("Normalization")
for i in range(X.shape[1]):
    X_normal[:, i] = (X[:, i] - numpy.mean(X[:, i]) ) /  numpy.std(X[:, i])
print(X_normal[1:5, :])

Mean for every feature:  [ 5.896  3.036  4.042  1.334]
Standard deviation for every feature:  [ 0.82388349  0.40927253  1.73263845  0.78092509]
Normalization
[[-0.3592741  -0.82096886  0.0911904  -0.04353811]
 [-1.2089088   0.15637502 -1.46712662 -1.58017717]
 [-0.3592741  -0.57663289  0.49519852  0.85283468]
 [ 0.12623144 -2.04264872 -0.02424049 -0.42769787]]


### 3.4) Number of subsequent equal values

One application of the numpy.diff function is that you can use it to count the number of subsequent equal values. Take x = [10, 10, 10, 20, 20, 20, 20, 30, 30, 40, 40, 40, 40, 40] and return the unique values and the number of subsequent equal values. 

In [117]:
x = numpy.array([10, 10, 10, 20, 20, 20, 20, 30, 30, 40, 40, 40, 40, 40])
d = numpy.diff(x)
i = numpy.nonzero((d > 0))
i = numpy.hstack([i[0], x.shape[0]-1])
y = x[i]
i = numpy.hstack([0, i+1])
c = numpy.diff(i)
print("Unique values: ", y)
print("Count: ", c)


Unique values:  [10 20 30 40]
Count:  [3 4 2 5]


### 4.1) Element wise product and dot product
Create two 2 by 3 arrays, which are called A and B. Where the contents of A are random numbers between 1 and 11 and the contents of B are the numbers 1 up to 6. First try to multiply the arrays elementwise and then try to multiply the arrays with the dot product. Which error gives the dot product and why is the dot product not working? Can you think of an operation on one of the matrices that can fix the problem?

Answer: The reason why the dot product is not working is because of the shape of the matrices. The error message "ValueError: shapes (2,3) and (2,3) not aligned: 3 (dim 1) != 2 (dim 0)" is telling you this. Both matrices have the shape (2,3), but in order to apply the dot product the number of columns in A has to be the same as the number of rows in B. The dot product is possible if we change the shape of matrix B to (3,2). This can be done in two different ways, either by using the reshape function or by taking the transpose. Alternatively, we could change the shape of A to (3,2) and then take the dot product B * A. 
Note that the different methods all result in different matrices.

In [7]:
A = 1 + (11-1)*numpy.random.random((2,3))
B = numpy.arange(1, 7).reshape(2,3)
print(A)
print(B)

# elementwise product
C = A*B
print(C)

# dot product
D = A.dot(B)
print(D)

# alternative dot product
E = numpy.dot(A, B)
print(E)


[[  8.28223023   1.7522902   10.5138572 ]
 [  1.83374827   3.11618221   7.10217603]]
[[1 2 3]
 [4 5 6]]
[[  8.28223023   3.50458041  31.5415716 ]
 [  7.33499309  15.58091107  42.61305617]]


ValueError: shapes (2,3) and (2,3) not aligned: 3 (dim 1) != 2 (dim 0)

In [10]:
A = 1 + (11-1)*numpy.random.random((2,3))
B = numpy.arange(1, 7).reshape(2,3)

# in order to get the dot product working, we change the shape of B to (3,2)
C = B.reshape(3,2)
D = A.dot(C)
print(D)

# alternative solution which uses the transpose
E = numpy.transpose(B)
F = A.dot(E)
print(F)

[[ 64.63161975  88.34723639]
 [ 48.01168319  67.58799865]]
[[  44.1736182   115.32046812]
 [  33.79399932   92.52294568]]


### 4.2) Transpose
Create two matrices (A and B) which are not square matrices such that you can calculate the dot product. Then calculate the following things: $(A*B)^T$ and $B^T * A^T$ where * is the dot product. What do you notice?

Answer: What you will notice if you did it correctly is that $(A*B)^T = B^T * A^T$. This is a famous equation in linear algebra. For more information: http://www.math.nyu.edu/~neylon/linalgfall04/project1/dj/proptranspose.htm

In [16]:
A = numpy.arange(6).reshape(2,3)
B = numpy.arange(6,12).reshape(3,2)
C = numpy.transpose(numpy.dot(A, B))
D = numpy.dot(numpy.transpose(B), numpy.transpose(A))
print("(A*B)^T = ")
print(C)
print("B^T * A^T = ")
print(D)

(A*B)^T = 
[[ 28 100]
 [ 31 112]]
B^T * A^T = 
[[ 28 100]
 [ 31 112]]


### 4.3) Inverse
Try to create a 2 by 2 matrix for which the determinant is equal to zero and then try to compute the inverse. When the determinant of a matrix is equal to zero, then the inverse has to give you an error message. Can you try to understand this error message? and after that fix the matrix such that the inverse exists. Hint: there is a simplified formula to calculate the determinant of a 2 by 2 matrix and using this formula you can easily choose the numbers in the matrix.

Answer: the formula for the determinant of a 2 by 2 matrix is det(A) = a*d - b*c. When you try to create the inverse then the error message will tell you that the matrix is singular, which means that the determinant is equal to zero.

In [5]:
A = numpy.array([[1, 2], [1, 2]])
print("Determinant: ", numpy.linalg.det(A))
print("Inverse: ", numpy.linalg.inv(A))

0.0


LinAlgError: Singular matrix

###  4.4) Euclidean distances
Load the iris training data set (make sure that the data set is uploaded to your Jupyter server). 
Take the first n data points, which are the first n rows. Then calculate the pairwise distances between the data points based on the Euclidean distance. You can save the results in an array D, where D(i,j) is the Euclidean distance between data point i and j. Find the two closest data points (note that it is not allowed to return twice the same data point), and check whether or not the corresponding label is the same.  

In [34]:
import csv

def load_table(path):
    reader = csv.reader(open(path), delimiter=' ')
    return [ (row[0:-1], row[-1]) for row in reader ]

X, Y = zip(*load_table('iris-train.txt'))
X = numpy.array(X, dtype='float')

n = 5
X = X[0:n, :]
Y = Y[0:n]

D = (numpy.PINF)*numpy.ones((n,n))
for i in range(n):
    for j in range(i+1, n):
        D[i,j] = numpy.linalg.norm((X[i,:] - X[j, :]), ord=2)
print(D)

minimum = D.min()
index_minimum = numpy.unravel_index(D.argmin(), (n,n))
print("The two closest data points are the rows with index {} and {} and the distance is {} and the labels are {} and {} \
    .".format(index_minimum[0], index_minimum[1], minimum, Y[index_minimum[0]], Y[index_minimum[1]] ))


[[        inf  0.26457513  3.19843712  0.96953597  0.78102497]
 [        inf         inf  3.06267857  0.99498744  0.73484692]
 [        inf         inf         inf  3.96862697  3.01330383]
 [        inf         inf         inf         inf  1.52643375]
 [        inf         inf         inf         inf         inf]]
The two closest data points are the rows with index 0 and 1 and the distance is 0.26457513110645914 and the labels are versicolor and versicolor     .


### 4.5) Speed up calculations

Try to make the exercises and make the code as efficient as possible and avoid using loops. 
- Create an 10000 by 10000 matrix with random numbers between 0 and 1. Then we want to divide all numbers that are larger than 0.5 by 2. You can check the result by computing the mean of the array before and after the operation. Measure the time it takes to run the code.
- Create a matrix A = [[11, 12, 13, 14, 15], [21, 22, 23, 24, 25], [31, 32, 33, 34, 35]]. Find the indices for the elements which are multiples of 11 and the elements that are multiples of 11.

In [61]:
# divide all numbers that are larger than 0.5 by 2
import time
start = time.process_time()
x = numpy.random.random((10000, 10000))
print("Mean before: ", numpy.mean(x))
i = x > 0.5
x[i] = x[i]/2
print("Mean after: ", numpy.mean(x))
end = time.process_time()
print("Execution time: ", end-start)

# indices for the elements which are multiples of 11
A = numpy.array( [[11, 12, 13, 14, 15], [21, 22, 23, 24, 25], [31, 32, 33, 34, 35]] )
condition = (A % 11 == 0)
print("Indices: ", numpy.where(condition))
print("Elements: ", A[condition])


Mean before:  0.500043967725
Mean after:  0.312515811166
Execution time:  4.697339982000003
Indices:  (array([0, 1, 2]), array([0, 1, 2]))
Elements:  [11 22 33]


### 4.6) Rounding grades

Create a three dimensional matrix with shape (3, 3, 10) and random numbers (float) between 0 and 10. Then assume that these numbers are grades. For the exercises avoid using loops and try to create efficient code.

- Now assume that this are the grades for 10 students which did 3 assignments and 3 questions per assignment, then calculate for each student the average grade over all the assignments given that every assignment and every question has equal weight.
- Round these grades to halves, except for the numbers between 5 and 6, those are rounded to integers, since a rounded 5.5 is not given as grade. 


In [74]:
# calculate average grade for every student
x = 10*numpy.random.random((3,3,10))
y = numpy.mean(numpy.mean(x, axis=0), axis=0)
print("Average grade per student: ")
print(y)

# rounding grades
i = numpy.logical_and(y > 5.0 , y < 6.0)
y[i] = numpy.round(y[i])
y = numpy.round(y*2)/2
print("Rounded grades: ")
print(y)



Average grade per student: 
[ 5.08684082  5.61091428  4.19889438  6.04894404  3.60938891  6.91047978
  6.23495447  5.28813231  5.3985673   5.20395994]
Rounded grades: 
[ 5.   6.   4.   6.   3.5  7.   6.   5.   5.   5. ]
