# 3. Vectorized operations

In [2]:
import numpy

### Basic operations on arrays with the same shape

The basic operations on arrays are applied elementwise.
The basic operations are addition, subtraction, multiplication, division and power.
The simplest case is when the shapes of the arrays are exactly the same, then an elementwise operation is straightforward. 

In [5]:
# basic operations between two arrays with the same shape:
x = numpy.array([10, 20, 30, 40])
y = numpy.array([5, 7, 52, 34])

print("y - x = ", y - x)
print("x + y = ", x + y)
print("x * y = ", x * y)
print("x / y = ", x / y)

y - x =  [ -5 -13  22  -6]
x + y =  [15 27 82 74]
x * y =  [  50  140 1560 1360]
x / y =  [ 2.          2.85714286  0.57692308  1.17647059]


### Basic operations on arrays with different shapes

Besides operations between arrays of the same shape, also operations between arrays of different shapes are allowed, but are not always possible. Operations on arrays with different shapes is often called broadcasting.

There are some different types of broadcasting:
- Basic operations between an array and a constant, then there are no restrictions on the shape.
- Basic operations between an array and a row vector, then the number of columns in the array has to be the same as the length of the row vector.
- Basic operations between an array and a column vector, then the number of rows in the array has to be the same as the length of the column vector.

When applying operations between an array and a row or column vector the shapes are still important.
For example, let x be a 2 by 3 array, let y be a row vector with 3 elements, and let z be a column vector with 2 elements, then operations are allowed.
When applying operations between array x and row vector y, then the operations are applied for each row, and the number of column in the array has to be the same as the length of the row vector.
When applying operations between array x and column vector z, then the operations are applied for each column, and the number of rows in the array has to be the same as the length of the column vector. 
When operations are applied between arrays of different shapes and the number of rows or columns is not the same, then this will return an error message.

For more information about Broadcasting:
http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html

In [6]:
# constant term
x = numpy.array([20, 25, 30, 35])
print("x - 2 = ", x - 2)
print("x * 2 = ", x * 2)
print("x **2 = ", x**2)

x - 2 = [18 23 28 33]
x * 2 = [40 50 60 70]
x **2 =  [ 400  625  900 1225]


In [44]:
# operations between array and vector
x = numpy.array([[1, 2, 3], [4, 5, 6]])
y = numpy.array([5, 5, 5]) # row vector
z = numpy.array([[1], [2]]) # column vector

# array and row vector
print("Operations between x and y which are applied for each row")
print("x + y = ", x+y)
print("x * y = ", x*y)

# array and column vector
print("Operations between x and z which are applied for each column")
print("x + z = ", x+z)
print("x * z = ", x*z)

Operations between x and y which are applied for each row
x + y =  [[ 6  7  8]
 [ 9 10 11]]
x * y =  [[ 5 10 15]
 [20 25 30]]
Operations between x and z which are applied for each column
x + z =  [[2 3 4]
 [6 7 8]]
x * z =  [[ 1  2  3]
 [ 8 10 12]]


### Applications

Interesting applications of operations on vectors are:
- Normalization: z = (x - mean(x)) / stdev(x), this is called the z-score and the z-score is roughly in the range between -3 and 3. Normalization is often used before applying machine learning algorithms.
- Feature scaling: y = (x - min(x)) / (max(x) - min(x)), this brings the score in the range 0 to 1.
- Convertion between different scales of measurements. Some examples: from Fahrenheit to Celsius, or from Dollars to Euros, or from Inches to Centimetres. 

These transformations can be applied to the whole array, but also to only one column or row. 
When applying to only one row or column of an array, then indexing can be used to indicate the vector.


In [46]:
# normalization
x = numpy.random.random((1,5))
z = (x - numpy.mean(x)) / numpy.std(x)
print("Normalized: ", z)

Normalized:  [[ 0.3848132  -0.81109531  1.4612922   0.35514762 -1.39015771]]


In [37]:
# transform Fahrenheit to Celsius: C = (F-32)/1.8
F = 32+(212-32)*numpy.random.random((1,4))
C = (F-32)/1.8
print("Fahrenheit:", F)
print("Celsius:", C)

Fahrenheit: [[ 145.89290709   33.99930381  199.26635605  161.82835961]]
Celsius: [[ 63.27383727   1.11072434  92.92575336  72.12686645]]


### Boolean operations on arrays

Next to the basic operations also boolean conditions can be applied to the arrays. Then the boolean conditions are applied to every element in the array. A lot of different conditions can be created, such as: equal to (==), not equal to (!=), greater than (>= or >), or smaller than (<= or <). 

In [41]:
# boolean operations on arrays
x = numpy.array([10, 20, 30, 14, 15, 16])
y = numpy.array([7, 5, 5, 7, 5, 7]) 
print("(x > 15) = ", x>15)
print("(y == 7) = ", y==7)

(x > 15) =  [False  True  True False False  True]
(y == 7) =  [ True False False  True False  True]


Next to these boolean conditions there is something called "Mask". This mask is a boolean array, used to select only certain elements for an operation. This can be very useful when you want to transform only a part of the array that satisfies a certain condition. This trick uses boolean indexing. The transformation is only applied to the elements in the array where the mask array has value True. The advantage of the mask array over the normal boolean indexing is that you can apply the same condition to multiple different arrays as long as the shape of the arrays is the same.

In [51]:
# if we want to replace all element of x that are larger than 15 with 15, we can use mask:
x = numpy.array([10, 20, 30, 14, 15, 16])
mask = (x > 15)
x[mask] = 15
print("Mask array: ", mask)
print("Transformed array: ", x)

Mask array:  [False  True  True False False  True]
Transformed array:  [10 15 15 14 15 15]


### Mathematical functions applied on vectors

A lot of mathematical functions can be applied to arrays and they are applied elementwise, such as:
- numpy.sqrt(x): squareroot
- numpy.sin(x)
- numpy.cos(x)
- numpy.tan(x)
- numpy.exp(x): exponential
- numpy.log(x): natural logarithm

In [51]:
x = numpy.array([1, 2, 3, 4])
print("x = ", x)
print("sqrt(x) = ", numpy.sqrt(x))
print("sin(x) = ", numpy.sin(x) )
print("cos(x) = ", numpy.cos(x) )
print("tan(x) = ", numpy.tan(x) )
print("exp(x) = ", numpy.exp(x) )
print("log(x) = ", numpy.log(x) )

x =  [1 2 3 4]
sqrt(x) =  [ 1.          1.41421356  1.73205081  2.        ]
sin(x) =  [ 0.84147098  0.90929743  0.14112001 -0.7568025 ]
cos(x) =  [ 0.54030231 -0.41614684 -0.9899925  -0.65364362]
tan(x) =  [ 1.55740772 -2.18503986 -0.14254654  1.15782128]
exp(x) =  [  2.71828183   7.3890561   20.08553692  54.59815003]
log(x) =  [ 0.          0.69314718  1.09861229  1.38629436]


### Summing over an array and finding the minimum or maximum
Some other functions that can be applied to the entire array x or to only a dimension are the following functions:
- x.sum() and numpy.cumsum(x)
- x.min() and x.argmin()
- x.max() and x.argmax()

These functions have a parameter which is called axis. When axis=0 then sum per column or the minimum per column is returned. When axis=1 then sum per row or the minimum per row is returned. In higher dimensional arrays, the same logic applies. 

Next to the normal sum function also the cumulative sum function is implemented in Numpy. The cumsum function returns the cumulative sum of the elements along a given axis, the axis can be given as second argument for the function. This cumsum can be useful in probability calculations when one has the probability density function but one need the cumulative density function. 

One important thing to notice is that when the argmin() or argmax() functions are applied, then the index of the minimum or maximum is returned, but this index is the linear index and not the index in all the dimensions.
For example, let x be a 2 by 2 matrix and let the argmin be 1, then this is equivalent with the position (0,1) in the matrix. Another example, let the argmax be 3, then this corresponds to the position (1,1) in the matrix.


In [48]:
x = numpy.array([[6, 5], [7, 8]])

# functions applied to the entire array:
print("sum:", x.sum())
print("minimum:", x.min(), "and index of minimum:", x.argmin())
print("maximum:", x.max(), "and index of maximum:", x.argmax())
print()

# functions applied to only one dimension of the array:
print("column sums:", x.sum(axis=0))
print("row sums:", x.sum(axis=1))
print("minimum per column:", x.min(axis=0))
print("maximum per row:", x.max(axis=1))

sum: 26
minimum: 5 and index of minimum: 1
maximum: 8 and index of maximum: 3

column sums: [13 13]
row sums: [11 15]
minimum per column: [6 5]
maximum per row: [6 8]


### Sorting
The arrays can be sorted which is similiar as sorting lists in Python. The functions sort() and argsort() can be applied to arrays.
When applied to a 2-dimensional array the sort operation will apply per row and therefore also the indices are based on the position in the row.

In [47]:
# sorting an 1-dimensional array:
print("Applied to 1-dimensional array")
x = numpy.array([5, 3, 6, 2, 6, 8])
print("unsorted x:", x)
y = x.argsort()
x.sort()
print("sorted x: ", x)
print("indices of argsort():", y)
print()

# sorting an 2-dimensional array:
print("Applied to 2-dimensional array")
x = numpy.array([[5, 3, 6], [2, 6, 8]])
print("unsorted x:", x)
y = x.argsort()
x.sort()
print("sorted x: ", x)
print("indices of argsort():", y)

Applied to 1-dimensional array
unsorted x: [5 3 6 2 6 8]
sorted x:  [2 3 5 6 6 8]
indices of argsort(): [3 1 0 2 4 5]

Applied to 2-dimensional array
unsorted x: [[5 3 6]
 [2 6 8]]
sorted x:  [[3 5 6]
 [2 6 8]]
indices of argsort(): [[1 0 2]
 [0 1 2]]


#### Exercise 3.1) Transformation
Load the iris train data set. Make sure that the data set is uploaded on your Jupyter server. The features are in array X. The features are the length and the width of the sepals and petals. The lengths and widths are measured in centimetres. Suppose we want to create the following transformations:
- For the first feature (sepal lenght in cm) we want to transform it to inches (1 inch = 2,54 cm).
- For the second feature (sepal width in cm) we want to transform it to a range in between 0 and 1.
- For the third feature (petal length in cm) we want to transform it to a boolean expression indicating whether or not it is larger than the mean.
- For the fourth feature (petal width in cm) we want to transform it to the exponential function.

Normally, when you will use the features to train an algorithm you will not change every feature from the same metric to different metrics, but most of the time the other way around. Most of the time, you will change features such that they have the same metric and that they are normalized. For now it is a good exercise to practice some transformations.


In [None]:
# Exercise 3.1
import csv

def load_table(path):
    reader = csv.reader(open(path), delimiter=' ')
    return [ (row[0:-1], row[-1]) for row in reader ]

X, Y = zip(*load_table('iris-train.txt'))
X = numpy.array(X, dtype='float')



#### Exercise 3.2) Practice with  2-dimensional arrays
Create one 2-dimensional array, which will be called A, you are free to choose the content as long as it is either the integer or float data type.
Then try to do the following things:
- Compute the total sum of the elements in the array in two ways, first with the sum() function and second with using a loop instead of the sum() function.
- Return the minimum of the entire array and the corresponding index, but transform the linear index into a row index and a column index. Think of a general formula to transform the linear index into a two dimensional index. Afterwards, you can check your result with the unravel_index() function.
- Create another 2-dimensional array (B) with the same shape of A. Then sort the array A, and then sort array B on the same way as array A. For example, if the element A[1,4] is after sorting in position [0,0], then the element B[1,4] also has to be in position [0,0]. For this you can use the argsort method, see the documentation: http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.argsort.html

In [None]:
# Exercise 3.2



### Rounding 

If you want to round every element in the array then the following rounding functions can be used:
- numpy.round(x, decimals = 2 )
- numpy.floor(x)
- numpy.ceil(x)


In [43]:
# rounding 
x = 10*numpy.random.random((1,5))
print("not rounded:", x)

x1 = numpy.round(x, decimals = 2)
print("round:", x1)

x2 = numpy.floor(x)
print("floor:", x2)

x3 = numpy.ceil(x)
print("ceil:", x3)

not rounded: [[ 6.95070703  6.9695847   1.20842674  2.29773688  2.45624574]]
round: [[ 6.95  6.97  1.21  2.3   2.46]]
floor: [[ 6.  6.  1.  2.  2.]]
ceil: [[ 7.  7.  2.  3.  3.]]


### Statistics

To apply some basic statistical functions to the numpy array x, the following functions can be useful:
- numpy.median(x) : median
- numpy.mean(x) : mean
- numpy.average(x, axis= , weights= ) : (weighted) average
- numpy.std(x) : standard deviation
- numpy.var(x) : variance
- numpy.corrcoef(x) : Pearson product-moment correlation coefficients
- numpy.correlate(x, y) : cross-correlation between two vectors x and y
- numpy.cov(x) : covariance matrix

These functions can be applied to the entire array, or to only one axis. When applied to one axis, then the parameter axis can be used. Similar functions exists which ignore NAN, these functions are called: nanmedian(), nanmean(), nanstd(), nanvar(). 

For more statistical functions in numpy: http://docs.scipy.org/doc/numpy/reference/routines.statistics.html

### Histograms

To create some histograms, not the plots, but the data only, the following functions can be used:
- numpy.histogram(x, bins=10 ) : basic histogram
- numpy.histogram2d(x, y) : histogram of two vectors x and y
- numpy.histogramdd(x) : multidimensional histogram

Two arrays are returned, the first array are the frequencies and the second array are the boundaries of the bins in the histogram.

In [71]:
x = numpy.random.random((4,5))
print("Mean: ", numpy.mean(x), "and variance: ", numpy.var(x))

# histogram
numpy.histogram(x, bins=5)

Mean:  0.54218218314 and variance:  0.0663362651995


(array([4, 2, 4, 5, 5]),
 array([ 0.07968814,  0.24726595,  0.41484377,  0.58242158,  0.74999939,
         0.91757721]))

#### Exercise 3.3) Normalization
Load the iris train data set. Make sure that the data set is on your Jupyter server. The features of the data set are located in the array X. Then calculate the basic statistics for every feature and then normalize the features.

In [None]:
# Exercise 3.3
import csv

def load_table(path):
    reader = csv.reader(open(path), delimiter=' ')
    return [ (row[0:-1], row[-1]) for row in reader ]

X, Y = zip(*load_table('iris-train.txt'))
X = numpy.array(X, dtype='float')




### Differences in values in a vector

One simple way to calculate the differences between sequential values in the vector is to use the numpy.diff(x) function. A very useful application is to locate the positions in the vector where the value changes in a very large vector with a lot of equal values.

In [54]:
x = numpy.array([1, 3, 4, 5, 10, 2])
y = numpy.diff(x)
print(y)

[ 2  1  1  5 -8]


#### Exercise 3.4) Number of subsequent equal values

One application of the numpy.diff function is that you can use it to count the number of subsequent equal values. Take x = [10, 10, 10, 20, 20, 20, 20, 30, 30, 40, 40, 40, 40, 40] and return the unique values and the number of subsequent equal values. 

In [None]:
# Exercise 3.4


