# 3. Vectorized operations

In [1]:
import numpy as np

## Vector addition with lists

Given two lists `x = [10, 20, 30, 40]` and `y = [5, 7, 52, 34]` how would we sum elements at corresponding indices? 

## Numpy arrays make such operations easy.

## Operations on arrays with the same shape

- **Elementwise** operations apply an operator to elements at the same position in two arrays.

In [2]:
# basic operations between two arrays with the same shape:
x = np.array([10, 20, 30, 40], dtype=np.float)
y = np.array([5, 7, 52, 34], dtype=np.float)

In [3]:
print("y - x = ", y - x)

y - x =  [ -5. -13.  22.  -6.]


In [4]:
print("x + y = ", x + y)

x + y =  [ 15.  27.  82.  74.]


In [5]:
print("x * y = ", x * y)

x * y =  [   50.   140.  1560.  1360.]


In [6]:
print("x / y = ", x / y)

x / y =  [ 2.          2.85714286  0.57692308  1.17647059]


## Operations on arrays with different shapes

Operations on arrays with different shapes involve **broadcasting**.

For more information about broadcasting see:
http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html

There are several scenarios:

### Array and a scalar. 
There are no restrictions on the shape.

In [7]:
# scalar 
x = np.array([20, 25, 30, 35])
print("x - 2 = ", x - 2)
print("x * 2 = ", x * 2)
print("x **2 = ", x**2)

x - 2 =  [18 23 28 33]
x * 2 =  [40 50 60 70]
x **2 =  [ 400  625  900 1225]


### Array and a row vector. 
The number of columns in the array has to be the same as the length of the row vector.

In [8]:
x = np.array([[1, 2, 3], [4, 5, 6]])
y = np.array([5, 5, 5]) # row vector
z = np.array([[1], [2]]) # column vector

print(x)
print()
print(y)
print()
print(z)

[[1 2 3]
 [4 5 6]]

[5 5 5]

[[1]
 [2]]


In [9]:
# array and row vector
print("Operations between x and y which are applied for each row")
print("x + y = \n", x+y)
print("x * y = \n", x*y)

Operations between x and y which are applied for each row
x + y = 
 [[ 6  7  8]
 [ 9 10 11]]
x * y = 
 [[ 5 10 15]
 [20 25 30]]


### Array and a column vector. 

The number of rows in the array has to be the same as the length of the column vector.


In [10]:
# array and column vector
print("Operations between x and z which are applied for each column")
print("x + z = \n", x+z)
print("x * z = \n", x*z)

Operations between x and z which are applied for each column
x + z = 
 [[2 3 4]
 [6 7 8]]
x * z = 
 [[ 1  2  3]
 [ 8 10 12]]


## Vector transformations

- Standardization: `z = (x - mean(x)) / stdev(x)`. Standardized values (z-scores) have zero mean and unit standard deviation. 

- Scaling: `y = (x - min(x)) / (max(x) - min(x))`, which brings the values in the range 0 to 1.

- Conversion between units measurements. Some examples: from Fahrenheit to Celsius, or from Dollars to Euros, or from inches to centimetres. 


### Exercise 3.1

Define function `standardize` which converts a vector of numbers to z-scores.


In [11]:
# 8< ..........................................
x = np.random.normal(-1,3,10)
print(x)

def standardize(x):
    return (x - np.mean(x))/np.std(x)

print(standardize(x))
print(np.mean(standardize(x)))
print(np.std(standardize(x)))


[-0.66027983 -2.41907795  0.00470794 -4.18076799 -3.82755385 -2.4734285
  4.11420284  1.88341238 -2.64559563 -2.74836844]
[ 0.25342289 -0.44850334  0.51881569 -1.15158371 -1.01061798 -0.47019433
  2.15889149  1.26859583 -0.53890525 -0.57992128]
7.77156117238e-17
1.0


### Exercise 3.2
- Define function `to_cm` which takes a vector of measurements in inches and converts them to centimeters.
- Define function `to_celsius` which takes a vector of measurements in Fahrenheit and converts them to Celsius: C = (F-32)/1.8


In [12]:
# 8< ..............................................
inch = np.array([1.0, 2.0, 10.0])
f = np.array([-40, 0.0, 100.0])

def to_cm(x):
    return x * 2.54
def to_celsius(x):
    return (x-32)/1.8

print(to_cm(inch))
print(to_celsius(f))

[  2.54   5.08  25.4 ]
[-40.         -17.77777778  37.77777778]


## Boolean operations on arrays

Boolean operations can be applied in a similar way as the arithmetic operations.
- equal to (`==`), 
- not equal to (`!=`), 
- greater than (`>=` or `>`), 
- less than (`<=` or `<`). 

In [13]:
# boolean operations on arrays
x = np.array([10, 20, 30, 14, 15, 16])
y = np.array([7, 5, 5, 7, 5, 7]) 
print("(x > 15) = ", x>15)
print("(y == 7) = ", y==7)

(x > 15) =  [False  True  True False False  True]
(y == 7) =  [ True False False  True False  True]


## Functions applied on vectors

All the following functions can be applied to arrays in an elementwise fashion:
- `np.sqrt`: square root
- `np.sin`: sine
- `np.cos`: cosine
- `np.tan`: tangent
- `np.exp`: exponential
- `np.log`: natural logarithm (base e)
- `np.log2`: base-2 logarithm 
- `np.log10`: base-10 logarithm

In [14]:
x = np.array([1, 2, 3, 4])
print("x = ", x)
print("sqrt(x) = ", np.sqrt(x))
print("sin(x) = ", np.sin(x) )
print("cos(x) = ", np.cos(x) )
print("tan(x) = ", np.tan(x) )
print("exp(x) = ", np.exp(x) )
print("log(x) = ", np.log(x) )
print("log2(x) = ", np.log2(x) )
print("log10(x) = ", np.log10(x) )

x =  [1 2 3 4]
sqrt(x) =  [ 1.          1.41421356  1.73205081  2.        ]
sin(x) =  [ 0.84147098  0.90929743  0.14112001 -0.7568025 ]
cos(x) =  [ 0.54030231 -0.41614684 -0.9899925  -0.65364362]
tan(x) =  [ 1.55740772 -2.18503986 -0.14254654  1.15782128]
exp(x) =  [  2.71828183   7.3890561   20.08553692  54.59815003]
log(x) =  [ 0.          0.69314718  1.09861229  1.38629436]
log2(x) =  [ 0.         1.         1.5849625  2.       ]
log10(x) =  [ 0.          0.30103     0.47712125  0.60205999]


## Reductions

The following operations can be applied to the entire array or to only one dimension:

- `.sum` and `numpy.cumsum`
- `.min` and `.argmin`
- `.max` and `.argmax`

In [15]:
x = np.array([[1, 6, 5], [2, 7, 8]])
print(x)
# functions applied to the entire array:
print("sum:", x.sum())
print("sum:", np.sum(x))
print("minimum:", x.min(), "and index of minimum:", x.argmin())
print("maximum:", x.max(), "and index of maximum:", x.argmax())

[[1 6 5]
 [2 7 8]]
sum: 29
sum: 29
minimum: 1 and index of minimum: 0
maximum: 8 and index of maximum: 5


One important thing to notice is that the index retured by `argmin` or `argmax` is the linear index and not the multidimensional index (see [2b_arrays.ipynb](2b_arrays.ipynb))

## Axis
Many reduction functions have a parameter called `axis`. When `axis=0` the operation is carried out on columns, so that the result has one element per column. When `axis=1` the operation is carried out on rows, so that the result has one element per row, and analogously for values `axis=2` or higher.

In [16]:
# functions applied to only one dimension of the array:
print("Sum columns:", x.sum(axis=0))
print("Sum rows:", x.sum(axis=1))
print("Minimum per column:", x.min(axis=0))
print("Maximum per row:", x.max(axis=1))

Sum columns: [ 3 13 13]
Sum rows: [12 17]
Minimum per column: [1 6 5]
Maximum per row: [6 8]


### Exercise 3.3
Define function `scale` which takes a vector of numbers and brings them to the range from 0 to 1:
$$\mathrm{scale}(x_i) = \frac{x_i - min(x)}{max(x) - min(x)}$$

In [17]:
# 8< .............................................
def scale(x):
    return (x - x.min())/(x.max() - x.min())

z = np.arange(0,10)
print(z)
print(scale(z))



[0 1 2 3 4 5 6 7 8 9]
[ 0.          0.11111111  0.22222222  0.33333333  0.44444444  0.55555556
  0.66666667  0.77777778  0.88888889  1.        ]


### Exercise 3.4a

The function `softmax` is often used in machine learning and statistics to convert a vector of arbitrary numbers into a vector of probabilities summing up to $1$. Softmax is computed by computing the exponential of each number, and then dividing each number by the sum of the exponentials:
$$ \mathrm{softmax}(x_i): \frac{\exp(x_i)}{\sum_{k=1}^N \exp(x_k)}$$

Implement the softmax function. Verify that in the resulting vector all number are between 0 and 1. Verify that the resulting numbers sum up to $1$.

In [18]:
# 8< ...........................................
z = np.random.normal(0,2,10)
print(z)
def softmax(x):
    E = np.exp(x)
    return E /np.sum(E)

print(softmax(z))
print(np.sum(softmax(z)))
print(np.all(softmax(z) >= 0.0))
print(np.all(softmax(z) <= 1.0))

[ 1.29362293  1.60196491  0.25413965  2.97832401 -1.09922943  0.48074146
  2.19909529 -0.05847433  2.55426853 -0.12801561]
[ 0.06604408  0.08989698  0.02335565  0.356033    0.00603436  0.0292957
  0.1633336   0.01708541  0.23298357  0.01593764]
1.0
True
True


### Exercise 3.4b

Implement a version of the `softmax` function which takes a matrix, and converts the values to probabilities such that each column sums up to 1.

In [19]:
# 8< ...........................................
z = np.random.normal(0,2,(4,5))
print(z)
print()
def softmax(x):
    E = np.exp(x)
    return E /np.sum(E, axis=0)

print(softmax(z))
print(np.sum(softmax(z), axis=0))
print(np.all(softmax(z) >= 0.0))
print(np.all(softmax(z) <= 1.0))

[[-5.35157607 -1.06649141  3.29657302 -1.69418694  1.28064987]
 [-0.60018168 -0.26807045 -3.33730707 -0.87365274  1.43964105]
 [-2.33028296 -1.09355215  0.90738449  0.24054154 -2.02341205]
 [ 1.77545606 -2.14270074 -2.83979995  0.43696363  1.23384472]]

[[  7.23359050e-04   2.20448015e-01   9.13090350e-01   5.37102168e-02
    3.16122109e-01]
 [  8.37256029e-02   4.89841987e-01   1.20076007e-03   1.22014202e-01
    3.70598675e-01]
 [  1.48417409e-02   2.14562520e-01   8.37341002e-02   3.71790943e-01
    1.16123204e-02]
 [  9.00709297e-01   7.51474789e-02   1.97478962e-03   4.52484638e-01
    3.01666896e-01]]
[ 1.  1.  1.  1.  1.]
True
True


## Sorting

For sorting use the functions and `sort` and `argsort`.

In [20]:
# sorting an 1-dimensional array:
print("Applied to 1-dimensional array")
x = np.array([5, 3, 6, 2, 6, 8])
print("Original x:           ", x)
print("Sorted   x:           ", np.sort(x))
y = x.argsort()
print("Indices of argsort:   ", y)
print("Sorted using indices: ", x[y])

Applied to 1-dimensional array
Original x:            [5 3 6 2 6 8]
Sorted   x:            [2 3 5 6 6 8]
Indices of argsort:    [3 1 0 2 4 5]
Sorted using indices:  [2 3 5 6 6 8]


**Attention**  The method `.sort` sorts the array in-place, that is destructively. Use with caution.

In [21]:
x = np.array([2, 3, 1])
print("x=", x)
x.sort()
print("x=", x)

x= [2 3 1]
x= [1 2 3]


Sorting can also be done per row or column.

In [22]:
x = np.array([[5, 3, 4],[2, 4, 2]])
print("Original: ")
print(x)
print("Columns are sorted: ")
print(np.sort(x, axis=0))

print("Rows are sorted: ")
print(np.sort(x, axis=1))


Original: 
[[5 3 4]
 [2 4 2]]
Columns are sorted: 
[[2 3 2]
 [5 4 4]]
Rows are sorted: 
[[3 4 5]
 [2 2 4]]


## Reversing

There is a special indexing syntax in `numpy` to obtain a view of the array in the reverse order. 

In [23]:
a = np.random.randint(0,10,5)
print(a)
print()
print(a[::-1])

[6 8 4 0 6]

[6 0 4 8 6]


### Exercise 3.5

The file `winequality-red.csv` contains measurements of wine samples, together with a quality rating. You can load this data into a pandas dataframe:

In [24]:
import pandas as pd
data = pd.read_csv("winequality-red.csv", delimiter=';')
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


1. Extract the column `quality` as a numpy array and sort the values from lowest to highest.
2. Now sort the same array from highest to lowest
3. Sort the whole dataframe according to the values in the column `quality`, from highest to lowest.

In [25]:
# 8<---------------------

quality = data['quality'].values
# Ascending
quality_asc = np.sort(quality)
print(quality_asc)

[3 3 3 ..., 8 8 8]


In [26]:
# Descending
quality_desc = quality_asc[::-1]
print(quality_desc)

[8 8 8 ..., 3 3 3]


In [27]:
# 8<-----------
data.sort_values(by='quality', ascending=False)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
495,10.7,0.350,0.53,2.60,0.070,5.0,16.0,0.99720,3.15,0.65,11.00,8
1403,7.2,0.330,0.33,1.70,0.061,3.0,13.0,0.99600,3.23,1.10,10.00,8
390,5.6,0.850,0.05,1.40,0.045,12.0,88.0,0.99240,3.56,0.82,12.90,8
1061,9.1,0.400,0.50,1.80,0.071,7.0,16.0,0.99462,3.21,0.69,12.50,8
1202,8.6,0.420,0.39,1.80,0.068,6.0,12.0,0.99516,3.35,0.69,11.70,8
828,7.8,0.570,0.09,2.30,0.065,34.0,45.0,0.99417,3.46,0.74,12.70,8
481,9.4,0.300,0.56,2.80,0.080,6.0,17.0,0.99640,3.15,0.92,11.70,8
455,11.3,0.620,0.67,5.20,0.086,6.0,19.0,0.99880,3.22,0.69,13.40,8
1449,7.2,0.380,0.31,2.00,0.056,15.0,29.0,0.99472,3.23,0.76,11.30,8
440,12.6,0.310,0.72,2.20,0.072,6.0,29.0,0.99870,2.88,0.82,9.80,8


### Rounding 

Rounding functions:
- `np.round`
- `np.floor`
- `np.ceil`


In [28]:
# rounding 
x = 10*np.random.random((1,5))
print("not rounded:", x)

x1 = np.round(x, decimals = 2)
print("round:", x1)

x2 = np.floor(x)
print("round down:", x2)

x3 = np.ceil(x)
print("round up:", x3)

not rounded: [[ 4.13868648  2.01932944  4.07060578  8.35601446  6.8164255 ]]
round: [[ 4.14  2.02  4.07  8.36  6.82]]
round down: [[ 4.  2.  4.  8.  6.]]
round up: [[ 5.  3.  5.  9.  7.]]


### Statistics

Statistics functions:

- `np.median` : median
- `np.mean` : mean
- `np.average`: (weighted) average
- `np.std` : standard deviation
- `np.var` : variance
- `np.cov` : covariance matrix
- `np.corrcoef` : Pearson's product-moment correlation coefficients

These functions can be applied to the entire array, or to only one axis, specified via the `axis=` keyword. 

Similar functions exists which ignore NaN: `nanmedian`, `nanmean`, `nanstd`, `nanvar`. 

For more statistical functions in numpy: http://docs.scipy.org/doc/numpy/reference/routines.statistics.html

### Exercise 3.6a

Define function `print_summary` which takes a pandas dataframe with numerical columns, and prints, for each column, basic statistics:

- name (name of the column in the data frame)
- mean 
- median
- min (minimum value)
- max (maximum value)
- std (standard deviation)

For example:
```
column: fixed_acidity
mean: 8.31963727329581
median: 7.9
min: 4.6
max: 15.9
std: 1.7405518001102729
column: volatile_acidity
mean: 0.5278205128205128
median: 0.52
min: 0.12
max: 1.58
...
```

### Exercise 3.6b
Modify the above function so that it takes an additional argument where the user can specify the number of decimal digits to display. For example, `print_summary(data, decimals=2)`:
```
column: fixed_acidity
mean: 8.32
median: 7.9
min: 4.6
max: 15.9
std: 1.74
....
```


In [29]:
import numpy as np
population = pd.read_csv("population.csv", delimiter='\t')
population.head()

Unnamed: 0,year,hare,lynx,carrot
0,1900,30000.0,4000.0,48300
1,1901,47200.0,6100.0,48200
2,1902,70200.0,9800.0,41500
3,1903,77400.0,35200.0,38200
4,1904,36300.0,59400.0,40600


In [30]:
# 8<-------------------------
def print_summary(df, decimals=2):
    for col in df.columns:
        print("column: {}".format(col))
        print("mean: {}"  .format(np.round(np.mean(   df[col]), decimals)))
        print("median: {}".format(np.round(np.median( df[col]), decimals)))
        print("min: {}"   .format(np.round(np.min(    df[col]), decimals)))
        print("max: {}"   .format(np.round(np.max(    df[col]), decimals)))
        print("std: {}"   .format(np.round(np.std(    df[col]), decimals)))
        print()
        
        
print_summary(population, decimals=3)

column: year
mean: 1910.0
median: 1910.0
min: 1900
max: 1920
std: 6.055

column: hare
mean: 34080.952
median: 25400.0
min: 7600.0
max: 77400.0
std: 20897.906

column: lynx
mean: 20166.667
median: 12300.0
min: 4000.0
max: 59400.0
std: 16254.592

column: carrot
mean: 42400.0
median: 41800.0
min: 36700
max: 48300
std: 3322.506



## Python modules

A Python module is a collection of reusable functions. You can create a module by putting some function definitions in a file with the extension `.py`. For example, put some of the functions you defined above in a file called `functions.py`. You can then use them from any notebook or other Python code by importing like this:

```python
from functions import * 
```
This will import all functions from this module, and they can be used directly.

The alternative is:

```python
import functions as F
```
where `F` is some shortened name. If your module have the function `scale`, you will then call it as `F.scale`.

## Imports inside modules (IMPORTANT)
The module must import everything it uses. For example, if your module `functions` uses numpy, then it needs to import it at the top of the file. It is not sufficient for the notebook which uses the module functions to import numpy.

Try this in a new notebook.


**For the programming exam you will submit Python modules with some function definitions.** 
Make sure you understand this concept if you haven't seen it before.

