![MLTrain logo](https://mltrain.cc/wp-content/uploads/2017/11/mltrain_logo-4.png "MLTrain logo")

To reproduce this session at your laptop upload `numpy.ipynb` to AzureML studio

In [1]:
% run changeNBLayout.py

-------------------------
# Introduction #

The `numpy` package is the basis for all numeric computations in Python because it provides __high-performance__ vector and matrix computations and a compact API.  
`numpy` methods and functions use C and Fortran underneeth so when only numpy calls are made in sequence the performance is comparabe to direct C|Fortran implementations.  
A sequence of _numpy-only_ calls is generally known as '_vectorized computations_'.

__NB:__ numpy processes in-memory data so there're no library provisions for efficient read|write to disks. It is the programmer's responsibility to employ mechanisms for either compacting the memory representations of data or handle disk paging.  
  
The practice for hadling large arrays is to use scipy's __sparse array__ libraries and numpy's _memory-mapped files_, both outside the scope of this lecture.

In [2]:
import numpy as np
import random as ran
from os import linesep as endl

# Our library version 
print 'Current Numpy version is:', np.__version__

Current Numpy version is: 1.13.3


In [8]:
np.show_config()

lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blis_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE


--------------------------
# Constructors #

### The ndarray constructor ###

In [3]:
arr = np.ndarray(shape = [2, 4], dtype = int, order = 'C')
print '2x4 integer ndarray with arbitrary memory contents:', endl, arr

2x4 integer ndarray with arbitrary memory contents: 
[[              0               0 140016874143536 140016874175296]
 [140016874104808 140016874175576 140017436771552               0]]


### The numpy.array factory function ###

In [4]:
# A numpy.ndarray object with 100 samples from the Gamma distribution
s = ran.seed(101)
data = [ran.gammavariate(.6, .8) for _ in xrange(10)]
print '10 elements from the Gamma distribution:', endl, np.array(data, dtype = float)

10 elements from the Gamma distribution: 
[ 0.45144449  0.31371191  0.08575201  0.14052511  0.01337472  0.01973911
  0.20938272  0.17227148  0.24616949  0.27341855]


### Specific ndarray constructors ###
There're special functions for constructing arrays of certain content:

In [6]:
# range ctor:
print 'Integer range ctor with arguments <start>, <stop>, <step>:', endl, np.arange(0, 10, 1)

Integer range ctor with arguments <start>, <stop>, <step>: 
[0 1 2 3 4 5 6 7 8 9]


In [7]:
# linspace
print '10 equispaced elements in [-1, 1)', endl, np.linspace(-1, 1, 10, endpoint = False)


10 equispaced elements in [-1, 1) 
[-1.  -0.8 -0.6 -0.4 -0.2  0.   0.2  0.4  0.6  0.8]


In [8]:
# logspace
print '10 logarithmic elements in [-1, 1]', endl, np.logspace(-1, 1, 10, endpoint = True, base = np.e)



10 logarithmic elements in [-1, 1] 
[ 0.36787944  0.45942582  0.57375342  0.71653131  0.89483932  1.11751907
  1.39561243  1.742909    2.17662993  2.71828183]


In [5]:
# identity
print np.eye(5)

[[ 1.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.]
 [ 0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  1.]]


In [9]:
# zeros
print '2 by 4 zeros', endl, np.zeros(shape = [2, 4])



2 by 4 zeros 
[[ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]]


In [10]:
# ones
print '2 by 4 units', endl, np.ones(shape = [2, 4])



2 by 4 units 
[[ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]]


In [11]:
# full
print '2 by 4 of value randomly selected from the normal distribution:' 
print np.full([2, 4], fill_value = ran.normalvariate(0., .1), dtype = float)


2 by 4 of value randomly selected from the normal distribution:
[[ 0.10838194  0.10838194  0.10838194  0.10838194]
 [ 0.10838194  0.10838194  0.10838194  0.10838194]]


### meshgrid and mgrid ###

These functions generate grid points in $\mathbb R^n$ (n is the dimension of the grid) from a list of coordinate vectors.  
So if x = [1, 2] and y = [0, 1] each of `meshgrid` and `mgrid` will generate 2 2D matrices of the form:  

``` Python
[[1, 2],  [[0, 0],
 [1, 2]],  [1, 1]]
```

This is particularely useful for evaluating nultivariate functions using vector operations as we shall see below

`meshgrid`

In [12]:
x = np.linspace(-1, 1, 4, endpoint = False)
y = np.logspace(-1, 1, 4, base = 10, endpoint = True)

X, Y = np.meshgrid(x, y, indexing = 'xy')

# np.set_printoptions(linewidth = 132, precision = 3)
print 'X = ', endl, X
print endl, 'Y =', endl, Y

print endl, 'Coordinate grid:', endl, [[(round(_x, 3), round(_y, 3)) for _x in x] for _y in y]

X =  
[[-1.  -0.5  0.   0.5]
 [-1.  -0.5  0.   0.5]
 [-1.  -0.5  0.   0.5]
 [-1.  -0.5  0.   0.5]]

Y = 
[[  0.1          0.1          0.1          0.1       ]
 [  0.46415888   0.46415888   0.46415888   0.46415888]
 [  2.15443469   2.15443469   2.15443469   2.15443469]
 [ 10.          10.          10.          10.        ]]

Coordinate grid: 
[[(-1.0, 0.1), (-0.5, 0.1), (0.0, 0.1), (0.5, 0.1)], [(-1.0, 0.464), (-0.5, 0.464), (0.0, 0.464), (0.5, 0.464)], [(-1.0, 2.154), (-0.5, 2.154), (0.0, 2.154), (0.5, 2.154)], [(-1.0, 10.0), (-0.5, 10.0), (0.0, 10.0), (0.5, 10.0)]]


Evaluating  $\mathtt {sin ({x^2} + {y^2})}$  on the generated $\mathtt{(X, Y)}$ grid:

In [13]:
np.sin(X**2 + Y**2)

array([[ 0.84683184,  0.25708055,  0.00999983,  0.25708055],
       [ 0.93752378,  0.44881914,  0.21378067,  0.44881914],
       [-0.5984752 , -0.98398663, -0.99749472, -0.98398663],
       [ 0.45202579, -0.27728286, -0.50636564, -0.27728286]])

`mgrid` is syntactic sugar for `start:end:step` meshgrids:

In [8]:
i, j = np.mgrid[4:0:-1, 0:8:2]
print i
print endl, j

[[4 4 4 4]
 [3 3 3 3]
 [2 2 2 2]
 [1 1 1 1]]

[[0 2 4 6]
 [0 2 4 6]
 [0 2 4 6]
 [0 2 4 6]]


### Random array generators ###
Numpy has a rich set functions for sampling from univariate and multivariate distributions.  
Ref: https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.random.html

In [15]:
print 'A 4 by 4 array with samples drawn from the uniform distribution in [0., 1.)]:'
print np.random.rand(4, 4)


A 4 by 4 array with samples drawn from the uniform distribution in [0., 1.)]:
[[ 0.75666134  0.57062423  0.90765377  0.73192674]
 [ 0.61907143  0.81945312  0.60340252  0.77756764]
 [ 0.46019069  0.16219173  0.95292449  0.88069386]
 [ 0.99714982  0.80113108  0.94189929  0.98103806]]


In [16]:
print '4 floats drawn from U[0., 1.)'
print np.random.sample(size = 4)



4 floats drawn from U[0., 1.)
[ 0.03327835  0.74927103  0.00351881  0.53768181]


In [17]:
# Create a vector of probabilities using 'toProb', then use np.random.choice
# to select with replacement (bootstrap) from {0,...,9} 4 elements:
toProb = lambda _: _/sum(_)
print 'Bootstrap from data using probability distribution'
print np.random.choice(np.arange(10), size = 4, replace = True, p = toProb(np.random.sample(10)))



Bootstrap from data using probability distribution
[5 7 9 9]


In [18]:
# Random hex numbers drawn uniformly in the range {0,...255}
print 'Random bytes', endl, repr(np.random.bytes(5))


Random bytes 
'\x10\x121\x1b\xd9'


### Distributions ### 
The list of supported distributions is large. Go to [the reference page](https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.random.html) to see more

In [19]:
print 'Floats drawn from N(0, 1):'
print np.random.randn(5, 5)


Floats drawn from N(0, 1):
[[-1.25429664 -0.7606203   0.01656388  0.09359257 -1.00257475]
 [-0.22227737  1.49466576  0.56839911 -1.79843154  0.50618778]
 [ 0.44979852 -0.06396355  0.21936311  0.96753712  0.48622614]
 [-1.47508678  0.52961624 -1.73713092  1.17493539  0.20939395]
 [ 2.73741004  0.18507445  1.18037864  0.44321329  0.25311176]]


In [20]:
print endl, 'Chi-square distribution with 5 degrees of freedom:'
print np.random.chisquare(5, 5)



Chi-square distribution with 5 degrees of freedom:
[ 4.4469966   6.18528533  2.14855819  7.91270868  6.20203405]


In [21]:
print endl, '10 samples from the exponential distribution'
print np.random.exponential(size = 10)


10 samples from the exponential distribution
[ 2.08566106  0.15850284  0.08979094  1.20079065  0.48318528  3.46827911
  2.3598703   2.30678439  0.18095178  0.00598623]


---------------------------------
# ndarray attributes #

In [102]:
from string import ascii_letters
sarr = np.array(list(ascii_letters), dtype = str)

print 'An array of characters:', endl, sarr
print sarr.dtype, sarr.shape, sarr.size, sarr.ndim, sarr.nbytes, sarr.itemsize


An array of characters: 
['a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o' 'p' 'q' 'r' 's' 't' 'u' 'v' 'w' 'x' 'y' 'z' 'A' 'B' 'C' 'D' 'E' 'F' 'G'
 'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O' 'P' 'Q' 'R' 'S' 'T' 'U' 'V' 'W' 'X' 'Y' 'Z']
|S1 (52,) 52 1 52 1


<span style = "font-size: 180%; font-weight: bold; color: darkgreen">Quiz</span>  
Create a 10x10 matrix of random floats.  
What's the size of it in bytes?

---------------------------------------------------
# Indexing and selection #

In [116]:
arr = np.arange(10)

# Element accessor:
print arr[5]

# Range and 'shard' accessors (slices)
print arr[2:5]
print arr[::-1]
print arr[:-1:2]
print arr[-1::-2]

5
[2 3 4]
[9 8 7 6 5 4 3 2 1 0]
[0 2 4 6 8]
[9 7 5 3 1]


### Fancy indexing ###
'Fancy' indexing is indexing with lists:

In [3]:
arr2 = np.linspace(-1, 1, 12).reshape(4, -1)
print arr2[[1, 3]]
print endl, arr2[:, [0, 1]]

[[-0.45454545 -0.27272727 -0.09090909]
 [ 0.63636364  0.81818182  1.        ]]

[[-1.         -0.81818182]
 [-0.45454545 -0.27272727]
 [ 0.09090909  0.27272727]
 [ 0.63636364  0.81818182]]


In [118]:
# Counter-intuitive: the following will NOT return a 2x2 array:
arr2[[0, 1], [0, 1]]

array([-1.   , -0.273])

<div style = "color: darkred; font-size: 200%; font-weight: bold;  text-decoration: underline"> 
Exercise 
</div>  

Given a 2D array `np.random.randn(10, 5)` select a subarray consisting of rows 1, 3 and 4 and columns 0, 2, 3

### Assignment ###

In [112]:
# Assign 3 to first row's elements. This is an application of 'broadcasting'
arr2[0] = 3
print arr2

[[ 3.     3.     3.   ]
 [-0.455 -0.273 -0.091]
 [ 0.091  0.273  0.455]
 [ 0.636  0.818  1.   ]]


In [115]:
# A fancier example of fancy indexing:
arr3 = np.linspace(-1, 1, 10)
arr3Ix = np.random.choice(10, size = [4, 4])
arr3[arr3Ix]

array([[-0.778, -0.111,  0.556,  0.556],
       [-0.556, -1.   ,  1.   ,  0.333],
       [-0.111, -1.   ,  0.556,  0.778],
       [-0.556, -1.   , -0.111,  1.   ]])

### Boolean indexing ###

In the following examples, we use hypothetical grades of MLTrain instructors in 5 courses.  
`names` vector contains instructor names and `grades` is a matrix with the course grades of 4 courses. Each row of `grades` refers to an instructor in `names`

In [6]:
grades = np.random.uniform(0, 1, size = [5, 4])
names = np.array(['Alex', 'Chris', 'Nick', 'Chris', 'Chris'])


In [7]:
print 'A mask for Chris:', endl, names == 'Chris'


A mask for Chris: 
[False  True False  True  True]


In [28]:
print 'Select the grades of Chris:'
print grades[names == 'Chris']


Select the grades of Chris:
[[ 0.29648067  0.62647038  0.62315227  0.28414599]
 [ 0.94647544  0.61264416  0.53760535  0.66312209]
 [ 0.08808139  0.43668582  0.05748348  0.15250833]]


In [29]:
print 'and the grades of the others:'
print grades[names != 'Chris']


and the grades of the others:
[[ 0.29971325  0.62940657  0.93927455  0.65521197]
 [ 0.6413069   0.13714796  0.08391276  0.6144379 ]]


In [30]:
print 'the grades of Chris and Alex:'
print grades[np.isin(names, ['Alex', 'Chris'])]


the grades of Chris and Alex:
[[ 0.29971325  0.62940657  0.93927455  0.65521197]
 [ 0.29648067  0.62647038  0.62315227  0.28414599]
 [ 0.94647544  0.61264416  0.53760535  0.66312209]
 [ 0.08808139  0.43668582  0.05748348  0.15250833]]


In [31]:
print 'Select the grades of Chris that are > .4 (nasty):'
print grades[(grades > .4) & np.repeat(names == 'Chris', 4).reshape(-1, 4)]

Select the grades of Chris that are > .4 (nasty):
[ 0.62647038  0.62315227  0.94647544  0.61264416  0.53760535  0.66312209
  0.43668582]


### where ###

In [162]:
# We want to grade qualitatively and therefore replace grades > .5 with 'HIGH'
qGrades = np.where(grades <= .5, 'LOW', 'HIGH')

print 'Chris'' grades:'
print qGrades[names == 'Chris']

Chris grades:
[['HIGH' 'HIGH' 'LOW' 'LOW']
 ['HIGH' 'LOW' 'HIGH' 'LOW']
 ['LOW' 'LOW' 'LOW' 'LOW']]


### choose ###
From N arrays a1, ... aN of the same shape and an index-array ix of equal shape and values in {0, ..., N-1}, we generate an output array with values chosen from the a1, ..., aN according to the values in ix.  
  
Let's emulate a random experiment using `choose`:  
I toss a coin 10 times and if I get heads I select from the normal distribution, otherwise from the exponential.  

For those with a probability background, I have created a __mixture model__.  

### <span style = "color: darkred;">  Quiz: </span> ###
<span style = "background: #E9B96E">
Let's test our knowledge on 'Statistical Inference':  
There's an underlying classification problem in the construction below. Can you describe it?
</span>

In [166]:
choices = np.random.choice([0, 1], 10)
arr1 = np.random.standard_normal(10)
arr2 = np.random.standard_exponential(10)

np.choose(choices, arr1, arr2)

array([-0.981, -0.981, -0.981, -0.981,  0.841,  0.841,  0.841,  0.841,  0.841,  0.841])

--------------
# Matrix operations #

### Element-wise operations ###

In [57]:
arr1 = np.tile(np.linspace(1, 4, 4, dtype = np.float32), 4).reshape(4, -1)
arr2 = np.random.choice(np.random.randn(4), [4, 4])

print arr1
print arr2

[[ 1.  2.  3.  4.]
 [ 1.  2.  3.  4.]
 [ 1.  2.  3.  4.]
 [ 1.  2.  3.  4.]]
[[-1.15735847  0.73996427 -1.15735847 -1.15735847]
 [-0.78481835  0.73996427 -0.278621   -0.78481835]
 [ 0.73996427  0.73996427  0.73996427 -0.278621  ]
 [-0.78481835 -0.278621   -0.278621   -0.78481835]]


In [58]:
print arr1 + arr2

[[-0.15735847  2.73996427  1.84264153  2.84264153]
 [ 0.21518165  2.73996427  2.721379    3.21518165]
 [ 1.73996427  2.73996427  3.73996427  3.721379  ]
 [ 0.21518165  1.721379    2.721379    3.21518165]]


In [59]:
print arr1 * arr2

[[-1.15735847  1.47992854 -3.47207542 -4.6294339 ]
 [-0.78481835  1.47992854 -0.83586301 -3.13927341]
 [ 0.73996427  1.47992854  2.2198928  -1.11448402]
 [-0.78481835 -0.55724201 -0.83586301 -3.13927341]]


In [60]:
print arr2 / (arr1)

[[-1.15735847  0.36998213 -0.38578616 -0.28933962]
 [-0.78481835  0.36998213 -0.09287367 -0.19620459]
 [ 0.73996427  0.36998213  0.24665476 -0.06965525]
 [-0.78481835 -0.1393105  -0.09287367 -0.19620459]]


In [62]:
print 'Element-wise function application (ufunc):'
print np.log(arr1)

Element-wise function application (ufunc):
[[ 0.          0.69314718  1.09861231  1.38629436]
 [ 0.          0.69314718  1.09861231  1.38629436]
 [ 0.          0.69314718  1.09861231  1.38629436]
 [ 0.          0.69314718  1.09861231  1.38629436]]


### Matrix multiplication ###

In [178]:
print 'Matrix multiplication:'
print np.dot(arr1, arr2)

print endl, 'Matrix-vector multiplications:'
print np.dot(arr1, arr2[:, 0])
print np.dot(arr2[0], arr1)

Matrix multiplication:
[[  2.542 -11.484  -5.994 -11.667]
 [  2.542 -11.484  -5.994 -11.667]
 [  2.542 -11.484  -5.994 -11.667]
 [  2.542 -11.484  -5.994 -11.667]]

Matrix-vector multiplications:
[ 2.542  2.542  2.542  2.542]
[ -3.624  -7.248 -10.872 -14.496]


### BLAS ###
See https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.linalg.html

In [187]:
import numpy.linalg as la

U, sigma, V = la.svd(arr2)
print 'Singular values of arr2:'
print sigma

eigValues, eigVectors = la.eig(arr2)
print endl, 'The eigenvalues of arr2 are:'
print eigValues

Singular values of arr2:
[ 7.127  4.499  1.87   0.997]

The eigenvalues of arr2 are:
[ 2.340+1.501j  2.340-1.501j -1.973+1.961j -1.973-1.961j]


-----------------------------------
# Retructuring arrays #

### reshape | resize ###

In [32]:
arr = np.arange(10)
print arr


[0 1 2 3 4 5 6 7 8 9]


In [39]:
print 'Make a 2x5 array:'
print arr.reshape(2, 5)


Make a 2x5 array:
[[0 1 2 3 4]
 [5 6 7 8 9]]


In [40]:
print 'This is the same'
print arr.reshape(2, -1)


This is the same
[[0 1 2 3 4]
 [5 6 7 8 9]]


In [41]:
print 'Formally, "newaxis" seems irrelevant but it reshapes as well:'
print arr.reshape(2, -1)[:, np.newaxis, :]


Formally, "newaxis" seems irrelevant but it reshapes as well:
[[[0 1 2 3 4]]

 [[5 6 7 8 9]]]


In [42]:
print 'nparray.resize modifies the array in place and adds zeros:'
x = arr.copy()
x.resize(5, 3)
print x


nparray.resize modifies the array in place and adds zeros:
[[0 1 2]
 [3 4 5]
 [6 7 8]
 [9 0 0]
 [0 0 0]]


In [43]:
print 'numpy.resize returns a new array and adds starting over'
print np.resize(arr, [5, 3])

numpy.resize returns a new array and adds starting over
[[0 1 2]
 [3 4 5]
 [6 7 8]
 [9 0 1]
 [2 3 4]]


### repeating and tiling ###

In [44]:
arr = np.arange(4)
print 'Repeat each element twice:'
print np.repeat(arr, 2)


Repeat each element twice:
[0 0 1 1 2 2 3 3]


In [45]:
print endl, 'We can have different repeats per element:'
print np.repeat(arr, [2, 1, 0, 3])




We can have different repeats per element:
[0 0 1 3 3 3]


In [46]:
print endl, 'We can repeat along an axis:'
print np.repeat(arr.reshape(2, 2), 2, axis = 1)



We can repeat along an axis:
[[0 0 1 1]
 [2 2 3 3]]


In [47]:
print endl, '"Tile" repeats the array as block, broadcasting to new axes if necessary:'
print np.tile(arr, [3, 2, 2])


"Tile" repeats the array as block, broadcasting to new axes if necessary:
[[[0 1 2 3 0 1 2 3]
  [0 1 2 3 0 1 2 3]]

 [[0 1 2 3 0 1 2 3]
  [0 1 2 3 0 1 2 3]]

 [[0 1 2 3 0 1 2 3]
  [0 1 2 3 0 1 2 3]]]


### stacking ###

`stack` concatenates a list of equi-sized arrays along a new dimension.  
For vectors it is intuitive but for arrays of more dimensions the result is more involved.  
To see what actually happens, before stacking add an extra dimension to one of the arrays as below:

In [48]:
print 'Stack along the 1st dim (rows):'
print arr[np.newaxis, :]
print np.stack([arr, -arr], axis = 0)


Stack along the 1st dim (rows):
[[0 1 2 3]]
[[ 0  1  2  3]
 [ 0 -1 -2 -3]]


In [52]:
print 'Stack along columns'
print arr[:, np.newaxis]
print np.stack([arr, -arr], axis = 1)


Stack along columns
[[0]
 [1]
 [2]
 [3]]
[[ 0  0]
 [ 1 -1]
 [ 2 -2]
 [ 3 -3]]


In [53]:
print 'Stack 2 2D arrays one after the other:'
arr2 = arr.reshape(2, 2)
print np.stack([arr2, -arr2], axis = 0)


Stack 2 2D arrays one after the other:
[[[ 0  1]
  [ 2  3]]

 [[ 0 -1]
  [-2 -3]]]


In [54]:
print 'Now stack by transforming each element to a 1x2 array:'
print np.stack([arr2, -arr2], axis = 2)

Now stack by transforming each element to a 1x2 array:
[[[ 0  0]
  [ 1 -1]]

 [[ 2 -2]
  [ 3 -3]]]


### Concatenate ###
To "stack" without adding a new dimension, use concatenate.  
Therefore, you can only concatenate vectors on axis = 0 and 2D arrays on axis = 0 or 1

In [56]:
arr = np.arange(4).reshape(2, 2)
print np.concatenate([arr, -arr], axis = 1)

[[ 0  1  0 -1]
 [ 2  3 -2 -3]]


<div style = "color: darkred; font-size: 200%; font-weight: bold;  text-decoration: underline"> 
Exercise 
</div>  

Concatenate 2 vectors columnwise.  
Is there a more direct way using `np.stack`?