# PDS NumPy Workshop

## Structure

1 - NumPy
* Motivation for using NumPy instead of standard Python lists
* NumPy basics - what are ndarrays/numpy arrays, creating arrays with lists (of dimensions 1, 2, and 3), ndim, shape, size, dtype, np.zeros, np.ones, arange, linspace
* Indexing and slicing with multidimensional arrays, reshaping and reshaping with -1 as argument
* Stacking (vstack, hstack)
* Math operations with numpy - adding/subtracting with same size, adding/subtracting/multiplying array by constant with broadcasting, elementwise product, matrix product, aggregate functions like sum(), min(), max()
* Linear algebra with np.linalg, solving systems of linear equations, finding eigenvalues of matrix, inverse, matrix power

2 - PCA with MNIST
* explain MNIST dataset and goal - classifying digits based on their grayscale pixel values
* explain PCA - dimensionality reduction to get the directions where the image varies most
* now actual PCA/code
* a - Preprocess data to convert to mean 0 and stdev 0
* b - Compute covariances and eigenvalue/eigenvectors, choose top k eigenvectors to capture variance
* c - Project training images onto reduced dimensionality eigenbasis and reproject to standard basis
* d - Project test images onto eigenbasis and back
* e - Test the transformed test images vs the transformed training images with labels using k nearest neighbors

## Numpy Basics

### ndarrays

In [140]:
# Generally use Python lists to process "arrays of numbers in Python"
ls = [1,2,3,4]

In [141]:
ls

[1, 2, 3, 4]

In [2]:
# Even though the elements in a given list can be of different types, can get slow when dealing with large amounts 
# of data and it's not easy to do complex math/linear algebra with them

In [142]:
# That's why we use NumPy instead - Python library for scientific computing and linear algebra that is based on 
# its "ndarray" data structure.

In [11]:
import numpy as np # this is the standard way to import it

In [12]:
# One dimensional array (vector)
vec = np.array([1,2,3,4])

In [13]:
vec

array([1, 2, 3, 4])

In [14]:
# Two dimensional array (matrix)
mat = np.array([[1,2,3],[4,5,6]])

In [15]:
mat

array([[1, 2, 3],
       [4, 5, 6]])

In [16]:
# Three dimensional array (3-d tensor)
tensor = np.array([[[1,2,3], [4,5,6]], [[7,8,9], [10, 11, 12]]])

In [17]:
tensor

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

In [19]:
# In general, can manipulate arrays with an arbitrary number of dimensions

In [21]:
# To get the number of dimensions of an array, use ndim.
print(vec.ndim) # 1
print(mat.ndim) # 2
print(tensor.ndim) # 3

1
2
3


In [27]:
# To get the shape of an array, use shape
print(vec.shape) 
print(mat.shape)
print(tensor.shape)

(4,)
(2, 3)
(2, 2, 3)


In [28]:
# To get the number of elements in an array, use size
print(vec.size) 
print(mat.size)
print(tensor.size)

4
6
12


In [31]:
# NumPy has convenient ways to quickly construct arrays without manually creating a list first.
zeros = np.zeros((3, 4)) # creates 3 x 4 matrix filled with zeros

In [33]:
zeros

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [35]:
ones = np.ones((3, 4)) # creates 3 x 4 matrix filled with ones

In [36]:
ones

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

In [37]:
nums = np.arange(10) # creates vector filled with numbers from 0 to 9, inclusive

In [39]:
nums

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [40]:
linear = np.linspace(0, 10, 50) # creates vector filled with 50 equally spaced numbers from 0 to 10, inclusive

In [42]:
linear

array([ 0.        ,  0.20408163,  0.40816327,  0.6122449 ,  0.81632653,
        1.02040816,  1.2244898 ,  1.42857143,  1.63265306,  1.83673469,
        2.04081633,  2.24489796,  2.44897959,  2.65306122,  2.85714286,
        3.06122449,  3.26530612,  3.46938776,  3.67346939,  3.87755102,
        4.08163265,  4.28571429,  4.48979592,  4.69387755,  4.89795918,
        5.10204082,  5.30612245,  5.51020408,  5.71428571,  5.91836735,
        6.12244898,  6.32653061,  6.53061224,  6.73469388,  6.93877551,
        7.14285714,  7.34693878,  7.55102041,  7.75510204,  7.95918367,
        8.16326531,  8.36734694,  8.57142857,  8.7755102 ,  8.97959184,
        9.18367347,  9.3877551 ,  9.59183673,  9.79591837, 10.        ])

### Indexing

In [43]:
# Indexing into 1-D NumPy arrays is very similar to indexing into Python lists

In [44]:
arr = np.arange(9)**2

In [45]:
arr

array([ 0,  1,  4,  9, 16, 25, 36, 49, 64])

In [48]:
arr[0] # NumPy arrays are 0-indexed, like lists in Python and arrays in Java/C

0

In [50]:
arr[0:2] # element at index 0 is included, element at index 2 is not, just like in Python

array([0, 1])

In [52]:
# can also iterate through 1-D ndarrays like you would through lists
for elem in arr:
    print(elem)

0
1
4
9
16
25
36
49
64


In [54]:
# Multidimensional arrays need an index for each of their axes/dimensions

In [58]:
multi = np.array([[1,2,3,4,5],[6, 7, 8, 9, 10], [11, 12, 13, 14, 15]])

In [59]:
multi

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15]])

In [60]:
multi[0, 2] # element in row 0, column 2

3

In [62]:
multi[0:2, 1:5] # array with elements in rows 0-1 (row 2 not included) and columns 1-4

array([[ 2,  3,  4,  5],
       [ 7,  8,  9, 10]])

In [63]:
multi[:, 2:4] # array with elements in all rows and columns 2-3 of the original

array([[ 3,  4],
       [ 8,  9],
       [13, 14]])

In [64]:
multi[-1, :] # array with the last row of the original

array([11, 12, 13, 14, 15])

### Reshaping and stacking

In [65]:
# Can change the shape of an array - useful when we need data to be in a specific form for a computation

In [72]:
orig = np.arange(9)
arr = orig.reshape(3, 3) # reshape returns a new array with the same values as the original in the specified shape

In [73]:
orig

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

In [74]:
arr

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [75]:
orig.resize(3, 3) # resize changes the shape of the array in place

In [76]:
orig

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [77]:
orig.resize(9)

In [80]:
new_arr = orig.reshape(3, -1) # when -1 is an argument to reshape, NumPy figures out what the missing dimensions 
                              # must be on its own

In [83]:
new_arr # in this case, since the original array has 9 elements, if its reshaped version has 3 rows it must have 3 
        # columns

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [93]:
a = np.arange(6).reshape(2, 3)
b = np.arange(8).reshape(2, 4)
c = np.arange(9).reshape(3, 3)

In [94]:
a

array([[0, 1, 2],
       [3, 4, 5]])

In [95]:
b

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

In [96]:
c

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [99]:
np.hstack((a, b)) # if two arrays have the same number of rows, can stack them horizontally with hstack

array([[0, 1, 2, 0, 1, 2, 3],
       [3, 4, 5, 4, 5, 6, 7]])

In [100]:
np.vstack((a, c)) # if two arrays have the same number of columns, can stack them vertically with vstack

array([[0, 1, 2],
       [3, 4, 5],
       [0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [145]:
mat = np.arange(9).reshape(3, 3)

In [147]:
mat

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [148]:
mat.T # transposes an array

array([[0, 3, 6],
       [1, 4, 7],
       [2, 5, 8]])

### Useful functions on arrays

In [110]:
a = np.arange(1, 9)

In [111]:
a

array([1, 2, 3, 4, 5, 6, 7, 8])

In [112]:
a.sort()

In [113]:
a

array([1, 2, 3, 4, 5, 6, 7, 8])

In [114]:
a.max() # max element

8

In [115]:
a.min() # min element

1

In [116]:
a.sum() # sum of all elements

36

In [117]:
a.prod() # product of all alements

40320

In [137]:
a.mean() # average of elements

4.5

In [138]:
a.std() # standard deviation of elements

2.29128784747792

In [128]:
# Let's do a speed test - sum all numbers from 0 to 10^8-1

In [133]:
l = [i for i in range(10**8)] # using Python lists

In [134]:
%%time
sum(l)

CPU times: user 1.31 s, sys: 3.48 s, total: 4.78 s
Wall time: 6.17 s


4999999950000000

In [135]:
np_list = np.arange(10**8) # using NumPy

In [136]:
%%time
np_list.sum() # should be aboust 20x as fast

CPU times: user 123 ms, sys: 108 ms, total: 231 ms
Wall time: 271 ms


4999999950000000

### Basic math

In [151]:
a = np.arange(9).reshape(3, 3)
b = (np.arange(9)**2).reshape(3, 3)

In [152]:
a

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [153]:
b

array([[ 0,  1,  4],
       [ 9, 16, 25],
       [36, 49, 64]])

In [154]:
a + b # can add arrays with the same shape elementwise

array([[ 0,  2,  6],
       [12, 20, 30],
       [42, 56, 72]])

In [155]:
a + 3 # can add a constant to all elements in an array

array([[ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [156]:
a * b # can multiply arrays with the same shape elementwise

array([[  0,   1,   8],
       [ 27,  64, 125],
       [216, 343, 512]])

In [157]:
a * 3 # can multiply a constant by all elements in an array

array([[ 0,  3,  6],
       [ 9, 12, 15],
       [18, 21, 24]])

### Linear algebra

In [158]:
# NumPy almost certainly has all the built-in functions you'll need for linear algebra

In [162]:
vec1 = np.arange(9)
vec2 = np.arange(9)**2

In [163]:
vec1

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

In [164]:
vec2

array([ 0,  1,  4,  9, 16, 25, 36, 49, 64])

In [165]:
vec1.dot(vec2) # dot product of two vectors (1-D arrays)

1296

In [166]:
mat1 = np.arange(9).reshape(3, 3)
mat2 = (np.arange(9)**2).reshape(3, 3)

In [168]:
mat1

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [169]:
mat2

array([[ 0,  1,  4],
       [ 9, 16, 25],
       [36, 49, 64]])

In [167]:
mat1 @ mat2 # matrix product of two matrices (2-D arrays)

array([[ 81, 114, 153],
       [216, 312, 432],
       [351, 510, 711]])

In [171]:
mat1 @ mat1 @ mat1 # can be awkward to write, so ...

array([[ 180,  234,  288],
       [ 558,  720,  882],
       [ 936, 1206, 1476]])

In [172]:
np.linalg.matrix_power(mat1, 3) # raise matrix to a power

array([[ 180,  234,  288],
       [ 558,  720,  882],
       [ 936, 1206, 1476]])

In [174]:
eig_vals, eig_vectors = np.linalg.eig(mat1)

In [175]:
eig_vals # eigenvalues

array([ 1.33484692e+01, -1.34846923e+00, -2.48477279e-16])

In [177]:
eig_vectors # columns in this matrix are the corresponding (normalized) eigenvectors

array([[ 0.16476382,  0.79969966,  0.40824829],
       [ 0.50577448,  0.10420579, -0.81649658],
       [ 0.84678513, -0.59128809,  0.40824829]])

In [182]:
np.linalg.det(mat2) # matrix determinant

-216.00000000000006

In [184]:
np.trace(mat1) # trace of a matrix

12

In [186]:
np.linalg.inv(mat2) # matrix inverse

array([[ 0.93055556, -0.61111111,  0.18055556],
       [-1.5       ,  0.66666667, -0.16666667],
       [ 0.625     , -0.16666667,  0.04166667]])