# Numpy tutorial

Oliver W. Layton

CS251/2: Data Analysis and Visualization

Spring 2023

In [1]:
import numpy as np 
import time

## Numpy ndarray basics

### Creation from Python lists

In [2]:
# Make a numpy array from a 2D python list
arr = np.array([[1,2,3],[4,5,6],[7,8,9]])
arr

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [3]:
# print it
print(arr)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


### Data type of ndarray

In [4]:
# determine data type
print('Type of array is\n', arr.dtype)

Type of array is
 int64


Type can be changed in a few ways. 

1. when creating array — (a) implicitly or (b) explicitly
2. by casting types.

In [5]:
# 1a implicitly
arr = np.array([[1., 2., 3.], [4., 5., 6.], [7., 8., 9.]])
print(arr)
print(arr.dtype)


[[1. 2. 3.]
 [4. 5. 6.]
 [7. 8. 9.]]
float64


In [6]:
# 1b explicitly
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype = float)
print(arr)
print(arr.dtype)

[[1. 2. 3.]
 [4. 5. 6.]
 [7. 8. 9.]]
float64


In [7]:
# 2. NOTE: This is a METHOD of the array, not a FUNCTION
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr_float = arr.astype(float)
print(arr_float)
print(arr_float.dtype)

[[1. 2. 3.]
 [4. 5. 6.]
 [7. 8. 9.]]
float64


In [8]:
# Can also be string. be careful in your CSV parser that your "numbers"
# aren't actually strings!
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]],dtype = str)
print(arr)

[['1' '2' '3']
 ['4' '5' '6']
 ['7' '8' '9']]


### Convert back to Python list

In [9]:
# Convert back from ndarray to Python list
arrAsList = arr.tolist()
print('Back as a Python list:\n', arrAsList)

Back as a Python list:
 [['1', '2', '3'], ['4', '5', '6'], ['7', '8', '9']]


### Other ways to create ndarrays quickly

#### 1. zeros

- We can plug in a list to get a multi-dimensional array
- We can plug in one int to get a vector of values

In [10]:
arr = np.zeros([10,3])
arr

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [11]:
np.zeros(5)

array([0., 0., 0., 0., 0.])

#### 2. ones

In [12]:
arr = np.ones([5,7])
arr

array([[1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1.]])

In [13]:
# can easily make any constant array
arr_10s = 10*np.ones([3,4])
print(arr_10s)

[[10. 10. 10. 10.]
 [10. 10. 10. 10.]
 [10. 10. 10. 10.]]


#### 3. Random values

In [14]:
# Uniform random values
np.random.seed(0)
arr_rand = np.random.random([2,4])
arr_rand

array([[0.5488135 , 0.71518937, 0.60276338, 0.54488318],
       [0.4236548 , 0.64589411, 0.43758721, 0.891773  ]])

In [15]:
np.random.randint(low = 0, high=11, size = (5,5))



array([[ 8, 10,  1,  6,  7],
       [ 7,  8,  1,  5,  9],
       [ 8,  9,  4,  3,  0],
       [ 3,  5,  0,  2,  3],
       [ 8,  1,  3,  3,  3]])

#### 4. Equally spaced floats in an interval

In [16]:
np.linspace(0,20, 4)

array([ 0.        ,  6.66666667, 13.33333333, 20.        ])

#### 5. Equally spaced ints in an interval

In [17]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [18]:
np.arange(-5,10,2)

array([-5, -3, -1,  1,  3,  5,  7,  9])

#### 6. Identity matrix

In [19]:
np.eye(5)

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

### Check dimensions — `shape`

In [20]:
# check shape of 3D array
arr_rand.shape

(2, 4)

In [21]:
big = np.zeros([3,4,5,6])
big.shape

(3, 4, 5, 6)

In [22]:
# check number of dimensions (M)
print(big.ndim)
print(arr_rand.ndim)

4
2


In [23]:
# Access 1st dim (#rows), 2nd dim (#cols) (Use f-string)
print(arr_rand.shape[0], arr_rand.shape[1])

2 4


In [24]:
# Check number of elements total
print('Num elements in arr_1:', arr_rand.size)

Num elements in arr_1: 8


## Brief detour: Rapidly build python lists (list comprehension)

In [25]:
# Brief detour: In python you can replace the workflow of 
# list-building by creating an empty list and looping to append...
myList = []
for i in range(5):
    myList.append(i)
print('myList build the usual way', myList)

myList build the usual way [0, 1, 2, 3, 4]


In [26]:
# ...with Python list comprehensions
myListComp = [i for i in range(5)]
print('myListComp', myListComp)

myListComp [0, 1, 2, 3, 4]


In [27]:
# you can build lists using any function of i. How about i^2?
myListSqr = [i**2 for i in range(5)]
print('myListSqr', myListSqr)

myListSqr [0, 1, 4, 9, 16]


## ndarray indexing

Basic Accessing and modifying of ndarrays.

### Access and modify single elements

In [28]:
# To access elements in a multidimensional ndarray use ONE set of square brackets []
# Make a new random array
np.random.seed(0)  # ensures random numbers come up the same each time. Useful for debugging.
arr = np.random.randint(low = 1, high = 11, size = (3,4,5))
print(arr)
print(arr.shape)

[[[ 6  1  4  4  8]
  [10  4  6  3  5]
  [ 8  7  9  9  2]
  [ 7  8  8  9  2]]

 [[ 6 10  9 10  5]
  [ 4  1  4  6  1]
  [ 3  4  9  2  4]
  [ 4  4  8  1  2]]

 [[10 10  1  5  8]
  [ 4  3  8  3  1]
  [ 1  5  6  6  7]
  [ 9  5  2  5 10]]]
(3, 4, 5)


In [29]:
# Get the 1st element
arr [1,1,1]

1

In [30]:
# Modifying single values is similar
arr[0,0,0] = 99
print('arr is now:\n', arr)

arr is now:
 [[[99  1  4  4  8]
  [10  4  6  3  5]
  [ 8  7  9  9  2]
  [ 7  8  8  9  2]]

 [[ 6 10  9 10  5]
  [ 4  1  4  6  1]
  [ 3  4  9  2  4]
  [ 4  4  8  1  2]]

 [[10 10  1  5  8]
  [ 4  3  8  3  1]
  [ 1  5  6  6  7]
  [ 9  5  2  5 10]]]


### Slicing: real power of numpy

Use **colon** notation for all values in a dimension

Access and modify different ranges of data along different dimensions 

Make a 3x5 random array. Access 2nd column

In [31]:
np.random.seed(0)
rand_arr = np.random.random([3,5])
rand_arr

array([[0.5488135 , 0.71518937, 0.60276338, 0.54488318, 0.4236548 ],
       [0.64589411, 0.43758721, 0.891773  , 0.96366276, 0.38344152],
       [0.79172504, 0.52889492, 0.56804456, 0.92559664, 0.07103606]])

In [32]:
rand_arr[:,1]

array([0.71518937, 0.43758721, 0.52889492])

Access 1st row

In [33]:
rand_arr[0]

array([0.5488135 , 0.71518937, 0.60276338, 0.54488318, 0.4236548 ])

Access last 2 columns

In [34]:
rand_arr[:,-2:]

array([[0.54488318, 0.4236548 ],
       [0.96366276, 0.38344152],
       [0.92559664, 0.07103606]])

Access columns at indices 1-2 and in 1st row. Careful about off-by-one.

- Low range (before :) CONTAINS that index
- High range (after :) DOES NOT contain that index (i-1)

In [35]:
rand_arr[0,1:3]

array([0.71518937, 0.60276338])

Use slicing to assign values efficiently in batch without loops

In [36]:
# Assign 1st row to -1s
rand_arr[0] = -1
print(rand_arr)

[[-1.         -1.         -1.         -1.         -1.        ]
 [ 0.64589411  0.43758721  0.891773    0.96366276  0.38344152]
 [ 0.79172504  0.52889492  0.56804456  0.92559664  0.07103606]]


In [37]:
# Assign 1st row to increasing ints
rand_arr[1] = np.arange(rand_arr.shape[1])
rand_arr

array([[-1.        , -1.        , -1.        , -1.        , -1.        ],
       [ 0.        ,  1.        ,  2.        ,  3.        ,  4.        ],
       [ 0.79172504,  0.52889492,  0.56804456,  0.92559664,  0.07103606]])

In [38]:
# Multiply the 3rd row by 5 times itself and update the row
rand_arr[2] = 5*rand_arr[2]
rand_arr

array([[-1.        , -1.        , -1.        , -1.        , -1.        ],
       [ 0.        ,  1.        ,  2.        ,  3.        ,  4.        ],
       [ 3.95862519,  2.6444746 ,  2.84022281,  4.62798319,  0.35518029]])

### What if we want to access a set of rows or columns that are not adjacent?

Can't use colon notation. Instead use `np._ix`

In [39]:
np.random.seed(0)
arr = np.random.randint(low = 0, high = 11, size = (4,5))
arr

array([[ 5,  0,  3,  3,  7],
       [ 9,  3,  5,  2,  4],
       [ 7,  6,  8,  8, 10],
       [ 1,  6,  7,  7,  8]])

In [40]:
arr[np.ix_([i for i in range(arr.shape[0])], [0,2,4])]

array([[ 5,  3,  7],
       [ 9,  5,  4],
       [ 7,  8, 10],
       [ 1,  7,  8]])

Example: Say we want column indices 0, 2, 4 and all rows.

**Syntax for `np.ix_`:**
- `np.ix_` goes inside the square brackets: `arr[np.ix_(blah)]`
- Give it `M` arguments (e.g. 2 for a 2D matrix).
- Each argument is a Python list (or ndarray) of indices to take along that dimension.

## Memory

- Numpy tries to be efficient with arrays so assignment does a shallow copy. To do a deep copy, you need to use `.copy()` method

In [41]:
a = np.linspace(-1, 1, 5)
print(a)

[-1.  -0.5  0.   0.5  1. ]


In [42]:
b = a
b[0] = 99
print(b)

[99.  -0.5  0.   0.5  1. ]


In [43]:
# changed a!
a

array([99. , -0.5,  0. ,  0.5,  1. ])

In [44]:
# fixed with .copy()
b = a.copy()
b[0] = 100

print(a)
print(b)

[99.  -0.5  0.   0.5  1. ]
[100.   -0.5   0.    0.5   1. ]


## Apply functions over dimensions (`axes`)

- Axes are the numpy term for different ndarray dimensions. 
- *Idea*: Do we want to apply an operation (e.g. sum) on the rows OR columns of a ndarray?
- *Example*: axis 0 are the rows, axis 1 are the columns, etc.
- We can apply functions over one or more axis super efficiently in one line of code! This is called **Vectorization** — MUCH MUCH faster than loops (stay tuned).

In [45]:
one = np.array([[1, 1, 1], [2, 2, 2], [3, 3, 3], [4, 4, 4]])
one

array([[1, 1, 1],
       [2, 2, 2],
       [3, 3, 3],
       [4, 4, 4]])

Sum along rows -> "collapse" across rows to get sum within each column — 3 numbers

In [46]:
np.sum(one, axis = 0) #0 stands for the rows, 1 stands for the cols

array([10, 10, 10])

Sum along columns -> "collapse" across columns to get sum within each row — 4 numbers

In [47]:
np.sum(one, 1)

array([ 3,  6,  9, 12])

**Careful:** Applying a function without specifying the axis may compute across the ENTIRE ndarray.

**Mnemonic trick:** Applying a function along an axis eliminates that dimension from the shape. Left with remaining dimensions.

In [48]:
print(one.shape)

print(f'Mean across axis 0: {the_mean.shape}')

print(one.shape)

print(f'Mean across axis 1: {the_mean.shape}')

(4, 3)


NameError: name 'the_mean' is not defined

## Broadcasting (basics)

**This is the most useful numpy feature thus far! This will become your bread-and-butter!**

We will cover the basics now and revisit broadcasting in more detail in a few weeks.

### Broadcasting scalars

As we saw, we can create an array of any size with any constant value WITHOUT ANY LOOPS. This is the simplest example of numpy **broadcasting** the scalar across the ndarray.

In [None]:
# Example with basic arithmetic
myArr = 5*np.ones([5,5])
myArr

### Applying an operation to corresponding values in two arrays that have the same shape

Broadcasting allows you to efficiently (*and in one line of code*) add, subtract, multiply, and perform other operations on corresponding values in two arrays.

#### Examples: subtracting arrays with several different shapes

In [None]:
# 1D arrays
arr1 = np.arange(10)
arr2 = 5*np.ones(10)
print(f'arr1: {arr1}')
print(f'arr2: {arr2}')
print(f'Shape of arr1: {arr1.shape}')
print(f'Shape of arr2: {arr2.shape}')

arr3 = arr1 - arr2
arr3

In [50]:
# 2D arrays
arr1 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr2 = np.array([[1, 1, 1], [0, 0, 0], [2, 2, 2]])
print(f'arr1:\n{arr1}')
print(f'arr2:\n{arr2}')
print(f'Shape of arr1: {arr1.shape}')
print(f'Shape of arr2: {arr2.shape}')

arr3 = arr1 - arr2
arr3

arr1:
[[1 2 3]
 [4 5 6]
 [7 8 9]]
arr2:
[[1 1 1]
 [0 0 0]
 [2 2 2]]
Shape of arr1: (3, 3)
Shape of arr2: (3, 3)


array([[0, 1, 2],
       [4, 5, 6],
       [5, 6, 7]])

### Singleton dimensions

Sometimes you have a 1D array that has shape like `(blah,)` but you need to use broadcasting to subtract it with another array that has a shape of a **column vector** of a matrix `(blah, 1)`. Because the number of values in the two arrays match, you would think it would be possible to broadcast operations to corresponding values (e.g. subtract). BUT, broadcasting won't work like you want (let's try).

In [57]:
arr1 = np.arange(3)
arr1 = arr1[:,np.newaxis] #can do arr1[:,np.newaxis,np.newaxis] for 2 dimensions
arr2 = np.array([[1], [2], [3]])
print(f'arr1:\n{arr1}')
print(f'arr2:\n{arr2}')
print(f'arr1 shape: {arr1.shape}')
print(f'arr2 shape: {arr2.shape}')
print('Trying to broadcast arr1 - arr2...')
arr3 = arr1 - arr2
print(arr3)
arr3.shape


arr1:
[[0]
 [1]
 [2]]
arr2:
[[1]
 [2]
 [3]]
arr1 shape: (3, 1)
arr2 shape: (3, 1)
Trying to broadcast arr1 - arr2...
[[-1]
 [-1]
 [-1]]


(3, 1)

To get broadcasting to subtract off corresponding values in the arrays, we need to add a **singleton dimension** — an extra 1 dimension to `arr1` so that the shape matches the other array and so that numpy interprets `arr1` also as a column vector.

In [None]:
arr1 = np.arange(3)

print(f'arr1:\n{arr2}')
print(f'arr1 shape: {arr1.shape}')
print(f'arr2 shape: {arr2.shape}')
print('Trying to broadcast arr1 - arr2...')


### Squeeze: How to get of all singleton dimensions

"Undo" a new axis / singleton dimension

In [58]:
print(arr1.shape)

(3, 1)


In [60]:
arr1 = np.squeeze(arr1)
print(f'arr1:\n{arr1}')
print(f'arr1 shape: {arr1.shape}')

arr1:
[0 1 2]
arr1 shape: (3,)


Removes ALL singleton dimensions (if you have more than one):

In [None]:
ex = np.zeros([1, 2, 1, 3, 1, 4, 5, 1])
ex.shape

## Vectorization speed vs loops

Time computation of summing a ndarray with loop vs vectorized.

In [62]:
def timeit(fun):
    '''Just a function to time the runtime of another function'''
    def timer():
        start = time.time()
        fun()
        end = time.time()
        print(f'Took {end - start:.3} secs to run.')
    return timer


@timeit
def sumLoop():
    '''Use for loop to sum a row vector'''
    longRow = np.array([i for i in range(1, 1000000)])
    theSum = 0
    for i in range(len(longRow)):
        theSum += longRow[i]


@timeit
def sumVectorized():
    '''Vectorized version of summing a row vector'''
    longRow = np.array([i for i in range(1, 1000000)])
    theSum = np.sum(longRow)

In [63]:
# Dynamic typing in python makes for loops with lots of small
# operations slow
print('sumLoop:')
sumLoop()

# Vectorization allows Numpy to stop searching at runtime
# and use efficient pre-compiled functions to batch-process
# the computation over the matrix
print('sumVectorized:')
sumVectorized()

sumLoop:
Took 0.224 secs to run.
sumVectorized:
Took 0.0874 secs to run.


## Combining multiple ndarrays

**Problem:**
- You have two ndarrays and want to concatenate them
- You have an ndarray and **want to append a column or row vector**

### Add/append a new column — "stack horizontally"

**Mnemonic**: Columns go horizontally.

Have `a`:

    [[1, 2]
     [3, 4]]
and `b`

    [[9]
     [9]]
    
want to make:

    [[1, 2, 9]
     [3, 4, 9]]
    
i.e. stack horizontally. Could be two matrices (not just a matrix and a vector).

**Caveat:** We need to make sure shapes are compatible for broadcasting:

- Result shape = `(2, 3)`
- We are starting with `a` shape: `(2, 2)`

The shape of `b` needs to be `(2, 1)` (why?)

In [64]:
a = np.array([[1,2],[3,4]])
b = 9*np.ones([2,1])
print(a)
print(b)

[[1 2]
 [3 4]]
[[9.]
 [9.]]


In [65]:
c = np.hstack([a,b]) #elements need to be in a list
print(c)
c.shape

[[1. 2. 9.]
 [3. 4. 9.]]


(2, 3)

## Switching around the axes of an ndarray and matrix multiplication in numpy

We can't matrix multiply the following ndarrays due to shape issues:

In [68]:
a = 3*np.ones([3, 4])
b = 2*np.ones([3, 4])

Need to pair up as

    (3, 4) x (4, 3)

OR

    (4, 3) x (3, 4)
    
Use the transpose to help out!

In [69]:
a @ b.T #np.transpose(b)

array([[24., 24., 24.],
       [24., 24., 24.],
       [24., 24., 24.]])

In [70]:
a * b

array([[6., 6., 6., 6.],
       [6., 6., 6., 6.],
       [6., 6., 6., 6.]])

Note: Transposing a ndarray vector isn't meaningful if you don't have a singleton dimension

In [72]:
a = np.ones(10)
a.shape

(10,)

In [73]:
a.T.shape

(10,)

In [74]:
a = a[:, np.newaxis]
print(a.shape)
print(a.T.shape)

(10, 1)
(1, 10)


### Matrix vs. element-wise multiplication

- Star (*) operator means element-wise multiplication