In [20]:
%autosave 0

Autosave disabled


*This notebook is part of  course materials for CS 345: Machine Learning Foundations and Practice at Colorado State University.
Original versions were created by Asa Ben-Hur.
The content is availabe [on GitHub](https://github.com/asabenhur/CS345).*

*The text is released under the [CC BY-SA license](https://creativecommons.org/licenses/by-sa/4.0/), and code is released under the [MIT license](https://opensource.org/licenses/MIT).*

<img style="padding: 10px; float:left;" alt="CC-BY-SA icon.svg in public domain" src="https://upload.wikimedia.org/wikipedia/commons/d/d0/CC-BY-SA_icon.svg" width="125">

# Numpy

Numpy (Numerical Python) is Python's library for numerical data, and provides a wealth of functionality for working with array data.

Numpy features include:
  * A fast and efficient multidimensional array object ndarray
  * Functions for performing computations on arrays
  * Tools for reading and writing array-based datasets to disk
  * Linear algebra operations, and random number generation

Our first step is to **import** the package (note the "import as" shortcut):

In [1]:
import numpy as np

**Python note:**  Instead of the above import, we could have done ``from numpy import *``, which would have made every statement shorter by not having to write ``np.`` before each Numpy command.  That is not a good idea, as the Numpy namespace conflicts with built in Python functions like ``min`` or ``max``.

Arrays are the standard data containers in Numpy, and can have any number of dimensions.

Let's create a one dimensional array:

In [2]:
my_array = np.array([1, 2, 3])
my_array

array([1, 2, 3])

What have we gained over using a Python list?

In [3]:
my_list = [1, 2, 3]

In fact, Numpy arrays are less flexible than Python lists:

In [None]:
my_list[0] = 'a'
try :
    my_array[0] = 'a'
except :
    print("")

The reduced flexibility of Numpy arrays comes with improved efficiency both in terms of storage (why?) execution, and a wealth of functionality for fast manipulation of numeric data.

Furthermore, using Numpy's C API, libraries written in C or Fortran, can operate on the data stored in a NumPy array without needing to copy the data. 

Let's demonstrate the speed advantage of Numpy arrays:

In [4]:
import numpy as np
my_array = np.arange(1000000)
my_list = list(range(1000000))
# Note: why not simply do my_list = range(10000000)?

In [5]:
%timeit my_array2 = my_array * 2

2.66 ms ± 143 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [6]:
%timeit my_list2 = [x * 2 for x in my_list]

90.7 ms ± 296 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


This clearly demonstrates the power of using Numpy arrays over Python lists.

**Note about timing Python code**
``%time`` is another magic command that can be used to measure the execution time of code snippets.  ``%timeit`` results are generally more accurate which is a result of the fact that ``%timeit`` does some clever things under the hood to prevent system calls from interfering with the timing. For example, it prevents cleanup of unused Python objects (known as garbage collection) which might otherwise affect the timing. For this reason, ``%timeit`` results are usually noticeably faster than ``%time`` results (see [this article](https://jakevdp.github.io/PythonDataScienceHandbook/01.07-timing-and-profiling.html) for more information about profiling Python code).

### Numpy ndarrays

Numpy ndarrays enable you to perform mathematical operations on entire arrays in a single operation without requiring for loops.  This is called *vectorization*, and is key for writing efficient machine learning code.

For example:

In [None]:
data = np.array( [ [1,2,3], [4,5,6] ])
data

In [None]:
data * 10    # multiply array by a scalar
data + data  # add arrays

You can also perform Boolean operations on arrays:

In [None]:
a1 = np.array([[1., 2., 3.], [4., 5., 6.]])
a2 = np.array([[0., 4., 1.], [7., 2., 12.]])
a2 > a1

Every array has a shape, which is a tuple indicating the size of each dimension, and a dtype, an object describing the data type of the array:


In [None]:
data.shape
data.dtype

While it's clear what Numpy does when asked to add a scalar to an array or add two arrays of the same size, check what happens when adding two arrays of unequal size, e.g. add a one dimensional array to a two dimensional array

In [None]:
# define two arrays, one which is two dimensional with a shape (2,3), 
# and another which is one dimensional.
# what will its size need to be for the operation to work?
# what is Numpy doing in this case?

Note that Numpy inferred the type from the data that we provided.  You can check what happens if there are floats in the input to the array constructor.  The `dtype` attribute will tell you what kind of array got created.

In [8]:
arr1 = np.array([6, 7.5, 8, 0, 1])
print(arr1,arr1.dtype)
arr2 = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
print(arr2,arr2.dtype)

[6.  7.5 8.  0.  1. ] float64
[[1 2 3 4]
 [5 6 7 8]] int64


In [9]:
print(arr1.shape, arr2.shape)


(5,) (2, 4)


Contrast this with the Python ``len`` builtin:

In [10]:
len(arr1), len(arr2)

(5, 2)

### Data types

Each array has a dtype associated with it, which is the type used to store the elements of the array.
The numerical dtypes are named as follows: a type name, like float or int, followed by a number indicating the number of bits per element. A standard double-precision floating-point value (what’s used under the hood in Python’s float object) takes up 8 bytes or 64 bits. Thus, this type is known in Numpy as ``float64``.



In [None]:
a1 = np.array([1, 2, 3], dtype=np.float64)
a2 = np.array([1, 2, 3], dtype=np.int32)
a1.dtype
a2.dtype

What would be the resulting data type for an array that contains strings and numbers?

In [None]:
# create an array that contains strings and numbers and check its data type

Some functions for creating arrays:

In [None]:
a = np.zeros((2,2))      # Create an array of zeros
print(a)

b = np.ones((2,2))       # Create an array of ones
print(b)

d = np.eye(2)            # Create a 2x2 identity matrix
print(d)

e = np.random.random((3,3))
print(e)  

f = np.arange(2, 3, 0.1)
print(f)

g = np.linspace(1., 4., 6)
print(g)

### Sidenote - getting **help** on python objects:

For getting help e.g. on the Numpy **linspace** function you can do one of the following:

```python
?np.linspace
```

or

```python
help(np.linspace)
```

In [None]:
help(np.linspace)

### Array indexing and slicing

We'll begin with one dimensional arrays:

In [None]:
a = np.array([2,3,4])
print(a[0], a[1], a[2])
a[0] = 5                  # Change an element of the array
a

Can you explain what's happening in the following piece of code?

In [None]:
b = a
b[0] = 1234
a
b

In [None]:
a = np.arange(10)
a
# indexing a single element
a[5]
# a slice:
a[5:8]
a[5:8] = 12
a

Let's see how slices behave in Numpy:

In [None]:
a_slice = a[5:8]
a_slice

In [None]:
a_slice[1] = 12345
a

Numpy has been designed to be able to work with very large arrays, so eagerly copying data could cause severe performance and memory problems.

If you want a copy of a slice instead of a view, you will need to explicitly copy it using e.g. ``arr[5:8].copy()``.


#### Two dimensional arrays

Let's look at two dimensional arrays.

In [11]:
b = np.array([[1,2,3],[4,5,6]])
b,b.shape

print(b[0, 0], b[0, 1], b[1, 0])
print(b[0][0], b[0][1], b[1][0])


1 2 4
1 2 4


The latter form of indexing works, because each row of a two dimensional array is an array as well:

In [12]:
a = b[1]
a

array([4, 5, 6])

## Array indexing


In [13]:
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
a[0][1]

2

To access the first row in the data matrix:

In [14]:
row = a[0]    # the first row of a
row, row.shape

(array([1, 2, 3, 4]), (4,))

To access a column of the matrix:

In [15]:
col = a[:, 0]
col, col.shape

(array([1, 5, 9]), (3,))

We can perform slicing on multiple dimensions, creating a submatrix:

In [16]:
submatrix = a[1:3, 1:4]
submatrix, submatrix.shape

(array([[ 6,  7,  8],
        [10, 11, 12]]),
 (2, 3))

### Advanced indexing

You can index an array using an integer array:


In [17]:
a[ [0, 2] ]   # extract a given set of rows

array([[ 1,  2,  3,  4],
       [ 9, 10, 11, 12]])

In [18]:
a[:, [0,2]]  # extract a given set of columns

array([[ 1,  3],
       [ 5,  7],
       [ 9, 11]])

**Question:** does indexing using an array create a copy of the array or simply a view, as in the case of slicing?

In [None]:
## write some code to answer this question

#### Exercises

describe the effect of each of the following slices:

In [None]:
a2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

In [None]:
a2d[:2]

In [None]:
a2d[:2, 1:]

In [None]:
a2d[1, :2]

In [None]:
a2d[:2, 2]

In [None]:
a2d[:, :1]

#### Boolean indexing

In [None]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
data = np.random.randn(7, 4)
(names,data)

In [None]:
names == 'Bob'

In [None]:
data[names == 'Bob']

In [None]:
cond = names == 'Bob'
data[~cond]

In [None]:
cond = (names == 'Bob') | (names == 'Will')
cond
data[cond]

In [None]:
data < 0

**Note:** The Python keywords ``and`` and ``or`` do not work with boolean arrays. You need to use & (and) and | (or) instead.


In [None]:
data[data < 0] = 0
data

In [None]:
data[names != 'Joe'] = 7
data

#### Reshaping arrays

What is the effect of the following operation?

In [None]:
a = np.arange(15).reshape((3, 5))
(a,a.shape)

Here's where it becomes interesting...

In [None]:
a = np.arange(15).reshape((-1, 5))
a

This is a neat trick:  -1 here means "as many rows as needed"

In [None]:
a.T
a.T.shape

This is called the "transpose" of a matrix.

In [None]:
np.transpose(a)

## Universal Functions: Fast Element-Wise Array Functions

A universal function, or *ufunc*, is a function that performs element-wise operations on data in an ndarray. 


In [None]:
a = np.arange(10)
a
np.sqrt(a)
np.exp(a)

In [None]:
x = np.random.randn(8)
y = np.random.randn(8)
x
y
np.maximum(x, y)

A complete list of ufuncs is available in the [Numpy documentation](https://docs.scipy.org/doc/numpy/reference/ufuncs.html#available-ufuncs).

### Stacking arrays

In [None]:
x = np.array( [1,2,3,4] )
y = np.array( [5,6,7,8] )
np.vstack([x,y])
np.hstack([x,y])

### Avoid loops when you can

Consider the following piece of code:

In [None]:
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = np.array([1, 0, 1])
y = np.empty_like(x)   # Create an empty matrix with the same shape as x

# Add the vector v to each row of the matrix x with an explicit loop
for i in range(4):
    y[i] = x[i] + v

y

As we know, loops are slow in python.  There is a much more efficient way of doing this:

In [None]:
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = np.array([1, 0, 1])
y = x + v
y

This is called **broadcasting**.

In [None]:
points = np.arange(-5, 5, 0.05)
xs, ys = np.meshgrid(points, points)
ys

In [None]:
z = np.sqrt(xs ** 2 + ys ** 2)

In [None]:
xs.shape

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.imshow(z, cmap=plt.cm.gray)
plt.colorbar()
plt.title("plot of $\sqrt{x^2 + y^2}$ for a grid of values")

### Expressing Conditional Logic as Array Operations


In [None]:
xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
cond = np.array([True, False, True, True, False])

In [None]:
list(zip([1,2,3],['a','b','c']))

In [None]:
result = [(x if c else y)
          for x, y, c in zip(xarr, yarr, cond)]
result

In [None]:
result = np.where(cond, xarr, yarr)
result

In [None]:
arr = np.random.randn(4, 4)
arr
arr > 0
np.where(arr > 0, 1, -1)

### Mathematical and Statistical Methods

In [None]:
a = np.random.randn(5, 4)
a
a.mean()
np.mean(a)
a.sum()

In [None]:
a.mean(axis=1)
a.sum(axis=0)

In [None]:
a = np.array([0, 1, 2, 3, 4, 5, 6, 7])
a.cumsum()

### Example:  Random Walks

A random walk in one dimension is a random process where at each step the walker goes a unit step either to the left or to the right.  Random walks have interesting statistical properties that can be investigated by simulating them.

In [None]:
import random
def random_walk(n):
    """Return a list of positions in a random walk"""
    position = 0
    walk = [position]
    for i in range(n):
        position += 2*random.randint(0, 1)-1
        walk.append(position)
    return walk

walk = random_walk(1000)

In [None]:
%timeit random_walk(1000)

Let's create a more efficient Numpy version:

In [None]:
def random_walk_vectorized(n):
    steps = np.random.choice([-1,+1], n)
    return np.cumsum(steps)

walk = random_walk_vectorized(1000)

In [None]:
%timeit random_walk_vectorized(1000)

In [None]:
num_steps = 10000
distance = random_walk_vectorized(num_steps)
t = np.arange(num_steps)
plt.plot(t, distance, 'g.')


In [None]:
?np.cumsum

To investigate statistical properties of random walks we'll need to generate many random walks.  To make this efficient, we'll do it all at once:

In [None]:
nwalks = 5000
nsteps = 1000
draws = np.random.randint(0, 2, size=(nwalks, nsteps)) # 0 or 1
steps = np.where(draws > 0, 1, -1)
walks = steps.cumsum(axis=1)
sq_distance = walks**2
mean_sq_distance = np.mean(sq_distance, axis=0)


In [None]:
steps.shape, walks.shape

In [None]:
t = np.arange(nsteps)
plt.plot(t, np.sqrt(mean_sq_distance), 'g.',t, np.sqrt(t), 'y-')
plt.xlabel("$t$")
plt.ylabel("square root of mean square distance")

### Numpy documentation

These were some of the basics that are relevant to our course.  You can find more details in the  [Numpy user manual](https://docs.scipy.org/doc/numpy/user/) and the detailed [reference guide](https://docs.scipy.org/doc/numpy/reference).

### Exercises

* Declare a 8x8 matrix and fill it with a checkerboard pattern of zeros and ones


* Create a 10x10 array with random values and find the minimum and maximum values in the entire array, and in each column / row.



* Create a matrix with arbitrary floating point numbers, and then normalize its values to be:  (i) between 0 and 1, and (ii) to be between -1 and 1.

* Create random vector of size 100, where each element is between 0 and 5, and replace the maximum value by 0