<a href="https://colab.research.google.com/github/albertomanfreda/intensive_school_ml/blob/master/lessonNumPy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NumPy

Numpy is a powerful library for mathematical operations, including tensor manipulation (which is crucial for Neural Networks).

It is *de facto* the standard mathematical library for operating with arrays and matrices in Python. Many other useful libraries are built on-top of NumPy, or are able to operate on NumPy arrays.

The basic data structure in NumPy is the **ndarray**, a multi-dimensional arrays (or tensor).
Each ndarray has a **ndim** number of dimensions (or axes), and a **shape**, which is a tuple storing the length along each axes. The total number of elements is the **size** of the ndarray, which is the product of all the axes dimension. Finally, **len()** return the dimension of the first axis. On a unidimensional ndarray len() == size.

Differently from a list or a tuple, which can store items of any type, elements in a NumPy ndarray are homogenous, with a type decided upon array creation.

The standard way to create a NumPy ndarray is the *array* function, which accepts as input an iterables (like a list, or a tuple) of elements.

In [None]:
import numpy as np # This alias is pretty much standard 

# Let's define a function to inspect the properties of an array
def print_array_properties(arr):
    """ Print the type, number of axes, shape and total number of elements of
    a NumPy ndarray. """
    msg = '{}d array of type={}, shape={} ({} elements in total).\n'
    print(msg.format(arr.ndim, arr.dtype, arr.shape, arr.size))

""" A uni-dimensionl array (or vector) of floating points numbers.
The dtype paramater is optional: numpy will usually figure out by itself."""
v = np.array([0., 1., 2., 3.], dtype=np.float)
# Each array has a shape, or size, which is a tuple of its dimensions
print(v)
print_array_properties(v)

# A 2d matrix can be created from a list of lists, like this
m = np.array([[0, 1, 2], 
              [3, 4, 5]])
print(m)
print_array_properties(m)

# 3d tensor
t = np.array([[[0, 1], 
               [2, 3]],
              [[3, 4],
               [5, 6]],
              [[6, 7],
               [8, 9]]])
print(t)
print_array_properties(t)

Elements in a NumPy ndarray can be accessed with the square parenthesis, much like lists and tuples. Furthermore, NumPy ndarrays support slicing. It takes a bit of practice to get accustomed to the syntax for multidimensional arrays, so don't be worried if you struggle at the beginning. 

In [None]:
# Create an array of integers with np.arange
# arange() uses the same syntax as range() and slices
v = np.arange(3, 20, 2)
print(v)
# Random access
print(v[3])
# Usual slicing for 1-d arrays: remember the lesson on slicing!
print(v[1:7:3])
print(v[5:])
print(v[:-3])

In [None]:
""" Create a 2d array by reshaping a 1d array. reshape() requires the new shape
to be comatible with old one: the total number of elements must be the same."""
m = np.arange(15).reshape(5, 3)
print(m, '\n')

# Random access
print(m[2, 1], '\n')
# This syntax works too
print(m[2][1], '\n')

# Slicing
print(m[1:3, 0:2], '\n')

# Select an entire row
print(m[2,:],'\n')

# Select an entire column
print(m[:,0],'\n')

## Vectorization 

Through at first glance a NumPy ndarray may look similar to a Python list (or a list of lists, for multi-dimensional arrays), they are, in fact, quite different objects. In fact, elements in a Python list can be of any type and are not guaranteed to (and in general don't) occupy contiguos memory addresses. Elements in a NumPy ndarray, on the other side, are always stored as a contiguos block of memory, as well as having their type known in advance. 

The advantage of this is that NumPy is able to delegate the task of performing mathematical operations on the array’s contents to optimized, compiled C code. This process is referred to as **vectorization**. The gain in performance can be huge: let's try it ourselves.

In [None]:
# We will use the time library to measure execution time
import time

def sum_pure_python(n, num_trials):
    """ Sum the first n integers and measure the execution time. Repeat for
    num_trials times, return the best time. Pure Python implementation, version.
    """
    times_pure_python = []
    for i in range(num_trials):
        tstart = time.time()
        sum(range(n))
        tstop = time.time()
        times_pure_python.append(tstop - tstart)
    return min(times_pure_python)

def sum_numpy(n, num_trials):
    """ Sum the first n integers and measure the execution time. Repeat for
    num_trials times, return the best time. NumPy version.
    """
    times_numpy = []
    for i in range(num_trials):
        tstart = time.time()
        np.sum(np.arange(n))
        tstop = time.time()
        times_numpy.append(tstop - tstart)
    return min(times_numpy)

n = 1000000
num_trials = 5
pure_python_best = sum_pure_python(n, num_trials)
print('Pure python time (best of {}): {:.5f} s'.\
      format(num_trials, pure_python_best))
numpy_best = sum_numpy(n, num_trials)
print('Numpy time (best of {}): {:.5f} s'.\
      format(num_trials, numpy_best))
print('NumPy code was {:.2f} times faster'.\
      format(pure_python_best / numpy_best))

The bottom line is that you should try to use NumPy, rather then explicitly (or implicitly) looping in pure Python, whenever you are doing intensive mathematical computations on a great (10^5) number of elements.

NumPy allows you to use vectorization easily by implementing mathematical operations so that they act automatically on the entire array.

In [None]:
x = np.arange(10)
# Every element of x will be squared
print(x**2, '\n')

# Divide every element of x by ten and add 1 to each
y = x / 10 + 1
print(y, '\n')

# Other mathematical operations from the NumPy library
print(np.exp(-x), '\n')
print(np.sin(2 * y**2), '\n')

# Operations between arrays are vectorized too
# Note: their shapes have to be compatible
print(x + y, '\n')

## Matrix multiplication vs element-wise multiplication


Whe operating with matrices, there are two kinds of multiplications: element-wise multiplication and matrix multiplication. 

Element-wise multiplication is easy: each element of the first is multiplied by the corresponding element of the second. It requires the matrix to have the same shape and returns a matrix of the same shape. It is a commutative operation: a \* b = b \* a

Matrix multiplication, on the other side, can be performed between two matrices of shapes $m \times n$ and $n \times p$ (that is the middle axis dimension must be the same). The result is a $m \times p$ matrix definied as:

$$ c_{ij} = \sum_{k=1}^n a_{ik} * b_{kj} $$ 

In [None]:
# Element-wise multiplication
a = np.array([1, 2, 3])
b = np.array([-1, -2, -3])
print(a * b)

In [None]:
# Matrix multiplication
a = np.arange(12).reshape((4, 3))
print(a)
print_array_properties(a)

# Create an array of given shape filled with a given value
b = np.full((3, 2), -1)
print(b)
print_array_properties(b)

c = a.dot(b)
print(c)
print_array_properties(c)

# Equivalent syntaxes (for 2d matrix):
c = a @ b
c = np.matmul(a, b)

# You can also multiply 1d arrays: this is basically the inner product
v1 = np.array([1., 2., 3.])
v2 = np.array([-1., -2., -3.])
print(v1.dot(v2))

""" IMPORTANT NOTE: while matmul and @ are completely equivalent, matmul and dot
are so only for 2d matrix. For matrix with ndim > 2 they do different things.
Please check the documentation in that case. """

## Operation along axis

A few NumPy functions accept an additional argument, which makes that function act not on te entire array but **along an axis**. For example, if you have a 3x2 matrix, you can calculate the global mean (which is a single number), or the mean along the rows, which will calculate the mean of each row and return an array of size 2, or you can calculate the mean along the columns, which will return an array of size 3 with the mean on each column.

In [None]:
import numpy as np
arr = np.array([[-1., 1.],
                [-2., 2.],
                [-3., 3.]])
# Mean of the entire array
print(arr.mean())
# Mean of each row
print(arr.mean(axis=0))
# Mean of each column
print(arr.mean(axis=1))

## Random number generation

In [None]:
# Set the seed for RNG (so that results are repetible)
np.random.seed(0)

# Generate an array of random numbers uniformly distributed between 0 and 1
# Here the dimensions must not be passed as a tuple, but as separate arguments
print(np.random.rand(2, 3), '\n')

# Generate gaussian numbers with given mean (loc) and sigma (scale)
print(np.random.normal(loc=1., scale=2., size=10))



## Boolean arrays and masks

A **boolean array** is an array filled with booelan values (True or False). The special thing about boolean arrays is that they can be used as **masks** to select values on other arrays, using the usual square parenthesis syntax. Let's see how that works:

In [None]:
a = np.array([10, 2, 4, 7, 1, 8, 9])
# Create a boolean array with a conditional expression
mask = a >= 5
print(mask)
print_array_properties(mask)

# Let's use it to select values on another array
b = np.arange(len(a))
print(b)
print(b[mask])

# Of course you do not need the intermediate variable
print(b[a >= 5])


In [None]:
""" Multiple masks can be combined using logical operators:
 ~ (not), & (and), | (or) if they have the same shape""" 
a = np.array([10, 2, 4, 7, 1, 8, 9])
b = np.arange(len(a))
mask_b = b < 4
mask_a = a >= 5
print(~mask_b)
print(mask_a & mask_b)
print(mask_a | mask_b)

##Other useful functions

In [None]:

# Create an array of zeros of the given shape
# Since shape is a tuple, we need double parenthesis
zero_arr = np.zeros((2, 3))
print(zero_arr, '\n')

# Create an array of ones
# If we pass a single number as shape the array is 1d
one_arr = np.ones(10)
print(one_arr, '\n')
# Equivalent syntax:
o = np.ones((10,))

# Array of equispaced values
print(np.linspace(0., 10., 5), '\n')

# Flatten an array
arr = np.array([[0., 1.],
                [2., 3.],
                [4., 5.]])
print(arr.flatten(), '\n')

# Transpose a matrix
print(arr.T, '\n') # or, equivalently, np.transpose(arr)