## Numpy

This section introduces numpy, a Python library which adds support for larg-scale vectors and matrices, as well as fast and efficient implemntations of important mathematical functions operating on data. 

To start, import numpy (frequently as `np` to make calls shorter).

In [71]:
import numpy as np

Numpy has many convenience functions for generating lists of numbers. For example, to create a list of all integers between 0 and 10:

In [59]:
np.arange(0, 10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

The `numpy.linspace` function gives you a linear interpolation of `n` numbers between two endpoints.

In [None]:
np.linspace(0, 10, 8)

Numpy is a very large library with many convenient functions. A review of them is beyond the scope of this chapter. We will introduce numpy functions as we go, depending on when we need them. A nice but non-comprehensive review can also be found [here](https://cs231n.github.io/python-numpy-tutorial/).

## Vectors

Vectors can be represented using `numpy.array`. So for example, to represent the vector:

$$
v = \begin{bmatrix} 2 \\ 3 \\ 1 \end{bmatrix}
$$

In [60]:
np.array([2, 3, 1])

array([2, 3, 1])

When two vectors of equal length are added, the elements are added point-wise.

$$
\begin{bmatrix} 2 \\ 3 \\ 1 \end{bmatrix} + \begin{bmatrix} 0 \\ 2 \\ -2 \end{bmatrix} = \begin{bmatrix} 2 \\ 5 \\ -1 \end{bmatrix}
$$


In [61]:
a = np.array([2, 3, 1])
b = np.array([0, 2, -2])
c = a + b
print(c)

[ 2  5 -1]


A vector can be multiplied element-wise by a number (called a "scalar"). For example:
    

$$
3 \begin{bmatrix} 2 \\ 3 \\ 1 \end{bmatrix} = \begin{bmatrix} 6 \\ 9 \\ 3 \end{bmatrix}
$$


In [62]:
3 * np.array([2,3,1])

array([6, 9, 3])

A dot product is defined as the sum of the element-wise products of two equal-sized vectors. For two vectors $a$ and $b$, it is denoted as $a \cdot b$ or as $a b^T$ (where T refers to the transpose operation, introduced further down this notebook.

$$
\begin{bmatrix} 1 & -2 & 2 \end{bmatrix} \begin{bmatrix} 0 \\ 2 \\ 3 \end{bmatrix} = 2
$$

This can be calculated with the `numpy.dot` function:

In [57]:
a = np.array([1,-2,2])
b = np.array([0,2,3])
c = np.dot(a, b)
print(c)

2


Or the shorter way:

In [58]:
c = a.dot(b)
print(c)

2


## Matrices

A matrix is a rectangular array of numbers.

Numpy can create matrices from normal Python lists using `numpy.matrix`. For example:

In [37]:
np.matrix([[2,3,1],[0, 4,-2]])

matrix([[ 2,  3,  1],
        [ 0,  4, -2]])

To instantiate a matrix of all zeros:

In [38]:
np.zeros((3, 3))

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

To instantiate a matrix of all ones:

In [30]:
np.ones((2, 2))

array([[1., 1.],
       [1., 1.]])

In linear algebra, a square matrix whose elements are all zeros, except the diagonals, which are ones, is called an "identity matrix." 

For example:

$$
\mathbf I =
\begin{bmatrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1
\end{bmatrix}
$$

is a 3x3 identity matrix. The reason why it is called an identity matrix is that it is analagous to multiplying a scalar by 1. A vector multiplied by an identity matrix is unchanged. 

$$
\mathbf I v = v
$$

To instantiate an identity matrix, use `numpy.eye`. For example:

In [63]:
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

To instantiate a matrix of random elements (between 0 and 1), you can use `numpy.random`:

In [64]:
A = np.random.random((2, 3))
print(A)

[[0.77756496 0.30393509 0.36155097]
 [0.74992142 0.60643816 0.45455456]]


Transposition is to reverse the axes of two matrices. So the element at `i,j` in the transposed matrix is equal to the element at `j,i` in the original. The matrix A transposed is denoted as $A^T$. 

In [65]:
A_transpose = np.transpose(A)
print(A_transpose)

[[0.77756496 0.74992142]
 [0.30393509 0.60643816]
 [0.36155097 0.45455456]]


It can also be done with the shorthand `.T` operation, as in:

In [66]:
A_transpose = A.T
print(A_transpose)

[[0.77756496 0.74992142]
 [0.30393509 0.60643816]
 [0.36155097 0.45455456]]


Like regular vectors, matrices are added point-wise (or element-wise) and must be of the same size. So for example:

$$
\begin{bmatrix} 4 & 3 \\ 3 & -1 \\ -2 & 1 \end{bmatrix} + \begin{bmatrix} -2 & 1 \\ 5 & 3 \\ 1 & 0 \end{bmatrix} = \begin{bmatrix} 2 & 4 \\ 8 & 2 \\ -1 & 1 \end{bmatrix} 
$$


In [67]:
a = np.matrix([[4, 3],[3,-1],[-2,1]])
b = np.matrix([[-2, 1],[5,3],[1,0]])
c = a + b
print(c)

[[ 2  4]
 [ 8  2]
 [-1  1]]


Also like vectors, matrices can be multiplied element-wise by a scalar.

$$
-2 \begin{bmatrix} 1 & -2 & 0 \\ 6 & 4 & -2 \end{bmatrix} = \begin{bmatrix} -2 & 4 & 0 \\ -12 & -8 & 4 \end{bmatrix} 
$$


In [68]:
a = np.matrix([[1,-2,0],[6,4,-2]])
-2 * a

matrix([[ -2,   4,   0],
        [-12,  -8,   4]])

To multiply two matrices together, you take the dot product of each row of the first matrix and each column of the second matrix, as in:  

![matrix multiplication via wikipedia](https://wikimedia.org/api/rest_v1/media/math/render/svg/780671bdfb3fab93187156d2c226df6fe36746fe)

So in order to multiply matrices $A$ and $B$ together, as in $C = A \dot B$, $A$ must have the same number of columns as $B$ has rows. For example:


$$
\begin{bmatrix} 1 & -2 & 0 \\ 6 & 4 & -2 \end{bmatrix} * \begin{bmatrix} 4 & -1 \\ 0 & -2 \\ 1 & 3 \end{bmatrix} = \begin{bmatrix} 4 & 3 \\ 22 & -20 \end{bmatrix} 
$$


In [69]:
a = np.matrix([[1,-2,0],[6,4,-2]])
b = np.matrix([[4,-1],[0,-2],[1,3]])
c = a * b
print(c)

[[  4   3]
 [ 22 -20]]


The Hadamard product of two matrices differs from normal multiplication in that it is the element-wise multiplication of two matrices. 

$$
\mathbf A \odot B =
\begin{bmatrix}
A_{1,1} B_{1,1} & \dots & A_{1,n} B_{1,n} \\
\vdots & \dots & \vdots \\
A_{m,1} B_{m,1} & \dots & A_{m,n} B_{m,n}
\end{bmatrix}
$$

So for example:

$$
\begin{bmatrix} 3 & 1 \\ 0 & 5 \end{bmatrix} + \begin{bmatrix} -2 & 4 \\ 1 & -2 \end{bmatrix} = \begin{bmatrix} -6 & 4 \\ 0 & -10 \end{bmatrix} 
$$

To calculate this with numpy, simply instantiate the matrices with `numpy.array` instead of `numpy.matrix` and it will use element-wise multiplication by default.

In [53]:
a = np.array([[3,1],[0,5]])
b = np.array([[-2,4],[1,-2]])
np.multiply(a,b)

array([[ -6,   4],
       [  0, -10]])

## Derivatives

The derivative of a function $f(x)$ is the instantaneous slope of the function at a given point, and is denoted as $f^\prime(x)$.

calc
 - functions
 - limits
 - derivative
 - chain rule


$$f^\prime(x) = \lim_{\Delta x\to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x} $$

The derivative of $f$ with respect to $x$ can also be denoted as $\frac{df}{dx}$.

The derivative of a polynomial function is given below: 

$$
f(x) = a x ^ b
$$

$$
\frac{df}{dx} = b a x^{b-1}
$$


todo
 - fix chain rule
 
 
-  softmax


## Chain rule

Functions can be composites of multiple functions. For example, consider the function:

$$
f(x) = (4x-5)^3
$$

This function can be broken down by letting:

$$
h(x) = 4x-5 \\
g(x) = x^3 \\
f(x) = g(h(x)) 
$$

The chain rule states that the derivative of a composite function $g(f(x))$ is:

$$
f^\prime(x) = g^\prime(h(x)) h^\prime(x)
$$




$$
\frac{dg}{dx} = \frac{dg}{df} \frac{df}{dx}
$$

Since $f$ and $g$ are both polynomials we find, we can easily calculate that:

$$
\frac{dg}{dx} = 3x^2 \\
\frac{dh}{dx} = 4
$$

and therefore:

$$
\frac{df}{dx} = 4(3x^2) = 12x^2
$$

The chain rule is extremely important to the study of neural networks, because it is what allows us to find the derivative of the network's cost function analytically. We will see more about this in the next notebook.

In [90]:
def h(x):
    return 4*x-5

def g(x):
    return x**3

def f(x):
    return g(h(x))

def h_deriv(x):
    return 4

def g_deriv(x):
    return 3*(x**2)

def f_deriv(x):
    return g_deriv(x) * h_deriv(x)


In [91]:
f(5)

3375

In [92]:
f_deriv(2)

48

## Multivariable functions

A function may have more than one variable. For example:

$$
f(X) = w_1 x_1 + w_2 x_2 + w_3 x_3 + ... + w_n x_n + b 
$$

or

$$
f(X) = b + \sum_i w_i x_i
$$

One useful trick to simplify this formula is to append a $1$ to the input vector $X$, so that:

$$
X = \begin{bmatrix} x_1 & x_2 & ... & x_n & 1 \end{bmatrix}
$$

and let $b$ just be an element in the weights vector, so:

$$
W = \begin{bmatrix} w_1 & w_2 & ... & w_2  & b \end{bmatrix}
$$

So then we can rewrite the function as:

$$
f(X) = W X^T
$$


A partial derivative of a multivariable function is the derivative of the function with respect to just one of the variables, holding all the others constant.

The partial derivative of $f$ with respect to $x_i$ is denoted as $\frac{\partial f}{\partial x_i}$.

## Gradient

The [gradient](https://en.wikipedia.org/wiki/Gradient) of a function is the vector containing each of its partial derivatives at point $x$.

$$
\nabla f(X) = ( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, ..., \frac{\partial f}{\partial x_n}  )
$$

We will look more closely at the gradient later when we get into how neural networks are trained.