# Math for Machine Learning

## What is Machine Learning?

Arthur Aamuel, 1959, "Field of study that gives computers the abiity to learn without being explicitly programmed".

## The Machine Learning Pipeline in Mathematics

1. **Data Processing**
    * Uses linear algebra to format the data in a way algorithms can ingest.
2. **Feature Engineering and Selection**
    * Uses vectores and matrices to transform data to make it easy for algorithms to understand.
3. **Modeling**
    * Uses geometry, probability, norms, and statistics to define the problem in a way the algorithm can optimize.
4. **Optimization**
    * Uses vector calculus to interate until certain conditions are met. Then you choose the best model.


## Vectors

### Norm

A measure of distance. 

### Norm Properties

1. All distances are non-negative. $\Vert \vec{v} \Vert \geq 0$
2. Distances multiply with scalar multiplication. $\Vert a \vec{v} \Vert = \vert a \vert \cdot \Vert v \Vert$
3. *Triangle Inequality*. If I travel from $A$ to $B$ then $B$ to $C$, that is at least as far as going from $A$ to $C$. $\Vert \vec{v} + \vec{w} \Vert \leq \Vert \vec{v} \Vert + \Vert \vec{w} \Vert$. 
    * If $A$, $B$, and $C$ all lie on the same line, then $\Vert \vec{v} + \vec{w} \Vert = \Vert \vec{v} \Vert + \Vert \vec{w} \Vert$

    
### Types of Norms

For $\vec{v} = \begin{pmatrix} v_1\\ v_2\\ ...\\ v_n\end{pmatrix}$

1. **Euclidean Norm**. 
\begin{align*}
\Vert \vec{v} \Vert_2 &= \sqrt{v_1^2 + v_2^2 + ... + v_n^2} \\
                      &= \sqrt{\sum_{i=1}^{n} {v_i^2}}
\end{align*}
2. $L_p-\text{Norm}$ 
\begin{equation*}
\Vert \vec{v} \Vert_p = \Big( \sum_{i=1}^{n} \vert v_i \vert ^p \Big)^{1/p}
\end{equation*}
3. $L_1-\text{Norm}$
\begin{equation*}
\Vert \vec{v} \Vert_1 = \sum_{i=1}^{n} \vert v_i \vert 
\end{equation*}
Other names are TAXICAB METRIC, MANHATTAN NORM, ...
4. $L_\infty-\text{Norm}$
\begin{equation*}
\Vert \vec{v} \Vert_\infty =  \lim_{p \to \infty} \Vert \vec{v} \Vert_p = \lim_{p \to \infty} \Big( \sum_{i=1}^{n} \vert v_i \vert ^p \Big)^{1/p} 
\end{equation*}
5. $L_0-\text{Norm}$, which is not a norm, is the number of non-zero elements.

## Vectors in Code

* We have used LaTeX to write down mathematical equations.
* Let's use Python now

In [5]:
# this defines a row
v = [1, 2, 3]
w = [1, 1, 1]

# this defines a matrix
A = [[1, 2, 3], [-1, 0, 1], [1, 1, 1]]

In [6]:
v + w

[1, 2, 3, 1, 1, 1]

In [7]:
import numpy as np

np.array(v) + np.array(w)

array([2, 3, 4])

In [10]:
2*v

[1, 2, 3, 1, 2, 3]

In [11]:
2*np.array(v)

array([2, 4, 6])

In [12]:
# L_p-Norms

print(np.linalg.norm(v, ord=1))
print(np.linalg.norm(v, ord=2))
print(np.linalg.norm(v, ord=np.inf))

6.0
3.7416573867739413
3.0


## Matrices

### Linear Algebra Operations



#### Dot Products

If we are given two vectors, $\vec{v} = \begin{pmatrix} v_1\\ v_2\\ ...\\ v_n\end{pmatrix}$, $\vec{w} = \begin{pmatrix} w_1\\ w_2\\ ...\\ w_n\end{pmatrix}$, then their dot product is  

\begin{align*}
\vec{v} \cdot \vec{w} &= \vec{v}^T \vec{w} = \begin{pmatrix} v_1 v_2 ... v_n\end{pmatrix} \begin{pmatrix} w_1\\ w_2\\ ...\\ w_n\end{pmatrix}\\
                      &= \sum_{i=1}^{n} v_iw_i
\end{align*}

If we view this in terms of angles, the angle between $\vec{v}$ and $\vec{w}$, $\theta$, is equal to

\begin{equation*}
\theta = \arccos{\frac{\vec{v} \cdot \vec{w}}{\Vert \vec{v} \Vert \Vert \vec{w} \Vert}}
\end{equation*}

Or

\begin{equation*}
\vec{v} \cdot \vec{w} = \Vert \vec{v} \Vert \Vert \vec{w} \Vert \cos{\theta}
\end{equation*}

* **Orthogonality**. $\vec{v} \cdot \vec{w} = 0$, assuming $\vec{v} \neq 0, \vec{w} \neq 0$.  
$\vec{v} \cdot \vec{w}$ will be $0$ when $\cos{\theta} = 0$, which happens when $\theta = -\pi/2$ $(-90^\circ)$ or $\theta = \pi/2$  $(90^\circ)$. 
The intuition of being orthogonal works in any dimension.

* $\vec{v} \cdot \vec{w} > 0$, assuming $\vec{v} \neq 0, \vec{w} \neq 0$.  
$\vec{v} \cdot \vec{w}$ will be greater than $0$ when $\cos{\theta} > 0$, which happens when $-\pi/2 < \theta < \pi/2$. In other words, all vectors pointing somewhat the same direction from the orthogonal line/plane/etc have a dot product greater than $0$.

* $\vec{v} \cdot \vec{w} < 0$, assuming $\vec{v} \neq 0, \vec{w} \neq 0$.  
$\vec{v} \cdot \vec{w}$ will be less than $0$ when $\cos{\theta} < 0$, which happens when $\theta < -\pi/2$ or $\theta > \pi/2$. In other words, all vectors pointing somewhat the opposite direction from the orthogonal line/plane/etc have a dot product less than $0$.

This gives us the way we can understand geometrically dot products. And what it leads us to, is the notion of hyperplane.  

#### Hyperplane Definition

* It is the thing orthogonal to a given vector.
* In 2D, it is the line orthogonal to a given vector.
* In 3D, it is the plane orthogonal to a given vector.
* In higher dimensions, similar. For example, in 4D it is the 3D space orthogonal to a given vector, etc.

Notice that hyperplanes pass through the origin ($(0, 0, 0)$ in 3D for example). We can extend the above definition by adding "or a translate to a different point". Thus, the hyperplane can also pass through another place, different from the origin.  

So this is the geometric notion of a hyperplane. It's just some subspace of your given high dimentional space that separates it into two equal parts. 

#### Matrix Multiplication

If we are given a weight matrix, $W = \begin{pmatrix} - \vec{w_1} -\\ - \vec{w_2} -\\ ...\\ - \vec{w_m} -\end{pmatrix}$, and a feature vector, $\vec{v} = \begin{pmatrix} v_1\\ v_2\\ ...\\ v_n\end{pmatrix}$, then   

\begin{align*}
W \vec{v} &= \begin{pmatrix} - \vec{w_1} -\\ - \vec{w_2} -\\ ...\\ - \vec{w_m} -\end{pmatrix} \begin{pmatrix} v_1\\ v_2\\ ...\\ v_n\end{pmatrix} \\
         &= \begin{pmatrix} \vec{w_1} \vec{v}\\ \vec{w_2} \vec{v}\\ ...\\ \vec{w_m} \vec{v}\end{pmatrix} = \begin{pmatrix} \sum_{i=1}^{n} w_{1i}{v_i}\\ \sum_{i=1}^{n} w_{2i}{v_i}\\ ...\\ \sum_{i=1}^{n} w_{mi}{v_i}\end{pmatrix}
\end{align*}

Now, if we have a weight matrix, $W = \begin{pmatrix} - \vec{w_1} -\\ - \vec{w_2} -\\ ...\\ - \vec{w_m} -\end{pmatrix}$, and a feature matrix, $X = \begin{pmatrix} \vec{v_1} \vec{v_2} ... \vec{v_k} \end{pmatrix}$, then

\begin{align*}
W X &= \begin{pmatrix} - \vec{w_1} -\\ - \vec{w_2} -\\ ...\\ - \vec{w_m} -\end{pmatrix} \begin{pmatrix} \vec{v_1} \vec{v_2} ... \vec{v_k} \end{pmatrix} \\
         &= \begin{pmatrix} \vec{w_1} \vec{v_1} \quad \vec{w_1} \vec{v_2} \quad ... \quad \vec{w_1} \vec{v_k}\\ \vec{w_2} \vec{v_1} \quad \vec{w_2} \vec{v_2} \quad ... \quad \vec{w_2} \vec{v_k}\\ \quad ... \quad \\ \vec{w_m} \vec{v_1} \quad \vec{w_m} \vec{v_2} \quad ... \quad \vec{w_m} \vec{v_k} \end{pmatrix}
\end{align*}

Thus, the $i, j$-th element of the matrix $WX$ will correspond to the $i$-th feature of the $j$-th data point.  

**Formal Definition**

If $A$ is a matrix where the rows are the features $w_i$ and $B$ is a matrix where the columns are data vectors $v_j$ then the $i,j$-th entry of the $C = AB$ product is $w_iv_j$, which is to say the $i$-th feature of the $j$-th vector. 
 
$$\boxed{
\begin{equation}
c_{i,j} = \sum_l a_{i,l} b_{l,j}
\end{equation}}$$

Matrix multiplication and examples

Hadamard product

Matrix product properties

Geometry of matrix operations

Determinant computation

Matrix invertibility

Linear dependency