# Introduction to Machine Learning

The following notebook is a concise summary of "Machine Learning Crash Course" - the 2nd course in [Google's Machine Learning Series](https://developers.google.com/machine-learning). 

## Supervised Learning: Recap

Recall key terminology:

* Label ($y$) - variable we're predicting 
* Feature ($x_1, x_2 ... x_n$) - input variables that describe out data 

Models map instances of data ($x_n$) to predicted labels ($y'$). 

Supervised learning typically involes either regression (continuous values predicted) or classification (discrete values predicted) models. 

## Linear Regression & Loss Introduction

<img src = "https://miro.medium.com/v2/resize:fit:597/1*RqL8NLlCpcTIzBcsB-3e7A.png" height="70%" width="50%">

If $\vec{x} = (x_1, ... x_n) \in \mathbb{R}^D$, where $\vec{x}$ is a vector of features, our goal in using a regression model is to make predictions $y$ that are as close to the target $t$ as possible. For a linear regression, $y = \sum_{i} w_ix_i + b$, where: 
* $y$ is the prediction 
* $\vec{w}$ is the weight vector 
* $b$ is the bias 

A common loss function is "squared (L2) loss," with $$L_2 = \sum(y - y')^2,$$ where $y$ is the observed value and $y'$ is the predicted value. 

The "mean square error" is simply the average L2 loss over the entire dataset, or $$ \frac{1}{n} \sum_{(x, y) \in D}(y - y')^2, $$ where $n$ is the number of data points, $x$ is a feature or set of features, $y$ is the label and $D$ is the dataset (with $(x, y)$ pairs).



## Reducing Loss

<img src = "https://developers.google.com/static/machine-learning/crash-course/images/GradientDescentDiagram.svg" style="background-color: white">

We take an iterative approach to reducing loss. Models use features to generate a predicted label ($y' = w_1x_1 + b$). This is compared to the true label from the dataset to determine loss. Since our goal is to minimize loss, the model 'updates' the parameters $b$ and $w$ over and over again until loss is minimized. 

However, the process of 'updating' is still a black box at this stage. How does our model know *how* to update the parameters? How does it ensure loss is minimized? The most common mechanism is known as gradient descent.

Given the function $f(x_1,...,x_n) \in \mathbb{R}^n$, $f$ has a partial derivative $\frac{\partial f}{\partial x_i}$. At a given point $a$, these derivatives define the vector

$$\nabla f(a) = \left(\frac{\partial f}{\partial {x_1}}(a),...,\frac{\partial f}{\partial x_n}(a)\right),$$

which is also called the *gradient* of $f$ at point $a$. Weights are initialized to 'reasonable' (often trivial) values, and adjusted in the "direction of steepest descent" of the gradient.  

Subsequent gradients are calculated by multiplying the current gradient by the learning rate ($\alpha$).
 * For one-dimensional functions, the ideal learning rate is $\frac{1}{f''(x)}$ 
 * For higher dimensional functions, the ideal learning rate is inverse of the Hessian

Note: The Hessian matrix $H_f$ is defined as $ \nabla^2 f $ or:

\begin{bmatrix}{\dfrac {\partial ^{2}f}{\partial x_{1}^{2}}}&{\dfrac {\partial ^{2}f}{\partial x_{1}\,\partial x_{2}}}&\cdots &{\dfrac {\partial ^{2}f}{\partial x_{1}\,\partial x_{n}}}\\[2.2ex]{\dfrac {\partial ^{2}f}{\partial x_{2}\,\partial x_{1}}}&{\dfrac {\partial ^{2}f}{\partial x_{2}^{2}}}&\cdots &{\dfrac {\partial ^{2}f}{\partial x_{2}\,\partial x_{n}}}\\[2.2ex]\vdots &\vdots &\ddots &\vdots \\[2.2ex]{\dfrac {\partial ^{2}f}{\partial x_{n}\,\partial x_{1}}}&{\dfrac {\partial ^{2}f}{\partial x_{n}\,\partial x_{2}}}&\cdots &{\dfrac {\partial ^{2}f}{\partial x_{n}^{2}}}\end{bmatrix}

The total number of data points used to calculate a given gradient is known as a batch.