# Introduction to Machine Learning

The following notebook is a concise summary of "Machine Learning Crash Course" - the 2nd course in [Google's Machine Learning Series](https://developers.google.com/machine-learning). 

## Supervised Learning: Recap

Recall key terminology:

* Label ($y$) - variable we're predicting 
* Feature ($x_1, x_2 ... x_n$) - input variables that describe out data 

Models map instances of data ($x_n$) to predicted labels ($y'$). 

Supervised learning typically involes either regression (continuous values predicted) or classification (discrete values predicted) models. When dealing with supervised learning models, we assume that: 
* Examples are drawn independently and identically (i.i.d), at random, from the same distribution
* Distribution is stationary (no changes within the dataset)

## Linear Regression & Loss Introduction

<img src = "https://miro.medium.com/v2/resize:fit:597/1*RqL8NLlCpcTIzBcsB-3e7A.png" height="70%" width="50%">

If $\vec{x} = (x_1, ... x_n) \in \mathbb{R}^D$, where $\vec{x}$ is a vector of features, our goal in using a regression model is to make predictions $y$ that are as close to the target $t$ as possible. For a linear regression, $y = \sum_{i} w_ix_i + b$, where: 
* $y$ is the prediction 
* $\vec{w}$ is the weight vector 
* $b$ is the bias 

A common loss function is "squared (L2) loss," with $$L_2 = \sum(y - y')^2,$$ where $y$ is the observed value and $y'$ is the predicted value. 

The "mean square error" is simply the average L2 loss over the entire dataset, or $$ \frac{1}{n} \sum_{(x, y) \in D}(y - y')^2, $$ where $n$ is the number of data points, $x$ is a feature or set of features, $y$ is the label and $D$ is the dataset (with $(x, y)$ pairs).



## Reducing Loss

<img src = "https://developers.google.com/static/machine-learning/crash-course/images/GradientDescentDiagram.svg" style="background-color: white">

We take an iterative approach to reducing loss. Models use features to generate a predicted label ($y' = w_1x_1 + b$). This is compared to the true label from the dataset to determine loss. Since our goal is to minimize loss, the model 'updates' the parameters $b$ and $w$ over and over again until loss is minimized. 

However, the process of 'updating' is still a black box at this stage. How does our model know *how* to update the parameters? How does it ensure loss is minimized? The most common mechanism is known as gradient descent.

Given the function $f(x_1,...,x_n) \in \mathbb{R}^n$, $f$ has a partial derivative $\frac{\partial f}{\partial x_i}$. At a given point $a$, these derivatives define the vector

$$\nabla f(a) = \left(\frac{\partial f}{\partial {x_1}}(a),...,\frac{\partial f}{\partial x_n}(a)\right),$$

which is also called the *gradient* of $f$ at point $a$. Weights are initialized to 'reasonable' (often trivial) values, and adjusted in the "direction of steepest descent" of the gradient.  

Subsequent gradients are calculated by multiplying the current gradient by the learning rate ($\alpha$).
 * For one-dimensional functions, the ideal learning rate is $\frac{1}{f''(x)}$ 
 * For higher dimensional functions, the ideal learning rate is inverse of the Hessian

Note: The Hessian matrix $H_f$ is defined as $ \nabla^2 f $ or:

\begin{bmatrix}{\dfrac {\partial ^{2}f}{\partial x_{1}^{2}}}&{\dfrac {\partial ^{2}f}{\partial x_{1}\,\partial x_{2}}}&\cdots &{\dfrac {\partial ^{2}f}{\partial x_{1}\,\partial x_{n}}}\\[2.2ex]{\dfrac {\partial ^{2}f}{\partial x_{2}\,\partial x_{1}}}&{\dfrac {\partial ^{2}f}{\partial x_{2}^{2}}}&\cdots &{\dfrac {\partial ^{2}f}{\partial x_{2}\,\partial x_{n}}}\\[2.2ex]\vdots &\vdots &\ddots &\vdots \\[2.2ex]{\dfrac {\partial ^{2}f}{\partial x_{n}\,\partial x_{1}}}&{\dfrac {\partial ^{2}f}{\partial x_{n}\,\partial x_{2}}}&\cdots &{\dfrac {\partial ^{2}f}{\partial x_{n}^{2}}}\end{bmatrix}

The total number of data points used to calculate a given gradient is known as a batch. By reducing the batch size, we can increase efficiency. In processes such as stochastic gradient descent (SGD), each iteration uses a batch size of 1, randomly chosen (i.e. 1 example). Others, such as mini-batch SGD, uses batches from 10-1000, also randomly chosen. 

Note: This is distinct from an epoch, which refers to the number of iterations the model makes through a training dataset. 

Below are some important suggestions for tuning hyperparameters (epochs, learning rates and batch size): 
* Loss should initially decrease rapidly, then reach a horizontal asymptote
    * If this convergence doens't happen, increase epochs
    * If loss decreases slowly, increase $ \alpha $
    * If loss varies, decrease the learning $ \alpha $
* Begin with large batch sizes, and decrease as far as possible without degredation of results
* Low $ \alpha $ with higher epochs or batch size often works well 

When training a model, it's important to avoid *overfitting*. The model below has minimal loss, but still serves as a bad model: new data would likely be miscategorized. 

<img src = "https://upload.wikimedia.org/wikipedia/commons/thumb/1/19/Overfitting.svg/1200px-Overfitting.svg.png" width="70%">

In other words, the best models are usually simple (see: Ockham's Razor). To ensure you aren't overfitting, it's important to split data into distinct sets: 
* Training set: data used to train your model (~70%)
* Validation set: data used to tune your model (~15%)
* Test set: data used to confirm results (~15%)

Validation and test sets should be representative of the entire data set and large enough to yield statistically significant results. Minimal exposure of the model to the test set and prevent overfitting. 

## Features In-Detail

"Feature Enginering" refers to the process of converting raw data to a feature vector. 

Good values for feature vectors are:
* 


## NumPy & Pandas Review

Note: ```array``` (NumPy), ```list``` (Python) and ```tensor``` (Tensorflow) are all analagous structures to matrices.

The following code snippets will deal with array creation and manipulation in NumPy. 

In [12]:
import numpy as np 

# Create a 2D array with selected values
new_array = np.array([[1, 2], [3, 4]])
print(new_array)

# Create an array of all 1s 
zero_array = np.ones
print(zero_array)

# Create an array of nums from x (inclusive) to y (not inclusive)
set_array = np.arange(6, 21)
print(set_array)

# Create an array of random floating pts from 0 to 1
floats = np.random.random([4])
print(floats)

# Create an array of random nums from x (inclusive) to y (not inclusive)
rand_array = np.random.randint(low=20, high=100, size=(8))
print(rand_array)

[[1 2]
 [3 4]]
<function ones at 0x117e30670>
[ 6  7  8  9 10 11 12 13 14 15 16 17 18 19 20]
[0.16162712 0.68256191 0.57716913 0.35841684]
[52 82 78 42 29 46 62 84]


The following code snippets will focus on pandas (note: in pandas, DataFrames are the main data structure - this is explored below). 

In [25]:
import numpy as np
import pandas as pd

# Create array to store temp vs light data
my_data = np.array([[30, 20], [20, 10], [10, 5], [0, 0]])

# Create cols
my_cols = ['temp', 'light']

# Create and print 
dataframe = pd.DataFrame(data=my_data, columns=my_cols)
print(dataframe, '\n')

# Create a new column based on doubled light values
dataframe["new col"] = dataframe["light"] * 2
print(dataframe, '\n')

# Print the first n rows (in this case, 2)
print(dataframe.head(2), '\n')

# Print the nth row (index starts at 0) (in this case, print the first)
print(dataframe.iloc[[0]], '\n')

# Print rows 0 (inclusive) to 3 (exclusive)
print(dataframe[0:3], '\n')

# Copy a DataFrame - creates an INDEPENDENT copy
copy = pd.DataFrame.copy

# Reference a DataFrame (simply assign to another var) 
ref = dataframe

# Shuffle rows (could be helpful if to ensure test/validation/training datasets differ)
shuffled = dataframe.reindex(np.random.permutation(dataframe.index))
print(shuffled)

   temp  light
0    30     20
1    20     10
2    10      5
3     0      0 

   temp  light  new col
0    30     20       40
1    20     10       20
2    10      5       10
3     0      0        0 

   temp  light  new col
0    30     20       40
1    20     10       20 

   temp  light  new col
0    30     20       40 

   temp  light  new col
0    30     20       40
1    20     10       20
2    10      5       10 

   temp  light  new col
2    10      5       10
3     0      0        0
1    20     10       20
0    30     20       40
