# Introduction
There are 3 major types of machine learning:
- Supervised
- Unsupervised
- Reinforcement

The building blocks of a machine learning algorithm are:
- Data
- Model
- Objective Function
- optimization Algorithm

# Introduction to Neural Networks
The building blocks of a machine learning algorithm are:
- **Data**
    - We need a certain amount of data to train the model with
    - This is usually historical data
- **Model**
    - The simplest model we can train is a linear model
    - Stepping on the linear model deep learning allows us to make complex non-linear models
- **Objective Function**
    - This estimates how correct the model's outputs are on average
    - The entire machine learning framework boils down to optimizing this function
- **Optimization Algorithm**
    - This consists of the mechanics through which we vary the parameters of the model to optimize the objective function

# Types of Machine Learning
- Supervised
    - This refers to the case where we provide the algorithm with inputs and their corresponding desired outputs.
    - Based on this information it learns how to produce outputs as close to the one we are looking for
    - Can be split into two types: Classification and Regression
    - **Classification:** outputs are split into categories
    - **Regression:** outputs are numerical
- Unsupervised
    - We feed inputs but do not give target outputs, meaning we don't tell the algorithm exactly what our goal is 
    - Instead we ask it to find some kind of dependence or underlying logic in the data provided
- Reinforcement
    - We train a model how to act in an environment based on the rewards it receives

# The Linear Model
### Single Output
- $y = wx + b$
- $w$ and $x$ are vectors with n entries and $b$ is a scalar

### Multiple Outputs
$$y_1 = x_1w_{11} + x_2w_{21} + b_1$$
$$y_2 = x_1w_{12} + x_2w_{22} + b_2$$

If we have m outputs, k inputs, and n observations, we would have the output matrix would have shape nxm, the input matrix would have weight nxk, the weights matrix would have shape kxm and the biases vector would have shape 1xm.

$$\begin{bmatrix}
y_{11} & y_{12} & * & y_{1m} \\
y_{21} & y_{22} & * & y_{2m} \\
* & * & * & * \\
y_{n1} & y_{n2} & * & y_{nm}
\end{bmatrix}
= 
\begin{bmatrix}
x_{11} & x_{12} & * & x_{1k} \\
x_{21} & x_{22} & * & x_{2k} \\
* & * & * & * \\
x_{n1} & x_{n2} & * & x_{nk}
\end{bmatrix}
\begin{bmatrix}
w_{11} & w_{12} & * & w_{1m} \\
w_{21} & w_{22} & * & w_{2m} \\
* & * & * & * \\
w_{k1} & w_{k2} & * & w_{km}
\end{bmatrix}
+
\begin{bmatrix}
b_1 & b_2 & * & b_m
\end{bmatrix}$$

# What is Objective Function?
The objective function is the measure used to evaluate how well the model's outputs match the desired correct values. They are generally split into two types: Loss and Reward

### Loss Functions
- These are also called cost functions
- The lower the loss function, the higher the level of accuracy of the model
- An intuitive example is a loss function that measures the error of prediction, we want to minimize this, which thus minimizes the loss
- Usually used in supervised learning

### Reward Functions
- These are essentially the opposite of loss functions, the higher the reward function, the higher the level of accuracy of the model
- This is used mostly in reinforcement learning, where the goal is to maximize a specific result

## Most Common Loss Functions
**Target:** The target (T) is the desired value at which we are aiming, generally we want our output (y) to be as close to the target as possible.

### Regression: L2-Norm
- L2-norm $= \sum_i (y_i - t_i)^2$
- norm comes from the fact it is the vector norm, or the Euclidean distance of the outputs and the targets
- The L2-norm is the OLS in statsmodels
- The lower this sum, the lower the error, so the lower the loss function

### Classification: Cross-Entropy
- $L\left(\vec{y}, \vec{t} \right) = - \sum_i t_i ln(y_i)$
- The target vector will be 0 for the wrong category and 1 for the correct category, the output vector will return the probability an input belongs in each category.

#### Example:
Say we are classifying pictures into cat, dog, and horse. Given a picture of a dog the target vector would be \[0, 1, 0\], similarly given a picture of a horse the target vector would be \[0, 0, 1\].

For the picture of the dog, we get an output vector of \[0.4, 0.4, 0.2\], and for the picture of the horse, we get an output vector of \[0.1, 0.2, 0.7\]. Clearly the second model output is better, but we can check this using cross-entropy.

$$L\left(\vec{y}, \vec{t} \right) = -0*ln(0.4) -1ln(0.4) -0ln(0.2) = 0.92$$
$$L\left(\vec{y}, \vec{t} \right) = -0*ln(0.1) -0ln(0.2) -1ln(0.7) = 0.36$$

These are the most common loss functions, but there are others that are useful. In fact, **any** function that holds the basic property of higher for worse results and lower for better results can be a loss function.

# Optimization Algorithm
### 1-Parameter Gradient Descent
- The learning rate is the rate at which the machine learning forgets old beliefs for new ones
- It is a tuning parameter in an optimization algortihm that determines the step size at each iteration while moving toward a minimum of a loss function.
- Generally we want the learning rate to be high enough so we can reach the closest minimum in a rational amount of time. But low enough that we don't oscillate around the medium
- There are techniques to allow us to choose the correct rate

There are several key take aways:
1. Using gradient descent we can find the minimum by trial and error
2. The update rule means that each trial is better than the previous one
3. The learning rate should be high enough so we don't iterate forever, but low enough so we don't oscillate forever
4. We should stop updating once we've converged, i.e. when $x_{i+1} - x_i = 0.001$

### n-dimensional Gradient Descent
Loss: 
$$\text{L2-norm} = \frac{1}{2}\sum_i (y_i - t_i)^2$$
It is convention to divide the L2-norm by 2, which will be explained later.

The update rule changes from:
$$x_{i+1} = x_i - \eta f'(x_i)$$
to
$$w_{i+1} = w_i - \eta \nabla_\vec{w} L(y,t)$$
$$b_{i+1} = b_i - \eta \nabla_\vec{b} L(y,t)$$

It is essentially the same rule except for the matrix w and the vector b instead of the number x. The initial rule was used to explain the concept, as it would be a familiar method from calculus.

![image.png](attachment:image.png)

- In the third line we differentiate by w using the chain rule
- We can see in the third line why the convention is to look at 1/2 L2-norm
- In the last line we define a new variable $\delta_i = y_i-t_i$

We get:
$$\nabla_w L = \sum_i x_i \delta_i$$
$$\nabla_b L = \sum_i \delta_i$$

So our update rule becomes:
$$w_{i+1} = w_i - \eta \sum_i x_i \delta_i$$
$$b_{i+1} = b_i - \eta \sum_i \delta_i$$