# Supervised Learning - Linar Models

Linear Models:
1. Building blocks for neural networks
2. Supervised Learning:
    - Labelled set of examples
    - Ground truth for the given example
    - Goal: Try to learn the function which better fits the data

## Concepts
1. $x_i$ = the given example
2. $y_i$ = the corresponding label/target value
3. $x_i = (x_{i1}, ... x_{id})$ = the features to learn from
4. $X = ((x_1, y_1), ..., (x_l, y_l))$ = the training set with $l$ examples
5. $a(x)$ = the model or the hypothesis

### Goal

$x -> a(x) -> y^{pred}$, where $a(x)$ can be a **regression** or a **classification** model.

For a 1-class model: $y = w_1x + b$.

For a multi-class model: $y = b + w_1x_1 + w_2x_2 + ... + w_dx_d$, where

- $w_d$ are the model coefficients
- $b$ is the bias term
- $d+1$ is the number of parameters for the models (so number of features + bias term)

In the vector notation the model can be described as: $a(x) = w^Tx$. And in the matrix notation: $a(X) = Xw$.

## Quality measure: Loss functions

To measure the quality of a model, we need to measure how far the prediction goes from the real target value. To do this, loss functions are used.

### Loss functions

#### Mean Squared Error (MSE) Loss

$L(w) =\frac{1}{l}\sum_{i=1}^{l} (w^Tx_i - y_i)^2$

The goal is then to **minimize** the loss: $min(L(w))$

# Linear model classification

## 1. Binary classification

- $y \in \{-1,1\})$
- $a(x) = sign(w^Tx)$

Number of parameters: $d$, $(w \in R^d)$

## 2. Multi-class Classification

- $y \in \{1,.., k\})$, with $k$ as the number of classes
- $a(x) = \argmax w^T_kx

Number of parameters: $d$, $(w \in R^d)$

## Loss computation

### Accuracy Loss

- Ratio of how we correctly classified points

$L(w) =\frac{1}{l}\sum_{i=1}^{l} a(x_i) = y_i$


### Squared Loss

- Consider $x_i$ such that $y_i = 1$

$L(w) =\frac{1}{l}\sum_{i=1}^{l} (w^Tx_i -1)^2$

- P is 1, if p is true
- p is 0, if p is false



# Logistic regression

- Class scores (**logits**) from a linear model
- How it works:
    - $z = (w_1^Tx, ..., w_k^Tx)$, as scalar scores for each class
    - Apply $e$ to each score: $(e^{z_1}, ...,e^{z_k})$
    - Apply Sigmoid on the elements of this vector and normalize (**Softmax transform**): $\sigma(z) = (\frac{e^{z_1}}{\sum_{k=1}^{K} e^{z_k}}\frac{e^{z_k}}{\sum_{k=1}^{K} e^^{z_k}}, ... ,\frac{e^{z_k}}{\sum_{k=1}^{K} e^{z_k}})$
       - The softmax transform delivers a probability distribution over all classes!
    
