# Supervised Learning
---


Algorithm that learn $x$ to $y$ (input to output mapping). The main key of SL is that you give your learning algorithms examples to learn from. Learn from dala labeled with the "right answers".

You should give input $x$ and desired output label $y$.

**Examples:**

| Input (X) | Output (Y) | Application |
| --- | --- | --- |
| Email | spam? (0/1) | spam filtering |
| Audio | text transctipts | speech recognition |
| English | Spanish | machine translation |
| ad, user info | click? (0/1) | online advertisementing|
| image of phone | defect? (0/1) | visual inspection |

# Algorithms

## Linear Regression

**Model:**
$a(x)= b + w_1x_1+\cdots+w_dx_d$

- $w_1,\dots,w_2$ - coefficients (weights)
- $b$ - bias
- $d+1$ parameters.

  <center><img src="Linear_model.png." width="350" height="280"></center>  
    
    
**Loss Function:**
How to measure model quality?
    
*Mean squared error:*
    $$L(w)=\frac{1}{l}\sum_{i=1}^{l}(w^Tx_i-y_i)^2=\frac{1}{l}\|Xw-y\|^2$$
    
*Fitting a model to training data:*
 
$$\min_{w}L(w) \rightarrow w=(X^TX)^{-1}X^Ty$$

## Linear Classification

*Binary classification:*
- $y\in \{-1, 1\}$
- $a(x)=\text{sign }(w^Tx)$
- Number of parameters: $d$ $(w\in \mathbb{R}^d)$.
- Model without bias.

<center><img src="linear_md_classif.png" width="350" height="280"> </center>

*Multiclass classification:*
- $y\in \{1,\dots,K\}$
- $a(x)=arg \max_{k\in\{1,\dots,K\}} (w_k^Tx)$
- Number of parameters: $K\cdot d$ $(w_k\in \mathbb{R}^d)$.
- *Example:* $z=(7,-7,5,10)$ - scores $a(x)=3$

### Classification loss:

- Classification accuracy:
$$\frac{1}{\ell} \sum_{i=1}^{\ell}\left[a\left(x_{i}\right)=y_{i}\right]$$
    - Not differentiable
    - Doesn't assess model confidence
    - [P] - Iverson bracket:
    \begin{align*}
    [P]= \begin{cases}1, & P \text { is true } \\ 0, & P \text { is false }\end{cases}
    \end{align*}
- Mean Square error:

    Consider an example $x_i$ such that $y_i=1$
    Squared loss: $(w^Tx_i-1)^2$   
   - If $a(x_i)=1$, then the loss is zero.
   - If $a(x_i)=r$, with $r\in [0,1]$, it is inconfident in its decision and we penalize if for ***low confidence***.
   - If $a(x_i)=r$, with $r\in (-\infty,0]$, then it misclassified, so we give it an larger penalty.
   - If $a(x_i)=r$, with $r\in (1,\infty]$, then it misclassified, so we give it an larger penalty, then we penalize it for hihg congidence.
   
   
<center><img src="Penalties.png" width="350" height="280"> </center>

## Logistic Regression

**Class probabilities:**

Class scores (logits) from a lineal model:

$$z=(w_1^Tx,\dots,w_1^Tx)$$

Convert to probabilities, to distribution

1. $(e^{z_1},\cdots,e^{z_1})$
2. Compute the softmax transform (It's a probability function) 
$$\sigma(z)=\left( \frac{e^{z_1}}{\sum e^{z_k}},\cdots,\frac{e^{z_K}}{\sum e^{z_k}} \right)$$

*Example:* $z=(7,-7.5,10)$, $\sigma(z)\approx (0.05,0,0.95)$ 

**Predicted class probabilities (model output):**
$$\sigma(z)=\left( \frac{e^{z_1}}{\sum e^{z_k}},\cdots,\frac{e^{z_K}}{\sum e^{z_k}} \right)$$

-*Target Values* for class probabilities:
$$p=([y=1],\dots,[y=K])$$

- *Measure the distance between probability distributions* 
    - Cross entropy: The intuition is we consider a target distribution $P$ and an   approximation of the target distribution $Q$, then the cross-entropy of $Q$ from $P$ is the number of additional bits to represent an event using $Q$ instead of $P$.
    
    $$H(P,Q)=-\sum_{x\in X}P(x)\log{(Q(x))}$$
    
    *Note:* $\log$ is the base-$2$ logarithm, meaning that the results are in bits. If the base-$e$, the result will have the units called nats.
    
    - Similiarity between $\sigma$ and $p$ can be measured by the cross-entropy:
    $$H(p,\sigma)=\sum_{k=1}^{K}[y=k] \log \frac{e^{z_{k}}}{\sum_{j=1}^{K} e^{z_{j}}}=-\log \frac{e^{z_{y}}}{\sum_{j=1}^{K} e^{z_{j}}}$$
    
        Example: Suppose $K=3$ and $y=1$: 

        - $p=(1,0,0)$, then $H(p,\sigma)=0$
        - $p=(0.5,0.25,0.25)$, then $H(p,\sigma)\approx 0.693$
        - $p=(0,1,0)$, then $H(p,\sigma)= +\infty$
        
    *Note:* Cross entropy gives high penalty for models that are confident in wrong decisions.
    
    - Cross-entropy is diferrenciable and can be used as a loss function. 
    
\begin{aligned}
L(w, b) &=-\sum_{i=1}^{\ell} \sum_{k=1}^{K}\left[y_{i}=k\right] \log \frac{e^{w_{k}^{T} x_{i}}}{\sum_{j=1}^{K} e^{w_{j}^{T} x_{i}}} \\
&=-\sum_{i=1}^{\ell} \log \frac{e^{w_{y_{i}}^{T} x_{i}}}{\sum_{j=1}^{K} e^{w_{j}^{T} x_{i}}} \rightarrow \min _{w}
\end{aligned}


# Comments

- It's known from machine learning that models with high confidence generalize better.

# References

[1] Introduction to Deep Learning - HSE University - Coursera course.

[2] Supervised Machine Learning: Regression and Classification - OpenIA, Stanford University - Coursera course.

[3] https://machinelearningmastery.com/cross-entropy-for-machine-learning/