# Supervised Learning


## Definitions and Notations

* $x^{(i)}$ input variables/features and  the space of all such $x^{(i)}$ is $X$
* $y^{(i)}$ output/target variable that we are trying to predict and the space of all such $y^{(i)}$ is $Y$
* A pair $(x^{(i)},y^{(i)})$ is called a training example and the dataset we will be using to learn - a list of $n$ training examples $ \{(x^{(i)},y^{(i)});i=1,2,...,n\} $ - is called a training set
* the superscript $“(i)”$ in the notation is simply an index into the training set

## Learning Problem


* The goal is, given a training set, to learn a function $ h: X \mapsto Y$ so that $h(x)$ is a “good” predictor for the corresponding value of $y$. For historical reasons, this function h is called a hypothesis.

* If  the target variable is continous ,the learning problem is **regression** 
* When y can take on only a small number of discrete values,the learning problem is **classification**

# Linear Regression

* We need to know functions/hypotheses $h$.An initial choice can be approximating $y$ as a linear function of $x$: $$ h_{\theta}(x)=\theta_{0}+\theta_{1}x_{1}+\theta_{2}x_{2}$$ (if we choose only two features)

* $\theta_{i}$'s are **parameters/weight** parametrizing the space of linear functions mapping from $X$ to $Y$.
* Using the convention $x_{0}=1$(**Intercept term**),so that $$h(x)=\sum_{i=0}^{d}\theta_{i}x_{i}=\theta^{T}x,$$
$\theta$ and $x$ are vectors and $d$ is the number of input variables (not counting $x_{0}$)

* For a training set,we have to pick or  learn the parameters $\theta,$ so that we can predict $y$.
* One reasonable method seems to be to make $h(x)$ cloase to y,for the training examples.To formalize this, we will define a function that measures, for each value of the θ’s, how close the $h(x^{(i)})'$ s are to the corrsponding $y^{(i)}$'s.

* We  define **Cost/Loss** function: $$J(\theta)=\frac{1}{2}\sum_{i=1}^{n}(h_{\theta}(x^{(i)})-y^{(i)})^{2},$$this is the least squares-cost function.



## LMS Algorithm

* Given:  $(x^{(i)},y^{(i)}),$ $i=1,...,n$
* Minimize: $J(\theta)=\frac{1}{2}\sum_{i=1}^{n}(h_{\theta}(x^{(i)})-y^{(i)})^{2}$

* We want to choose θ so as to minimize $J(\theta)$

* We can use the gradient descent algorithm that starts with some “initial guess” for $\theta$, and that repeatedly changes $\theta$ to make $J(\theta)$ smaller, until hopefully we converge to a value of
$\theta$ that minimizes $J(\theta)$.

* The **Gradient Descent ** algorithm,which starts from initial $\theta,$ and repeatedly performs the update: $$\theta_{j}:=\theta_{j}-\alpha \frac{\partial J( \theta)}{\partial \theta_{j}} ,$$ until some condition is met.
This update is simultaneously performed for all values of  $j = 0, . . . , d$

* $\alpha$ is called the learning rate. This is a very natural algorithm that repeatedly takes a step in the direction of steepest decrease of $J$.

    * We need to find the partial derivative in the quation,so that we can implement the algorithm.For a single training example $(x,y).$ We have:$$\frac{\partial J( \theta)}{\partial \theta_{j}}=\frac{\partial}{\partial \theta_{j}}\frac{1}{2}(h_{\theta}(x)-y)^{2}$$
$$=2.\frac{1}{2}(h_{\theta}(x)-y).\frac{\partial}{\partial \theta_{j}}(h_{\theta}(x)-y)$$
$$=(h_{\theta}(x)-y).\frac{\partial}{\partial \theta_{j}} (\sum _{i=0}^{d}\theta_{i}x_{i}-y)$$
$$=(h_{\theta}(x)-y)x_{j}$$

* for a single training example,this gives the update rule:$$\theta_{j}:=\theta_{j}+\alpha(y^{(i)}-h_{\theta}(x^{(i)}))x_{j}^{(i)},$$this rule is called **LMS**(Least Mean Square)update rule.

For a training set,
* Repeat until convergence {$$\theta_{j}:=\theta_{j}+\alpha\sum_{i=1}^{n}(y^{(i)}-h_{\theta}(x^{(i)}))x_{j}^{(i)}$$ ( for every $j$) .}

* $J$ being convex quadratic function,the optimization here for linear regression has only one global, and no other local, optima; thus gradient descent always converges (assuming the learning rate α is not too large) to the global minimum.

# The Matrix Picture



* Given a set of training examples $(x^{(i)},y^{(i)})$ i.e training set ,define matrix $X$ to be the $m$-by-$n$ matrix ($m$-by-$n+1$,if intercept term is absorbed) that contains the input values of training examples in it's rows:  
$$ \begin{bmatrix} 
     & {(x^{(1)})^{T}} &  \\
     & \vdots & \\
     &   (x^{(m)})^{T}     &  
        \end{bmatrix} $$
    
    
  * $ \overrightarrow{\textbf y}$ be the $m-$ dimensional vector containing the target values from the training set:
    $$\begin{bmatrix} 
     & y^{(1)} &  \\
     & \vdots & \\
     &   y^{(m)}     &  
    \end{bmatrix}$$
    
 
 * Since $h_{\theta}(x^{(i)})=(x^{(i)})^{T}\theta$,we can verify that
 $$X \theta - \overrightarrow{\textbf y} = \begin{bmatrix} 
     & {(x^{(1)})^{T}}\theta &  \\
     & \vdots & \\
     &   (x^{(m)})^{T} \theta    &  
    \end{bmatrix} - \begin{bmatrix} 
     & y^{(1)} &  \\
     & \vdots & \\
     &   y^{(m)}     &  
         \end{bmatrix}$$
    
    $$=\begin{bmatrix} 
     & h_{\theta}(x^{(1)})-y^{(1)} &  \\
     & \vdots & \\
     &  h_{\theta}(x^{(m)})- y^{(m)}     &  
    \end{bmatrix}$$
    
    
 $\frac{1}{2}(X\theta-\overrightarrow{\textbf y})^{T}(X\theta-\overrightarrow{\textbf y})=\frac{1}{2} \sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^{2}=J(\theta)$ This is the cost function.
 * To minimize $J$,we have to find the derivatives with respect to $\theta$




* Some matrix derivatives result 
  $$\nabla_{A}tr AB = B^{T}$$ <br>
  $$\nabla_{A}\vert A\vert=\vert A\vert(A^{-1})^{T}$$ <br>
  $$\nabla_{A^{T}}f(A)= (\nabla_{A}f(A))^{T}$$ <br>
  $$\nabla_{A}trABA^{T}C=CAB+C^{T}AB^{T}$$


* Using the last two results 
 $$\nabla_{A^{T}}trABA^{T}C=B^{T}A^{T}C^{T}+BA^{T}C$$
 


* The cost function $$J(\theta)=\frac{1}{2}(X\theta-\overrightarrow{ y})^{T}(X\theta-\overrightarrow{ y})$$
 $$\nabla_{\theta} J(\theta)=\nabla_{\theta}\frac{1}{2}(X\theta-\overrightarrow{ y})^{T}(X\theta-\overrightarrow{ y})$$
 
 <br>
 
     $$=\frac{1}{2}\nabla_{\theta}(\theta^{T}X^{T}X\theta-\theta^{T}X^{T}\overrightarrow{ y}-\overrightarrow{ y}^{T}X\theta+\overrightarrow{ y}^{T}\overrightarrow{ y})$$
     
   <br>
     
     $$=\frac{1}{2}\nabla_{\theta} tr(\theta^{T}X^{T}X\theta-\theta^{T}X^{T}\overrightarrow{ y}-\overrightarrow{ y}^{T}X\theta+\overrightarrow{ y}^{T}\overrightarrow{ y})$$
   
   
   
   <br>
     $$=\frac{1}{2}\nabla_{\theta}(tr \theta^{T}X^{T}X\theta-2tr\overrightarrow{ y}^{T}X\theta)$$
     
     

$$=\frac{1}{2}(X^{T}X\theta+X^{T}X\theta-2X^{T}\overrightarrow{ y})$$ ,<br>
     
   

  $$=X^{T}X\theta-X^{T}\overrightarrow{ y}$$
 
 
 
 
 * To minimize $J,$we set the derivative to zero and obtain:
 $$X^{T}X\theta=X^{T}\overrightarrow{ y}$$
 
 
 
 * Thus the value of $\theta $ that minimizes $J(\theta)$ is given in closed form by the equation
 $$\theta=(X^{T}X)^{-1}X^{T}\overrightarrow{ y}$$