### Probabalistic Interpretation of Linear Regression

- Why is minimizing J sensible? 
- $Y = \theta_0 + \theta_1x_1 + \eta$, where $\eta$ is a random variable drawn from a Gaussian distribution with mean 0 and variance $\sigma^2$

- Since Y is determined in part by a random quantity, Y is also random. USe property of Gaussian variables: 
- If $Z$ is Gaussian with mean $0$ and variance $1$, then $ W = a + bZ$ is gaussian with mean $a$ and variance $b^{2}$. Proof: 
We have $E[W] = E[a + bZ] = E[a] + E[bZ] = a + bE[z] = a $ 

and $Var[W] = E[ (W - E[W])^2] = E[(a + bZ - e[a + bZ])^2] = E[b^2z^2] = b^2E[z^2]$. 

Since $E[z^2] + (E[z])^2 = Var[z]$, $E[z^2] = 1$ and $Var[W] = b^2$.

- Thus, if $\eta$ is normal with mean $0$ and variance $\sigma^2$, then Y is also normal with mean $\theta_0 + \theta_1x_1$ and variance $\sigma^2$.

- Then, we can write the likelihood of one training example according to the normal distribution. If we assume independence and that our data are drawn from a normal distribution (crucial assumptions), then we can write out the negative log-likelihood as $ -\frac{1}{2}[ \frac{1}{\sigma^2} \sum_{n}[y_{n} - (\theta_0 + \theta_1x_n)]^2 + Nlog(\sigma^2) $, which, ignoring constants, is the same as our cost function J from the previous lecture. 
- So once again, minimizng the cost $J = \sum_{n} (y_n - \theta^Tx_n)^2$ is the same as maximizing the log-likelihood. 

### Why does the probabalistic interpretation help us? 

- Shows us that the data are: independent, normally distributed, linear mean, and constant variance. These are the assumptions we have when we do linear regression, and other learning algoirhtms make less or different assumptions.

- Multivariate linear regression: 
- We have a design matrix $ X = \begin{bmatrix}
    x_1^T \\
    x_2^T \\
    ... \\
    x_n^T \\ 
    \end{bmatrix} \in R^{N * (D + 1)}$  where the rows are features and columns are training example vectors.

- We can define the vector of errors as $ y - X\Theta $ and the error to be the L2-norm of this vector: $ || y - X\Theta||^{2}  =  ( y - X\Theta)^T(y - X\Theta) = \sum_{n}(y_n - x_n^{T}\theta)^2$. Compactly, we can write $ J(\theta) = C - 2y^TX\Theta + \Theta^TX^tX\Theta$.  The gradient can be given by $\frac{\delta J}{\delta \Theta} = 2\Theta X^TX - 2(X^Ty)$, and solving for $\Theta$ we obtain $\Theta = (X^TX)^{-1}X^Ty$. 

- Even though we have an explicit solution to linear regression, it's $O(ND^2)$ for the matrix mult and  $O(D^3)$ for the matrix inversion, so gradient descent may be more efficient. 

### Gradient descent for linear regression:

t = 0

initialize $\Theta^0$

repeat:

  $\frac{\delta J}{\delta \Theta} = X^TX\Theta^T - X^Ty$
    
  $\Theta^{t + 1} = \Theta^t - \eta * \frac{\delta J}{\delta \Theta} $
  
  $ t = t + 1 $

until convergence.

Gradient descent is $O(ND)$ for each iteration, since it requires the entire training set. 

### Stochastic Gradient Descent

- A more efficient algorithm is stochastic gradient descent. Instead of computing the gradient across all of the training examples, we randomly select one training example and compute that example's contribution to the overall gradient. We then update our weights with that value. Another variation, minibatch gradient descent, relies on more than one training example per iteration; it uses batches drawn randomly from the training set. 
 
 SGD is only $O(D)$ in each iteration, but possibly takes more iterations to converge. 
 
 
 ### What happens if $X^TX$ is not invertible?
 
 - Why could this happen? The rows and columnd are not all linealry dependent, determinant = 0. 
 - if $D > M$ ( more features than data points, not enough training data) then the matrix $X^TX$ will not be invertible. 
 - If any two or more columns of X are linearly dependent, meaning that two features are perfectly correlated. Then the solution will not be unique. 
 
 ### Nonlinear hypotheses 
 
 - If data are not linearly seperable, we can use feature engineering to introduce nonlinear features. 
 - Let $\phi(x)$ be a function that transforms the features of the feature vector $x$. We define $\phi(x): R^{D} \longrightarrow R^{M} $ so that the data does become linearly seperable, and tehn apply our linear algorithms to the transformed feaures. 
- Now we have $ \phi(x) = \begin{bmatrix} 1 \\ x \\ x^2 \\ .. \\ x^m \end{bmatrix} $, our hypothesis function $ h_\theta(x) = \theta^T\phi(x) $ and our objective function $ \sum_{n} ( \theta^T\phi(x_n) - y_n)^2 $. 

- Through transforming and using nonlinear features, we can do nonlinear regression. 


    

