# Linear Regression with Multiple Variables

### Multiple Features

Consider the housing price example from week 1. In that example, there was one variable, x, which represented the size in sq ft of the house. The x variable was used to predict the price (y) of the home. What if there were other variables contributing to the price of the house (e.g. # of bedrooms, # of floors, and the age of the home in years).

Notation:
- $n$ = number of features
- $x^{(i)}$  = input (features) of the $i^{th}$ training example. For example, $x^{(2)} = \begin{bmatrix} 1416 \\ 3 \\ 2 \\ 40 \end{bmatrix}$ Which is a vector of feature values for the 2nd training example.
- $x^{(i)}_j$ = value of the feature *j* of the $i^{th}$ training example.

What is the new hypothesis for multiple features?
- the previous hypothesis for univariant linear regression was $ h_\mathsf{\theta}(x) = \mathsf{\theta_0}+\mathsf{\theta_1x}$
- $ h_\mathsf{\theta}(x)=\mathsf{\theta_0}+\mathsf{\theta_1x_1}+\mathsf{\theta_2x_2}+\mathsf{\theta_3x_3}+\mathsf{\theta_4x_4}$
- The above formula represents the summation of n features. In the housing example, there are 4 features. it can more formally/generally be written as $h_\mathsf{\theta}(x)=\mathsf{\theta_0}+\mathsf{\theta_1x_1}+\mathsf{\theta_2x_2}+...+\mathsf{\theta_nx_n}$
    - for convenience of notation, $x_0 = 1$ or $x^{(i)}_0 = 1$
    - $x = \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \end{bmatrix}$ x is therefore a 0-index feature vector.
    - $\mathsf{\theta} = \begin{bmatrix} \mathsf{\theta_0} \\ \mathsf{\theta_1} \\ \mathsf{\theta_2} \\ \mathsf{\theta_3} \end{bmatrix}$ or the parameters, can also be written as a vector
- $ h_\mathsf{\theta}(x)= \mathsf{\theta^{T}x}$
    - $\begin{bmatrix}\mathsf{\theta_0} &\mathsf{\theta_1} &\mathsf{\theta_n}\end{bmatrix}\begin{bmatrix} x_0 \\ x_1 \\ x_n \end{bmatrix} = h_\mathsf{\theta}(x)=\mathsf{\theta_0x_0}+\mathsf{\theta_1x_1}+\mathsf{\theta_2x_2}+\mathsf{\theta_nx_n}$

### Gradient Descent for Multiple Variables

- Think of the parameters $\mathsf{\theta_0},\mathsf{\theta_1},...\mathsf{\theta_n}$ as $\mathsf{\theta}$ = a n+1 dimensional vector
- Cost function:
J($\mathsf{\theta}_0$ $\mathsf{\theta}_1$...$\mathsf{\theta}_n$) =  $\frac{1}{2m}$ $\sum \limits_{i=1} ^{m}(h_\mathsf{\theta}(x^i)-y^i)^2$
    - instead of writing J($\mathsf{\theta}_0$ $\mathsf{\theta}_1$...$\mathsf{\theta}_n$), J($\mathsf{\theta}$) = a function of the parameter vector theta
- Gradient descent:
    - Repeat $\mathsf{\theta}_j := \mathsf{\theta}_j -\mathsf{\alpha}\frac{\mathsf{\delta}}{\mathsf{\delta}\mathsf{\theta}_j}J(\mathsf{\theta})$
    - $\mathsf{\theta}_j := \mathsf{\theta}_j -\mathsf{\alpha}\frac{1}{m}\sum \limits_{i=1}^{m}(h_\mathsf{\theta_0}(x^{(i)})-y^{(i)})x^{(i)}_j$


### Gradient Descent in Practice I: Feature Scaling
- If you make sure that the various features are on a similar scale, then convergence via gradient descent can occure more quickly.
- E.g. $x_1 = size(0-2000ft^2)$ and $x_2 =bedrooms(1-5)$
    - the contours of the associated graph would be skewed, i.e. very tall and thin ovals because $x_1=2000$ vs $x_2=5$ represent a large difference in scale.
    - one solution would be to scale the feature values:
        - $x_1 = \frac{size(ft^2)}{2000}$
        - $x_2 = \frac{number of bedrooms}{5}$
- Ideally the idea is to get every feature into approximately a $ -1 ≤ x_i ≤ 1$ range.
- Mean normalization:
    - replace $x_i$ with $x_i - \mathsf{\mu_i}$ to make features have approximately zero mean
    - e.g. $x_1 = \frac{size-1000}{2000}$ $x_2=\frac{numberofbedrooms-2}{5}$
    - $x_1 = \frac{x_1 - \mathsf{\mu_1}}{s_1}$
    -  $\mathsf{\mu_i}$ = average value of $x_i$
    - $s_1$ = the range of the values, i.e. the max value - the min value
        - $s_1$ can also be replaced with the standard deviation value.

### Gradient Descent in Practice II: Learning Rate