  Machine Learning Online Class
  Exercise 1: Linear regression with multiple variables

# Optional Exercises
If you have successfully completed the previous notebook, congratulations! You
now understand linear regression and should able to start using it on your
own datasets.
For the rest of this programming exercise, we have included the following
optional exercises. These exercises will help you gain a deeper understanding
of the material, and if you are able to do so, we encourage you to complete
them as well.

In [None]:
using Plots
plotlyjs()

# Linear Regression with Multiple Variables

In this part, you will implement linear regression with multiple variables to
predict the prices of houses. Suppose you are selling your house and you
want to know what a good market price would be. One way to do this is to
first collect information on recent houses sold and make a model of housing
prices.
The file `data/ex1data2.txt` contains a training set of housing prices in Portland, Oregon. The first column is the size of the house (in square feet), the
second column is the number of bedrooms, and the third column is the price
of the house.
This notebook has been set up to help you step through this exercise.


## Feature Normalization
We will start by loading and displaying some values from this dataset. By looking at the values, note that house sizes are about 1000 times the number of bedrooms. When features differ by orders of magnitude, first performing feature scaling can make gradient descent converge much more quickly.
Your task here is to complete the code in `feature_normalize()` to
- Subtract the mean value of each feature from the dataset.
- After subtracting the mean, additionally scale (divide) the feature values by their respective “standard deviations.”

The standard deviation is a way of measuring how much variation there is in the range of values of a particular feature (most data points will lie within ±2 standard deviations of the mean); this is an alternative to taking the range of values (max-min).

You can use the `std()` function to compute the standard deviation. For example, inside `feature_normalize()`,
the quantity `X[:,1]` contains all the values of `x1` (house sizes) in the training set, so `std(X[:,1])` computes the standard deviation of the house sizes.

At the time that `feature_normalize()` is called, the extra column of 1’s corresponding to $x_0$ = 1 has not yet been added to `X` (see ex1 multi.m for
details). You will do this for all the features and your code should work with datasets of all sizes (any number of features / examples). 

Note hat the data from the file is loaded into variable `X_raw`, which is then passed to `feature_normalize()` to compute `X_norm`. After adding the column of ones we will finally have `X`. The separation between these three variables is mainly due to ease of use in a notebok as it allows to perform each step independently of another. `X` in the text above may relate to any of those three variables.

Note that each column of the matrix `X` corresponds to one feature.



In [2]:
function feature_normalize(X)
    #   feature_normalize returns a normalized version of X where
    #   the mean value of each feature is 0 and the standard deviation
    #   is 1. This is often a good preprocessing step to do when
    #   working with learning algorithms.
    
    # Instructions: First, compute the mean 'mu' and standard deviation
    #               'sigma' for each dimension of X.
    #               There is a direct way to do this in Julia, check
    #               out the help documentation for std and mean and 
    #               play around with the optional 'region' argument you
    #               can provide to these functions in order to get the
    #               desired result.
    
    #               Next you should subtract 'mu' for each feature and
    #               then divide the features by 'sigma'.
    
    #               Finally return the normalized version of X, mu and 
    #               sigma.
    #
    #               Note that X is a matrix where each column is a 
    #               feature and each row is an example. You need 
    #               to perform the normalization separately for 
    #               each feature.
    
    # ====================== YOUR CODE HERE ======================
    mu = mean(X, 1)
    sigma = std(X, 1)
    X_norm = (X .- mu) ./ sigma
    # ============================================================
    return X_norm, mu, sigma
end

## Load Data
data = readdlm("../data/ex1data2.txt", ',')
X_raw = data[:, 1:2]
y = data[:, 3]
m = length(y)

# Show some data points
X_raw[1:10, :]

10×2 Array{Float64,2}:
 2104.0  3.0
 1600.0  3.0
 2400.0  3.0
 1416.0  2.0
 3000.0  4.0
 1985.0  4.0
 1534.0  3.0
 1427.0  3.0
 1380.0  3.0
 1494.0  3.0

In [3]:
X_norm, mu, sigma = feature_normalize(X_raw)

# Add intercept term to x_norm
X = [ones(m) X_norm];

In [4]:
# First 10 examples after normalization
X[1:10, :]

10×3 Array{Float64,2}:
 1.0   0.13001    -0.223675
 1.0  -0.50419    -0.223675
 1.0   0.502476   -0.223675
 1.0  -0.735723   -1.53777 
 1.0   1.25748     1.09042 
 1.0  -0.0197317   1.09042 
 1.0  -0.58724    -0.223675
 1.0  -0.721881   -0.223675
 1.0  -0.781023   -0.223675
 1.0  -0.637573   -0.223675

***
**Implemantation note:**
When normalizing the features, it is important to store the values used for normalization - the mean value and the standard deviation used for the computations. After learning the parameters
from the model, we often want to predict the prices of houses we have not seen before. Given a new x value (living room area and number of bed-
rooms), we must first normalize x using the mean and standard deviation
that we had previously computed from the training set.

***

# Part 2: Gradient Descent
Previously, you implemented gradient descent on a univariate regression
problem. The only difference now is that there is one more feature in the matrix `X`. The hypothesis function and the batch gradient descent update rule remain unchanged.
You should complete the code in `compute_cost_multi()` and `gradient_descent_multi()`.
to implement the cost function and gradient descent for linear regression with multiple variables. If your code in the previous part (single variable) already
supports multiple variables, you can use it here too.
Make sure your code supports any number of features.
You can use ‘size(X, 2)’ to find out how many features are present in the dataset.

Complete the code in `gradient_descent_multi()`. In a first step you must be able to compute the cost for multi dimensional `X`. If necessary adapt `compute_cost()` from the first part such that it can handle multi dimensional `X`.

In a second step write up the code to perform gradient descent on this data in function `gradient_descent_multi!()`. The exclamation point at the end of the function name is a Julia convention for in-place functions. This means that functions ending with ! modify one or more of their input parameters. In this case you should write up code that modifies the input parameter `theta`.

In [32]:
function compute_cost_multi(X, y, theta)
  # Compute cost for linear regression
  # Computes the cost J of using theta as the
  # parameter for linear regression to fit the data points in X and y

  # Instructions: Compute the cost of a particular choice of theta
  #               You should set J to the cost.
  
  # Initialize some useful values, such as the number of training examples m.
  m = length(y)


  # ====================== YOUR CODE HERE ======================
  h = X * theta;
  J = 1/(2*m) * sum((h .- y).^2)
  # ============================================================
  return J
end


function gradient_descent_multi!(theta, X, y, alpha, num_iters)
    
    # Initialize some useful values
    m, n = size(X)
    J_history = zeros(num_iters)
    d_theta = zeros(n)
    
    # Hint: Perform a loop over num_iters and fill in values 
    #       J_history[iter] = compute_cost_multi(X, y, theta)


    # ====================== YOUR CODE HERE ======================
    for iter in 1:num_iters
        h = X * theta
        for j in 1:n
              x = X[: ,j]
              d_theta[j] = -alpha/m * sum((h .- y) .* x)
        end
        theta[:] += d_theta
        J_history[iter] = compute_cost_multi(X, y, theta)
    end
    # ============================================================

    return J_history
end


# Choose some alpha value
alpha = 0.1;
num_iters = 50;

# Init Theta and Run Gradient Descent
theta = zeros(3);
J_history = gradient_descent_multi!(theta, X, y, alpha, num_iters);
@show(theta);

theta = [3.38658e5, 1.04128e5, -172.205]


In [37]:
plot(J_history/10e10, color=:red, width=2, yaxis=("cost / 10^10"), xaxis=("iteration"))

### 3.2.1 Optional (ungraded) exercise: Selecting learning rates
In this part of the exercise, you will get to try out different learning rates for the dataset and find a learning rate that converges quickly. You can change the learning rate in the cell above by trying out different values for `alpha` and examine the process of the optimization process.

If you picked a learning rate within a good range, your plot look similar Figure 4. If your graph looks very different, especially if your value of J(θ)
increases or even blows up, adjust your learning rate and try again. We recommend trying values of the learning rate α on a log-scale, at multiplicative steps of about 3 times the previous value (i.e., 0.3, 0.1, 0.03, 0.01 and so on).
You may also want to adjust the number of iterations you are running if that will help you see the overall trend in the curve.

![](../figures/Fig_4.png)

Next, use this value of $\theta$ to predict the price of a house with 1650 square feet and 3 bedrooms. You will use value later to check your implementation of the normal equations. Don’t forget to normalize your features when you make this prediction!

In [38]:
# Recall that the first column of X is all-ones. Thus, it does
# not need to be normalized.

price = 0  # You should change this

# ====================== YOUR CODE HERE ======================
x_new = [1650.0, 3.0]'
x_new = (x_new .- mu) ./ sigma
x_new = [1 x_new]
price = dot(x_new, theta)
# ============================================================

println("Predicted price of a 1650 sq-ft, 3 br house (using gradient descent): $price");

Predicted price of a 1650 sq-ft, 3 br house (using gradient descent): 292748.0852321537


# Part 3: Normal Equations
In the lecture videos, you learned that the closed-form solution to linear regression is

$$ \theta = \left( X^TX \right)^{-1}X^Ty.$$


Using this formula does not require any feature scaling, and you will get an exact solution in one calculation: there is no “loop until convergence” like in gradient descent.
Complete the code in `normal_eqn()` to use the formula above to calculate $\theta$. Remember that while you don’t need to scale your features, we still need to add a column of 1’s to the `X` matrix to have an intercept term $\theta_0$.

In [46]:
function normal_eqn(X, y)
    # Examine the inv() via the help Julia functionality
    # and use it to compute the closed form solution
    # ====================== YOUR CODE HERE ======================
    theta = inv(X' * X) * X' * y;
    # ============================================================
end

# Load Data
data = readdlm("../data/ex1data2.txt", ',')
X = data[:, 1:2]
y = data[:, 3]
m = length(y)

# Add intercept term to X
X = [ones(m) X];

theta = normal_eqn(X, y)
x_new = [1, 1650, 3]
price = dot(x_new, theta)

@show(theta)
println("Predicted price of a 1650 sq-ft, 3 br house (using normal eqn): $price");

theta = [89597.9, 139.211, -8738.02]
Predicted price of a 1650 sq-ft, 3 br house (using normal eqn): 293081.46433489607


Compare the predictions of house prices for the closed form solution with the gradient descent solution. Do the same thing for the two $\theta$.