 Machine Learning Online Class - Exercise 2: Logistic Regression

# Instructions
 
This file contains code that helps you get started on the second part of the exercise which covers regularization with logistic regression. You will need to complete the following functions in this exericse:

     sigmoid()
     cost_function_reg()
     gradient!()
     predict()

For this exercise, you will not need to change any code in this file, or any other files other than those mentioned above.

# 2) Regularized logistic regression
In this part of the exercise, you will implement regularized logistic regression
to predict whether microchips from a fabrication plant passes quality assur-
ance (QA). During QA, each microchip goes through various tests to ensure
it is functioning correctly.
Suppose you are the product manager of the factory and you have the
test results for some microchips on two different tests. From these two tests,
you would like to determine whether the microchips should be accepted or
rejected. To help you make the decision, you have a dataset of test results
on past microchips, from which you can build a logistic regression model.

## 2.1 Visualizing the Data
Similar to the previous parts of this exercise, *plot_data()* is used to generate a
figure where the axes are the two test scores, and the positive
(y = 1, accepted) and negative (y = 0, rejected) examples are shown with
different markers.

The Figure shows that our dataset cannot be separated into positive and
negative examples by a straight-line through the plot. Therefore, a straight-
forward application of logistic regression will not perform well on this dataset
since logistic regression will only be able to find a linear decision boundary.


In [1]:
using Plots
using Optim
gr()

filename = "../data/ex2data2.txt"
delimiter = ','
data = readdlm(filename, delimiter)
X_data = data[:, [1, 2]]
y_data = data[:, 3];

In [2]:
#############################################################
# The functions in this section are defined for you, there
# is no need to change code in this cell.
############################################################

function plot_data(X, y)
    accepted = y .== 1.0
    rejected = .!accepted
    scatter(X_data[accepted, 1], X_data[accepted, 2], 
            m=(1.0, :circle, 4),
            c=:black, label="accepted")
    
    scatter!(X_data[rejected, 1], X_data[rejected, 2],
             m=(1.0, :square, 3),
             c=:yellow, label="rejected")
    xlabel!("Microchip Test 1")
    ylabel!("Microchip Test 2")
end

function map_feature(X1, X2; degree=6)
    out = ones(size(X1 ,1))
    for i = 1:degree
        for j = 0:i
            out = hcat(out, (X1.^(i - j)).*(X2.^j))
        end
    end
    return out
end

function predict_proba(x, theta)
    sigmoid(dot(x, theta))
end


function plot_decision_boundary(X, y, theta)
    x1 = linspace(-1, 1.5, 50)
    x2 = linspace(-1, 1.5, 50)
    p = zeros(length(x1), length(x2))

    for i = 1:length(x1)
        for j = 1:length(x2)
            p[i,j] = predict_proba(map_feature(x1[i], x2[j]), theta)
        end
    end
    p = p'
    contour(x1, x2, p, levels=3)
    
    accepted = y .== 1.0
    rejected = .!accepted
    
    scatter!(X[accepted, 1], X[accepted, 2], c=:black, label="accepted")
    scatter!(X[rejected, 1], X[rejected, 2], c=:yellow, label="rejected")
    xlabel!("Microchip Test 1")
    ylabel!("Microchip Test 2")
end

plot_decision_boundary (generic function with 1 method)

In [3]:
plot_data(X_data, y_data)

## 2.2) Feature Mapping
In this part, you are given a dataset with data points that are not
linearly separable. However, you would still like to use logistic 
regression to classify the data points. To do so, you introduce more features to use -- in particular, you add
polynomial features to our data matrix (similar to polynomial regression).

In the provided function *map_feature()*, we will map the features into
all polynomial terms of $x_1$ and $x_2$ up to the sixth power.
Note that *map_feature()* also adds a column of ones for us, so the intercept term is handled.

$$
\text{map_feature(x)} = [1,\ x_1,\ x_2,\ x_1x_2,\ x_2^2,\ x_1^3,\ \dots,\  x_1x_2^5,\ x_2^6]^\top
$$

As a result of this mapping, our vector of two features (the scores on
two QA tests) has been transformed into a 28-dimensional vector. A logistic
regression classifier trained on this higher-dimension feature vector will have
a more complex decision boundary and will appear nonlinear when drawn in
our 2-dimensional plot.
While the feature mapping allows us to build a more expressive classifier,
it also more susceptible to overfitting. In the next parts of the exercise, you
will implement regularized logistic regression to fit the data and also see for
yourself how regularization can help combat the overfitting problem.

In [4]:
const X = map_feature(X_data[:, 1], X_data[:, 2])
const y = copy(y_data)
const m = size(X, 1)
const lambda = 0.0;

@show(size(X));

size(X) = (118, 28)


## 2.3 Cost Function and Gradient
### 2.3.1
Now you will implement code to compute the cost function and gradient forregularized logistic regression. Complete the code in *cost_function_reg()*  and *gradient!()* to return the cost and gradient respectively.
Recall that the regularized cost function in logistic regression is

$$
J(\theta) = \frac{1}{m}\sum_{i=1}^m[-y^{(i)} \log(h_\theta(x^{(i)})) - (1-y^{(i)})\log(1-h_\theta(x^{(i)}))] + \frac{\lambda}{2m}\sum_{j=1}^n\theta_j^2.
$$

Note that you should not regularize the parameter $\theta_0$.
Recall that in Julia indexing starts from 1, hence, you should not be regularizing the *theta[1]* parameter in the code (which corresponds to $\theta_0$). The gradient
of the cost function is a vector where the j th element is defined as follows:

$$
\frac{\partial J(\theta)}{\partial\theta_j} = \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \hspace{5cm} \text{for j = 0}
$$

$$
\frac{\partial J(\theta)}{\partial\theta_j} = \left(\frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}  \right) + \frac{\lambda}{m}\theta_j\hspace{3cm} \text{for j $\geq$ 0}
$$

Once you are done, call your *cost_function_reg()* using the initial value of θ (initialized to all zeros). You should see that the
cost is about 0.693.


In [7]:
sigmoid(z) = 1 ./ (1 .+ exp.(-z))

function cost_function_reg(theta)
    h = sigmoid(X * theta)
    J = 1/m * sum(-y .* log.(h) .- (1 .- y) .* log.(1 .- h))
    J += lambda / (2m) * sum(theta[2:end])
end


function gradient!(grad, theta)
    h = sigmoid(X * theta)
    grad[:] = 1 / m * X' * (h .- y)
    for j = 2:size(theta, 1)
      grad[j] += lambda / m * theta[j]
    end
end


theta_init = zeros(size(X, 2))
grad = zeros(theta_init)

cost = cost_function_reg(theta_init)
@show(cost);

cost = 0.6931471805599453


### 2.3.2 Learning Parameters
Similar to the previous parts, you will use fminunc to learn the optimal parameters θ. If you have completed the cost and gradient for regularized logistic regression (*cost_function_reg()*) correctly, you should be able to step
through the next part to learn the parameters θ using *optimize()*.



In [8]:
result = optimize(cost_function_reg, gradient!, theta_init, LBFGS())

theta_opt = result.minimizer
cost_opt = result.minimum

@show(cost_opt);

cost_opt = 0.27124705591155157


## 2.4 Plotting the Decision Boundary
To help you visualize the model learned by this classifier, we have provided the function plot_decision_boundary() which plots the (non-linear) decision boundary that separates the positive and negative examples. In plot_decision_boundary(), we plot the non-linear decision boundary by computing the classifier’s predictions on an evenly spaced grid and then and drew a contour plot of where the predictions change from y = 0 to y = 1.

**Optional Exercise**:
Try different values of lambda and see how regularization affects the decision boundary
Try the following values of lambda (0, 1, 10, 100).
How does the decision boundary change when you vary lambda? How does
the training set accuracy vary?

In [9]:
plot_decision_boundary(X_data, y_data, theta_opt)

In [10]:
# Predict whether the label is 0 or 1 using learned logistic 
# regression parameters theta
# yhat = predict(X, theta) computes the predictions for X using a 
#  threshold at 0.5 (i.e., if sigmoid(theta'*x) >= 0.5, predict 1, else 0.

function predict(X, theta)
    h = sigmoid(X * theta)
    return round.(Int, h)
end

predict (generic function with 1 method)

In [11]:
yhat = predict(X, theta_opt)
accuracy = mean(yhat .== y)
@show(accuracy)
@show(cost_opt)

accuracy = 0.8728813559322034
cost_opt = 0.27124705591155157


0.27124705591155157