##Machine Learning Wk 3

<h3 align="right"><u>Video: Classification Problems</u></h3>
Examples
* Email: Spam/Not Spam?
* Online Transactions: Fraudulent (Yes/No)?
* Tumor: Malignant/Benign?

$y \in \{0, 1\}$
* 0: "Negative Class" (e.g. benign tumor)
* 1: "Positive Class" (e.g. malignant tumor)

By convention, 0 indicates an absence of something. But the assignment is somewhat arbitrary.
#####Applying linear regression to a classification problem
<img src="images/wk3_classification_529.png">

Threshold classifier output $h_\theta(x)$ at 0.5:
- If $h_\theta(x) >= 0.5$, predict "y=1"
- If $h_\theta(x) < 0.5$, predict "y=0"

By adding one outlier, linear regression no longer gives us a good model (<span style="color:magenta">magenta</span> line vs. <span style="color:blue">blue</span> line). For this reason, <b>it is not a good idea to use linear regression for classification problems.</b>

* <b>Linear regression</b>: $h_\theta(x)$ can be $>1$ or $<0$

* <b>Logistic regression</b>: $0 <= h_\theta(x) <= 1$ ~ has the property that the output of logistic regression always has values between 0 and 1. 

<h3 align="right"><u>Video: Hypothesis Representation</u></h3>

Determine the function we will use to represent the hypothesis for classification problems.
#####Logistic Regression Model
<img src="images/wk3_hypothesis_305.png" align="right" width="50%">
Want $0 <= h_\theta(x) <= 1$

$\qquad h_\theta(x) = g(\theta^Tx)$

$\qquad g(z) = \frac{1}{1+e^{-z}}$

$\qquad h_\theta(x) = \frac{1}{1+e^{-\theta^Tx}}$


Plotted to the right:
* <b>Sigmoid function</b><br>
* <b>Logistic function</b><br> Two terms are interchangeable and can be used to refer to the function g(z).

<b>Interpretation of Hypothesis Output</b><br>
$h_\theta(x) =$ estimated probability that $y = 1$ on input x

Example: $If x = \begin{bmatrix} x_0 \\ x_1 \end{bmatrix} = \begin{bmatrix} 1 \\ tumorSize \end{bmatrix}$<br>
$\qquad h_\theta(x) = 0.7$<br><br>
Tell patient that 70% chance of tumor being malignant.

$h_\theta(x) = p(y=1 | x;0)$<br>
"probability that y=1, given x, parameterized by $\theta$"
<br><br>

<img src="images/wk3_hypothesis_314.png" width="80%">

<h3 align="right"><u>Video: Decision Boundary</u></h3>

Supose predict:
<img src="images/wk3_decision_405.png" align="right" width="50%">
* <span style="color:magenta">"$y = 1$" if $h_\theta(x) >= 0.5$ <br>
$\qquad g(z) >= 0.5$ when $z >= 0$ <br>
$\qquad h_\theta(x) = g(\theta^Tx) >= 0.5$ <br>
$\qquad whenever\;\theta^Tx >= 0$</span>


* <span style="color:red">"$y = 0$" if $h_\theta(x) < 0.5$ <br>
$\qquad g(z) < 0.5$ when $z < 0$ <br>
$\qquad h_\theta(x) = g(\theta^Tx) < 0.5$ <br>
$\qquad whenever\;\theta^Tx < 0$</span>
<br><br><br>

<b>Decision Boundary Example</b>
<img src="images/wk3_decision_813.png" align="right" width="40%">

$h_\theta(x) = g(\theta_0 + \theta_1x_1 + \theta_2x_2)$<br>
parameters fit (next video): $\theta = \begin{bmatrix} -3 \\ 1 \\ 1 \end{bmatrix}$

Predict "y = 1" if $-3 +x_1 + x_2 >= 0$<br>
$\qquad -3 + x_1 + x_2 == \theta^Tx$<br>
<span style="color:magenta">$\qquad x_1 + x_2 = 3$</span>

<span style="color:magenta">Magenta line is Decision Boundary</span>
<br><br><br>

<b>Non-linear decision boundaries</b>
<img src="images/wk3_decision_1230.png" align="right" width="40%">

$h_\theta(x) = g(\theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_1^2 + \theta_2^2)$

$\theta = \begin{bmatrix} -1 \\ 0 \\ 0 \\ 1 \\ 1 \end{bmatrix}$

Predict "y=1" if $-1 + x_1^2 + x_2^2 >= 0$<br>
<span style="color:magenta">$\qquad x_1^2 + x_2^2 >= 1$</span>


Higher order poloynomials can give even more complex decision boundaries.


<h3 align="right"><u>Video: Cost Function</u></h3>

How to fit parameters $\theta$ for logistic regression.

<b>Supervised learning model of fitting a logistic regression model</b>

Training set: $\{(x^1,y^1),(x^2,y^2),\cdots,(x^m,y^m)\}$

m examples $\qquad x \in \begin{bmatrix} x_0 \\ x_1 \\ \cdots \\ x_n \end{bmatrix}$ 
$\qquad x_0 = 1, y \in \{0,1\}$


$h_\theta(x) = \frac{1}{1+e^{-\theta^Tx}}$

<b>How to choose parameters $\theta$?</b>

####Cost Function
Linear regression: $J(\theta) = \frac{1}{m} \sum_{i=1}^m \frac{1}{2} (h_\theta(x^i) - y^i)^2 $

This cost function is non-convex when applied to logistic regression.

<img src="images/wk3_cost_420.png" width="90%">


####Logistic cost function
<img src="images/wk3_cost_609.png" align="right" width="40%">
$Cost(h_\theta(x),y) = \begin{cases} 
-log(h_\theta(x)) & \text{if y = 1}\\
-log(1-h\theta(x)) & \text{if y = 0}
\end{cases}$
<br><br><br><br>


<img src="images/wk3_cost_620.png" align="left" width="30%">
Cost = 0 if y =1, $h_\theta(x)$ = 1<br>
But as $h_\theta(x) \to 0 \\ Cost \to \infty$

Captures intuition that if $h_\theta(x) = 0$, (predict P(y =1 | 0;$\theta$) = 0), but y =1, we'll penalize learning algorith by a very large cost

<img src="images/wk3_cost_1000.png" align="right" width="30%">




<h3 align="right"><u>Video: Simplified Cost Function and Gradient Descent</u></h3>

####Logistic regression cost function
$J(\theta) = \frac{1}{m} \sum_{i=1}^m Cost(h_\theta(x^{(i)}),y^{(i)})\\
Cost(h_\theta(x),(y)) = \begin{cases}
-log(h_\theta(x)) & \text{if y = 1}\\
-log(1-h\theta(x)) & \text{if y = le0}
\end{cases}$<br>
Note: y=0 or 1 always
<br><br>
Simplfied way of writing the cost function:<br>
$\begin{align}
Cost(h_\theta(x),(y)) &= -ylog(h_\theta(x)) - (1-y)log(1-h_\theta(x))\\
\text{If y=1} &= -log(h_\theta(x))\\
\text{If y=0} &= -log(1-h_\theta(x))
\end{align}$

<span style="color:blue">
$$(\theta) = - \frac{1}{m} [\sum_{i=1}^m y^{(i)} log\,h_\theta(x^{(i)}) + (1-y^{(i)})log(1-h_\theta(x^{(i)})) ]$$
</span>
$$\text{To fit parameters} \theta:\\
min_\theta J(\theta)\\
\text{To make a prediction given new x:}\\
\text{Output } h_\theta=\frac{1}{1+e^{-\theta^Tx}}$$

<img src="images/wk3_gradient_830.png" width="90%">

Vectorized implementation:<br>
$$\theta := \theta - \alpha \frac{1}{m} \sum_{i=1}^m [(h_\theta(x^i)-y^i) * x^i]$$

<h3 align="right"><u>Video: Advanced Optimization</u></h3>

Gradient descent is one example of an algorithm that can be used, but there are other algos that may be more efficient:
* Conjugate gradient
* BFGS
* L-BFGS

Advantages: No need to manually pick $\alpha$, often faster than gradient descent<br>
Disadvantages: More complex

<img src="images/wk3_adv_824.png" width="90%">

<img src="images/wk3_adv_1314.png" width="90%">

<h3 align="right"><u>Video: Multiclass Classification: One-vs-all</u></h3>

Examples:
* Email foldering/tagging: Work, Friends, Family, Hobby (four classes: y=1/2/3/4)
* Medical diagrams: Not ill, Cold, Flu (y=1/2/3)
* Weather: Sunny, Cloudy, Rain Snow (y=1/2/3)
Gradient descent is one example of an algorithm that can be used, but there are other algos that may be more efficient:


<img src="images/wk3_multi_430.png" width="90%">

Split up the classification problem into many one-vs-all problems.

Train a logistic regression classifier $h_\theta^{(i)}(x)$ for each class $i$ to predict the probability that $y = i$.

On a new input $x$, to make a prediction, pick the class $i$ that maximizes:<br>
$max_i\,h_\theta^{(i)}(x)$