# Week 3
## Classification and Representation
### Classification
Variable y is discrete-valued.
For example:
 * Is email spam or not? (binary)
 * Is an online transaction fraudulent or not? (binary)
 * Is a handwritten character an A, B, C, ...? (multiclass)

Assign y to one of two values:
y -> {0,1}

Can try to fit a straight line to data and then "threshold". If y is above j, the y is 1, otherwise it is 0.
 * However, the data may not fit linearly.

Logistic regression is an algorithm which always predicts a value h(x) which is always within the bounds of the actual values of y.

### Hypothesis Representation
What is the function used to represent the hypothesis h(x)?
h(x) = g(w^T x) (g of theta transpose x)

Where g(z) = 1 / (1 + e^-z)
^ This is also known as the sigmoid function or Logistic function.
 * The sigmoid function asymptotes at 0, and never produces a value < 0 or > 1.

So,
h(x) = 1 / (1 + e^-(theta^T * x))

 * h(x) is the estimated probability that y = 1 with input x.
 * For a regression model trained on tumor size where y is {benign, malignant}, if h(x) is 0.7, then the tumor given by x has a 70% chance of being malignant.

hw(x) = P( y=1 | x; w)

### Decision Boundary
Set 0.5 as your boundary >= means y=1, < means y=0.
When is h(x) = 0.5? 
g(z) = 0.5 when z = 0
h(x) = g(w^T x) = 0.5 when w^T x = 0

Example:
hw(x) = g(w0 + w1x1 + w2x2)
If w = \[-3; 1; 1\]
Then y = 1 if -3 + x1 + x2 >= 0
x1 + x2 >= 3

 * graph of x1 + x2 >= 3 is the Decision Boundary
 * decision boundary is a property of hypothesis, not the data
 * But training data affects the weights.
 
5 - x1 >= 0.5 when x1 <= 4.5

**Non-linear decision boundaries**
hw(x) = g(w0 + (w1 * x1) + (w2 * x2) + (w3 * x1^2) + (w4 * x2^2))
lets say w = \[-1; 0; 0; 1; 1\]
So y = 1 if (-1 + x1^2 + x2^2) >= 0
* x1^2 + x2^2 = 1 is a circle centered on the origin

So decision boundary is a circle where everything inside will produce y = 0, everything outside will produce y = 1.

**Higher Order polynomials**
 * Decision boundary might be an ellipse or more complex shapes.


## Logistic Regression Model
### Cost Function
How to fit parameters w?
We could use the same cost function as linear regression, taking the average of the squared errors from h(x) and y.
 * However, this produces a non-convex function due to the nonlinearity of the sigmoid function inside. Many local optima, and gradient descent doesn't work very well.
 
Logistic Regression Cost Function:
Cost(hw(x),y) = 
  -log(hw(x)) if y=1
  -log(1 - hw(x)) if y=0

IA negative logarithm approaches a minimum and levels out. So this works well as a cost function.
As hw(x) -> 0, Cost -> inf
 * This penalizes false negatives very heavily. If we predict a very small probability, and we are wrong, that will cost us a lot.

This was a helpful video.

### Simplified Cost Function and Gradient Descent
We can compress the two lines from the two cases above into one equation.
Cost(hw(x),y) = -y * log(hw(x))  -  (1-y) log(1 - hw(x))

Basically the (1-y) in the second term becomes 0 if y = 1, and we only use the first term. The "-y" in the first term is 0 if y = 0, and we use the second term.

There are other cost functions we can use. This one is derived from "Maximum likelihood estimation" principle.

We want to mimimize J(w). Update rule from Linear regression can be used.


### Advanced Optimization
We can scale much better to large datasets.
Optimization algorithms:
 * Gradient descent
 * Conjugate gradient
 * BFGS
 * L-BFGS

Advantages to above algorithms:
 * Clever inner-loop (line search algo) means you don't need to manually pick the learning rate.
 * Often converge faster than gradient descent (due to the above)
Disadvantages:
 * More complex


## Multiclass Classification
### Multiclass Classification: One-vs-all
May have n possible discrete values of y.

One-vs-All / One-vs-rest Classification:
Turn training set into n seperate binary classification problems. For the ith class, we find a decision boundary where we find whether or not x is in that class. We should (hopefully) find just one class that fits. We may have multiple, but we just choose the one with the highest probability.


## Quiz 
5/5 (100%)

*I'm honestly impressed with myself, didn't feel very confident in the material at all*


## Solving the Problem of Overfitting (Friday)
### The problem of overfitting
For linear regression, fitting a straight line to data which might better match a quadratic function is "underfitting". Also, it will have a "high bias". 

Overfitting is the opposite, with "high variance". If you fit a high-order polynomial to data that would be suited to a quadratic function, you are introducing unnecessary variability.

Too many features can fit the training set very well, but may not generalize to new data.

How to address overfitting:
 * Two options:
 ** reduce the number of features. Possibly by manually selecting which features to keep. later in course - model selection algorithm
 ** Regularization - keep all features but reduce the magnitude of parameters w. This works well when we have a lot of features, each of which contributes a bit to predicting y.


### Cost Function
Regularizing cost function:
 * We want smaller values for parameters w0, w1, w1 ... wn
 * This will give us a "simpler" hypothesis - the function will be smoother.
 * So cost will be less prone to overfitting.

Example:
 * training set with 100 features, 100 parameters. Hard to pick in advance which features will be relevent or not.
 * to your cost function, add the sum of all w^2 weighted heavily with the regularization parameter lambda.
 ** lambda captures the tradeoff between two goals: goal of the first term is to fit the training set well, second goal is to smooth the function.
 ** adding this sum will drive up the average cost and thus dampen the values of w except for those which contribute heavily to fitting the model.
 ** if lambda is too large, we will underfit (penalize too heavily).


### Regularized Linear Regression
For linear regression without regularization, we used gradient descent and updated w all at once.

We can add the regularization for just one feature w by adding (lambda/m) * wj

Because of math, we can just shrink wj a little (multiply by ~0.99) and then do our normal gradient descent.


### Regularized Logistic Regression
Add to logistic regression equation lamda/2m * SUM from 1-n of wj^2

Cost function also includes the above term.

Derivative with respect to theta0 does not change, but derivative with respect to other thetas does change, add *y/m * thetaj*

Much more powerful nonlinear classifiers than polynomial regression. Learning these next!


## Quiz
5/5 (100%)

## Programming Assignment (Saturday)