# Week 6
## Evaluating a Learning Algorithm
### Deciding What to Try Next
Usually a difference between someone who knows the math but doesn't really understand how to apply it, and those who deeply understand it and know how to apply it to real-world problems (that's me).

We're gonna go through some examples on how to apply the tools we have learned.

Debugging a learning algorithm
 - More training examples (not always as helpful as you might think)
 - Try smaller sets of features (to prevent overfitting)
 - Try gathering additional features (another type of gathering more data, may be a large project and you can't be sure how helpful it will be beforehand).
 - Add polynomial features (x^2, x1x2, etc.)
 - Decrease / increase regularization lambda

A lot of times, people will go by gut feeling when deciding how to debug. This wastes time and money.

Machine learning diagnostic:
 - a test to run to gain insight into what is/isn't working.
 - These can take time to implement, but will usually save more time by preventing you from going down the wrong path.

### Evaluating a Hypothesis
Overfitting - just because a hypothesis works well to predict training examples, doesn't mean it will generalize.

Split training data into training and test sets. (usually something like a 70% / 30% split). Best to randomize which examples are in each set. After a number of training iterations, figure out the error on the test sets and make sure it is still nondecreasing.

Misclassification error: err(hOfX, y) = 
{1 if hOfX >= 0.5, y=0
{1 if hOfX < 0.5, y=1
{0 otherwise

Test error = 1/m SUM(err(hOfXi, yi)) for all i

 - Presumably, you can also adapt misclassification errors to n output classes.

### Model Selection and Train/Validation/Test Sets
How to decide what degree of polynomial to fit to a dataset? What features should you include? How do you choose the right lambda.

- These are called model selection problems

To decide on a polynomial degree for your model, you could do it the naive way and train on each different model and plot the test set error of each, picking the set which minimizes this value.
 - How do you know that another model wouldn't do better with more training iterations, or perform better on a general data set?
 - We actual chose the value d (polynomial degree) which fits best to the test set. We may have overfit!!
 
Instead, let's split the original training set into three sets - training set, cross validation set, and test set. Usually pick a ratio like 60%/20%/20%
 - Now we can do the above steps, except use the cross validation set to test the performance of each different degree model. Pick the set with the lowest cross validation error, and then continue with your training using the test set to measure the generalization error of this model.
 - We will expect the performance of the model on the cross validation set to perform better on the training dataset, but we will get a more accurate idea of our performance.

## Bias vs Variance
### Diagnosing Bias vs Variance
Training error tends to decrease as polynomial degree of the model decreases (as we start to possibly overfit).

Cross validation error is parabolic - it decreases as polynomial degree increases, but increases once polynomial degree starts to overfit to the training set and not the cross validation set.

If training error and cross validation error are high, you have a "high bias" problem. If cross validation is high and training error is low, you have a "high variance" problem.

### Regularization and Bias/Variance
Similar to bias/variance with polynomial degrees model section, but this time for choice of lambda.

Large lambda can result in high bias (underfit). Each of the thetas will approach 0 as lambda increases, and so our cost function will be flat.

Small lambda can result in high varias (overfit). This is basically no regularization, so performs just like a high-order polynomial fit to the data.

Intermediate lambda gives us a reasonable fit to the data.

How to choose lambda:
 - Don't use regularization when calculating training, validation, and test set errors.
 - Try training models with lambdas in the range \[0, 10\]. Your lambdas should increase exponentially - i.e. 0, 0.1, 0.2, 0.4, 0.8, ... 10.
 - Use cross validation set to validate the error for each model, and pick the model that minimizes the error.
 - How well does J(theta) for this model perform on the test set?
 - Can you do this at the same time as picking polynomial degree? Are their any dependencies?

### Learning Curves
Plot training set and cross validation set errors for linearly increasing training set sizes.

Usually training error will increase logarithmically (initially fits very well, and will begin to level out over time).

Usually cost function error will decrease exponentially (initial high error, decreases and levels out).

When you have a high bias, your training and cross validation errors will approach each other and level off at a relatively high error value. Note that more training data will not help very much, since as more data is added you don't get very good returns on your error.

When you have a high variance, there will be a gap between your training and cross validation errors. Your training set will be overfit, resulting in a low training error. Your cross validation error will be high, but might continue to decrease with a higher training set size. In this scenario, it may help to get more data to train your model on.

### Deciding What to Do Next Revisited
Debugging a learning algorithm:
 - More training examples (this will help if you have a high variance problem. i.e. if you are overfitting to the training set)
 - Smaller sets of features (this will also help high variance problems and will prevent overfitting)
 - Adding more (or higher order) features (this will help high bias problems. If your current hypothesis is too simple, it may help to add more higher-order features)
 - Decreasing lambda (this will fix high bias problems)
 - Increasing lambda (this will fix high variance problems)

Neural network-specific issues:
 - Small networks have fewer parameters and so are more prone to underfitting. They are, however, computationally cheaper.
 - Large networks have more parameters and so are more prone to overfitting. They are more computationally expensive.
 - You can use the cross validation technique to select the number of hidden layers to use in a neural network.

## Quiz
5/5 (100%)

## Building A Spam Classifier
### Prioritizing what to work on
How to build a spam classifier
1. Supervised learning: x features of email, y=spam(1) or not(0)
2. features x: Choose 100 words indicative of spam or not
3. Feature vector would have dimension 100
4. Usually you pick the most frequently occuring words for your features.

How do you get a low error on a spam classifier?
- Collect lots of data. This will help sometimes, but not all the time.
- Develop sophisticated features based on email routing information
- Developed more sophisticated features based on the email message.
-- for example, can we combine "discount" and "discounts" into the same feature? Can we correct misspellings?

Even people who have spent a lot of time working on the spam problem, don't know which option is best to explore. Far too many try to pick the solution rather than try to solve the problem in the most effective way.

### Error Analysis
Start with a simple algorithm that is simple to implement (i.e. < 1 day). Use it to get some errors on cross-validation data.

Plot learning curves and use it to pick an approach: more data, more features, etc. See if you can spot trends in the examples that your algorithm failed on, and then address those shortcomings.

Important to have a way to evaluate the performance of your algorithm. 


## Handling Skewed Data
### Error Metrics for Skewed Classes
What if one of the classes is very rare? We may want to minimize false positives (we predicte true but it was actually false) and false negatives (we predicted false but it was actually true).

Precision = true positives / predicted positives = true positives / (true positives + false positives)

Recall = true positives / actual positives = true positives / (true positives + false negatives)

We can use precision and recall, rather than just a learning rate, to evaluate the performance of an algorithm with skewed classes.


### Trading Off Precision and Recall
Higher precision leads to lower recall (we will only delvier cancer diagnoses when we are very confident, but we may miss some people)

Higher recall leads to lower precision.

However, the relationship may not be linear. Lemme guess... you can plot these?

- Taking the average of precision and recall is usually not very good.
- F1 score (or just "F Score"): 2(PR/(P+R))
- Since f1 takes the product, if either precision or recall are very low, you won't get a good score. Makes it good at striking a balance.
- Also, a perfect F score is "1".


## Using Large Data Sets
How much data should you train on?

- Some algorithms which don't perform well on small training set sizes perform great on larger training set sizes.
- Can a human expert confidently predict y given x? If so, then a very large training set is likely to immprove the algorithm. This stems from the fact that if a human expert can predict y, then there are probably enough features in x such that we have low variance.

I'm not sure if I follow that argument. A human expert may be able to sort through many features, but an algorithm might have high variance due to overfitting. Any way, I like the common sense of picking a complete set of features such that you yourself could solve the problem with the given info.

## Quiz 2
5/5 (100%)