# Evaluating a Learning Algorithm

- It doesn't matter much that your parameters fit the dataset if it doesn't fit the training set.
- The dataset is broken into three parts: training set, cross-validation set, and testing set.
- An estimate of how the dataset are broken: 60%, 20%, 20%
- We can now calculate three separate error values for the three different sets using the following method:
    - Optimize the parameters in Θ using the training set for each polynomial degree.
    - Find the polynomial degree d with the least error using the cross validation set.
    - Estimate the generalization error using the test set with Jtest(Θ(d)), (d = theta from polynomial with lower error);


** Additional Notes**
- ** Training phase:** you present your data from your "gold standard" and train your model, by pairing the input with expected output.
- ** Validation/Test phase:** in order to estimate how well your model has been trained (that is dependent upon the size of your data, the value you would like to predict, input etc) and to estimate model properties (mean error for numeric predictors, classification errors for classifiers, recall and precision for IR-models etc.)
- **Application phase:** now you apply your freshly-developed model to the real-world data and get the results. Since you normally don't have any reference value in this type of data (otherwise, why would you need your model?), you can only speculate about the quality of your model output using the results of your validation phase.
The validation phase is often split into two parts:
    - In the first part you just look at your models and select the best performing approach using the validation data (=validation)
    - Then you estimate the accuracy of the selected approach (=test).
Hence the separation to 50/25/25.


# Bias and Variance 
- Look at Graph
- ** LEFT**: This is high bias problem. We are underfitting the model from the training to the cv or testing set. 
- ** RIGHT**: This is high variance problem. We are overfitting the model from the training to the cv or testing set. 
- ** BIAS**: The training and cv error will have high cost function
- ** VARIANCE**: The training error will be low (you are fitting the training set well), and the cv error will be higher than the training set

- ** Linear Regression with regularization **
    - When there's a large lambda: We tend to underfit the data
    - When there's a small lambda: We tend to overfit the data
    - NEED TO LOOK FOR AN INTERMEDIATE LAMBDA
    - LARGE LAMBDA: **HIGH BIAS** because we are underfitting (since we are changing the equation). Moreover, underfitting means that random event aren't taken into account. The visual is that we will have similar wrong prediction.
    - SMALL LAMBDA: **HIGH VARIANCE** because we are close to zero (almost adding no value, so the equation remains the same) so there will be overfitting. Moreover, overfitting means that we can accuratly find different results centered around the correct prediction. 
<img src="Image7.png">
- ** Learning Curves **
     - As you add more observation, the error increases because the model can fit better on small datasets rather than large datasets
     - The more data, the more the cost function of the training and cross-validation are closer together
     - ** If a learning algorithm is suffering from high bias**, getting more training data WILL NOT by itself help much
     - This is because bias means that it suffers from underfitting. If create a simple model that doesn't consider all possible outcomes, we will constantly get the wrong result, hence, more data points won't do anything. 
<img src="Image8.png">
<img src="Image9.png">
         - There's a high error in the training set bc we are underfitting
         - Look at the graphs above. When we have a simple model (since that's when underfitting occurs), adding more observation still causes underfitting. 


- ** If a learning algorithm is suffering from high variance**, getting more training data WILL by itself help much
     - This occurs because variance means that it overfits our model. Variance means that we will differernt results. Because the model overfits the training sample, the model will get differnt results throughout each prediction. As we get more data, we will be able to find an average.
<img src="Image10.png">
<img src="Image11.png">
         - The error continues to decrease because we can fit the data more as you begin to understand the data more
         

# Extras
** If you have HIGH VARIANCE PROBLEM:**
- You can get more training examples
- Try smaller sets of features (bc you are overfitting)
- Try increasing lambda, so you can not overfit the training set as much. The higher the lambda, the more the regularization applies. 

** If you have HIGH BIAS PROBLEM:**
- Try getting additional features
- Try adding polynomial features
- Try decreasing lambda, so you can try to fit the data better. The lower the lambda, the less the regularization applies.

**Model Complexity Effects:**
- Lower-order polynomials (low model complexity) have high bias and low variance. In this case, the model fits poorly consistently.
- Higher-order polynomials (high model complexity) fit the training data extremely well and the test data extremely poorly. These have low bias on the training data, but very high variance.
- In reality, we would want to choose a model somewhere in between, that can generalize well but also fits the data reasonably well.

### Overfitting in Machine Learning
Overfitting refers to a model that models the training data too well.

Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the model. The problem is that these concepts do not apply to new data and negatively impact the models ability to generalize.

### Underfitting in Machine Learning
Underfitting refers to a model that can neither model the training data nor generalize to new data.

An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data

---

# SECOND PART

# Building a Spam Classifier
- So how could you spend your time to improve the accuracy of this classifier?
    - Collect lots of data (for example "honeypot" project but doesn't always work)
    - Develop sophisticated features (for example: using email header data in spam emails)
    - Develop algorithms to process your input in different ways (recognizing misspellings in spam).
    - It is difficult to tell which of the options will be most helpful.
    
- Selecting the model or approach can be difficult bc there's no clearn approach on how to implement it. 
    - Thus, Andrew suggest making a simple model and then plotting a learning curve to see if you need more or less data
    - Error analysis: Look at all the mistakes or error that the model did. Then, check the information and categorizes on a common ground of these errors. 
        - Stemming software: In the SPAM/NONSPAM example, it will find words that are the same and treat them the same. 
- It is very important to get error results as a single, numerical value. Otherwise it is difficult to assess your algorithm's performance. For example if we use stemming, which is the process of treating the same word with different forms (fail/failing/failed) as one word (fail), and get a 3% error rate instead of 5%, then we should definitely add it to our model. However, if we try to distinguish between upper case and lower case letters and end up getting a 3.2% error rate instead of 3%, then we should avoid using this new feature. Hence, we should try new things, get a numerical value for our error rate, and based on our result decide whether we want to keep the new feature or not.


- **Error Analysis**
     - The recommended approach to solving machine learning problems is to:
     - Start with a simple algorithm, implement it quickly, and test it early on your cross validation data.
     - Plot learning curves to decide if more data, more features, etc. are likely to help.
     - Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.

# Handling Skewed Data

<img src="Confusion-Matrix.png">

- Just having an improvement on the classification isn't an optimally way to check the power of the model. For example, if there's an improvment from 99.2 to 99.5 accuracy, are we making better prediction or are we simply predicting 0 more times.
- We create a Prediction/Actual box (in a binary problem, it would be a 2 by 2. 

- **Classifier Accuracy**
    - (true positives + true negatives) / (total examples)
    - This is a good way of measuring UNLESS the data is skewed into one direction
    - If the data is skewed, we don't really know if the model is good or we simply are predicting the model to be more like the skewed data


- Another method to use is **Recall/Precision**:
    - **Precision**: Out of all the patient that we predicted that have cancer (or 1), what fraction actually have cancer?
        - TRUE POSITIVES/PREDICTED POSITIVES -> TRUE POSITIVES/(TRUE POSITIVES + FALSE POSITIVES)
        - Using the box, this is represented as row 1
    - **Recall**: Out of all the patient that actually have cancer (or 1), what fraction did we correctly detect as having cancer?
        - TRUE POSITIVES/ACTUAL POSITIVES -> TRUE POSITIVES/(TRUE POSITIVES + FALSE NEGATIVES)
        - Using the box, this is represented as col. 1 

#### Trading Off Precision and Recall
- In cancer case, we can change the treshold of 0.5 to 0.7
- If we do this, we will have a higher precision and a lower recall.
- The tradeoff is the precision looks at the prediciton. How will did we predict. So even if we didn't predict ALL the values, if we did well on the predicitons we made, we have a good precision.
- However, recall looks at all the actual values (that are 1) and see how well we predicted out of those. IT doesn't care if we, say predicted all the values are 1, bc it would return a high recall score since we predicted 1 to most of the actual 1's
- It's NOT a good way to find the average of both scores!
- A better way to evaluate these scores is to use a **F-SCORE or F1 Score**
- **F-Score** falls btw 0 and 1, 0 being the worst and 1 being the best


# Using Large Data Sets
- Some researcher have used different models on different data sets. What they found is that the models that have the most data tend to the same. Thus, when all these models had more data to work on, they did roughly the same.
- To have a low bias: We can make sure that our model has enough parameters to work with.
- To have a low variance: We can make that our model has enough data (a lot)
