# Supervised and Unsupervised Learning

## Supervised Learning

In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output.

Supervised learning problems are categorized into:
* **Regression problem**: consists in predicting results within a continuous output.
* **classification problem**: consists in predicting results in a discrete output (discrete categories). (Logistic Regression) 

**Example of regression problem**

Given features about houses as the size, we try to predict their price. Price is function of the size and is a continuous output.

**Example of classification problem**

Given the features of a house, try to predict the house "sells for more or less than the asking price." Here we are classifying the houses into two discrete categories.

## Unsupervised Learning

In unsupervised learning, we are given little or no idea about what our results should look like. We can try clustering the data based on relationships among the variables or pick up...

With unsupervised learning there is no feedback based on the prediction results.

Example of clustering:

Clustering: Take a collection of 1,000,000 different persons, and find a way to automatically group these persons into groups that are somehow similar or related by different variables, such as location, roles, and so on.

Non-clustering: identifying individual voices and music from a mesh of sounds.


## Model Selection and Train/Validation/Test Sets


A hypothesis may have a low error for the training examples but still be inaccurate (because of overfitting). Thus, to evaluate a hypothesis, given a dataset of training examples, we can split up the data into two sets: a training set and a test set. Typically, the training set consists of 70 % of your data and the test set is the remaining 30 %.

The new procedure using these two sets is then:

* Optimize the parameters using the training set.
* Estimate the generalization error using the test set.

One way to break down our dataset into the three sets is:

* Training set: 60%
* Cross validation set: 20%
* Test set: 20%

We can now calculate three separate error values for the three different sets using the following method:

* Optimize the parameters using the training set.
* Find with the least error using the cross validation set.
* Estimate the generalization error using the test set.


## Neural Networks:
Diagnosing Bias vs. Variance

In this section we examine the relationship between the degree of the polynomial d and the underfitting or overfitting of our hypothesis.

    We need to distinguish whether bias or variance is the problem contributing to bad predictions.
    High bias is underfitting and high variance is overfitting. Ideally, we need to find a golden mean between these two.

The training error will tend to decrease as we increase the degree d of the polynomial.

At the same time, the cross validation error will tend to decrease as we increase d up to a point, and then it will increase as d is increased, forming a convex curve.


As λ increases, our fit becomes more rigid. On the other hand, as λ approaches 0, we tend to over overfit the data. So how do we choose our parameter λ to get it 'just right' ? In order to choose the model and the regularization term λ, we need to:

* Create a list of lambdas (i.e. λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24});
* Create a set of models with different degrees or any other variants.
* Iterate through the λs and for each λ go through all the models to learn some Θ.
* Compute the cross validation error using the learned Θ (computed with λ) on the JCV(Θ) without regularization or λ = 0.
* Select the best combo that produces the lowest error on the cross validation set.
* Using the best combo Θ and λ, apply it on Jtest(Θ) to see if it has a good generalization of the problem.

Deciding What to Do Next Revisited

Our decision process can be broken down as follows:

    Getting more training examples: Fixes high variance

    Trying smaller sets of features: Fixes high variance

    Adding features: Fixes high bias

    Adding polynomial features: Fixes high bias

    Decreasing λ: Fixes high bias

    Increasing λ: Fixes high variance.

Diagnosing Neural Networks

    A neural network with fewer parameters is prone to underfitting. It is also computationally cheaper.
    A large neural network with more parameters is prone to overfitting. It is also computationally expensive. In this case you can use regularization (increase λ) to address the overfitting.

Using a single hidden layer is a good starting default. You can train your neural network on a number of hidden layers using your cross validation set. You can then select the one that performs best.

Model Complexity Effects:

    Lower-order polynomials (low model complexity) have high bias and low variance. In this case, the model fits poorly consistently.
    Higher-order polynomials (high model complexity) fit the training data extremely well and the test data extremely poorly. These have low bias on the training data, but very high variance.
    In reality, we would want to choose a model somewhere in between, that can generalize well but also fits the data reasonably well.