# Chapter 2. Supervised Learning

[Supervised learning](https://en.wikipedia.org/wiki/Supervised_learning) is used when we want to predict a certain outcome from a given input, and we have examples of input/output pairs.

## Classification and Regression

There are two major types of supervised machine learning problems, called [classification](https://en.wikipedia.org/wiki/Statistical_classification) and [regression](https://en.wikipedia.org/wiki/Regression_analysis).

In classification, the goal is to predit a *class label*, which is a choice from a predefined list of possibilities.  
In chapter 1 we used the example of classifying irises into one of three possible species.  
Classification is sometimes separated into *binary classification*, which is the special case of distinguishing between exactly two classes, and *multiclass classification*, which is classification between more than two classes.

For regression tasks, the goal is to predict a continuous number, a floating-point number, or a real number.  
Predicting a person's annual income from their education, their age, and where they live is an example of a regression task.  
When predicting income, the predicted value is an amount and can be any number in a given range.  
Another example of a regression task is predicting the yield of a corn farm given attributes such as previous yields, weather, and number of employees working on the farm.  
The yield again can be an arbitrary number.

An easy way to distinguish between classification and regression tasks is to ask whether there is some kind of continuity in the output.  
If there is continuity between possible outcomes, then the problem is a regression problem.  
Think about predicting annual income -- there is clear continuity in the output.  
Whether a person makes $50,000 or $50,001 per year doesn't make much difference, even though they are technically different dollar amounts.  
By contrast, recognizing which language a book is written in is a classification problem because there is no matter of degree.  
The book is written in English, or Arabic, or French, or some other language; there is no continuity between languages and there is no language that is *between* Arabic and French.

## Generalization, Overfitting, and Underfitting

We want to build a model that is able to generalize as accurately as possible.  
Building a model that is too complex for the amount of information available is called [overfitting](https://en.wikipedia.org/wiki/Overfitting).  
Overfitting occurs when you fit a model too closely to the particularities of the training set and come up with a model that works well on that training set but is not able to generalize to new data.  
On the other hand, if your model is too simple or whose scope is too broadly defined, then you might not be able to capture all the aspects of and variability in the data.  
This is known as [underfitting](https://en.wikipedia.org/wiki/Overfitting#Underfitting), and will result in your model performing poorly on both the training and test sets because it cannot capture the underlying trend of the data.  
You can learn more about underfitting vs. overfitting [in the scikit-learn documentation](http://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html).

### Relation of Model Complexity to Dataset Size

It's important to note that model complexity is initmately tied to the variation of inputs contained in your training dataset.  
The larger the variety of data points that your dataset contains, the more complex a model you can use without overfitting.  
Usually, collecting more data points will yield more variety, so larger datasets allow you to build more complex models.  
In the real world, you often have the ability to decide how much data to collect, which might be more beneficial than tweaking and tuning your model.  

## Supervised Machine Learning Algorithms