## Detect Overfitting and Underfitting with Learning Curves
For this quiz, we'll be using three models to train the circular dataset below.

- A Decision Tree model,
- a Logistic Regression model, and
- a Support Vector Machine model.

![circle-data](circle-data.png)

One of the models overfits, one underfits, and the other one is just right. First, we'll write some code to draw the learning curves for each model, and finally we'll look at the learning curves to decide which model is which.

First, let's remember that the way the curves look for the three models, is as follows:
![learning-curves](learning-curves.png)

For the first part of the quiz, all you need is to uncomment one of the classifiers, and hit 'Test Run' to see the graph of the Learning Curve. But if you like coding, here are some details. We'll be using the function called **learning_curve**:

**train_sizes, train_scores, test_scores = learning_curve(
    estimator, X, y, cv=None, n_jobs=1, train_sizes=np.linspace(.1, 1.0, num_trainings))**

No need to worry about all the parameters of this function (you can read some more in here, but here we'll explain the main ones:

 - **estimator**, is the actual classifier we're using for the data, e.g., LogisticRegression() or GradientBoostingClassifier().
 - **X** and **y** is our data, split into features and labels.
 - **train_sizes** are the sizes of the chunks of data used to draw each point in the curve.
 - **train_scores** are the training scores for the algorithm trained on each chunk of data.
 - **test_scores** are the testing scores for the algorithm trained on each chunk of data.

Two very important observations:

- The training and testing scores come in as a list of 3 values, and this is because the function uses 3-Fold Cross-Validation.
- **Very important:** As you can see, we defined our curves with Training and Testing Error, and this function defines them with Training and Testing Score. **These are opposite, so the higher the error, the lower the score**. Thus, when you see the curve, you need to flip it upside down in your mind, in order to compare it with the curves above.

### Part 1: Drawing the learning curves
In here, we'll be comparing three models:

- A **Logistic Regression** model.
- A **Decision Tree** model.
- A **Support Vector Machine** model with an rbf kernel, and a gamma parameter of 1000 (this is another type of model, don't worry about how it works for now).

Uncomment the code for each one, and examine the learning curve that gets drawn. If you're curious about the code used to draw the learning curves, it's on the **utils.py** tab.

In [2]:
# Import, read, and split data
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
data = pd.read_csv('data.csv')
import numpy as np
X = np.array(data[['x1', 'x2']])
y = np.array(data['y'])

# Fix random seed
np.random.seed(55)

### Imports
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC



In [None]:
# TODO: Uncomment one of the three classifiers, and hit "Test Run"
# to see the learning curve. Use these to answer the quiz below.

### higher the error, the lower the score

In [4]:
### Logistic Regression
estimator = LogisticRegression()


LogisticRegression()

![logistic](logistic.png)

In [None]:
### Decision Tree
#estimator = GradientBoostingClassifier()




![gradient](gradient.png)


In [None]:
### Support Vector Machine
#estimator = SVC(kernel='rbf', gamma=1000)

![SVM](SVM.png)

![quiz](quiz.png)

We can observe from the curves that:

- The **Logistic Regression** model has a low training and testing score.
- The **Decision Tree** model has a high training and testing score.
- The **Support Vector Machine** model has a high training score, and a low testing score.

From here, we can determine that the Logistic Regression model underfits, the SVM model overfits, and the Decision Tree model is just right.

Equivalently, we can flip this curves (as they measure score, and our original curves measure error), and compare them with the following three curves, we can see that they look a lot like the three curves we saw before. (Note: The fact that we flip the curves doesn't mean that the error is 1 minus the score. It only means that as the model gets better, the error decreases, and the score increases.)




![error-curves](error-curves.png)

Now, we should check if this is visible in the actual model. When we plot the boundary curves for each one of these models, we get the following:
![models](models.png)

When we look at the models above, does it make sense that the first one underfits, the second one is right, and the third one overfits? It does, right? We can see that the data is correctly bounded by a circle, or a square. What our models do, is the following:

- The **Logistic Regression** model uses a line, which is too simplistic. It doesn't do very well on the training set. Thus, it underfits.
- The **Decision Tree** model uses a square, which is a pretty good fit, and generalizes well. Thus, this model is good.
- The **Support Vector Machine** model actually draws a tiny circle around each point. This is clearly just memorizing the training set, and won't generalize well. Thus, it overfits.

It's always good to do a reality check when we can, and see that our models actually do have the behavior that the metrics tell us.