# Detect Overfitting and Underfitting with Learning Curves

One of the models overfits, one underfits, and the other one is just right. First, we'll write some code to draw the learning curves for each model, and finally we'll look at the learning curves to decide which model is which.

<img src = https://video.udacity-data.com/topher/2017/June/594dbe26_learning-curves/learning-curves.png>

But if you like coding, here are some details. We'll be using the function called learning_curve:
```python
train_sizes, train_scores, test_scores = learning_curve(
    estimator, X, y, cv=None, n_jobs=1, train_sizes=np.linspace(.1, 1.0, num_trainings))
```

No need to worry about all the parameters of this function (you can read some more in here, but here we'll explain the main ones:

    estimator, is the actual classifier we're using for the data, e.g., LogisticRegression() or GradientBoostingClassifier().
    X and y is our data, split into features and labels.
    train_sizes are the sizes of the chunks of data used to draw each point in the curve.
    train_scores are the training scores for the algorithm trained on each chunk of data.
    test_scores are the testing scores for the algorithm trained on each chunk of data.


Two very important observations:

- The training and testing scores come in as a list of 3 values, and this is because the function uses 3-Fold Cross-Validation.
- very important: As you can see, we defined our curves with Training and Testing Error, and this function defines them with Training and Testing Score. These are opposite, so the higher the error, the lower the score. Thus, when you see the curve, you need to flip it upside down in your mind, in order to compare it with the curves above.



---

# Grid Search

Grid Search in sklearn is very simple. We'll illustrate it with an example. Let's say we'd like to train a support vector machine, and we'd like to decide between the following parameters:

    kernel: poly or rbf.
    C: 0.1, 1, or 10.

(Note: These parameters can be used as a black box now, but we'll see them in detail in the Supervised Learning Section of the nanodegree.)

The steps are the following:
1. Import GridSearchCV
```python
from sklearn.model_selection import GridSearchCV
```
2. Select the parameters:

Here we pick what are the parameters we want to choose from, and form a dictionary. In this dictionary, the keys will be the names of the parameters, and the values will be the lists of possible values for each parameter.
```python
parameters = {'kernel':['poly', 'rbf'],'C':[0.1, 1, 10]}
```
3. Create a scorer.

We need to decide what metric we'll use to score each of the candidate models. In here, we'll use F1 Score.
```python
from sklearn.metrics import make_scorer
from sklearn.metrics import f1_score
scorer = make_scorer(f1_score)
```
4. Create a GridSearch Object with the parameters, and the scorer. Use this object to fit the data.
```python
# Create the object.
grid_obj = GridSearchCV(clf, parameters, scoring=scorer)
# Fit the data
grid_fit = grid_obj.fit(X, y)
```
5. Get the best estimator.
```python
best_clf = grid_fit.best_estimator_
```
Now you can use this estimator best_clf to make the predictions.

In the next page, you'll find a lab where you can use GridSearchCV to optimize a decision tree model.

---