#### Overfitting and Underfitting With Machine Learning Algorithms

The cause of poor performance in machine learning is either overfitting or underfitting the data. After reading this you will discover the concept of generalization in machine learning and the problems of overfitting and underfitting that go along with it. Let's get started.

#### Approximate a Target Function in Machine Learning


Supervised machine learning is best described as approximating a target function (f) that maps input variables (X) to an output variable (Y). This characterization describes the range of classification and prediction problems and the machine algorithms that can be used to address them. An important consideration in learning the target function from the training data is how well the model generalizes to new data. Generalization is critical because the data we collect is only a sample, it is incomplete and noisy and non-representative.

#### Generalization in Machine Learning 

In machine learning we describe the leraning of the target function from training data as inductive learning. Induction referes to learning general concepts from specific examples which is exactly the problem that supervised machine learning problems aim to solve. This is different from deduction that seeks to learn specific concepts from general rules. Generalization refers to how well the concepts learning by a machine learning model apply to specific examples not seen by the model when it was learning. The goal of a good machine learning model is to generalize well from the the trianing data to any data from the problem domain. This allows us to make predictions in the future on data the model has never seen. There is a terminology used in machine learning when we talk about how well a machine learning model learns and generalizes to new data, namely overfitting and underfitting. Overfitting and underfitting are the two biggest causes for poor performance of machine leraning algorithms.

#### Statistical Fit

In statistics, a fit referes to how well you approximate a target function. This is good terminology to use in machine learning, because supervised machine learning algorithms seek to approximate the unknown underlying mapping function for the output variables given the input variables. Statistics often describe the goodness of fit which refers to measures used to estimate how well the approximation of the function matches the target function. Some of these methods are useful in machine learning (e.g. calculating the residual errors), but some of these techniques assume we know the form of the target function we are approximating, which is not the case in machine learning. If we knew the form of the target function, we would use it directly to make predictions, rather than trying to leran an approximation from samples of noisy training data.

#### Overfitting in Machine Learning

Overfitting refers to a model that models the training data too well. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the model. The problem is that these concepts do not apply to new data and negatively impact the model's ability to generalize. Overfitting is more likely with non parametric and nonlinear models that have more flexibility when learning a target function. As such, many nonparametric machine learning algorithms also include parameters or techniques to limit and constrain how much detail the model learns. For example, decision trees are a nonparametric machine learning algorithm that is very flexible and is subject to overfitting trianing data. This problem can be addressed by pruning a tree after it has learned in order to remove some of the detail it has picked up. 

#### Underfitting in Machine Learning

Underfitting refers to a model that can neither model the training data nor generalize to new data. An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data. Underfitting is often not discussed as it is easy to detect given a good performance metric. The remedy is to move on and try alternate machine learning algorithms. Nevertheless, it does provide a good contrast to the problem of overfitting.

#### A Good Fit in Machine Learning

Ideally, you want to select a model at the sweet spot between underfitting and overfitting. This is the goal, but it is very difficult to do in practice. To understand this goal we can look at the performance of a machine learning algorithm over time as it is learning from training data. We can plot both the skill on the training data and the skill on the test dataset we have held back from the training process. Over time, as the algorithm learns, the error for the model on the training data decreases monotonically. If you have set up your data correctly and chosen an appropriate algorithm, you should see error on the test set initially decrease and _eventually start to increase again_. This means the model has just crossed the cusp of overfitting; it has trained for too long. The sweet spot is the point just before the error on the test dataset starts to increase where the model has the lowest error on the unseen test dataset. This is often not a useful technique in practice, because by choosing the stopping point for training using the error curve on the test set it means that the testset is no longer "unseen" or a standalone objective measure. Some knowledge (a lot of useful knowledge) about that data has leaked into the training procedure. There are two additional techniques you can use to help find the sweet spot in practice: resampling methods and a validation dataset.

#### How To Limit Overfitting

Both overfitting and underfitting can lead to poor model performance. But by far the most commo problem in applied machine learning is overfitting. Overfitting is such a problem because the evaluation of machine learning algorithms on training data is different from the evaluation we actually care the most about, namely how well the algorithm performs on unseen data. There are two important techniques that you can use when evaluating machine learning algorithms to limit overfitting: (1) use a resampling technique to estimate model accuracy (2) hold back a holdout dataset. The most popular resampling technique is k-fold cross-validation. It allows you to train and test your model k-times on different subsets of training data and build up an estimate of the performance of a machine leraning model on unseen data. A holdout dataset is simply a subset of your data that you hold back from your machine learning algorithms until the very end of your project. After you have selected and tuned your machine learning algorithms on your training dataset you can evaluate the learned models on the holdout dataset to get one final objective idea of how the model might perform on unseen data. My feelings are that if the one-time holdout performance is good, we productionalize the model. If not, it's back to the drawing board starting all over again at square 1. Using cross validation is a gold standard in applied machine learning for estimating model accuracy on unseen data. If you have plenty of data to spare and can afford to save off a holdout set, that is also excellent practice. 