# Model Performance

One of the key goals of Machine Learning is to create a model in such a way that we have confidence that the model we build will predict new samples with a similar degree of accuracy on the set of data for which the model was evaluated. Without this
confidence, the model’s predictions are useless. 

On a practical note, all model building efforts are constrained by the existing data. For many problems, the data may have a limited number of samples, may be of less-than-desirable quality, and/or may be unrepresentative of future samples. While there are ways to build predictive models on small data sets,  we will assume that data quality is sufficient and that it is representative of the entire sample population.

The techniques of preparing the data in the above manner is explained in the Data Preparation course. 

Working under these assumptions about the dataset, we must use the data at hand to find the best predictive model. Almost all predictive modeling techniques have tuning parameters that enable the model to flex to find the structure in the data. Hence, we must use the existing data to identify settings for the model’s parameters that yield the best and most realistic predictive performance. This is known as model tuning.

In this course, We will learn about:

1. Overfitting.
2. Bias-Variance trade-off
3. Techniques to prepare data for validating model perfromance
4. Model Tuning parameters
5. Model Evaluation


## Overfitting problem

There now exist many techniques that can learn the structure of a set of data so well that when the model is applied to the data on which the model was built, it correctly predicts every sample. In addition to learning the general patterns in the data, the model has also learned the characteristics of each sample’s unique noise. This type of model is said to be over-fit and will usually have poor accuracy when predicting a new sample.


In the example below, the blue curve represents the model and the red dots represent the training data points:

<img src="../images/overfitting.png", style="width: 700px;"> 

As you can see from the above that the curve tries to fit all the data points in the training dataset.

Overfitting means that the Model fits the training data very well including the noise in the dataset to such an extent that when we try to predict the new data with the model it performs very poorly. In other words, the model has been very flexible to fit the training dataset so as to lose the understanding of the generic pattern in the overall data.


In [1]:
#Press play button and submit to contunue


# hint

Press play and then submit, to proceed

In [2]:
# ref_assert

try:
    test = True
    if test == True:
        ref_assert_var = True
        print('continue')
    else:
        ref_assert_var = False
        print('Please follow the instructions given and use the same variables provided in the instructions. ')
except Exception:
    print('Please follow the instructions given and use the same variables provided in the instructions. ')

continue


## Bias - Variance Trade-off

In order to minimize the expected test error, we need to select a statistical learning method that simultaneously achieves
low variance and low bias.


### Variance

Variance refers to the amount by which function f would change if we
estimated it using a different training data set. Since the training data
are used to fit the statistical learning method, different training data sets
will result in a different function f. But ideally the estimate for f should not vary
too much between training sets. However, if a method has high variance
then small changes in the training data can result in large changes in estimated f. In
general, more flexible statistical methods have higher variance.


### Bias

Bias refers to the error that is introduced by approximating
a real-life problem, which may be extremely complicated, by a much
simpler model. For example, linear regression assumes that there is a linear
relationship between Y and X1,X2, . . . , Xp. It is unlikely that any real-life
problem truly has such a simple linear relationship, and so performing linear
regression will undoubtedly result in some bias in the estimate of f.


### Bias-Variance Trade-off

In the diagram below, as the complexity increases, the variance increases and bias decreases.
This corresponds to the total error (= train_error + test_error) varying as a curve. It first decreases as the complexity grows to a point where the accuracy of the model is optimal. Once it reaches a point where the model starts overfitting the train data, the total error increases again.

The point where the total error is minimal is the trade-off point between bias and variance. This will correspond to the most accurate and optimal model for the train and test dataset.

<img src="../images/complexity-error.png", style="width: 700px;"> 


Reference - Introduction to Statistical Learning using R

In [3]:
# Pre-loaded code

### hint

In [4]:
#solution

In [5]:
# ref_assert

try:
    test = True
    if test == True:
        ref_assert_var = True
        print('continue')
    else:
        ref_assert_var = False
        print('Please follow the instructions given and use the same variables provided in the instructions. ')
except Exception:
    print('Please follow the instructions given and use the same variables provided in the instructions. ')

continue



## Train- Test Split

In order to tune the model to find the structure in the data in the best possible fashion, we must use the existing data to identify settings for the model’s parameters that yield the best and most realistic predictive performance.

Traditionally, this has been achieved by splitting the existing data into training and test sets. The training set is used
to build and tune the model and the test set is used to estimate the model’s predictive performance. Modern approaches to model building split the data into multiple training and testing sets, which have been shown to often find more optimal tuning parameters and give a more accurate representation of the model’s predictive performance.

To avoid over-fitting, we use a general model building approach that encompasses model tuning and model evaluation
with the ultimate goal of finding the reproducible structure in the data. This approach entails splitting existing data into distinct sets for the purposes of tuning model parameters and evaluating model performance. The choice of data splitting method depends on characteristics of the existing data such as its size and structure.

When a large amount of data is at hand, a set of samples can be set aside to evaluate the final model. The “training” data set is the general term for the samples used to create the model, while the “test” or “validation”data set is used to qualify performance.

## Exercise

In the following exercise, write code to create a train-test split using sklearn library.

In [6]:
from sklearn import datasets
import pandas as pd

#load boston dataset using boston_dataset.feature_names
boston_dataset = datasets.load_boston()
boston_data = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
boston_data['MEDV'] = boston_dataset.target

#create X and y as input and output vectors
X = boston_data[['CRIM', 'ZN', 'CHAS', 'NOX', 'RM', 'DIS', 'RAD', 'PTRATIO', 'B', 'LSTAT']]
y = boston_data[['MEDV']]

# Write code below to create train-test split

### hint
Use sklearn.modelselection. train_test_split

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=4)
X_train.head()

Unnamed: 0,CRIM,ZN,CHAS,NOX,RM,DIS,RAD,PTRATIO,B,LSTAT
84,0.05059,0.0,0.0,0.449,6.389,4.7794,3.0,18.5,396.9,9.62
354,0.04301,80.0,0.0,0.413,5.663,10.5857,4.0,22.0,382.8,8.05
221,0.40771,0.0,1.0,0.507,6.164,3.048,8.0,17.4,395.24,21.46
34,1.61282,0.0,0.0,0.538,6.096,3.7598,4.0,21.0,248.31,20.34
267,0.57834,20.0,0.0,0.575,8.297,2.4216,5.0,13.0,384.54,7.44


In [8]:
# ref_assert

try:
    test = True
    if test == True:
        ref_assert_var = True
        print('continue')
    else:
        ref_assert_var = False
        print('Please follow the instructions given and use the same variables provided in the instructions. ')
except Exception:
    print('Please follow the instructions given and use the same variables provided in the instructions. ')

continue


## k-fold Cross Validation

In the previous lesson, we saw how to split the large dataset as the “training” data set and the “test” or “validation”data set to qualify performance.

However, when the number of samples is not large, a strong case can be made that a test set should be avoided because every sample may be needed for model building. Additionally, the size of the test set may not have sufficient power or precision to make reasonable judgements. Resampling methods, such as cross-validation, can be used to produce appropriate estimates of model performance using the training set.

Generally, resampling techniques for estimating model performance operate similarly: a subset of samples are used to fit a model and the remaining samples are used to estimate the efficacy of the model. This process is repeated multiple times and the results are aggregated and summarized. The differences in techniques usually center around the method in which subsamples are chosen.

In Cross Validation, the samples are randomly partitioned into k sets of roughly equal size. A model is fit using the all samples except the first subset (called the first fold). The held-out samples are predicted by this model and used to estimate
performance measures. The first subset is returned to the training set and  procedure repeats with the second subset held out, and so on. The k resampled estimates of performance are summarized (usually with the mean and standard error) and used to understand the relationship between the tuning parameter(s) and model utility.

A leave-one-out cross-validation (LOOCV), is the special case where k is the number of samples. In this case, since only one sample is held-out at a time, the final performance is calculated from the k individual held-out predictions. For example, if 10-fold cross-validation was repeated five times, 50 different held-out sets would be used to estimate model efficacy. The choice of k is usually 5 or 10, but there is no formal rule. As k gets larger, the difference in size between the training set and the resampling subsets gets smaller. As this difference decreases, the bias of the technique becomes smaller (i.e., the bias is smaller for k = 10 than k = 5). In this  context, the bias is the difference between the estimated and true values of performance.


### Exercise

In the code below, check how KFold from sklearn has been used to make the CV.

In [9]:
from sklearn.model_selection import KFold
import numpy as np
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(n_splits=2)
kf.get_n_splits(X)
print(kf)  
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

KFold(n_splits=2, random_state=None, shuffle=False)
TRAIN: [2 3] TEST: [0 1]
TRAIN: [0 1] TEST: [2 3]


In [10]:
# ref_assert

try:
    test = True
    if test == True:
        ref_assert_var = True
        print('continue')
    else:
        ref_assert_var = False
        print('Please follow the instructions given and use the same variables provided in the instructions. ')
except Exception:
    print('Please follow the instructions given and use the same variables provided in the instructions. ')

continue


## Repeated Training/Test Splits

Repeated training/test splits is also known as “leave-group-out crossvalidation” or “Monte Carlo cross-validation.” This technique simply creates multiple splits of the data into modeling and prediction sets. The proportion of the data going into each subset and the number of repetitions are controlled by the Data Scientist. The bias of the resampling technique decreases as the amount of data in the subset approaches the amount in the modeling set. A good rule of thumb is about 75–80%. Higher proportions are a good idea if the number of repetitions is large.

Increasing the repetitions of the number of subsets has the effect of decreasing the uncertainty of the performance estimates.
For example, to get a gross estimate of model performance, 25 repetitions will be adequate if the user is willing to accept some instability in the resulting values. However, to get stable estimates of performance, it is suggested to choose a larger number of repetitions (say 50–200).

### The Bootstrap

The Bootstrap is another re-sampling technique to improve model performance. A bootstrap sample is a random sample of the data taken with replacement. After a data point is selected for the subset, it is still available for further selection. The bootstrap sample is the same size as the original data set. As a result, some samples will be represented multiple times in the bootstrap sample while others will not be selected at all. The samples not selected are usually referred to as the “out-of-bag” samples. For a given iteration of bootstrap resampling, a model is built on the selected samples and is used to predict the out-of-bag samples.

In [11]:
# ref_assert

try:
    test = True
    if test == True:
        ref_assert_var = True
        print('continue')
    else:
        ref_assert_var = False
        print('Please follow the instructions given and use the same variables provided in the instructions. ')
except Exception:
    print('Please follow the instructions given and use the same variables provided in the instructions. ')

continue


## Tuning parameters

Various ML libraries provide several implementation of the ML algorithms.
With each algorithm, the libraries provide different options such as parameters to tune the model to achieve maximum efficiency.
These parameters have to be determined by evaluating the performance of the model and tuning the parameters accordingly.

For example, in the KNN algorithm for regression, the optimal 'k' value needs to be determined by tuning the algorithm with various k values and identifying the efficiency of the resulting models. There are several approaches to evaluate the performance or efficiency of the models, which will be dealt with in the next lesson in this course.


In [12]:
#hit playbutton to continue

### hint

In [13]:
# ref_assert

try:
    test = True
    if test == True:
        ref_assert_var = True
        print('continue')
    else:
        ref_assert_var = False
        print('Please follow the instructions given and use the same variables provided in the instructions. ')
except Exception:
    print('Please follow the instructions given and use the same variables provided in the instructions. ')

continue


## Changing the data

The performance of the model could be improved by finetuning the data used for training the model. Some of the techniques include:

1. Removing Predictors
2. Removing correlations
3. Creating Computed Variables

### Removing Predictors

There are potential advantages to removing predictors prior to modeling. First, fewer predictors means decreased computational time and complexity. Second, if two predictors are highly correlated, this implies that they are measuring the same underlying information. Removing one should not compromise the performance of the model and might lead to a more parsimonious and interpretable model. Third, some models can be crippled by predictors with degenerate distributions. In these cases, there can be a significant improvement in model performance and/or stability without the problematic variables.

Consider a predictor variable that has a single unique value; we refer to this type of data as a zero variance predictor. For some models, such an uninformative variable may have little effect on the calculations. A model such as linear regression
would find these data problematic and is likely to cause an error in the computations. These data have no information and can easily be discarded. Similarly, some predictors might have only a handful of unique values that occur with very low frequencies. These “near-zero variance predictors” may have a single value for the vast majority of the samples. They could be treated in a similar way.


### Between-Predictor Correlations

Collinearity is the technical term for the situation where a pair of predictor variables have a substantial correlation with each other. It i  also possible to have relationships between multiple predictors at once (calle multicollinearity).

For example, in a housing price dataset, the crime rate information might have a linear colinearity with say data on how secure the residents feel. Such information must be cleaned in such a way that there is a unique predictor without correlations beween predictors. 



### Creating Computed Variables

In some cases, a computed variable (feature) with multiple input variables might provide better model compured to the individual variables included as it is. For example in a stock market dataset, a variable with the average of last 5 days gives better meaning to the dataset. (This is called moving averages). 


In [14]:
# Exercise

In [15]:
# ref_assert

try:
    test = True
    if test == True:
        ref_assert_var = True
        print('continue')
    else:
        ref_assert_var = False
        print('Please follow the instructions given and use the same variables provided in the instructions. ')
except Exception:
    print('Please follow the instructions given and use the same variables provided in the instructions. ')

continue


### R-squared:

R-squared is a value between 0 and 1.0 to measure how well the dependent variables are effectively modeling the target variable. The higher the value, the better that the dependent variables explain the fit. If the R-squared is a low value closer to 0.0 it shows that the predictor variable selected is not a good indicator of the target variable. To increase the R-squared value and get better fit or predictions, the best way is to increase the number of independent variables (also called Feature Engineering)

In [16]:
# ref_assert

try:
    test = True
    if test == True:
        ref_assert_var = True
        print('continue')
    else:
        ref_assert_var = False
        print('Please follow the instructions given and use the same variables provided in the instructions. ')
except Exception:
    print('Please follow the instructions given and use the same variables provided in the instructions. ')

continue
