We will learn how to evaluate the performance of our k-nearest neighbors algorithm. We'll define what we mean by performance, and then look at how to calculate metrics to judge whether the model is "good" or not. We'll start learning an incredibly handy R library called `caret`, which is used for creating machine learning models and automating the process of evaluating their performance as well. Instead of having to code everything by hand, we'll learn how to use the `caret` library to perform the various steps of the machine learning workflow.

We judge the performance of a machine learning algorithm by evalutating how well it predicts the outcomes of data it hasn't seen before. We can think of a machine learning algorithm as a function that takes in data and outputs predictions. If we feed in data that the algorithm hasn't seen yet, we can get predictions and then compare them to the actual outcomes contained in the data. This process is called **holdout validation**. Holdout validation is a form of **cross-validation**, which is the more general name for evaluating model performance.

By using this process, we don't actually have to go out and collect more data to evaluate our algorithms. With a single dataset, we can divide it into parts and use each part to assess the performance of the model. 

The word `caret` is actually an acronym, short for **Classification And REgression Training**. Classification and regression are the official names of the tasks that machine learning models are typically responsible for. In our case, the k-nearest neighbors algorithm is attempting to predict a good rental price, so is a form of regression problem. As we've mentioned before `caret` streamlines the process of holdout validation, and it is capable of much more.

We'll hone in on the first step of holdout validation:

![image.png](attachment:image.png)

The code below is an example of how to use the `createDataPartition()` function:

`train_indices <- createDataPartition(y = data[["tidy_price"]],
                                     p = 0.8,
                                     list = FALSE)`

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

`set.seed(1)
library(caret)
train_indices <- createDataPartition(y = dc_listings[["tidy_price"]],
                                     p = 0.7,
                                     list = FALSE)`
                                     
`train_listings <- dc_listings[train_indices,]
test_listings <- dc_listings[-train_indices,]`

Now that we have our training set and test set, we need to "train" the algorithm and evaluate it against the test set. Before we can actually do the training, there is another step we need to take with the `caret` library. Holdout validation is actually one of many ways to do validation. Some of the other validation methods require setting particular parameters. 

`caret` conveniently provides us with a function to set these parameters for the validation process before we start training. This function is `trainControl()`. Below is an example of how we use the `trainControl()` function:

`train_control <- trainControl(method = "none")`

Holdout validation is one of the simplest methods to evaluate the performance of a model, and as such, it is not always the best. What's important to grasp here is that we need to specify these parameters in `trainControl()` before training the algorithm with `caret`. The `method = "none"` argument here specifies that we don't want to do any special resampling or multiple-fold validation with our algorithm. We'll come back to this function when we start examining more sophisticated validation methods, but for now the above code works.

![image.png](attachment:image.png)

The code below is an example use of the `train()` function.

`train_control <- trainControl(method = "none")`

`knn_model <- train(outcome ~ predictor1 + predictor2, 
                   data = training_data, 
                   method = "knn", 
                   trControl = train_control)`
                   
The first argument that we see in the `train()` example above is a formula. The formula, `outcome ~ predictor1 + predictor2`, is what we provide to the `train()` function to tell it what we are trying to predict (`outcome`) through two features in the data (`predictor1` and `predictor2`).

Earlier, we only learned how to use one feature (`accommodates`) to predict `tidy_price`, so we might write this as `tidy_price ~ accommodates`. The `~` character is what we use to separate the outcome we'd like to predict from the features we're using to do so.

![image.png](attachment:image.png)

The output of the `train()` function is a list that essentially contains the trained machine learning model. With this trained model, we can move further with the holdout procss.

**Task**

![image.png](attachment:image.png)


**Answer**

`set.seed(1)
library(caret)
train_indices <- createDataPartition(y = dc_listings[["tidy_price"]],
                                     p = 0.7,
                                     list = FALSE)`
                                     
`train_listings <- dc_listings[train_indices,]
test_listings <- dc_listings[-train_indices,]
train_control <- trainControl(method = "none")`

`knn_model <- train(tidy_price ~ accommodates + maximum_nights, 
                   data = train_listings, 
                   method = "knn", 
                   trControl = train_control)`

Above, we used the `train()` function to "train" version of the k-nearest neighbors algorithm. By "training an algorithm", we really mean that the rows of the training data will be used as the neighbors for new listings we want to predict the prices for. With a trained model, we can now start producing predictions on the test set! This is the next step in the holdout process.

![image.png](attachment:image.png)

**Task**

Using the example code above, create predictions for the `test_listings`. 
    * Use both the `accommodates` and `maximum_nights` as our features in this algorithm.
    
**Answer**

`set.seed(1)
library(caret)
train_indices <- createDataPartition(y = dc_listings[["tidy_price"]],
                                     p = 0.7,
                                     list = FALSE)`
                                     
`train_listings <- dc_listings[train_indices,]
test_listings <- dc_listings[-train_indices,]
train_control <- trainControl(method = "none")`

`knn_model <- train(tidy_price ~ accommodates + maximum_nights, 
                   data = train_listings, 
                   method = "knn", 
                   trControl = train_control)`
                   
`test_predictions <- predict(knn_model, newdata = test_listings)`

![image.png](attachment:image.png)

We typically quantify this in terms of **error**, or how much the predictions differ from the actual value. If the predicted price closely matches the actual price, then the error will be small. Conversely, if the predicted price is nowhere near the actual price, then we would see a large error. Using the results of the `predict()` function, we can actually create a new column in the test set that captures the error for each listing. Below is an example of how we might do this using hypothetical column names:

`test_predictions <- predict(knn_model, newdata = test_listings)`

`test_listings <- test_listings %>%
    mutate(
        error = actual_price - test_predictions
    )`

**Task**

Create a new column in `test_listings` called `error` that captures how much the model predictions differ from the actual listing prices.

**Answer**

`test_listings <- test_listings %>%
  mutate(
    error = tidy_price - test_predictions
  )`

![image.png](attachment:image.png)

There is a specific term for this summary value: **the error metric**. There are many types of error metrics. We'll focus on one metric: **the root mean squared error (RMSE)**.

![image.png](attachment:image.png)

This difference is then squared. We square the difference between the predicted and actual value for multiple reasons. In short, squaring the differences helps us convert all of the error into a positive value. If we just tried to sum up all of the error, we run into trouble. Some errors are positive while others are negative, so if we tried to sum them all up they would cancel each other out. This cancellation is undesirable, so we use squaring to prevent it.

Each difference is squared, and then we calculate at the mean, or average, squared difference. Recall that the mean is useful for summarizing numerical data, which is why we use it here. After calculating the average squared difference, we take its square root. When we take the square root it helps put the error back in terms of the original outcome. In our case, taking the square of the difference of listing prices means that its unit is also squared (dollars squared). This doesn't make sense, so we convert it back into dollar amounts to make the RMSE interpretable again. All of these details together are why this error metric is named "root mean squared error".

As a last note, we may see in some other places that omit the square-root step, leaving the error metric in terms of squared differences. This **MSE** is still technically an error metric, so it's important to be aware that it still has usage. For our purposes however, we'll use the RMSE.

**Task**

1. First, calculate the squared error based on the original error column that we created. Give this column the name `squared_error`.
2. Using this new `squared_error column`, calculate the **RMSE**. Assign this to the variable `rmse`.

**Answer**

`test_listings <- test_listings %>%
  mutate(
    squared_error = error^2
  )`

`rmse <- sqrt(mean(test_listings$squared_error))`

In this file, we learned how to evaluate our k-nearest neighbors model using holdout validation. Holdout validation is incredibly useful to know for a data scientist. There are many types of validation out there, but all of them follow a similar process to holdout validation.