# Introduction

This is the companion workbook to [Picking the Best Model with Caret](https://www.kaggle.com/rtatman/picking-the-best-model-with-caret). Read through the lesson and then come here to complete the exercises so all your work is in one central place.

For this workbook we'll be working with a dataset of FIFA (football/soccer) players from EASports' FIFA video game series. You'll be predicting what a player's rank will be given their other attributes.
____

**Remember**: If you want to share this notebook you need to make it public so that other people can see it. You can do that by forking this notebook and then selecting "public" on the drop-down menu to the left of the "Publish" button.
____
# Table of Contents: 

* [Setting up our environment](#Setting-up-our-environment)
* [Clean our data](#Clean-our-data)
* [Split data into testing & training](#Split-data-into-testing-&-training)
* [Fit a baseline model](#Fit-a-baseline-model)
* [Tune a model using caret](#Tune-a-model-using-caret)
* [Compare our models](#Compare-our-models)

# Setting up our environment
___

Here I've set up the environment for you. Remember to run this cell first or the other cells won't work! :)

In [1]:
# libraries we'll use
library(tidyverse) # utility functions
library(caret) # hyperparameter tuning
library(randomForest)
library(Metrics) #useful metrics

# read in data
player_statistics <- read_csv("../input/FullData.csv")

# Clean our data
___

This datset is pretty clean. All you'll need to do is remove rows with na's and select just the numeric columns.

In [5]:
# omit na's and remove non-numeric columns
player_statistics <- player_statistics %>%
na.omit() %>%
select_if(is.numeric)


Check your dataframe out to make sure it looks reasonable. 

In [6]:
# check out your data frame using the str() function
str(player_statistics)


# Split data into testing & training
____

Split your data so that 80% of your data in the training set and 20% is in the testing set. 

In [8]:
# set a random seed
set.seed(1234)

# 80/20 train/test split
# train/test split
training_indexs <- createDataPartition(player_statistics$Rating, p = .2, list = F)
training <- player_statistics[training_indexs, ]
testing  <- player_statistics[-training_indexs, ]


Convert your predictors into a matrix & get a vector of your target variables.

In [9]:
# get a mtrix of predictors and a vector of our target variable
predictors <- training %>% select(-Rating) %>% as.matrix()
output <- training$Rating



# Fit a baseline model
____

Fit a baseline model using randomForest. I'd recommend setting "ntree" to 25.

> **How can I figure out what a good ntree is?** You can check the output of a random tree model as it adds more trees by setting the argument do.trace to TRUE. It will print out the the mean standard error and what percent of the variance your model doesn't explain for each number of trees. You can then pick a ntree that is near the "elbow", the point at which adding additional another three stops dramatically improving your model's fit. 

Once you've trained a model, you can examine it by calling the varaible you assinged it to. (So if you called your model "base_model", you can look at your model by running a line that's just "base_model".)

In [10]:
# fit a model
base_model <- randomForest(x = predictors, y = output,
                      ntree = 50) # number of trees

# examine your model
base_model


Finally, go ahead and calcuate the rmse (root mean squared error) for your base model on your held-out test data.

In [11]:
# find the rmse on our test data
rmse(predict(base_model, testing), testing$Rating)


# Tune a model using caret
____

Now that you have a base model to compare it with, try tuning the model using the "train" function from the caret package. You can examine your model by printing the variable you assigned your model to using the print() function. 

In [12]:
# tune a model
tuned_model <- train(x = predictors, y = output,
                     ntree = 5, # number of trees (passed ot random forest)
                     method = "rf") # random forests

# examine your tuned model
print(tuned_model)

You can also check out the error over the different mtry values that caret tried by passing the model to the ggplot function. 

In [13]:
# plot the error over various mtry variables 
ggplot(tuned_model)

# Compare our models
___

Now that we have our two models, let's compare them to each other.

> **Tip:** You can access the automatically-picked best model by getting the finalModel component for your tuned model. So if you called your tuned model ```model_tuned```, the best model would be ```model_tuned$finalModel```.

First, compare the root mean squared error (rmse) for each of your models on the test data. Which model has a lower overall error on the test data? Why might this be?

In [14]:
# get rmse for the base model on the testing data
print("Base model mean error:")
print(rmse(predict(base_model, testing), testing$Rating))

# get rmse for the tuned model on the testing data
print("Tuned model mean error:")
print(rmse(predict(tuned_model$finalModel, testing), testing$Rating))

Second, look at the five most important varibles for each model. Are the the same?

In [15]:
# plot the relative variable importance for our tune & un-tuned models

# two columns, 1 row (for plots)
par(mfrow = c(1,2))

# plot both variable importances
varImpPlot(base_model, n.var = 5)
varImpPlot(tuned_model$finalModel, n.var = 5)


# And that's it! :)
___

Nice work! Now that you've got some practice, why not try using caret to tune a different model, like xgboost? You can check out an [R xgboost tutorial here](https://www.kaggle.com/rtatman/machine-learning-with-xgboost-in-r).

Happy analyzing!