# TODO

1. ROC curve for trees + logistic regression
2. Cross validation of linear regression -- see bottom of script
3. Better model metrics around linear regression + logistic regression

# The Machine Learning Process

* Exploratory data analysis
* Build the simplest model
* Iterate


# Machine Learning: Key Terms

 <tr>
    <td> <img src="confusion_matrix.png" alt="Drawing" style="width: 500px;"/> </td>
    <td> <img src="bootstrapping.png" alt="Drawing" style="width: 500px;"/> </td>
    </tr>



# R Packages We'll Use

* caret: package for Classification And REgression Training (https://topepo.github.io/caret/index.html)
* tidyverse: set of packages for tidy data science (https://www.tidyverse.org)
* keras: interface for R into the keras deep learning library (https://keras.rstudio.com)
* rpart: package for Recursive Partioning And Regression Trees (https://cran.r-project.org/web/packages/rpart/rpart.pdf)



In [1]:
install.packages("tidyverse", "caret", "skimr", "AppliedPredictiveModeling", "keras", "modelr", "rpart")

“'lib = "skimr"' is not writable”

ERROR: Error in install.packages("tidyverse", "skimr", "AppliedPredictiveModeling", : unable to install packages


# Sources

* Alligator Data: https://www.r-bloggers.com/simple-linear-regression-2/
* Abalone Data: http://archive.ics.uci.edu/ml/datasets/Abalone



# Resources
* DataCamp’s Machine Learning with R skill track (requires paid access).
* Useful worked example: https://cfss.uchicago.edu/persp003_linear_regression.html
* Modeling in the tidyverse: http://r4ds.had.co.nz/model-basics.html
* Exploratory visualizations: https://machinelearningmastery.com/data-visualization-in-r/
* https://tutorials.iq.harvard.edu/R/Rstatistics/Rstatistics.html
* MLR is working towards a scikit-like implementation https://github.com/mlr-org/mlr


# supervised vs. unsupervised

[rephrase]
In supervised learning (SML), the learning algorithm is presented with labelled example inputs, where the labels indicate the desired output. SML itself is composed of classification, where the output is categorical, and regression, where the output is numerical.

In unsupervised learning (UML), no labels are provided, and the learning algorithm focuses solely on detecting structure in unlabelled input data.

# data pre-processing

In [None]:
5.6 Data pre-processing

5.6.1 Missing values

Real datasets often come with missing values. In R, these should be encoded using NA. There are basically two approaches to deal with such cases.

Drop the observations with missing values, or, if one feature contains a very high proportion of NAs, drop the feature altogether. These approaches are only applicable when the proportion of missing values is relatively small. Otherwise, it could lead to loosing too much data.

Impute missing values.

Data imputation can however have critical consequences depending on the proportion of missing values and their nature. From a statistical point of view, missing values are classified as missing completely at random (MCAR), missing at random (MAR) or missing not at random (MNAR), and the type of the missing values will influence the efficiency of the imputation method.

The figure below shows how different imputation methods perform depending on the proportion and nature of missing values (from Lazar et al., on quantitative proteomics data).

In [None]:
We could perform imputation manually, but caret provides a whole range of pre-processing methods, including imputation methods, that can directly be passed when training the model.

In [None]:
We have seen in the Unsupervised learning chapter how data at different scales can substantially disrupt a learning algorithm. Scaling (division by the standard deviation) and centring (subtraction of the mean) can also be applied directly during model training by setting. Note that they are set to be applied by default prior to training.

train(X, Y, preProcess = "scale")
train(X, Y, preProcess = "center")
As we have discussed in the section about Principal component analysis, PCA can be used as pre-processing method, generating a set of high-variance and perpendicular predictors, preventing collinearity.

train(X, Y, preProcess = "pca")

In [None]:
5.7.1 Multiple pre-processing methods

It is possible to chain multiple processing methods: imputation, center, scale, pca.

train(X, Y, preProcess = c("knnImpute", "center", "scale", "pca"))




# recipes and baking

In [None]:
To deal with the dummy variable issue, we can expand the recipe with more steps:
mod_rec <- recipe(Sale_Price ~ Longitude + Latitude + Neighborhood, data = ames_train) %>%
step_log(Sale_Price, base = 10) %>%
# Lump factor levels that occur in <= 5% of data as "other"
step_other(Neighborhood, threshold = 0.05) %>%
# Create dummy variables for _any_ factor variables
step_dummy(all_nominal())
Note that we can use standard dplyr selectors as well as some new ones based on the data type
( all_nominal() ) or by their role in the analysis ( all_predictors() ).

In [None]:
Now that we have a preprocessing speci􀂀cation, let's run it on the training set to prepare the recipe:
mod_rec_trained <- prep(mod_rec, training = ames_train, retain = TRUE, verbose = TRUE)
## oper 1 step log [training]
## oper 2 step other [training]
## oper 3 step dummy [training]
Here, the "training" is to determine which factors to pool and to enumerate the factor levels of the
Neighborhood variable,
retain keeps the processed version of the training set around so we don't have to recompute it.
14

In [None]:
Once the recipe is prepared, it can be applied to any data set using bake :
ames_test_dummies <- bake(mod_rec_trained,newdata = ames_test)
names(ames_test_dummies)
## [1] "Sale_Price" "Longitude" "Latitude"
## [4] "Neighborhood_College_Creek" "Neighborhood_Old_Town" "Neighborhood_Edwards"
## [7] "Neighborhood_Somerset" "Neighborhood_Northridge_Heights" "Neighborhood_Gilbert"
## [10] "Neighborhood_Sawyer" "Neighborhood_other"
If retain = TRUE the training set does not need to be "rebaked". The juice function can return the
processed version of the training data.
Selectors can be used with bake and the default is everything() .

# training vs. test set

We typically split data into training and test data sets:
Training Set: these data are used to estimate model parameters and to pick the values of the
complexity parameter(s) for the model.
Test Set: these data can be used to get an independent assessment of model ef􀁿cacy. They should
not be used during model training.

In [None]:
library(rsample)
# Make sure that you get the same random numbers
set.seed(4595)
data_split <- initial_split(ames, strata = "Sale_Price")
ames_train <- training(data_split)
ames_test <- testing(data_split)
nrow(ames_train)/nrow(ames)



library(ggplot2)
## Do the distributions line up?
ggplot(ames_train, aes(x = Sale_Price)) +
geom_line(stat = "density",
trim = TRUE) +
geom_line(data = ames_test,
stat = "density",
trim = TRUE, col = "red")
Outcome Distributions
11

# comparing models

In [None]:
5.8.6 Comparing models

We can now use the caret::resamples function that will compare the models and pick the one with the highest AUC and lowest AUC standard deviation.

model_list <- list(glmmet = glm_model,
                   rf = rf_model,
                   knn = knn_model,
                   svm = svm_model,
                   nb = nb_model)
resamp <- resamples(model_list)
resamp

In [None]:
summary(resamp)

lattice::bwplot(resamp, metric = "ROC")

# grid search

In [None]:
We usually don't have two-dimensional data so a quantitative method for under measuring over􀁿tting is
needed. Resampling 􀁿ts that description. A simple method for tuning a model is to used grid search:
├── Create a set of candidate tuning parameter values
└── For each resample
│ ├── Split the data into analysis and assessment sets
│ ├── [preprocess data]
│ ├── For each tuning parameter value
│ │ ├── Fit the model using the analysis set
│ │ └── Compute the performance on the assessment set and save
├── For each tuning parameter value, average the performance over resamples
├── Determine the best tuning parameter value
└── Create the final model with the optimal parameter(s) on the training set
Random search is a similar technique where the candidate set of parameter values are simulated at
random across a wide range. Also, an example of nested resampling can be found here.

# Sources / Resources

In [1]:
https://lgatto.github.io/IntroMachineLearningWithR

ERROR: Error in parse(text = x, srcfile = src): <text>:1:7: unexpected '/'
1: https:/
          ^


In [None]:
6.3 Credit

Many parts of this course have been influenced by the DataCamp’s Machine Learning with R skill track, in particular the Machine Learning Toolbox (supervised learning chapter) and the Unsupervised Learning in R (unsupervised learning chapter) courses.

The very hands-on approach has also been influenced by the Software and Data Carpentry lessons and teaching styles.

In [None]:
http://www.tidyverse.org/
R for Data Science
Jenny's purrr tutorial or Happy R Users Purrr
Programming with dplyr vignette
Selva Prabhakaran's ggplot2 tutorial
caret package documentation
CRAN Machine Learning Task View