# Section 12 - Supervised Learning

In [1]:
library(ggplot2)
library(tidyr)
library(magrittr)
library(gridExtra)
library(data.table)

# install.packages("caret", DEPENDENCIES=true)
library(caret)

# install.packages("randomForest")
library(randomForest)

options(repr.plot.width = 7, repr.plot.height = 5, repr.plot.res = 100)
options(warn=-1)

file_path <- file.path("/Users", "donatabuozyte", "Downloads", "extdata")


Attaching package: ‘magrittr’

The following object is masked from ‘package:tidyr’:

    extract

Loading required package: lattice
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:gridExtra’:

    combine

The following object is masked from ‘package:ggplot2’:

    margin



In [2]:
# --------------------

## Introduction

### 1. Motivation

Interested in good prediction, not in identifying most predictive features or drawing conclusions about statistical independence.

New goal is less ambitious, but focuses on prediction accuracy.

### 2. Supervised vs. Unsupervised Learning

Supervised learning: build model taking feature values as input and returning predictions by training an algorithm.

Unsupervised learning: build model identifying patterns in distribution of data.

---

In terms of statistical learning:
- supervised learning: fit conditional distribution $p(y|x)$ with outcome $y$ and input $x$.
- unsupervised learning: fit distribution $p(x)$ or some aspects of it.

### 3. Notation

Outcome (to be predicted): $y$.

Features (predict the outcome): $x_{1}, \ldots, x_{p}$

### 4. Basic Approach in Supervised Machine Learning

Regression/classification tasks: have series of features and unknown numeric or categorical outcome.

How to build model:
1. collect data where outcome is known.
2. train model based on this data.
3. use trained model to determine unknown outcomes.

In [3]:
# --------------------

## Over- and Underfitting

Goal: model generalizing well and capturing trend of data.

Underfitting: model does not capture trend of data -> hight error between actual and predicted outcomes.

Overfitting: model does not generelize well, but fits training data too well

### 1. Example: Polynomial Curve Fitting

Data: $n = 10$, generated via $f(x) = \sin{(2 \pi x)}$ with added random noise from normal distribution.

Model: $y(x, x) = \sum_{j=0}^{m} w_{j} x^{j}$. (Note: non-linear in $x$, but linear in coefficients $w$)

Find coefficients: use linear regression model to find $w^{*}$ for given order $m$.

---

Results:
- $m \in \{0,1\}$: underfitting.
- $m=3$: quite good approximation, captures trends well, good generalization.
- $m \in \{9,10\}$: overfitting.

---

Conclusion:

As supervised learning depends only on data, the complexity (e.g. the order of polynomials) of the functions in the possible models should be chosen based on data and amount of data. For example, polynomial degree should be lower than available number of data points.

Also, instead of minimizing the error on training data, one could minimize the (generalization) error on unseen data.

In [4]:
# --------------------

## Splitting Dataset for Performance Assessment

Initial datasets:
- available dataset (known outcomes)
- independent dataset (unknown outcomes)

Assumption: all observation are i.i.d.

---

Common strategy: divide available dataset into training data (to train model) and test data (to evaluate the performance of the trained model).

### 1. Overfitting to Training Dataset

Detected when measured error on test data notably larger than on training data.

Alternatives:
- performance measures (precision, recall)
- visualize performance on training vs. test data (ROC, precision-recall curves)

### 2. Cross-validation

Strategy to assess preformance of model to prevent overfitting.

Idea:
1. split data randomly into $k$ different folds.
2. train model on $k-1$ folds.
3. evaluate on fold not used for training.

How to pick $k$:
- large $k$: good, but slow computation
- usually: $k \in \{5, 10\}$

#### 2.1 Pitfalls of Cross-validation

Assumption: training and test samples i.i.d. (otherwise test data could e.g. contain data not included in training data).

---

*Example: non-independent dustributions*

Dataset: contains repeated measures, between replicated simulated trend is $x=y$.

Cross-validation at data point level: favors models learning clusters.

Cross-validation at cluster level: learn trend across clusters, but difficult without application knowledge.

#### 2.2 Cross-validation in R (`trainControl()`, `train()`)

Define validation specifications via `trainControl()`.

Train model via `train()` with previously defined training control for cross validation.

In [5]:
# example:
#   logistic regression for binary classification of gender based on height values

heights_dt <- fread(file.path(file_path, "height.csv")) %>%
                na.omit() %>%
                .[sex %in% c("F", "M")] %>%
                .[, sex:=as.factor(toupper(sex))]

# generate control structure
k <- 5    # number of folds
fitControl <- trainControl(method = "cv",                        # cv for cross-validation
                           number = k, 
                           classProbs=TRUE,                      # compute class probabilities 
                           summaryFunction = twoClassSummary     # for computing sensitivity, specificity, area
                                                                 #     under ROC curve
                                                                 #     (for performance measure)
                           )

In [6]:
# run CV
lr_fit <- train(sex~.,                      # formula and dataset definition
                data = heights_dt,          # formula and dataset definition
                method = "glm",             # model specification
                family = "binomial",        # model specification
                trControl = fitControl,     # validation specification
                metric = "ROC",             # Specify which metric to optimize
                )

# return average values for sensitivity, specificity, area under ROC curve of the k trained models
lr_fit

# return preformance measures for each model
lr_fit$resample

# return final model
lr_fit$finalModel

Generalized Linear Model 

540 samples
  4 predictor
  2 classes: 'F', 'M' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 433, 432, 432, 431, 432 
Resampling results:

  ROC        Sens       Spec     
  0.9665917  0.9009412  0.9026013


ROC,Sens,Spec,Resample
0.962807,0.94,0.9122807,Fold1
0.9596552,0.9,0.8965517,Fold2
0.9706897,0.9,0.9137931,Fold3
0.9800541,0.8823529,0.9482759,Fold4
0.9597523,0.8823529,0.8421053,Fold5



Call:  NULL

Coefficients:
         (Intercept)                height                mother  
            -26.2826                0.5084               -0.1829  
              father  `V5King of the Hill`  
             -0.1780                2.8171  

Degrees of Freedom: 539 Total (i.e. Null);  535 Residual
Null Deviance:	    746.2 
Residual Deviance: 247.3 	AIC: 257.3

In [7]:
# --------------------

## Random Forests as Alternative Models

Go-to non-interpretable supervised learning model, can achieve state-of-the-art performance.

Application: regression or classification.

Robust to overfitting, allows fitting flexible functions.

### 1. Basics of Decision Trees

Idea: partition or segment training dataset into severla simple regions.

Make prediction for new input by using mean/mode of training observations on region to which new input is assigned.

#### 1.1 Decision Trees for Regression Tasks

Example: dataset predicting baseball player's salary based on feature years (most important feature) and hits.

Splits in decision tree:
- $\text{Years} < 4.5$: if true then salary is $5.11$, else ...
- $\text{Hits} < 117.5$: if true then salary is $6.00$, else $6.74$.

Divide players into regions (*leaf nodes*) defined by decision tree, i.e.:
- $R_{1} = \{ X | \text{Years} < 4.5 \}$
- $R_{2} = \{ X | \text{Years} \geq 4.5, \text{Hits} < 117.5 \}$
- $R_{3} = \{ X | \text{Years} \geq 4.5, \text{Hits} \geq 117.5 \}$

---

*How to choose splits:*

Goal: find regions s.t. RSS is minimized, i.e. $\mathrm{arg} \min_{R_{j}} \sum_{j=1}^{J} \sum_{i \in R_{j}} (y_{i} - \hat{y}_{R_{j}})^{2}$.

Problem: computationally unfeasible due to number of possible partitions.

Solution: top-down greedy approach.

Approach: (iteratively) select feature $X_{j}$, threshold $s$ s.t. split $R_{1}(j,s) = X|X_{j} < s$, $R_{2}(j,s) = X|X_{j} \geq s$ results in best reduction in RSS, i.e. $\min_{j, s} \sum_{i \in R_{1}(j,s)} (y_{i} - \hat{y}_{R_{1}})^{2} + \sum_{i \in R_{2}(j,s)} (y_{i} - \hat{y}_{R_{2}})^{2}$, until stopping criterion is met.

Result: $J$ regions, predict outcome of any sample by assigning sample to region (outcome of sample = mean of observations in that region).

#### 1.2 Decision Trees for Classification Tasks

Similar to regression, but prediction is most commonly occuring class of observations in region.

Additionally: use cross-entropy instead of RSS as splitting criterion.

### 2. Random Forests for Classification and Regression Tasks

Idea: aggregation/bagging, i.e. generate many different predictors via decision trees with bootstrapping (for randomness), and combine them.

Steps:
1. build $B$ decision trees using training data (fitted models: $T_{1}, \ldots, T_{B}$).
2. for every observation in the test set predict $\hat{y}_{j}$ using tree $T_{j}$.
3. final prediction: $\hat{y} = \frac{1}{B} \sum_{j=1}^{B} \hat{y}_{j}$ for continuous outcomes; $\hat{y}$ with the majority vote for categorical outcomes.

---

Ensure that trees in forest are different via randomness in two ways:
1. boostrap training data by sampling all observations in training data with replacement (hence all trees are trained on different data).
2. randomly select subset of features to be included in building of every tree (hence not all features considered for each tree) and select different subset for each tree. This reduces correlation, prevents overfitting.

### 3. Random Forests in R (`randomForest()`)

Note: optimize/tune hyperparameters for each application for better performance.

In [8]:
# example:
#   train random forest classifier for predicting gender of each person given height values

heights_dt$V5 <- NULL
rf_classifier <- randomForest(sex~.,               # Define formula and data
                              data=heights_dt,     # Define formula and data
                              
                              # Hyper parameters (for each tree in forest)
                              ntree=100,           # Define number of trees
                              nodesize = 5,        # Minimum size of leaf nodes
                              maxnodes = 30,       # Maximum number of leaf nodes
                                                           
                              importance=TRUE      # Output the feature importances
                              )

rf_classifier

rf_classifier$importance

# observations:
#   - height of person most important feature
#   - mother's height more important for predicting females
#   - father's height more important for predicting males


Call:
 randomForest(formula = sex ~ ., data = heights_dt, ntree = 100,      nodesize = 5, maxnodes = 30, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 100
No. of variables tried at each split: 1

        OOB estimate of  error rate: 10.19%
Confusion matrix:
    F   M class.error
F 227  25  0.09920635
M  30 258  0.10416667

Unnamed: 0,F,M,MeanDecreaseAccuracy,MeanDecreaseGini
height,0.35319603,0.291433521,0.31947897,152.06655
mother,0.03919943,0.007175171,0.02241581,19.29353
father,0.04425455,0.026287798,0.03476921,26.05173


In [9]:
heights_dt[, sex_predicted := predict(rf_classifier, heights_dt[,-c("sex")])]
head(heights_dt)

height,sex,mother,father,sex_predicted
150,F,170,177,F
152,F,157,175,F
152,F,155,156,F
153,F,155,175,F
153,F,150,176,F
153,F,150,165,F


##### End of Section 12!

##### End of the lecture! :)