## Machine Learning II: SVMs & Overfitting

Today we're going to walk through an example of predicting tumor and normal status directly from gene expression values using **support vector machines**. We'll also learn about the dreaded [overfitting](https://en.wikipedia.org/wiki/Overfitting). As usual, make sure your kernel is set to  R system-wide. Start by running the following code:

In [None]:
# Load useful R packages
library(e1071) # contains implementation of SVM
library(caret) # contains methods for obtaining training & testing accuracy  
library(tidyverse)

We'll first define a custom function to read in our data:

In [None]:
read_data <- function(data_filename, labels_filename){
  # INPUT: 
  #   data_filename -- path to the .pcl file you are loading
  #   labels_filename -- path to the labels corresponding to the labels for each sample     
  # OUTPUT: 
  #   samples x genes dataframe (gene expression values normalized/centered)
  
  # PCL files are tab-delimited with samples as column names, genes as row names
  dataset =read.delim(data_filename, sep='\t', header=T, row.names = 1)
  
  # Labels are a space-delimited file where the first column is sample names and the 
  # second column is the tumor status label  
  labels = read.delim(labels_filename, sep=' ', header=F)
  colnames(labels) = c("Sample", "Label")
  
  # Check to make sure the sample names are the same in the labels and data
  if(length(intersect(colnames(dataset), labels$Sample)) == 0){
    print("Sample names do not match between labels and data. Please make sure you're using the correct combo of file names!")
    return(NA)  
    }
    
  # Mean-center each gene's expression values so we can compare them 
  # Note that the 'apply' call (which just applies the scale() fxn to each row) returns a
  # transposed matrix such that genes are now columns and samples are now rows 
  save_samplenames = colnames(dataset) 
  dataset_transformed = apply(dataset, 1, scale) 
  rownames(dataset_transformed) = save_samplenames
  dataset_transformed = data.frame(dataset_transformed)
  
  # add a column for label 
  dataset_transformed$Sample = row.names(dataset_transformed)
  dataset_transformed = dataset_transformed %>% left_join(labels, by="Sample")
  dataset_transformed$Sample = NULL # we don't need this column after we've used it to connect samples to labels
  
  
  return(dataset_transformed)
}

We'll then read in the [METABRIC](https://www.mercuriolab.umassmed.edu/metabric) dataset, which contains  of gene expression values for tissue samples from both tumor and non-tumor ("normal") tissue samples. **METABRIC_dataset.pcl** contains the gene expression values for each sample, while **Metabric_labels.txt** contains the labels (Tumor/Normal) for each sample. 

In [None]:
# Reset to ~/32_Data_ML_II after testing locally, making this dir 
MetabricFile = "~/32_Data_ML_II/METABRIC_dataset.pcl"
MetabricLabels  = "~/32_Data_ML_II/Metabric_labels.txt"

metabric = read_data(MetabricFile, MetabricLabels)
metabricY = factor(metabric$Label)
metabricX = metabric %>% select(-Label)

Now, we'll construct our SVM from the data, and get the fitted (predicted) labels for each sample:

In [None]:
svm_mod_metabric <- svm(x=metabricX, y=metabricY, data=metabric, kernel = "linear", cost=0.000001) # "cost" is the C parameter
prediction_metabric <- predict(svm_mod_metabric, metabricX)

We'll want to know the 'training accuracy,' or the proportion of the SVM's classifications that were correct:

In [None]:
AccuracyStatsTraining = confusionMatrix(metabricY, prediction_metabric)
print(AccuracyStatsTraining$overall["Accuracy"])

Congratulations! You've built your first SVM, and on training data it separates tumor data from normal data with over 90% accuracy! Now that we've done this with some biomedical data, let's take a step back and talk about things we should consider as we build a model.

_**Q1:** What are our labels?_

_**Q2:** What are our features?_

_**Q3:** What are our examples?_

### Overfitting in machine learning ###

When you train a computer to build a model that describes data that you've seen, a challenge known as "overfitting" can arise. When fitting the model, we want to find a model that fits the data as well as possible. However, real data is noisy. The model that fits data we have with the least error may capture the main features of the data, but may also capture noise in the data that we don't intend to model. When a model fits noise in training data, we call this problem overfitting.

For example, imagine that a professor wants to test a group of students' knowledge of calculus. She gives the students previous exam questions and answers to study. However, in the final exam, she uses the same questions to test the students. Some of the students could do very well because they memorized answers to the questions even though they don't understand calculus. The professor realizes this problem and then gives the students a new set of questions to test them. The students who memorized all the answers to previous exam questions may fail the new exam because they have no idea how to solve the new problems. We would say that those students have "overfit" to training data.

How can overfitting be a problem with machine learning? Don't we want the model to fit the data as well as possible? The reason is we want a model that captures the features that will also exist in some new data. If the model fits the noise in the data, the model will perform poorly on new data sets!

Let's use simulations to illustrate the overfitting problem. We are going to simulate two variables x and y and we let **y = x + e**, where e is some noise. That is, y is a linear function of x. _You don't need to know how this code works. We're not going to focus on regression during this course. You may want to have it to refer to in the future._

In [None]:
# This code will make our data by adding random noise to a linear relationship
# Simulate two variables x and y
# y=x+e, e is some noise
x = seq(0, 2, length.out=10)
y = x + 0.5*rnorm(length(x))

# Make dataframe of these simulated data
SimulatedData = data.frame(X=x, Y=y)
# Just the 'X' column 
Xdata = SimulatedData %>% select(X

In [None]:
# Plot the points
ggplot(SimulatedData, aes(x=X, y=Y)) + geom_point() + ggtitle("Simulated Points") + theme_classic()

Next, we want to train linear regression models on x and use the models to predict y. The models we are going to use are:  
1. A simple linear regression model: Y~X  
2. A complex multiple regression model: Y ~ X + X^2 + X^3 + X^4 ... + X^10  

We want to choose the model that will most accurately predict y.

In [None]:
# Fit a simple linear regression model to our data
simple_regression = lm(Y~X, data = SimulatedData)
simple_predictions = predict(simple_regression, newdata=Xdata )
simulatedSimple = data.frame(Xsimple = SimulatedData$X, Ysimple = simple_predictions)

# Fit a multiple regression model to our data
multiple_regression = lm(Y ~ X + I(X^2) + I(X^3) + I(X^4) + I(X^5) + I(X^6) + I(X^7) + I(X^8) + I(X^9) + I(X^10), data = SimulatedData)
XdataMultiple = data.frame(X= seq(0, 2, length.out=1000))
multiple_predictions =  predict(multiple_regression, newdata=XdataMultiple )
multiple_predictions_10 = predict(multiple_regression, newdata=Xdata)
simulatedMultiple =  data.frame(Xmultiple=XdataMultiple$X, Ymultiple=multiple_predictions)

In [None]:
# Plot original points in black, simple linear regression predictions in blue, multiple regression predictions in red
ggplot() + geom_point(data=SimulatedData, aes(x=X, y=Y)) +
  geom_line(data=simulatedSimple, aes(x=Xsimple, y=Ysimple), color="blue") +
  geom_line(data=simulatedMultiple, aes(x=Xmultiple, y=Ymultiple), color="red") 

Let's calculate the mean squared error, which is just the mean((differences between predicted and actual Y values)^2):

In [None]:
mse_simple = mean((SimulatedData$Y - simulatedSimple$Ysimple)^2)
mse_multiple = mean((SimulatedData$Y - multiple_predictions_10)^2)

# Mean squared error for simple regression model
print(paste0("MSE for simple regression model: ", mse_simple))
# Mean squared error for the multiple regression model 
print(paste0("MSE for multiple regression model: ", mse_multiple))

The multiple regression model fits the data perfectly (MSE is almost 0). The predicted values are the exact the same as the observed values since the prediction curve goes through every point. However, the simple regression model captures the linear relation between x and y but it didn't predict perfectlly well with the observed values. Then, shoud we choose multiple regression model rather than simple regression model since the former fitts the data much better than the latter?

_**Q4:** Which model do you think is the better model? Why?_

Remember that we want to find a model that fits the data well and, most importantly, can predict well on some new data. Let's simulate some new data and see the prediction performance of each model on the new data.

In [None]:
# New data
xnew = seq(0, 2, length.out=10)
ynew = x + 0.5*rnorm(length(x))
SimulatedNew = data.frame(X = xnew, Y = ynew)

predict_simple = predict(simple_regression, newdata=SimulatedNew)
predict_multiple =  predict(multiple_regression, newdata=SimulatedNew)

SimulatedNew$Ysimple = predict_simple
SimulatedNew$Ymultiple = predict_multiple

In [None]:
# Plot original model and new data 
ggplot() + geom_point(data=SimulatedNew, aes(x=X, y=Y)) +
  geom_line(data=simulatedSimple, aes(x=Xsimple, y=Ysimple), color="blue") +
  geom_line(data=simulatedMultiple, aes(x=Xmultiple, y=Ymultiple), color="red") + ggtitle("Regression Models Performance with New Data")

In [None]:
mse_simple_new = mean((ynew - predict_simple)^2)
mse_multiple_new = mean((ynew - predict_multiple)^2)
print(paste0("MSE for simple regression model(new data): ", mse_simple_new))
print(paste0("MSE for multiple regression model(new data): ", mse_multiple_new))

The multiple regression model will almost certainly perform worse than simple regression model on the new data (we don't know for sure in your case, because new data are simulated each time - check with your neighbors to see what they get as well, or feel free to clear and re-run the code to see another example). This is because the multiple regression model overfits the training data. It captures not only the true linear relation between x and y but also the random noise. However, simple regression only captures linear relation. 

This also demonstrates that it is not a good idea to train and evaluate a model on the same data set. If so, we tend to choose the model that overfits the data. However, in real data analysis, you will occasionally see papers reporting nearly perfect model fitting results. If you look closely, you will find that the authors fit and evaluate the model on the same data set. You now know that this is a typical overfitting problem. In your future research, be careful with the overfitting problem when you try some machine learning models on your data!

To avoid overfitting, there are several methods. One is to use regularization in the model to reduce the model complexity. The other is to train the model on one dataset and evaluate the model on a separate dataset. For now, we'll cover evaluating on a separate dataset.

## Homework: BRCA Tumor/Normal - Revisited!


We are lucky enough to have an independent validation dataset of breast cancers from The Cancer Genome Atlas (TCGA). Let's see how our classifier does here! 

Note, you may have to re-run the very beginning of this notebook (fitting **svm_mod_metabric**) if you've left the CoCalc session since starting it. 

In [None]:
TCGAFile = "~/32_Data_ML_II/TCGA_dataset.pcl"
TCGALabels = "~/32_Data_ML_II/TCGA_labels.txt"

# Testing accuracy of our SVM on TCGA data 
TCGA = read_data(TCGAFile, TCGALabels)
TCGA_y = factor(TCGA$Label)
TCGA_x = TCGA %>% select(-Label)

prediction_TCGA <- predict(svm_mod_metabric, TCGA_x )
AccuracyStatsTesting =  confusionMatrix(TCGA_y, prediction_TCGA)
print(AccuracyStatsTesting$overall["Accuracy"])

_**Q1**: Run the code in the cell above this and report the training and testing accuracy observed with C = 0.000000001 (1 pt)_

_**Q2**: Do you think that your breast cancer classifier is under or overfitting your data? Why or why not? (3 pts)_

_**Q3:** Based on your answer to Q1, should you raise, lower, or keep C the same here? Justify your answer.(4 pts)_

_**Q4**: Now, try fitting the model with a different value for C. Report your training and testing accuracy (2 pts)._