## Machine Learning III: Really Overfit

After today's discussion, we'd like to really reinforce the issues with overfitting (and care that is required here).

We have machine learning at our fingertips, and we've just discussed some of the dangers. 

For this activity, we've managed to get our hands on some expression data about a disease (D1). This datasets has features in columns and examples in rows. Each feature represents expression data, while each row represents a person. We want to be able to predict whether or not a person has a disease.

For this, we'll use an SVM and build some models and experiment.

As in the previous activity, we'll load some libraries and a custom function to read our data in (for simplicity).

**Execute the two cells below**

In [None]:
# Load useful R packages
library(e1071) # contains implementation of SVM
library(caret) # contains methods for obtaining training & testing accuracy
library(tidyverse)

In [None]:
read_data <- function(data_filename, labels_filename){
  # INPUT:
  #   data_filename -- path to the .pcl file you are loading
  #   labels_filename -- path to the labels corresponding to the labels for each sample
  # OUTPUT:
  #   samples x genes dataframe (gene expression values normalized/centered)

  # PCL files are tab-delimited with samples as column names, genes as row names
  dataset =read.delim(data_filename, sep='\t', header=T, row.names = 1)

  # Labels are a space-delimited file where the first column is sample names and the
  # second column is the tumor status label
  labels = read.delim(labels_filename, sep=' ', header=F)
  colnames(labels) = c("Sample", "Label")

  # Check to make sure the sample names are the same in the labels and data
  if(length(intersect(colnames(dataset), labels$Sample)) == 0){
    print("Sample names do not match between labels and data. Please make sure you're using the correct combo of file names!")
    return(NA)
    }

  # Mean-center each gene's expression values so we can compare them 
  # Note that the 'apply' call (which just applies the scale() fxn to each row) returns a
  # transposed matrix such that genes are now columns and samples are now rows 
  save_samplenames = colnames(dataset)
  dataset_transformed = apply(dataset, 1, scale)
  rownames(dataset_transformed) = save_samplenames
  dataset_transformed = data.frame(dataset_transformed)

  # add a column for label
  dataset_transformed$Sample = row.names(dataset_transformed)
  dataset_transformed = dataset_transformed %>% left_join(labels, by="Sample")
  dataset_transformed$Sample = NULL # we don't need this column after we've used it to connect samples to labels


  return(dataset_transformed)
}

Our data comprises:

* `D1.pcl`: contains the gene expression values for each sample. 
* `D1_labels.txt`: contains the labels (Case/Control) for each sample.

Let's load the data in and prepare it for the SVM model bulding.

**Execute the code below.**

In [None]:
# Reset to ~/32_Data_ML_II after testing locally, making this dir
D1pclfile = "~/33_Data_ML-III/D1.pcl"
D1labelfile  = "~/33_Data_ML-III/D1_labels.txt"

D1 = read_data(D1pclfile, D1labelfile)
D1_Y = factor(D1$Label)
D1_X = D1 %>% select(-Label)

Now, we'll construct our SVM from the data, and get the fitted (predicted) labels for each sample:

In [None]:
svm_mod_D1 <- svm(x=D1_X, y=D1_Y, data=D1, kernel = "linear", cost=0.000001) # "cost" is the C parameter
prediction_D1 <- predict(svm_mod_D1, D1_X)

We'll want to know the 'training accuracy,' or the proportion of the SVM's classifications that were correct:

In [None]:
AccuracyStatsTraining = confusionMatrix(D1_Y, prediction_D1)
print(AccuracyStatsTraining$overall["Accuracy"])

**Q1:** What is the accuracy you return from this basic model?

**Q2:** OK, now let the ML fly: See if you can improve the model. Options you could consider:

    - Changing the "cost" function to different values
    - Change the 'kernel' to use a different model function. eg.
        `svm(..., kernel=polynomial,degree=2)`     #a polynominal function with 2 degree polynomial.
        
Explore this space and provide at least *three examples* of different models and evaluate their accuracy.

**Provide and Execute your code below.**

In [None]:
## Example 1



In [None]:
## Example 2



In [None]:
## Example 3



**Q3:** What is your reported accuracy of your best model? 

OK, you build your best model using all of the data you had. If you stopped here, and reported your results, you could run the risk of overfitting. How will you know the model that you have picked is the best?

We've provided you an independent data set that you can use for testing:

**D2_test.pcl** contains the gene expression values for each sample. 

**D2_test_labels.txt** contains the labels (Case/Control) for each sample.

**Execute the code below**

In [None]:
# Reset to ~/32_Data_ML_II after testing locally, making this dir
D2pclfile = "~/33_Data_ML-III/D2_test.pcl"
D2labelfile  = "~/33_Data_ML-III/D2_test_labels.txt"

D2 = read_data(D2pclfile, D2labelfile)
D2_Y = factor(D2$Label)
D2_X = D2 %>% select(-Label)

Now, let's test the model we trained in D1 to predict the data in D2.

In [None]:
prediction_D2 <- predict(svm_mod_d1, D2_X)
AccuracyStatsTraining = confusionMatrix(D2_Y, prediction_D2)
print(AccuracyStatsTraining$overall["Accuracy"])

**Q4.** What do you observe about the prediction accuracy of your best trained model in the test set? Does this model *actually* do a good job or not?

**Q5.** You only considered / had time to train three models above. But if you had enough time, could you have found an *even better* model for prediction? Why or why not? Is this problematic, and if so, why?