In [1]:
# Install a package BioConductor ExperimentHub to access the example data
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install()
BiocManager::install("ExperimentHub")

# Install glmnet for LASSO and Elastic Net regression
install.packages("glmnet")

In [None]:
library("BiocManager")
library("ExperimentHub")

library("glmnet")

# Problem 9: Microbiome data

Download this dataset from BioConductor.

In [None]:
eh = ExperimentHub()
data = eh[["EH361"]]

In this dataset, for each of $N_{\textrm{obs}}$ patients, the presence of $N_{\textrm{dim}}$ species of microbe was measured.
In addition, the disease state of the patient was measured, which can be "n" (no cancer), "adenoma" (precancerous), or "cancer".
For simplicity, let's only use the "n" and "cancer" states (and remove the adenomas).

In [None]:
colnames(data)
rownames(data)

data$disease
dataCancerNoCancer = data[, data$disease %in% c("n", "cancer")]
dataCancerNoCancer$disease

## Question i

Inspect the data.

How many patients are there ($N_{\textrm{obs}}$)?

How many species of microbe were measured ($N_{\textrm{dim}}$)?


## Question ii

In generalized linear regression (glm) model, the probability that a patient has cancer, given the presence of microbes $X_1,... X_{N-\textrm{dim}}$ is

$p = \frac{\operatorname{exp} (\beta_1 X_1 + ... )}{1+\operatorname{exp}\left( \beta_1 X_1 + ... \right)}$.

This model has a likelihood function $L(\beta_1, ... \beta, Y_1,...)$ you can see [here](https://en.wikipedia.org/wiki/Generalized_linear_model). 
Simply using the maximum likelihood approach is impossible here, since $N_{\textrm{dim}}>N_{\textrm{obs}}$. 

On the other hand, LASSO regression maximizes $L + \lambda \sum_i^n | \beta |$.

Perform LASSO regression for a sweep over $\lambda$.
Plot the value of all the parameters $\beta_i$ versus $\lambda$.


In [None]:
y = factor(dataCancerNoCancer$disease)
x = t(exprs(dataCancerNoCancer))

lassoFit = glmnet(y=y, x=x, family="binomial")

plot(lassoFit, xvar = "lambda", label = TRUE)

Perform cross-validation for each value of $\lambda$.
What value of $\lambda$ minimizes the cross-validation error?

In [None]:
crossValidationOutput <- cv.glmnet(y=factor(dataCancerNoCancer$disease),
                                   x=t(exprs(dataCancerNoCancer)), family="binomial")

plot(crossValidationOutput)

bestLambda = crossValidationOutput$lambda.min
confusionMatrix = predict(lassoFit, newx = t(exprs(dataCancerNoCancer)), type="class",s=bestLambda)
table(confusionMatrix, dataCancerNoCancer$disease)

## 3.  Elastic Net regression

The characteristic feature of Ridge regression is the penalty

$$\mathrm{log}\left(L(\beta)\right) - \lambda \sum_i  \beta_i ^2,$$

while the penalty for Lasso regression is

$$\mathrm{log}\left(L(\beta)\right) - \lambda \sum_i | \beta_i |.$$

In this Problem Set, we explore the penalty

$$\mathrm{log}\left(L(\beta)\right) - \lambda \left(\alpha \sum_i | \beta_i | +  (1-\alpha) \sum_i  \beta_i ^2\right),$$

which is called __Elastic Net__.  

i) In parameter space, Ridge Regression corresponds to finding optimal parameters on a circle, while LASSO regression corresponds to finding optimal parameters on a diamond. What shape does Elastic Net correspond to?

ii). The `glmnet` package was built for Elastic Net regression. Look up the [glmnet package help files](https://www.rdocumentation.org/packages/glmnet/versions/3.0-2/topics/glmnet) to find out how to perform Elastic Net regression for a specific $\alpha$. Do this for $\alpha=0.5$, and return the confusion matrix.

In [None]:
# CODE HERE

iii) Perform a sweep over $\alpha=0$ to $\alpha=1$. Plot the number of species included versus $\alpha$.

Hint: The cv.glmnet output object has a returns a value `$nzero`, which is the number of nonzero factorso, which is the number of species desired.  

In [None]:
# CODE HERE

iv) What $\alpha$ value minimizes the number of false positives (non-cancer patients that are predicted to have cancer)? What $\alpha$ value minimizes the number of false negatives (cancer patients that are predicted to not have cancer)?


In [None]:
# CODE HERE