# Project A3

## [A3.1] Assignment description
### Gene Expression Data
In this assignment, you will analyze gene expression data and gradually select the most informative markers for a diagnosis task. The data is made of hundreds of probeset expression values measured for a hundred different patients suffering from Glioblastoma (a form of brain cancer).

In particular, we are looking for expression markers that can predict whether a patient suffers from Glioblastoma or No tumor. The data has already been preprocessed and is available below as two compressed file:

 - [https://inginious.info.ucl.ac.be/course/LGBIO2010/A3Q1/train.csv.bz2](train.csv.bz2)
 - [https://inginious.info.ucl.ac.be/course/LGBIO2010/A3Q1/test.csv.bz2](test.csv.bz2)

These files are heavy and you should download once each file and work with the local copy on your computer.

Such files can be decompressed using the following command:
```
bzip2 -d data.csv.bz2
```
Note that decompressing the data files is not mandatory (and not doing it would save space on your disk). It is only useful if you want to glance at the data first (see the important note below on how to import the data into an R session). Each decompressed .csv file contains a large two-dimensional table in which the respective entries are separated by a comma. Most data analysis softwares can take such a comma-separated file as input.

The first line of each file contains the headers or names of the columns. The subsequent lines contain the actual data. Each line contains x entries corresponding to the content of the table in the following order:

 - A string specifying the patient ID, followed by
 - (x-2) probeset expression values, and
 - a string which can either be Glioblastoma or No tumor. This status corresponds to the two conditions of interest in the subsequent analysis.

Important
---------
You will perform all analysis on the content of the train file, except for A3.9 where you will work with the content of the test file.

Before actual analyses of the gene expression data, you will have to load the train and test data in your R session.

Download both files and import them as data.frame objects in R. The read.csv() R function can be useful here. Observe that the read.csv() function can read the compressed and uncompressed versions of the files. Directly reading the compressed versions tends to be much faster.

Ensure that the rows names (the patient IDs) and columns names (probesets names and "labels" for the last column) are actual names of the data frame and not elements of the data frame.

From now on, the train and test datasets refer to the train and test data frames imported in your R session from the corresponding files.

### [Validation] Gene expression data
Report the dimensions (number of rows and number of columns) of the train data frame as an unnamed R vector of integer of length two, stored in a .rds file.

## [A3.2] Probesets with largest variance
In this exercise, you will proceed to non-specific filtering of the probesets.

Rank the various probesets by increasing variance on all samples. Keep only 25% of probesets with the larger variances.

<center>
<img src="https://inginious.info.ucl.ac.be/course/LGBIO2010/data/A3/VariancePlot.png" width="500" height="500">

Illustration of your task. It does not correspond to the actual data you must process.
</center>

## Question 1: Generic identification of probesets with largest variance

Provide an R function identifying the probesets with largest variance kept after non-specific filtering. The data and the percentage of probesets to keep are parameters of the function.

Specifications
--------------
Your implementation:
- must provide an R function called ns_filtering,
- the function takes exactly two parameters:

    - df: a data frame with rows corresponding to patients and columns corresponding to probesets. It can be obtained from the train data frame described in [A3.1] exercise with the following R instructions:
``` R
# Assuming data is the train data frame described in [A3.1]
df = data[, -which(colnames(data) == "labels")]
# Or equivalently
df = subset(data, select = -labels)
```  
   - kept: a floating-point value corresponding to the fraction or percentage of probesets that are kept (typically 0.25) (if the number of probesets to keep is not an integer, round the number of probesets to keep),

- returns a vector of the variance of kept probesets:

    - values are sorted by decreasing variance,
    - variance values are named after the associated probesets names,
    - the vector contains variances of probesets that are kept after non-specific filtering.

Observe that your function can be used to answer to find the top 5 probesets with the largest variance:
```
ns_filtering(df, 0.25)[1:5]
```

In [1]:
ns_filtering = function(df, kept){

}

SyntaxError: invalid syntax (1534154299.py, line 1)

### Question 2: Non-specific filtering
Provide the variances, in decreasing order, of the 25% of probesets with largest variance after non-specific filtering of the train dataset.

Important
---------
All subsequent exercises are performed using only the 25% of probesets with largest variance kept after non-specific filtering.

The values must be reported as a named R vector of values, sorted by decreasing values, stored in a .rds file. Each value must be named after the corresponding probeset.

## [A3.3] t-Test
In exercises [A3.3] and [A3.4] Corrections for multiple testing), you will use statistical tests to find probesets that are differentially expressed between the two conditions of interest, in the train dataset.

Rank the various probesets (after non-specific filtering) using a t-test by increasing p-values to distinguish between the two conditions under study.
<center>
<img src="https://inginious.info.ucl.ac.be/course/LGBIO2010/data/A3/WelchTtest.jpg" width="500" height="500">

Illustration of your task. It does not correspond to the actual data you must process.
</center>

### Question 1: Generic identification of significant probesets
Provide an R function computing the number of probesets that are deemed significant according to a t-test with no specific correction. The data and the significance level are parameters of the function.

Specifications
--------------
Your implementation:

- must provide an R function called differential_selection,
- the function takes exactly three parameters:
    - data: a data frame as described in the context of the exercise [A3.1].
    - target: an R character string corresponding to the name of the column containing the status of the patients in data.
    - alpha: a numerical-floating point value corresponding to the significance level of the t-test (typically 0.05).
- returns an integer value corresponding to the number of probesets that are deemed significant according to a t-test with no correction.

Observe that your function can be used to answer to Question 2 using:
```
# data is the data frame obtained after non-specific filtering (keeping only some probesets as described in [A3.2]) of the data frame described in [A3.1]
differential_selection(data, "labels", 0.05)
```

In [None]:
#Fill in with your R code
differential_selection = function(data, target, alpha){
}

### Question 2: Significant probesets
How many features (= probesets) are deemed significant according to such a t-test? Consider a 5% significance level and no p-value correction at this point.

The expected answer is an integer (for example, 1).

## [A3.4] Corrections for multiple testing
How many features are considered significant after a Bonferroni correction of the original t-test (see [A3.3] exercise)? How many features are considered significant after an FDR correction of the original t-test?

### Question 1: Generic definition of corrected p-values
Provide an R function computing the corrected p-values of the features that are considered significant after a correction of the original test. The data and the significance level of the test, as well as the correction method, are parameters of the function.

Specifications
--------------
Your implementation:
- must provide an R function called correct,
- the function takes exactly four parameters:

    - data: a data frame as described in the context of the exercise [A3.1].
    - target: an R character string corresponding to the name of the column containing the status of the patients in data.
    - alpha: a numerical floating-point value corresponding to the significance level of the t-test (typically 0.05).
    - method: an R character string corresponding to the correction to consider (acceptable values are bonferroni or fdr).

- returns a named vector of the corrected p-values:

    - values are the corrected p-values of the kept features,
    - the name of each element is the name of the associated probeset,
    - the values are sorted according to increasing corrected p-values.

Observe that your function can be used to answer to Question 2 and Question 3 using:

```
# data is the data frame described in [A3.1] after non-specific filtering in [A3.2]
Q2 = correct(data, "labels", 0.05, "bonferroni")
Q3 = correct(data, "labels", 0.05, "fdr")
```

In [None]:
#Fill in with your R code
correct = function(data, target, alpha, method){

}

### Question 2: Bonferroni correction
Report an R vector of the corrected p-values of the features that are considered significant after Bonferroni correction of the original t-test. Consider a significance level of 5%. The vector is named after the corresponding probesets names. Corrected p-values should be ordered by increasing value.

Your result should be submitted as a .rds file.

### Question 3: FDR correction
Report an R vector of the corrected p-values of the features that are considered significant after FDR correction of the original t-test. Consider a significance level of 5%. The vector is named after the corresponding probesets names. Corrected p-values should be ordered by increasing value.

Your result should be submitted as a .rds file.

### Question 4: Correction methods
You considered two alternative approaches to correct for multiple testing, namely Bonferroni and FDR. Select the valid propositions below.
- [ ] The Bonferroni correction is less conservative than the FDR correction: more probesets are considered differentially expressed with the Bonferroni correction
- [ ] The top-one most significant probeset is guaranteed to be the same after either Bonferroni or after FDR correction
- [ ] The Bonferroni-corrected p-values are lower than the FDR-corrected p-values
- [ ] The FDR-corrected p-values are lower than the Bonferroni-corrected p-values
- [ ] The FDR correction is less conservative than the Bonferroni correction: more probesets are considered differentially expressed with the FDR correction

### Question 5: Differentially expressed probesets
Report an R vector of the uncorrected p-values of the 50 most differentially expressed features. The vector is named after the corresponding probesets' names. The p-values should be ordered by increasing value.

Your result should be submitted as a .rds file.

## [A3.5] Heatmaps
In exercises [A3.5], [A3.6], and [A3.7], you will use heatmaps and 2-D plots to get an idea of the possible associations among some of the variables and the distribution of part of the data.

A heatmap illustrates two hierarchical clusterings, respectively along the features (= probesets) and along the samples.

Check the heatmap.2 function of the gplots R package to produce heatmaps.

<center>
<img src="https://inginious.info.ucl.ac.be/course/LGBIO2010/data/A3/Heatmap.jpg" width=500 height=500>

Illustration of your task. The heatmap does not correspond to the actual data you must process.
</center>

### Question 1: Data transformation
Centering and normalizing features to unit variance is a common data transformation. Select the appropriate motivation(s) for such a data transformation, in the context of descriptive analysis, among the propositions below.

- [ ] To ensure normality of the distribution, ensuring that normality assumptions of statistical tests are met
- [ ] To ensure that variables present the same scale, ensuring similar contribution weights of the variables in subsequent analyses
- [ ] To ensure that variables present the same distribution, ensuring that variables represent similar underlying models
- [ ] To ensure that outliers have a reduced influence, ensuring that the conclusions are not driven by these outliers

### Question 2: Heatmap
Report a heatmap of all samples along the 50 most differentially expressed features (according to your analysis on the data after non-specific filtering).

- Use some color coding [1] to label each sample with its respective condition (Glioblastoma or No tumor).
- Ensure that the plot contains appropriate titles and is readable [2].

The submitted file should be a .png file.

[1]	Check the RowSideColors or ColSideColors arguments of the heatmap.2 function.
[2]	Consider setting the trace argument to none.

### Question 3: [Open question] Graph description and analysis
Regarding the file submitted in Question 2:

- Describe the data transformation (if any) used before actually generating the graph, or explicitly state that you did not transform the data (if you did not),
- Briefly describe the heatmap,
- Discuss to which extent the samples are clustered consistently with the conditions of interest,
- Explain the reason(s) of possible difference(s), if any, or why the clustering is fully consistent with the conditions of interest.

#Answer either in English or in French

## [A3.6] 2-D plot of most discriminative features
You identified probesets that are differentially expressed between the two conditions of interest in exercises [A3.3], and [A3.4]. Can you actually observe different distributions of the most discriminative features between these conditions?

<center>

<img src="https://inginious.info.ucl.ac.be/course/LGBIO2010/data/A3/2Dplot.png" width=500 height=500>

Illustration of your task. The graph does not correspond to the actual data you must process.
</center>

### Question 1: 2-D plot of most discriminative features
Report a 2-D plot of all samples along the two most discriminative features according to your differential expression analysis.

- Use some specific coding (e.g. a color code) to label each sample with its respective condition.
- Report the most discriminating probeset along the x-axis and the second best along the y-axis.
- Report the names of the chosen features along the x- and y-axis of your plot.
- Ensure that the plot is readable.

The submitted file should be a .png file.

### Question 2: Expression of the most discriminating probeset
Determine whether the two most discriminative features, according to your differential expression analysis, are under-expressed, over-expressed or if they tend to present similar expression (i.e. their mean expression values do not significantly differ after FDR correction and a significance threshold alpha = 5%).

Report an R vector with named values (considering the probesets as names). Values can be either:

- 1 if the Glioblastoma patients present over-expression,
- -1 if the Glioblastoma patients present under-expression,
- 0 if the Glioblastoma patients present similar expression to No tumor patients.

Report the values for the two most discriminative features considering your differential expression analysis.

## [A3.7] 2-D plot of least discriminative features
You identified probesets that are differentially expressed between the two conditions of interest in exercises [A3.3], and [A3.4]. Can you actually observe different distributions of the least discriminative features (among all features kept after non-specific filtering, see [A3.2]) between these conditions?

<center>

<img src="https://inginious.info.ucl.ac.be/course/LGBIO2010/data/A3/2Dplot.png" height=500 width=500>

Illustration of your task. The graph does not correspond to the actual data you must process.
</center>

### Question 1: 2-D plot of least significant features
Report a 2-D plot of all samples along the two least discriminative features to distinguish between both conditions.

- Use some specific coding (e.g. a color code) to label each sample with its respective condition.
- Report the least discriminating probeset along the x-axis and the second least discriminating one along the y-axis.
- Report the names of the chosen features along the x- and y-axis of your plot.
- Ensure that the plot is readable.

The submitted file should be a .png file.



### Question 2: Expression of the least discriminating probesets
Determine whether the two least discriminative features are under-expressed, over-expressed or if they tend to present similar expression (i.e. their mean expression values do not significantly differ after FDR correction and a significance threshold alpha = 5%).

Report an R vector with named values (considering the probesets as names). Values can be either:

- 1 if the Glioblastoma patients present over-expression,
- -1 if the Glioblastoma patients present under-expression,
- 0 if the Glioblastoma patients present similar expression to No tumor patients.

Report the values for the two least discriminative features to distinguish between both conditions as .rds file.

## [A3.8] Support Vector Machines
In exercises [A3.8], and [A3.9], you will fit a linear SVM on a training set and evaluate the model on a test set.



### Question 1: SVM weights
Fit a linear SVM on the training using all probesets obtained after non-specific filtering (see [A3.2]). We recommend the LiblineaR R package. Each feature should be centered and normalized to unit variance before the actual SVM training.

Report an R data frame with the 10 most important features according to the absolute weight values of such a linear model. The data frame should be made of 10 rows named after the probeset names. The data frame should be made of two columns, named respectively "Weight" and "Rank". The "Weight" column should contain the absolute weight value of the 10 most important features according to the absolute weight values of the linear model. The "Rank" column should contain the respective ranks of those features according to the ranking of p-values computed in Question [A3.3] .

You are invited to consult the documentation of the LiblinearR R package. To build an SVM model, the type argument should be equal to 2. Once a model has been built from a training sample, the model parameters can easily be accessed as model$W[1,]. Such model paremeters also include an intercept or bias term, by default. You are expected to stick to this default setting when fitting the SVM. Yet, when reporting the 10 most important features, only the 10 most important actual probesets should be reported (no matter what is the fitted bias term).
```
#Build a model from a training sample
model <- LiblineaR(...)
#Access the model parameters
model$W[1,]
```

Allowed extensions: .rds



### Question 2: [Open question] SVM weights - R code
Submit below the R code you implemented, including possible calls to some existing R package(s) through the library(...) function, to answer previous question.

In [None]:
#Fill in with your R code below

## [A3.9] Prediction
Compute the labels (either Glioblastoma or No tumor) of the test samples predicted by the linear SVM estimated on the training set.

### Question 1: Confusion matrix
Report a confusion matrix between predicted and true labels on the test set.

Important notes:
----------------
You must now work on the test set and no longer on the training set.
The SVM model produced in exercise [A3.8] is built with a reduced set of probesets and these probesets have been scaled before fitting the SVM. As such, remove from the test set all probesets that have been removed by the non-specific filtering (see [A3.2] ). Then, normalize each remaining probeset (from the test set) using the mean and variance scaling parameters computed from the training set. Check the use of the scale function in the examples included in the LiblineaR documentation.
Upload an R table stored in a .rds file. The table contains the predictions on the rows and the true labels on the columns. The first row and the first column are called Glioblastoma. The second row and the second column are called No tumor.

As an example, suppose the following:

- prediction: c("Glioblastoma", "Glioblastoma", "No tumor", "Glioblastoma", "No tumor")
- true labels: c("Glioblastoma", "Glioblastoma", "Glioblastoma", "No tumor", "No tumor")

To produce the associated table, you would use the following R instructions:

```
prediction = c("Glioblastoma", "Glioblastoma", "No tumor", "Glioblastoma", "No tumor")
truth = c("Glioblastoma", "Glioblastoma", "Glioblastoma", "No tumor", "No tumor")
res = table(prediction, truth)
```

### Question 2: Generic classification accuracy
Provide an R function computing the classification accuracy of a model. The confusion matrix is a parameter of the function.

Specifications
--------------
Your implementation:

- must provide an R function called classification_accuracy,
- the function takes exactly one parameter:
    - confusion_matrix: a confusion matrix as described in the previous Question.
- returns a floating-point value in the range [0.0,1.0] corresponding to the classification accuracy of the model given the provided prediction.

Observe that your function can be used to answer to Question 3. To compute the classification accuracy of the example provided in Question 1, you would use:
```
predictions = c("Glioblastoma", "Glioblastoma", "No tumor", "Glioblastoma", "No tumor")
truth = c("Glioblastoma", "Glioblastoma", "Glioblastoma", "No tumor", "No tumor")
accuracy = classification_accuracy(table(predictions, truth))
```

In [None]:
#Fill in with your R code
classification_accuracy = function(confusion_matrix){

}

### Question 3: Classification accuracy
Report the classification accuracy of the SVM on the test (report a floating-point number in the range [0.0,1.0] with at least two decimals).