# Intro to Bioinformatics Applications of R

<div class="alert alert-block alert-info">
    
Let's practice using R to analyze some data using data manipulation and visualizations. We will use publicly-available data ([GSE166925](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE166925)) from NCBI GEO with inflammatory bowel disease and colorectal cancer patients.
 
<br><br> 
RNA-seq data usually comes with two files: 
<ul>
<li> a metadata file with information about the samples (ex: sample ID, disease status, gender, etc.) where each row represents a sample and columns contain the various information </li>
<li> a gene expression file with the gene expression information for each sample where the rows represent the genes and the columns represent the samples </li>
</ul>
</div>

### Libraries
Every time you load a library you're adding a new environment to R. The library specific functions become available because R searches all environments for your function call. Sometimes you override an existing function by loading a new library. You can reference the package-specific function via the package specifier.

Note: You can also suppress the messages printed when loading a library using [`suppressPackageStartupMessages()`](https://www.geeksforgeeks.org/how-to-disable-messages-when-loading-a-package-in-r/). 

One of the benefits of using libraries is you don't need to rewrite code that already exists! We can use the `getGEO()` function from the `GEOquery` library to automatically get the data from GEO instead of manually trying to get the sample information. The code on how the files were generated is in the solutions notebook.

The code to download this datatset is in the SOLUTIONS file.

<div class="alert alert-block alert-success">

<p>We now have both the metadata and gene expression files ready to analyze.</p>

</div>

### Data Manipulation
We need to format the data in a way that we can use for later analyses/visualizations.

Let's read in the sample metadata table that was saved to GSE166925-meta.tsv (file location: `~/module-1-programming/data/GSE166925-meta.tsv`) into a `meta` dataframe object. We can use the [`read.table()`](https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/read.table) function to generate dataframes from files.

In [None]:
meta <- read.table('FILE_NAME')
head(meta)

Let's also use the title column as the index instead of the GSM IDs using `rownames()`.

*Hint: Use the $ to access the title column*

In [None]:
rownames(meta) <- 
head(meta)

Now let's read in the gene expression data table from the file GSE166925_salmon.genes.tpms.txt.gz into an `expr` dataframe. (file location: `~/public/module-1-programming/data/GSE166925_salmon.genes.tpms.txt.gz`)

In [None]:
expr <- read.table('FILENAME',header=TRUE)
head(expr)

We'll review this more in the RNA-seq module but when looking at gene expression data, we usually use the log normalized expression for analysis. The file provided already has normalized data (TPM in given in the filename), so we can take the log using `lapply()`.

*Hint: We only want to take the log of the data, not the gene names*

In [None]:
expr[-1] <- lapply(DATA, FUNCTION)
head(expr)

The provided file only has the gene IDs (ENSG IDs) but it's easier to use the gene names/symbols, so let's find a library that can add the gene names.

In [None]:
suppressPackageStartupMessages(library(EnsDb.Hsapiens.v79))

gene_keys <- GENE_COLUMN

geneIDs <- ensembldb::select(EnsDb.Hsapiens.v79, 
                             keys= gene_keys, 
                             keytype = "GENEID", 
                             columns = c("GENEID","SYMBOL"))
head(geneIDs)

This created a new dataframe `geneIDs` with matching ENSG IDs and gene names. To get the gene names into our `expr` dataframe, we can use the `merge()` function. 

The [`merge()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/merge) function allows you to merge two dataframes by specifying columns that are matching. 

In the example of our two dataframes, the 'gene_id' column in the `expr` dataframe and the 'GENEID' column in the `geneIDs` dataframes both contain the ENSG IDs, so we can merge on those columns and create a new dataframe `expr2` which has the gene names and gene expression.

In [None]:
expr2 <- merge(DATAFRAME1, DATAFRAME2, by.x='COLUMN', by.y='COLUMN')
head(expr2)

To make the `expr2` dataframe match the format of the `meta` dataframe (rows with samples), we can use the `t()` function which transposes the data, essentially switching the rows and columns.

First, let's remove the "GENEID" column from `expr2`.

In [None]:
expr2 <- expr2
head(expr2)

Now let's transpose the `expr2` dataframe.

*Hint: Use `t()` to transpose the dataframe then cast it using `data.frame()`.

In [None]:
expr2 <- 
head(expr2)

Set the column names using `colnames()` on the first row then remove it.

In [None]:
colnames(expr2) <- 
expr2 <- expr2
expr2

We can now merge the `meta` and `expr2` dataframes so we have all the information in the same dataframe.

In [None]:
merged_data <- merge(DATAFRAME1, DATAFRAME2, by.x='COLUMN', by.y='COLUMN')
head(merged_data)

### Visualizations using Basic Plots
Let's explore the sample metadata by creating some plots using the `meta` dataframe.

Let's first look at the distribution of patient ages in our dataset.

*Hint: use `colnames(meta)` to see the available data columns*

**Histograms:**

[`hist()`](https://www.datamentor.io/r-programming/histogram/) : generates a histogram

In [None]:
hist()

Next, generate a bar plot using [`barplot()`](https://www.statmethods.net/graphs/bar.html) of the gender. 

The input for `barplot()` is a table of counts for each condition, not a list of the conditions. 

Ex: instead of a list/vector with "female", "female", "female", "male", we need to use the [`table()`](https://www.statology.org/table-function-in-r/) function to get the counts, female: 3 and male: 1.

In [None]:
gender_table <- table()
gender_table

In [None]:
barplot(gender_table)

<div class="alert alert-block alert-info">
    <h3>R Practice for basic visualizations: barplot</h3><br>
    
<p>
    Generate a stacked bar plot using <code>barplot()</code> of the disease and inflammation status. The x-axis should be divided by disease, and each disease will be subdivided into inflamed/uninflamed.
    
Hints:
- You can look here for [examples]((https://www.statmethods.net/graphs/bar.html))
- Generate the counts table with disease and inflammation status
- You can use the `beside=` parameter when calling `barplot()` so the groups aren't stacked on top of each other
- Use the `legend=` parameter to include the legend in the plot
- You can also add title (`main=`), x-axis label (`xlab=`), y-axis label (`ylab=`), colors (`col=`)
</p>
    

</div>

**Scatter plots:**

Let's make a scatter plot for the first two genes in our `expr2` data frame.

[`plot()`](https://www.w3schools.com/r/r_graph_plot.asp) : general function that creates various plots depending on the input. We can input two vectors to generate a scatter plot

In [None]:
x1 <- expr2[]
x2 <- expr2[]

plot()

We can add a title to our plot with the genes by using [`paste()`](https://www.digitalocean.com/community/tutorials/paste-in-r). `paste()` allows you to combine variables and/or strings into a single string. 

In [None]:
title = paste()

In [None]:
x1 <- expr2[,1]
x2 <- expr2[,2]

x_gene <- names(expr2[1])
y_gene <- names(expr2[2])

plot(x1, x2, xlab=, ylab=, main=title)

We can also plot data using gene names. Let's plot FGR and GCLC. 

In [None]:
x1 <- 
x2 <- 

plot(,, xlab=, ylab=,main=)

**Boxplots**

We can also compare the expression of certain genes between certain groups using boxplots. 

[`boxplot()`](https://www.tutorialspoint.com/r/r_boxplots.htm) : generates a boxplot

Use `boxplot()` to plot the TYK2 gene expression by disease.

In [None]:
boxplot(COLUMN1 ~ COLUMN2, DATAFRAME)

<div class="alert alert-block alert-info">
    <h3>R Practice for basic visualizations: boxplot</h3><br>
    <p>Generate a boxplot comparing STAT3 between Control vs CD vs UC samples only
    
Hints:
- You can look here for [examples](https://www.statmethods.net/graphs/boxplot.html) 
- Look at the [`subset=`](https://stackoverflow.com/questions/38908230/how-do-i-subset-a-box-plots-in-r) parameter in boxplot() which uses the format `dataframe$column %in% c(variable1, variable2)`
- Can also add title, x and y labels, colors
   </p>
</div>

<div class="alert alert-block alert-success">

<p>You should have 3 boxes for CD, Control and UC showing the expression of STAT3. </p>

</div>

### Visualizations using ggplot

In [None]:
library('ggplot2')

[ggplot2](https://datacarpentry.org/R-ecology-lesson/04-visualization-ggplot2.html) works primarily with dataframes. We have to supply ggplot with a dataframe. As we go through ggplot, a key thing to notice is how the plot can continually be enhaanced by adding layers and themes (generally indicated by a ```+``` sign and a following statement) to an existing plot. 

Here is a ggplot data visualization cheat sheet for your own reference: https://github.com/rstudio/cheatsheets/blob/main/data-visualization-2.1.pdf

Now let's initialize a basic ggplot with our `merged_data` data frame, using the TYK2 and STAT3 columns of data. The aes() function is what we will use to specify the X and Y axes

In [None]:
ggplot(DATAFRAME, aes(x=, y=))

Do you see a blank ggplot? If so, don't worry, you did this step correctly. While we have x and y labels that match the columns we initialized the plot with, we don't see any plotted data. This is because ggplot does not make assumptions about the plot you are meaning to draw. Initializing the ggplot only tells ggplot what dataframe and what x and y columns from the dataframe should be used.

Now let's make a scatterplot. We will do this by adding on a layer using ```geom_point()```.

In [None]:
ggplot(merged_data, aes(x=TYK2, y=STAT3)) + 
    geom_point()

From here, we can add on a smoothing layer using the method "lm" to draw a line of best fit using `geom_smooth(method='lm')`.

In [None]:
ggplot(merged_data, aes(x=TYK2, y=STAT3)) +
    geom_point() +
    geom_smooth(method='lm')

We can also adjust the x and y limits using `xlim(c(x,y))` and `ylim(c(x,y))`.

In [None]:
ggplot(merged_data, aes(x=TYK2, y=STAT3)) + 
    geom_point() + 
    geom_smooth(method='lm') + 
    xlim(c(0, 5)) + 
    ylim(c(0, 5))

<div class="alert alert-block alert-info">
    <h3>R Practice for ggplot</h3><br>
    
Generate a box plot like we did before (STAT3 expression for Control, CD, UC) but use ggplot this time (`geom_boxplot()`). 
<br>

You can look here for [examples](http://www.sthda.com/english/wiki/ggplot2-box-plot-quick-start-guide-r-software-and-data-visualization)
    
<br>
<p>Suggestions to include:</p>  
<ul>
<li>Ordering the box plots (Control, CD, UC)</li>
    <li>Labels for title, x and y axis (<code>labs()</code>)</li>
<li>Colors for each disease group</li>
    <li>Add points for the samples (look at <code>geom_point()</code> or <code>geom_jitter()</code>)</li>
<li>Font size</li>
<li>ggplot theme</li>
    </ul>

</div>

<div class="alert alert-block alert-success">

You should have three boxes showing the expression of STAT3 for Control, Disease and UC.
    
</div>

### [Pheatmap](https://www.rdocumentation.org/packages/COMPASS/versions/1.10.2/topics/pheatmap)
Pheatmap enables you to draw clustered heatmaps, where you can control more of the graphical parameters.

In [None]:
library(pheatmap)

Let's see how the top genes identified in this [meta-analysis paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8440637/) perform in our dataset.
First, read in the file `data/PMC8440637_top_genes.txt` into a dataframe `gene_table` so we can see the top 10 up- and top 10 down-regulated genes from this paper.

In [None]:
# Get list of top genes from file
gene_table <- read.table('FILE',header=TRUE)
genes <- as.vector()
genes

Now let's create a new data frame `data` by getting the expression for these genes from the `expr2` data frame.

The pheatmap function typically has the genes as rows and samples as columns, so let's also use `t()` and `data.frame()` to transpose our dataframe into the correct format.

In [None]:
data <- data.frame(t(expr2[,genes]))
head(data)

Plot the gene expression using `pheatmap()`

In [None]:
pheatmap(data)

Let's increase the size so the heatmap isn't compressed using `options(repr.plot.width = #, repr.plot.height = #)`.

In [None]:
options(repr.plot.width = 14, repr.plot.height = 8)
pheatmap(data)

<div class="alert alert-block alert-info">
    <h3>R Practice for pheatmap</h3><br>
    <p>Let's change some of the parameters so we can try to explain what we're seeing with the data.</p>
<br>
    
You can look here for [examples](https://davetang.org/muse/2018/05/15/making-a-heatmap-in-r-with-the-pheatmap-package/)
    
<ul>
<li>Subset the samples so you're only showing the Control, CD and UC samples</li>
<li>Add annotations to the samples using the meta data frame (try adding one or more of the columns to see if any of the characteristics describe the change in gene expression for the samples)</li>
<li>Add annotations to the genes to show whether they were classified as up or down regulated (Hint: look in the file for the top genes)</li>
<li>Can also change some of the aesthetics</li>
</ul>
</div>

1. Create a new dataframe `sample_annotations` with only the Control, CD and UC samples from `meta`

*Hint: use this format `dataframe$column %in% c(variable1, variable2)`*

2. Create a new dataframe `data2` with only the Control, CD and UC samples from `data` 

*Hint: get a list of sample IDs for Control, CD and UC and use those sample IDs to filter `data`*

3. Create a new dataframe `gene_annotations` with the regulation information from the `gene_table` dataframe. 

Hints:
- Subset `gene_table` so it only has the Symbol and Regulation columns
- Change the rownames to the Symbol using `rownames(gene_annotations) <-`
- Remove the Symbol column after setting the rownames

*The resulting dataframe should have the genes as the rownames and one column for Regulation with up/down*

4. Plot the data using `pheatmap()`. 

Set the row (gene) and column (sample) annotations using the parameters `annotation_row=` and `annotation_col=`. 

For the sample annotations, try different subsets of the columns to see which of the sample metadata best explains the heatmap. Ex: you can try just gender, gender and disease, or other combinations.

<div class="alert alert-block alert-warning">
<strong>Question:</strong> What results did you find from your heatmap? Do the results support the findings from the meta-analysis done in the paper that these are top up- and down-regulated genes in IBD?
</div>

*Type your findings here*

### Session Info

In [None]:
sessionInfo()