# Intro to Bioinformatics Applications of R

<div class="alert alert-block alert-info">
    
Let's practice using R to analyze some data using data manipulation and visualizations. We will use publicly-available data ([GSE166925](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE166925)) from NCBI GEO with inflammatory bowel disease and colorectal cancer patients.

</div>

First, let's get the data using the `GEOquery` library.

### Libraries
Every time you load a library you're adding a new environment to R. The library specific functions become available because R searches all environments for your function call. Sometimes you override an existing function by loading a new library. You can reference the package-specific function via the package specifier.

Note: You can also suppress the messages printed when loading a library using [`suppressPackageStartupMessages()`](https://www.geeksforgeeks.org/how-to-disable-messages-when-loading-a-package-in-r/). 

One of the benefits of using libraries is you don't need to rewrite code that already exists! We can use the `getGEO()` function from the `GEOquery` library to automatically get the data from GEO instead of manually trying to get the sample information.

The code is provided here, but we won't run it:

    # Get data
    suppressPackageStartupMessages(library(GEOquery))
    
    #Sys.setenv(VROOM_CONNECTION_SIZE=500072) # Might need to change to have enough buffer to download  file
    gse <- getGEO(GEO='GSE166925',destdir='data/')

Using the gse object we created, we can get the sample information/metadata and store it in a `meta` dataframe we can later refer to.

    # Generate data frame for metadata
    gse_data <- gse[[1]]
    
    columns <- c("title","disease:ch1","gender:ch1", "inflammation_status:ch1", "nancy_score:ch1",
                    "patient_age:ch1","site_taken:ch1")
    
    meta <- pData(gse_data)[, which((names(pData(gse_data)) %in% columns)==TRUE)]

    # Save metadata table to file for later use
    write.table(meta, "./GSE166925-meta.tsv", sep='\t',quote=FALSE)

### [Data: GSE166925](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE166925)

Let's read in the sample metadata table that was saved to GSE166925-meta.tsv into a `meta` dataframe object. We can use the [`read.table()`](https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/read.table) function to generate dataframes from files. Let's also use the title column as the index instead of the GSM IDs.

Let's also read in the gene expression data table from the file GSE166925_salmon.genes.tpms.txt.gz into an `expr` dataframe. 

This file is available or download on the GEO sample page. 

We'll review this more in the RNA-seq module but when looking at gene expression data, we usually use the log normalized expression for analysis. The file provided already has normalized data (TPM in given in the filename), so we can take the log using `apply()`.

The provided file only has the gene IDs (ENSG IDs) but it's easier to use the gene names/symbols, so let's find a library that can add the gene names.

This created a new dataframe `geneIDs` with matching ENSG IDs and gene names. To get the gene names into our `expr` dataframe, we can use the `merge()` function. 

The [`merge()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/merge) function allows you to merge two dataframes by specifying columns that are matching. 

In the example of our two dataframes, the 'gene_id' column in the `expr` dataframe and the 'GENEID' column in the `geneIDs` dataframes both contain the ENSG IDs, so we can merge on those columns.

### Visualizations using Basic Plots
Let's explore the sample metadata by creating some plots using the `meta` dataframe.

Let's first look at the distribution of patient ages in our dataset.

**Histograms:**

[`hist()`](https://www.datamentor.io/r-programming/histogram/) : generates a histogram

Next, generate a bar plot using [`barplot()`](https://www.statmethods.net/graphs/bar.html) of the gender. 

The input for `barplot()` is a table of counts for each condition, not a list of the conditions. 

Ex: instead of a list/vector with "female", "female", "female", "male", we need to use the [`table()`](https://www.statology.org/table-function-in-r/) function to get the counts, female: 3 and male: 1.

<div class="alert alert-block alert-info">
    <h3>R Practice for basic visualizations: barplot</h3><br>
    
<p>Generate a stacked bar plot using [`barplot()`](https://www.statmethods.net/graphs/bar.html) of the disease and inflammation status. The x-axis should be divided by disease, and each disease will be subdivided into inflamed/uninflamed.
    
[Hints:](https://www.statmethods.net/graphs/bar.html)
- Generate the counts table with disease and inflammation status
- You can use the `beside=` parameter when calling `barplot()` so the groups aren't stacked on top of each other
- Use the `legend=` parameter to include the legend in the plot
- You can also add title (`main=`), x-axis label (`xlab=`), y-axis label (`ylab=`), colors (`col=`)</p>

</div>

**Scatter plots:**

Let's make a scatter plot for the first two genes in our `expr2` data frame.

[`plot()`](https://www.w3schools.com/r/r_graph_plot.asp) : general function that creates various plots depending on the input. We can input two vectors to generate a scatter plot

Note: we just need the gene expression values, not the gene ID and symbol, so we need to subset the data by using `-c()` to designate which columns we don't need.

There's an error! Let's make sure the format of our data matches what is needed for the input.

We can add a title to our plot with the genes by using [`paste()`](https://www.digitalocean.com/community/tutorials/paste-in-r). `paste()` allows you to combine variables and/or strings into a single string. 

We can also plot data using gene names. Let's plot FGR and GCLC. 

**Boxplots**

We can also compare the expression of certain genes between certain groups using boxplots. 

[`boxplot()`](https://www.tutorialspoint.com/r/r_boxplots.htm) : generates a boxplot

Let's create a new dataframe for the gene 'TYK2', where the gene expression is matched to the sample metadata.

<div class="alert alert-block alert-info">
    <h3>R Practice for basic visualizations: boxplot</h3><br>
    <p>Generate a boxplot comparing STAT3 between Control vs CD vs UC samples only
    
Hints:
- Look at the [`subset=`](https://stackoverflow.com/questions/38908230/how-do-i-subset-a-box-plots-in-r) parameter in boxplot()
- Can also add title, x and y labels, colors
    </p>

</div>

<div class="alert alert-block alert-success">

<p>You should have 3 boxes for CD, Control and UC showing the expression of STAT3. </p>

</div>

### Visualizations using ggplot

In [None]:
library('ggplot2')

Ggplot works primarily with dataframes. We have to supply ggplot with a dataframe. As we go through ggplot, a key thing to notice is how the plot can continually be enhaanced by adding layers and themes (generally indicated by a ```+``` sign and a following statement) to an existing plot. 

Here is a ggplot data visualization cheat sheet for your own reference: https://github.com/rstudio/cheatsheets/blob/main/data-visualization-2.1.pdf

Let's first create a new data frame as before with both TYK2 and STAT3.

Now let's initialize a basic ggplot with our `data` data frame, using the TYK2 and STAT3 columns of data. The aes() function is what we will use to specify the X and Y axes

Do you see a blank ggplot? If so, don't worry, you did this step correctly. While we have x and y labels that match the columns we initialized the plot with, we don't see any plotted data. This is because ggplot does not make assumptions about the plot you are meaning to draw. Initializing the ggplot only tells ggplot what dataframe and what x and y columns from the dataframe should be used.

Now let's make a scatterplot. We will do this by adding on a layer using ```geom_point()```.

From here, we can add on a smoothing layer using the method "lm" to draw a line of best fit. 

We can also adjust the x and y limits.

<div class="alert alert-block alert-info">
    <h3>R Practice for ggplot</h3><br>
<p>Generate a box plot like we did before (STAT3 expression for Control, CD, UC) but use ggplot this time. </p>

</div>

<div class="alert alert-block alert-success">

<p>In addition to the suggestions from before, you can also try changing the ggplot themes. </p>

<p>Suggestions to include:</p>  
<ul>
<li>Ordering the box plots (Control, CD, UC)</li>
<li>Labels for title, x and y axis</li>
<li>Colors for each disease group</li>
<li>Font size</li>
<li>ggplot theme</li>

</ul>
</div>

### [Pheatmap](https://www.rdocumentation.org/packages/COMPASS/versions/1.10.2/topics/pheatmap)

In [None]:
library(pheatmap)

Let's see how the top genes identified in this [meta-analysis paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8440637/) perform in our dataset.
First, read in the file PMC8440637_top_genes.txt into a table so we can see the top 10 up- and top 10 down-regulated genes from this paper.

Now let's create a new data frame by getting the expression for these genes from the `expr2` data frame.

Plot the gene expression using `pheatmap()`

Let's increase the size so the heatmap isn't compressed.

<div class="alert alert-block alert-info">
    <h3>R Practice for pheatmap</h3><br>
    <p>Let's change some of the parameters so we can try to explain what we're seeing with the data.</p>
    <ul>
    <li>Subset the samples so you're only showing the Control, CD and UC samples</li>
    <li>Add annotations to the samples using the meta data frame (try adding one or more of the columns to see if any of the characteristics describe the change in gene expression for the samples)</li>
    <li>Add annotations to the genes to show whether they were classified as up or down regulated (Hint: look in the file for the top genes)</li>
    <li>Can also change some of the aesthetics</li>
    </ul>
</div>

1. Create a new dataframe `sample_annotations` with only the Control, CD and UC samples from `meta`

2. Create a new dataframe `data3` with only the Control, CD and UC samples from `data2` 

*Hint: get a list of sample IDs for Control, CD and UC and use those sample IDs to filter `data2`*

3. Create a new dataframe `gene_annotations` with the regulation information from the `gene_table` dataframe. 

Hints:
- Subset `gene_table` so it only has the Symbol and Regulation columns
- Change the rownames to the Symbol
- Remove the Symbol column after setting the rownames
- The resulting dataframe should have the genes as the rownames and one column for Regulation with up/down

4. Plot the data using `pheatmap()`. Set the row (gene) and column (sample) annotations using the parameters `annotation_row=` and `annotation_col=`. For the sample annotations, try different subsets of the columns to see which of the sample metadata best explains the heatmap. Ex: you can try subsets of gender, gender and disease, or other combinations.

<div class="alert alert-block alert-warning">
<strong>Question:</strong> What results did you find from your heatmap? Do the results support the findings from the meta-analysis done in the paper that these are the top up- and down-regulated genes in IBD?
</div>