# Bioinformatics 101

In this notebook, we will manipulate gene expression data (RNA seq). The data is from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE183947. This notebook is created from following [this Youtube playlist](https://www.youtube.com/playlist?list=PLJefJsd1yfhbIhblS-85alaFsPdU00DaA) by Bioinformagician.

## Importing libraries and getting data

We set print options because jupyter notebook cuts columns off short.

In [1]:
options(repr.matrix.max.cols=50, repr.matrix.max.rows=100)

In [2]:
library(dplyr)
library(tidyverse)
library(GEOquery)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


── [1mAttaching core tidyverse packages[22m ───────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mpurrr    [39m 1.0.2     [32m✔[39m [34mtidyr    [39m 1.3.1
── [1mConflicts[22m ─────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


ERROR: Error in library(GEOquery): there is no package called ‘GEOquery’


In [None]:
dat <- read.csv(file = '../data/GSE183947_fpkm.csv')

In [None]:
dim(dat)

On the website that provided the data, we can read more about it. The data consists of 30 samples with breast cancer and 30 samples without cancer. 

Looking at the data, it has 20246 genes and 60 cells. Each value represents an expression of genes normalized with [FPKM](https://docs.gdc.cancer.gov/Encyclopedia/pages/FPKM/#:~:text=Description,total%20number%20of%20mapped%20reads.) (Fragments Per Kilobase of transcript per Million mapped reads). 

In [None]:
head(dat)

After looking at the data, we get metadata to find out which ones are cancer cells or not. We can easily get the data using `getGEO` to download it.

In [None]:
gse <- getGEO(GEO = 'GSE183947')

In [None]:
gse

From this, we can get the metadata.

In [None]:
phenoData(gse[[1]])

In [None]:
metadata <- pData(phenoData(gse[[1]]))
head(metadata, 1)

There are many columns, but we only care about 1st, 10th, 11th, and 17th columns.

In [None]:
metadata.subset <- select(metadata, c(1,10,11,17))

In [None]:
metadata.subset[1,]

In [None]:
head(metadata.subset)

## Using pipe

We want to change the names of columns and values because they are horrible. What do characteristics_ch1 and characteristics_ch1.1 mean? We can change them by using pipes (`%>%`). By putting `%>%` at the end each line, the result of each line is passed to the following line. 

Here are the functions used in the pipes:
  - `select` selects columns from `metadata`.
  - `rename` renames column names.
  - `mutate` changes the data from the column.
  - `gsub` replaces the pattern. In this case, it replaces `"tissue: "` into `""`, which means deleting the matching string from `tissue` column.
  - `head` returns top 6 rows.

In [None]:
metadata %>%
    select(1,10,11,17) %>%
    rename(tissue = characteristics_ch1) %>%
    rename(metastasis = characteristics_ch1.1) %>%
    mutate(tissue = gsub('tissue: ', '', tissue)) %>%
    mutate(metastasis = gsub('metastasis: ', '', metastasis)) %>%
    head()

After checking that it has the right format, we assign it to `metadata.modified`.

In [None]:
metadata.modified <- metadata %>%
    select(1,10,11,17) %>%
    rename(tissue = characteristics_ch1) %>%
    rename(metastasis = characteristics_ch1.1) %>%
    mutate(tissue = gsub("tissue: ", "", tissue)) %>%
    mutate(metastasis = gsub("metastasis: ", "", metastasis))

In [None]:
metadata.modified[1:3,]

Right now, our data is in a wide format, which means samples run through columns. We will reshape data into a long format, which is easier to work with. We will use `gather` function to reshape the data by providing key and value. `-gene` leaves the gene column alone.

In [None]:
head(dat)

In [None]:
dat.long <- dat %>%
    rename(gene = X) %>%
    gather(key = 'samples', value = 'FPKM', -gene)

In [None]:
head(dat.long)

In [None]:
dim(dat.long)

Now, we will join dat.long and metadata.modified so we have gene expression and cancer value in one dataframe. We use `left_join` which adds additional columns on the left dataframe (`dat.long` in this case). We take a look at the result of the pipeline with `head` then assigns the result into `dat.long`.

In [None]:
dat.long %>%
    left_join(., metadata.modified, by = c('samples' = 'description')) %>%
    head()

In [None]:
dat.long <- dat.long %>%
    left_join(., metadata.modified, by = c('samples' = 'description'))

In [None]:
head(dat.long)

## Explore data

We can explore the data. Since we are interested in breast cancer, we want to look at `BRCA1` and `BRCA2` genes, which are tumor suppressor genes. 

Pipe line:
  - `filter` filters rows by values.
  - `group_by` groups by gene and tissue columns so that we can calculate values.
  - `summarise` makes a summary by calculating mean and median.
  - `arrange` sorts by values. `-mean_FPKM` sorts by descending order.
  - `head` provides us with an output.

In [None]:
dat.long %>%
    filter(gene == 'BRCA1' | gene == 'BRCA2') %>%
    group_by(gene, tissue) %>%
    summarise(mean_FPKM = mean(FPKM),
              median_FPKM = median(FPKM)) %>%
    arrange(-mean_FPKM) %>%
    head()