08_R_functions_consensus_peaks.Rmd

---
title: "Learning functions in R to create consensus peaks"
author: "JR"
date: "11/29/2020"
output: html_document
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(GenomicRanges)
source("util/intersect_functions.R")
```

 
First we are going to go over a "function" in R. This is a really nice feature of 
defining a function for R that you may run frequently. For example we have over 
1,000 peak files we have to import into R using the function below. 

We need a few minimum aspects in defining a function:
-note we use #' markdown for a function.

@description: describes the function
@param is a paramater that the function needs. 

We have used many of R's built in base functions that require paramters too.


R has many built in functions such as mutate that needs parameters. For example,
table and summary we used before requires a parameter of an object to summarize etc.
But the cool part is you can make your own functions just like these base R funcitons. 


How and why would we write a function in R?

If you're finding yourself doing the same set of tasks over and over and over, this
can be wrapped into function. We saw an example of this in bash with the what-time-is-it.sh
script.

Now, let's see what it looks like in R. We know what a function is from math -- it takes
inputs and maps it to outputs, let's take a look at a classic polynomial function in R.

```{r}
# This function has two parameters, and we have to let R know
# that we want to be able to change the values of x and y.
# f(x) = x * y
# We also have to decide on a name for the function, we will call it 'fun'
fun <- function(x, y) { }
# Inside the curly braces is where we'll put the operations that the
# function will carry out.
# Then finally we will have the function return a value
fun <- function(x, y) {
  ans <- x * y
  return(ans)
}

fun(2,-4)
```

When creating a function, you can also add documentation about what that function 
does and the data types and expectationn for the the parameters it takes.

```{r}
#' A function to multiply two numbers
#'
#' @description 
#' This function will multiply the input values of X and Y
#' 
#' @param x one number you'd like to multiply
#' @param y the other number you'd like to multiply
fun <- function(x, y) {
  ans <- x * y
  return(ans)
}
```


Note that if you were creating an R package (All the libraries/packages we load into R
were created by grad students like yourself!).

Thus, the formatting above the function will compile into the nice documentation 
that you see in the help viewer pane (? fun-name).

For the purposes of this class project's organization, we'll put most of the functions
that we use in multiple analysis RMarkdown files, into the `../util` directory. 

Now let's look at the functions involved in creating consensus peaks.


```{r}
#' intersect replicates into a "consensus peak list" 
#' 
#' @description 
#' this function will take the  union of peak widths across replicates for a given
#' DNA binding protein. the function that will take a list of granges objects and return 
#  one granges object with merged peaks that are in all replicates
#' 
#' @param 
#'  the path to consensus peak files
#' # We're going to iterate over all the files to make it work. 

```


The only parameter for the create_consensus_peaks is the file path to the peak files.

So basically we are making a new R function (it's in util already under create_consensus_peaks).
Below we will go through each step of the function that occcurs in between { }. 

```{r}
create_consensus_peaks <- function(broadpeakfilepath = "/scratch/Shares/rinnclass/ .... <path to your broadpeaks") {
  
  fl <- list.files(broadpeakfilepath, 
                   full.names = TRUE)
  
  
  #fl <- fl[grep("broadPeak", fl)]
  #This is for safety if you want to grab the right file type from 'macs' output.

# fl: this is just going to be a list of the peak file names
# this used the R function "list.files" -- take a look:
# ?list.files
# So we have now read in all the file names in the paramater path.


## Here we are reading in files first again and putting them in the object "create_consensus_peaks"
## We are using grep to just grab broadPeak files. This is because the macs folder has many different
## versions of peak files. It is just useful to be sure it grabs the right file type.
## In our case the folder all_peak_files is only broadPeak files so a bit redundant....for safety.
  
  tf_name <- sapply(fl, function(x){

## The first command sapply (? sapply) needs a parameter of what to work on (fl or list of file names)
## then sapply wants to perform a funciton -- here we set the function to carry out the content in { }
## Here these commands are being executed in the function(x) that is anonymous or
## unnamed function. 
    
    file_name <- str_extract(x, "[\\w-]+\\.broadPeak")
    str_extract(file_name, "^[^_]+(?=_)")
  })
  
## Here we just named the file with str_extract and regex.
# ?str_extract (need tidyverse library loaded)
# The paramaters are:
  
# string: the input vector of coerciable data (e.g. character string) -- peak_files
# here we are using x that is being read in by the function which is an individual file name.

  
# pattern: the pattern to look for across the input vector. this is using regular
# expression (regex). You don't need to know this much regex, but if you are ever 
# curious you can use the online regex calculator :)
# 
# https://regexr.com/
# 
# if you paste in the two regex commands below it will tell you what they are looking for.

## Now our goal is to associate the replicate files with a unique DBP name. So 
## 'unique_tf' will only provide one DBP name -- but will later be associated with
## each replicate file associated with that DBP name.

  unique_tf <- unique(tf_name)

## ?unique
## here we see the unique function provide information to index each DBP as a unique 
## entry.

 
  consensus_peaks <- list()
  
  # This for loop will iterate over all dna binding proteins.
  for (i in 1:length(unique_tf)) {
  # this is why we set up unique_tf above so the for loop will only run on one DBP
  # at a time, and merges the peaks of the replicate files associated with each DBP.
  # then the for loop will move onto the next unique_tf and merge those peak files.
    
    dbp <- unique_tf[i]
    
    cat("Now I'm running", dbp, "\n")
    # this is just a fun way of seeing the progress of the for loop -- also good
    # for debugging when a file fails -- you know the last successful run.
    
    tf_files <- fl[grep(dbp, fl)]
    # here we are creating dbp to use in grep to grab the dbp entry and find it in fl
    # this way we only intersect the replicate files that have the same dbp name.
    # this is why we indexed into unique_tf[i]. Then the next unique DBP name will
    # run in the next iteration of the loop.
    
    if (length(tf_files) > 1) {
    
    # turns out some files don't have replicates :( so we are saying here that if
    # the current (i) entry for 'tf_files' is equal or less than 1 to stop running.
    # we can't intersect peaks if there is only one peak file :)
      
    peak_list <- c()
    
    # making a new object peak list that will be populated by the loop below.
    # we leave it empty until the loop runs by simpling putting c()
    
    for (j in 1:length(tf_files)) {
    
## This for loop is going to go through each peak file and store it in peak_list
      
      peak_list <- c(peak_list, read_peaks(tf_files[j]))
    }
    
    # read_peaks is a function we already made in .util. It's pretty basic and 
    # commented out below
    
    # read_peaks <- function(broad_peak_file, filter_to_canonical_chr = TRUE) {
  
    # dat <- read.table(broad_peak_file, sep = "\t")
  # if(filter_to_canonical_chr == TRUE) {
    # dat <- dat[dat$V1 %in% c(paste0("chr", 1:22), "chrM", "chrX", "chrY"),]
 # }
  # gr <- GRanges(seqnames = dat$V1,
               # ranges = IRanges(start=dat$V2,end=dat$V3))
  #return(gr)
#}
    
## Read peaks is a good example of 'ranges' in action. notice the redundant cleaning up
## to cannonical chromosomes (which is also below).
    
## Setting up peak_list and printing 'c()' the information of read_peaks for a unique
## DBP that we set up above. This will result in read_peaks reading in the peaks from 
## the replicate files in tf_files will both be read in.
    

    canonical_chr <- c(paste0("chr", 1:22), "chrM", "chrX", "chrY")
    # we are using paste0 to make a list of chr1 - chr22 and chrM etc
    # these are the only chromosomes we want so setting this up for below
    
    for (i in 1:length(peak_list)) {
      peak_list[[i]] <- peak_list[[i]][which(seqnames(peak_list[[i]]) %in% canonical_chr)]
    }
  
## This for loop is going to look throuhg the peak_list and retreive only the
## seqnames 'which' (?seqnames) that are %in% (handy r function!) our cannonical chr list.
   
     final_peakset <- intersect_peaks(peak_list = peak_list)
## intersect peaks is a function we made in the util folder intersect_functions.R
## It basically uses the genomicRanges package findOverlaps that does the bulk 
## of the intersecting peaks between files in 'peak_files'
## the only parameter this function needs is the peak list object (they are named the same)
## Here is the intersect_peaks function below

#    intersect_peaks <- function(peak_list) {
#     combined_peaks <- peak_list[[1]]
#     for(i in 2:length(peak_list)) {
#     suppressWarnings(pl_ov <- findOverlaps(combined_peaks, peak_list[[i]]))
#   pl1 <- combined_peaks[unique(pl_ov@from)]
#   pl2 <- peak_list[[i]][unique(pl_ov@to)]
#   suppressWarnings(combined_peaks <- GenomicRanges::reduce(union(pl1, pl2)))
# }
#  return(combined_peaks)
#}
     
## of note the function reduce, that is a function that will merge two overlaping
## ranges. We take the union of peak in list 1 and 2 (pl1, pl2). This means if they
## overlap it will take the furtherst ends of the ranges. 
     
## now we return to the for loop with an 'if' statement that will only operate
## if the condition of the if statement are true. Here saying the final peak list
## needs to be greater than 0 -- otherwise there were no peaks.
## if greater than 0 peaks are found it will name each peak and put it into an
## object final_peakset and make a new column 'name' that consists of 
## the DBP (or TF) name with an underscore followed by the 1 through all the peaks
## that are in the final_peakset of overlapping peaks.
    if (length(final_peakset) > 0) {
      final_peakset$name <- paste0(dbp, "_", 1:length(final_peakset))
    }
    
    consensus_peaks <- c(consensus_peaks, list(final_peakset))
    names(consensus_peaks)[length(consensus_peaks)] <- dbp
    
## Now we are just tidying and finalizing the consensus_peaks. First concatinating
## consensus_peaks with all the peaks in final_peakset. The list function is 
## listing all the peaks in final_peakset and printing them to consensus_peaks.
## lastly using names to name each peak by the DBP (or TF) using <- to input the
## DBP name.
    }
  }
  return(consensus_peaks)
}
    

# Write the consensus peaks to a file and take the name of the file
# from the name of the list element -- which   should be the DBP name.
for (i in 1:length(consensus_peaks)) {
  rtracklayer::export(consensus_peaks[[i]], 
                      paste0("results/consensus_peaks/", 
                             names(consensus)[i],".bed"))
}
```