# GCB5350 - Creating a pipeline for analysis

## Instructions

In this adventure, you will convert the "pipeline" we debugged in the previous adventure to an actual pipeline (with some outputs).

Once pipelined, you can then use the power of automation to normalize a bunch of data files and look at the outputs.

Here is the 'debugged' version of the previous code:

In [None]:
### My first pipeline: GExpr_std_pipe.R

## written by: XXXX
## created on: YYYY

## This code is designed to take in two comma separated value files, one with gene expression data
## the other with gene location information, merge the tables, and standardize the expression 
## values.

## BLOCK ZERO
## Load libraries used for this pipeline
library(tidyverse)

## BLOCK 0.5
## Read input directly from the command line

### BLOCK ONE
## This bit of code: ...
GExpr <- read.table(file="GExp_snippet.csv", sep=",",header=T)
Loc <- read.table(file="Loc_snippet.csv", sep=",", header=T)
z <- left_join(GExpr,Loc, by="geneid") %>%
      relocate(chr,pos,.after=geneid)

### BLOCK TWO
## This bit of code: ...
x <- z %>%
  rowwise(geneid) %>%
  mutate(ave = mean(c_across(starts_with("GTEX")),na.rm=T)) %>%
  mutate(sd = sd(c_across(starts_with("GTEX")),na.rm=T)) %>%
  relocate(ave,sd,.after=pos)


### BLOCK THREE
## This bit of code: ...
for (i in seq_along(rownames(x))) {
  for(j in 6:length(x[i,])) {
      this_ave = x[i,]$ave
      this_sd = x[i,]$sd
      x[i,j] = (x[i,j] - this_ave) / this_sd
  }
}

### BLOCK FOUR
## This bit of code: ...
x <- x %>%
  rowwise(geneid) %>%
  mutate(ave_std = mean(c_across(starts_with("GTEX")),na.rm=T)) %>%
  mutate(sd_std = sd(c_across(starts_with("GTEX")), na.rm=T)) %>%
  relocate(ave_std,sd_std,.after=sd)
         
### BLOCK FIVE
## This bit of code: ...


**Q1.** First, edit the above code to make this more 'readable'. Use comments to:

* Put author, and date when the pipeline was created.
* Give one comment for each block of code to describe what the block is designed to do.

**Add your edits to the code above.**

**Q2.** Next, add a new block of code - **BLOCK FIVE** which:

* writes the output of the table `x` to a new table called: `GExp_snippet.csv.std`
* arguments to `write.table()`: no row names, do include column names, do not include quotes, separate the data by comma.

**Add your block of code to the above.**

**Q3.** Now let's make the file input *more generic* by:

* Add code to section **BLOCK 0.5** to obtain files that need to be analyzed from the command line.
* Modify the file argument in `read.table()` in **BLOCK ONE** to utilize those that were specified on the command line.

A slide is on canvas in the 'Reproducible Pipelines' module may be helpful to you here.

**Provide your edits to the code above.**

**Q4.** There's one more change that we need to make: Our output file is a 'generic' name that will be overwritten each time the code runs! Let's make that *more generic* by further modifying **BLOCK FIVE** code to:

* Creates a new variable called `outfile`. 
* use `paste(sep="")` to create a new file name, one that contains the filename the user specified on the command line - the one used to create the variable `GExpr`, but appends ".std" to it.
* Change the output from what you had previously to `outfile`.

(**Note:** Cocalc will be VERY confused if you try to open this newly created file because it thinks you've changed the file extension. But this does not matter in UNIX -- from within a UNIX terminal using `head` or `more` would work fine). 

**Modify the code to the above.**

**Q5.** OK, now we're ready to port this over into UNIX and try this out via `Rscript`:

* Open a UNIX terminal (Remember UNIX? click 'Files' -> "(+) New" -> ">_ Linux Terminal")
* In UNIX, use `touch` to create a new file called `GExpr_std_pipe.R`
* Open this file using the text editor `nano` or `emacs`, and copy the entire portion of code you edited/created above into this .R script
* Save the file.
* Use `Rscript` and the "example" files referenced in the script / we used previously (`GExp_snippet.csv` and `Loc_snippet.csv`) to test your script out. This **might** look like:
   
        $ Rscript GExpr_std_pipe.R --args GExp_snippet.csv Loc_snippet.csv
       
But this might vary on how you have your file input prepared.

* Check to see that you've created the expected output file and that it looks like what you expect!

To save you from flipping back and forth, here's "the answer" you should get to make sure your output is correct!

| geneid   | chr | pos     | ave   | sd | ave_std   | sd_std    | GTEX.A01 | GTEX.A02 | GTEX.A03 | GTEX.A04 | GTEX.A05 | GTEX.A06 |
|----------|-----|---------|-----------|--------|-------|-------|----------|----------|----------|----------|----------|----------|
| ENSG0001 | 11  | 1023832 | -1.2 | 2.22      | -6.66E-17  | 1  | 1.03     | -0.45    | 1.12     | NA       | -0.855   | -0.855   |
| ENSG0002 | 17  | 199299  | -1.42 | 2.31      | -1.11E-16 | 1  | -0.643   | -0.643   | -0.643   | -0.643   | 1.44     | 1.13     |
| ENSG0003 | 22  | 111238  | 1.26 | 0.207      | -3.89E-17  | 1 | -1.25    | -0.772   | NA       | 0.193    | 0.675    | 1.16     |