# Piping commands with grep, sed, samtools, and bedtools

## Introduction

You may have heard people talking about pipelines and piping commands in bioinformatics. Piping commands is key to bioinformatics because it can save time and disk space that would otherwise be used to create intermediate files. In this notebook, we're going to learn about some basic unix commands and pipe (chain) them together.

## Searching with *grep*

Sometimes you want to find a phrase within your text file. In Microsoft Word, you can use find (CTRL+f). How would we do this on the command line? One command that is powerful for searching files on UNIX systems is **grep**, which stands for **g**lobal **r**egular **e**xpression **p**rint.

Let's create a file called `sample.txt` (You can use vim and paste these lines in), which has the contents:
```
#LastName  FirstName Gender    Year
Marina     Ryan      Male      3rd
Wheeler    Emily     Female    4th
Gorkin     David     Male      NA
Chiou      Josh      Male      2nd
Geusz      Ryan      Male      2nd
Kolodziej  Krystyna  Female    2nd
```

Let's say that I just want to pull out lines corresponding to students in their 2nd year.
I can do this with a simple command using grep.

    grep "2nd" sample.txt

What this command does is search the `sample.txt` file for the phrase "2nd".
You should get an output that looks something like this.

```
Chiou      Josh      Male      2nd
Geusz      Ryan      Male      2nd
Kolodziej  Krystyna  Female    2nd
```

For more practice with **grep**, check out this [tutorial](https://www.panix.com/~elflord/unix/grep.html), which also demonstrates some more complicated use cases.

## Editing standard output with *sed*

Sometimes you want to find strings in your file, and replace them with something else. In Microsoft Word, this is called find-and-replace. On the command line, we can use **sed**, which stands for **s**tream **ed**itor.

Let's use our `sample.txt` file again. David likes to go by Dave, so let's update that in our file to reflect that.

    sed "s/David/Dave/g" sample.txt
    
What this command does is substitute (s) all instances of "David" with "Dave" globally (g) in the text file `sample.txt`.  
You should get an output that looks something like this on the command line.

```
#LastName  FirstName Gender    Year
Marina     Ryan      Male      3rd
Wheeler    Emily     Female    4th
Gorkin     Dave      Male      NA
Chiou      Josh      Male      2nd
Geusz      Ryan      Male      2nd
Kolodziej  Krystyna  Female    2nd
```

For more practice with **sed**, check out this [tutorial](http://www.grymoire.com/Unix/Sed.html), which also demonstrates some more complicated use cases.

## Combining *grep* and *sed* with pipes

Let's combine what we've learned so far with a pipe. Pipes are a way to chain commands together so that we don't have to create intermediate files. 

Let's say that we want to extract all lines that contain "Male". For simplicity, we also want to change all instances of "Male" to "M". 

We can do this by piping commands. Piping allows us to combine the standard output of one command and perform another command on this output. You can pipe as many commands as you want together, which would create one really long command. One really useful command that you've learned already is **less**. Let's combine all of these together into one command.

    grep "Male" `sample.txt` | sed "s/Male/M/g" | less
    
This should give us an output that looks something like this.

```
Marina     Ryan      M      3rd
Gorkin     Dave      M      NA
Chiou      Josh      M      2nd
Geusz      Ryan      M      2nd
```

## Using pipes with samtools

[Samtools](http://www.htslib.org/doc/samtools.html) is a powerful bioinformatics program that you're going to use for visualizing reads from `.sam` or `.bam` files. These are the standard format for how aligned reads are stored, so chances are you'll be using samtools a lot. Samtools is set up to enable piping - let's see how we can use this to our advantage.

Let's download some data and so we can play around with piping. I picked a random `.bam` file from ENCODE, but you can use whatever `.bam` file you want for this.

    wget https://www.encodeproject.org/files/ENCFF510XHR/@@download/ENCFF510XHR.bam

Now that we've downloaded the file, let's check out the reads with `samtools view`. Again, we can pipe this to less so that our command line doesn't get overloaded. If you don't know what the `-S` flag does, it basically turns off line wrap so that each line in your file only takes up 1 line (you can scroll back and forth with left and right arrow keys).

    samtools view ENCFF510XHR.bam | less -S
    
Let's say that we want to just pull out reads that aligned to chr9. We can use **grep** to search for lines that contain "chr9" like so:

    samtools view ENCFF510XHR.bam | grep "chr9" | less -S
    
NOTE: grep "greedy" matches by default, so if you search for "chr1", you would get any lines that contain that, as well as "chr10", "chr11", "chr12"...and so on. We can use a different command to get exact matches. [AWK](https://en.wikipedia.org/wiki/AWK) is a powerful programming language for processing field-separated text files. In this command, we want to print all lines that contain "chr1" in the 3rd field ($3) - see the SAM file specification if you are unsure why we're looking at the 3rd field for the chromosomal alignment info.

    samtools view ENCFF510XHR.bam | awk '$3=="chr1"' | less -S 

Now that we've covered basic piping, it's worth noting that you could've saved time by just indexing the `.bam` file and used something like this.
    
    samtools index ENCFF510XHR.bam
    samtools view ENCFF510XHR.bam chr1:1000000-2000000 | less -S
    
This command will pull out all reads from chr1 between 1Mb and 2Mb and pipe it to `less`.

## Using pipes with bedtools

[BEDtools](http://bedtools.readthedocs.io/en/latest/index.html) is a program suite developed for BED files - which are used to define genomic regions. We often use BED files to define regions of ChIP-seq and ATAC-seq peaks, and piping is especially useful for manipulating these files.

Let's download a file from ENCODE and check out how this works. We're going to start with the [bedtools intersect](http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html) tool, which can work with `.bed`, `.bam`, `.sam`, `.gtf`, `.vcf`, and many other genomic file formats.

    wget https://www.encodeproject.org/files/ENCFF846YPA/@@download/ENCFF846YPA.bed.gz
    
Let's check out the file and see what it contains. It's gzipped, so lets decompress it first.

    gunzip ENCFF846YPA.bed.gz
    less -S ENCFF846YPA.bed
    
You should see something that looks like this in the first few lines...

```
chr7    106808618       106809664       .       0       .       4760.47152657298        -1      3.915135906622  445
chr12   120755258       120755932       .       0       .       3506.63375389865        -1      3.915135906622  343
chr12   58145508        58146594        .       0       .       3223.69035472088        -1      3.915135906622  783
chr7    108209953       108210606       .       0       .       2981.61237398703        -1      3.915135906622  302
chr11   65479191        65479822        .       0       .       2728.6959357355 -1      3.915135906622  283
```
The first thing to note is that fields 1-3 relate to the mapping positions: for example on the first line chr7 defines the chromosome, 106808618 defines the start point of the region, and 106809664 defines the end point.

Let's say that we want to pull out all reads that were used to form these peaks. We're going to use the `ENCFF510XHR.bam` file that we downloaded earlier combined with `bedtools intersect`. Since we're reading binary information from standard input, we use the `-` to tell `samtools view` to read from stdin. Notice that the order of the files for `bedtools intersect` matters here - since we want to see the reads that it's pulling out we put the `.bam` file first. The carat (>) at the end here tells the command line to send our output to a new file (ENCFF846YPA.reads.sorted.sam).

    bedtools intersect -a ENCFF510XHR.bam -b ENCFF846YPA.bed | samtools view - > ENCFF846YPA.reads.sorted.sam
    
If we want to see just the reads from chr9, we can use `grep` or `awk` and pipe our input.

    bedtools intersect -a ENCFF510XHR.bam -b ENCFF846YPA.bed | samtools view - | grep "chr9" > ENCFF846YPA.chr9.reads.sorted.sam
