# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Introduction-to-Unix---awk-and-Makefiles" data-toc-modified-id="Introduction-to-Unix---awk-and-Makefiles-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction to Unix - awk and Makefiles</a></div><div class="lev1 toc-item"><a href="#Working-with-tabular-files:-Awk" data-toc-modified-id="Working-with-tabular-files:-Awk-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Working with tabular files: Awk</a></div><div class="lev2 toc-item"><a href="#Example-of-tabular-file:-the-GFF3-format" data-toc-modified-id="Example-of-tabular-file:-the-GFF3-format-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Example of tabular file: the GFF3 format</a></div><div class="lev2 toc-item"><a href="#Basic-AWK-syntax:-filters" data-toc-modified-id="Basic-AWK-syntax:-filters-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Basic AWK syntax: filters</a></div><div class="lev4 toc-item"><a href="#Exercise" data-toc-modified-id="Exercise-2.2.0.1"><span class="toc-item-num">2.2.0.1&nbsp;&nbsp;</span>Exercise</a></div><div class="lev2 toc-item"><a href="#Awk:-printing-columns-and-doing-operations" data-toc-modified-id="Awk:-printing-columns-and-doing-operations-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Awk: printing columns and doing operations</a></div><div class="lev3 toc-item"><a href="#Exercise-(difficult)" data-toc-modified-id="Exercise-(difficult)-2.3.1"><span class="toc-item-num">2.3.1&nbsp;&nbsp;</span>Exercise (difficult)</a></div><div class="lev2 toc-item"><a href="#AWK:-searching-by-regular-expressions" data-toc-modified-id="AWK:-searching-by-regular-expressions-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>AWK: searching by regular expressions</a></div><div class="lev3 toc-item"><a href="#Last-exercise!" data-toc-modified-id="Last-exercise!-2.4.1"><span class="toc-item-num">2.4.1&nbsp;&nbsp;</span>Last exercise!</a></div><div class="lev1 toc-item"><a href="#Bonus:-Makefiles" data-toc-modified-id="Bonus:-Makefiles-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Bonus: Makefiles</a></div><div class="lev2 toc-item"><a href="#Defining-pipelines-with-Makefiles" data-toc-modified-id="Defining-pipelines-with-Makefiles-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Defining pipelines with Makefiles</a></div><div class="lev2 toc-item"><a href="#How-to-run-Makefile-rules" data-toc-modified-id="How-to-run-Makefile-rules-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>How to run Makefile rules</a></div><div class="lev1 toc-item"><a href="#The-last-slide" data-toc-modified-id="The-last-slide-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>The last slide</a></div>

# Introduction to Unix - awk and Makefiles

Giovanni M. Dall'Olio, 20/02/2017. All materials available here: https://dalloliogm.github.io/ 

```
  _______________
 / Part 4        \
 \ awk and make  /
  ---------------
         \   ^__^
          \  (oo)\_______
             (__)\       )\/\
                 ||----w |
                 ||     ||
```

Welcome to the Programming for Evoluationary Biology workshop!!

**How to use these slides**: Press Space to get to the next slide. Use arrows to navigate the subsections.


In [13]:
# Configuration - this will not appear in the slideshow
alias grep='grep --color'
pwd
cd ../exercises

/home/gmd78366/workspace/peb_unix_intro/exercises


# Working with tabular files: Awk


The **awk** command allows to search and manipulate tabular files from the command line.

Imagine it as the equivalent of Excel/Calc for the command line. It allows to do search on specific columns of a file, to do numerical operations, or to change the order of the columns.

The advantage of a command-line tool over graphical software is that the memory footprint is much lower. So you can access and modify large files in a fraction of the time that it would take with Excel.

## Example of tabular file: the GFF3 format

The file genes/chr8.gff contains an example of file in the GFF3 format:

In [14]:
head genes/chr8.gff

##gff-version 3
##source-version refgene 1.28.10
##date 2016-09-08
##genome-build .	hg19
chr8	refgene	gene	18248755	18258723	.	+	.	gene_id=10;symbol=NAT2;;ID=10
chr8	refgene	gene	100549014	100549089	.	-	.	gene_id=100126309;symbol=MIR875;;ID=100126309    
chr8	refgene	gene	144895127	144895212	.	-	.	gene_id=100126338;symbol=MIR937;;ID=100126338
chr8	refgene	gene	145619364	145619445	.	-	.	gene_id=100126351;symbol=MIR939;;ID=100126351
chr8	refgene	gene	91970706	91997485	.	-	.	gene_id=100127983;symbol=C8orf88;;ID=100127983
chr8	refgene	gene	74332309	74353753	.	+	.	gene_id=100128126;symbol=STAU2-AS1;;ID=100128126


As you can see it is a tab-separated file, which we could easily read in Excel or Calc.

The format specifications are defined [here](https://genome.ucsc.edu/FAQ/FAQformat.html#format3), but in short:

- the first, fourth and fifth columns contain the chromosome name and coordinates
- the second column describes the tool or resource that generated the annotation
- the third column describe the type of feature (e.g. gene, transcript, exon, TF binding site, Histone Acetylation mark, etc...
- the ninth column contains several fields, separated by a semicolon


## Basic AWK syntax: filters

The basic AWK syntax is the following:

```
awk 'filters {print statements}' filename
```

Awk is quite smart at recognizing the field separator, and by default assumes they are separated by tabs.

Each column of the file can be referred to with the dollar sign followed by the number of column.

For example $2 refers to the second column, and so on.

The following code filters all the lines belonging to chromosome 8, between the coordinates 100000 and 200000:

In [15]:
awk '$1=="chr8" && $4>100000 && $5<200000 ' genes/chr8.gff

chr8	refgene	gene	182200	197339	.	+	.	gene_id=169270;symbol=ZNF596;;ID=169270
chr8	refgene	gene	116086	117024	.	-	.	gene_id=441308;symbol=OR4F21;;ID=441308
chr8	refgene	gene	158345	182318	.	-	.	gene_id=644128;symbol=RPL23AP53;;ID=644128


#### Exercise

Can you print all the lines between 5000000 and 10000000 ?

In [16]:
awk '$4 > 5000000 && $5 < 10000000 ' genes/chr8.gff


chr8	refgene	gene	7143733	7212876	.	-	.	gene_id=100128890;symbol=FAM66B;ID=100128890
chr8	refgene	gene	7215498	7220490	.	-	.	gene_id=100131980;symbol=ZNF705G;ID=100131980
chr8	refgene	gene	7812535	7866277	.	+	.	gene_id=100132103;symbol=FAM66E;ID=100132103
chr8	refgene	gene	7783859	7809935	.	+	.	             _________
chr8	refgene	gene	6261077	6264069	.	-	.	            / Cows in \
chr8	refgene	gene	7272385	7274354	.	-	.	            | the     |
chr8	refgene	gene	7946463	7946611	.	-	.	            \ Genome! /
chr8	refgene	gene	6602685	6602765	.	+	.	             ---------
chr8	refgene	gene	8905955	8906028	.	+	.	                      \   ^__^
chr8	refgene	gene	6602689	6602761	.	-	.	                       \  (oo)\_______
chr8	refgene	gene	6693076	6699975	.	+	.	                          (__)\       )\/\
chr8	refgene	gene	8559666	8561617	.	+	.	                              ||----w |
chr8	refgene	gene	9182561	9192590	.	+	.	                              ||      |
chr8	refgene	gene	81

## Awk: printing columns and doing operations

Awk also allows to print only specific columns, and do algebraic operations on them.

Remember that each column can be referred as \$1, \$2, \$3, etc...

For example the following code prints the first column, and the sum of the fourth and third. We can pipe the output to head or less, to make it easier to visualize:

In [17]:
awk '{print $1, $5-$4}' genes/chr8.gff | head


##gff-version 0
##source-version 0
##date 0
##genome-build 0
chr8 9968
chr8 75
chr8 85
chr8 81
chr8 26779
chr8 21444


Notice how this also prints the headers of the file. We can exclude these by adding a grep condition:

In [18]:
awk '{print $1, $5-$4, $9}' genes/chr8.gff | grep -v '^#' |  head

chr8 9968 gene_id=10;symbol=NAT2;;ID=10
chr8 75 gene_id=100126309;symbol=MIR875;;ID=100126309
chr8 85 gene_id=100126338;symbol=MIR937;;ID=100126338
chr8 81 gene_id=100126351;symbol=MIR939;;ID=100126351
chr8 26779 gene_id=100127983;symbol=C8orf88;;ID=100127983
chr8 21444 gene_id=100128126;symbol=STAU2-AS1;;ID=100128126
chr8 12197 gene_id=100128338;symbol=FAM83H-AS1;;ID=100128338
chr8 1835 gene_id=100128627;symbol=CDC42P3;;ID=100128627
chr8 3282 gene_id=100128750;symbol=RBPMS-AS1;;ID=100128750
chr8 69143 gene_id=100128890;symbol=FAM66B;ID=100128890
grep: write error: Broken pipe


### Exercise (difficult)

Starting from the previous command, can you extract the gene symbol into a separate column?

Hints: pipe an additional awk statement after the first. Use the -F option to specify a different field separator.

In [19]:
awk '{print $1, $5-$4, $9}' genes/chr8.gff | grep -v '^#' | awk -F';' '{print $1, $2}' | head

chr8 9968 gene_id=10 symbol=NAT2
chr8 75 gene_id=100126309 symbol=MIR875
chr8 85 gene_id=100126338 symbol=MIR937
chr8 81 gene_id=100126351 symbol=MIR939
chr8 26779 gene_id=100127983 symbol=C8orf88
chr8 21444 gene_id=100128126 symbol=STAU2-AS1
chr8 12197 gene_id=100128338 symbol=FAM83H-AS1
chr8 1835 gene_id=100128627 symbol=CDC42P3
chr8 3282 gene_id=100128750 symbol=RBPMS-AS1
chr8 69143 gene_id=100128890 symbol=FAM66B


## AWK: searching by regular expressions

Awk can also be used to search by regular expression.

For example, the following code will print all the lines in which the symbol starts with "MIR":

In [20]:
awk '$9 ~ /symbol=MIR/ {print $0}' genes/chr8.gff 

chr8	refgene	gene	100549014	100549089	.	-	.	gene_id=100126309;symbol=MIR875;;ID=100126309    
chr8	refgene	gene	144895127	144895212	.	-	.	gene_id=100126338;symbol=MIR937;;ID=100126338
chr8	refgene	gene	145619364	145619445	.	-	.	gene_id=100126351;symbol=MIR939;;ID=100126351
chr8	refgene	gene	65285775	65295842	.	+	.	gene_id=100130155;symbol=MIR124-2HG;;ID=100130155
chr8	refgene	gene	128972879	128972941	.	+	.	gene_id=100302161;symbol=MIR1205;;ID=100302161
chr8	refgene	gene	10682883	10682953	.	-	.	gene_id=100302166;symbol=MIR1322;;ID=100302166
chr8	refgene	gene	129021144	129021202	.	+	.	gene_id=100302170;symbol=MIR1206;;ID=100302170
chr8	refgene	gene	129061398	129061484	.	+	.	gene_id=100302175;symbol=MIR1207;;ID=100302175
chr8	refgene	gene	128808208	128808274	.	+	.	gene_id=100302185;symbol=MIR1204;;ID=100302185
chr8	refgene	gene	145625476	145625559	.	-	.	gene_id=100302196;symbol=MIR1234;;ID=100302196
chr8	refgene	gene	113655722	113655812	.	+	.	gene_id=100302225;symbol=MIR2053;;ID

### Last exercise!

Calculate the lenght of the gene POU5F1B.

Find the Gene whose gene_id is equal to that number.

In [21]:
awk '$9 ~ /POU5F1B/ {print $5-$4}' genes/chr8.gff 


1584


In [22]:
awk '$9 ~ /gene_id=1584/ {print $0}' genes/chr8.gff 

chr8	refgene	Good_Job!	143953773	143961236	.	-	.	gene_id=1584;symbol=CYP11B1;;ID=1584


# Bonus: Makefiles

Let's have a look at the file called Makefile in the unix_intro directory:

In [24]:
cd ..
head Makefile

# This is a Makefile, which will be explained later in the course.
# Please don't look at it yet :-)

publish: slides_bash commit
	echo "convert the slides to pdf, commit, and push to github"
	git push


test_exercises: start help ignorecase multiplefiles
generate_exercises: generate_grep generate_awk


Press space or the down key to continue

## Defining pipelines with Makefiles

Makefiles are a basic way to define pipelines of shell commands.

Nowadays there are more sophisticated tools available, but most of these are based on Makefiles.


A Makefile is a collection of "rules".

Each of these rules follows this basic syntax is:

```
target: prerequisites
    commands to execute
```

As you can see in the Makefile included, most of the rules allow to regenerate the exercise files, or to execute some commands without having to type them everytime.

For example, the rule "testrule" is associated to two echo commands.

## How to run Makefile rules

To execute a rule in the Makefile, simply type:

```
make [name of the rule]
```

For example:



In [25]:
make testrule

echo this is a Makefile rule
this is a Makefile rule
echo You can associate it to as many commands you want
You can associate it to as many commands you want


The program "make" will automatically detect any file named "Makefile" in the current directory, and execute any rule with the specific name.

Rules can also be nested together. For example the two rules "test_exercises" and "generate_exercises" at the beginning of the file are a way to call several other rules together.

# The last slide

This is the last slide of the workshop. To finish, try to execute the rule "cow" in the Makefile.

In [26]:
make goodbye

 _____________ 
/ I hope you  \
| have        |
| enjoyed the |
| workshop    |
\ :-)         /
 ------------- 
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||
 ___________ 
( Now let's )
( go for    )
( dinner    )
 ----------- 
        o   ^__^
         o  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||
 _______________________ 
/ Note: real genomes do \
| not contain hidden    |
\ cows                  /
 ----------------------- 
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||
