# Doing filtering tasks

As mentioned before, `awk` is very useful to filter datasets. We learned that it is a small programming language. So, using `awk` already means you are doing programming - congrats! Let's do a bit more with it.

Usually, you do `awk` followed by a string contained in quotation marks `''` and the file name (or standard input*).

\*What was that again?

Within the quotation marks, certain structures have to be respected. For example, you usually set the filtering parameters first, e.g. a certain column (`$1` means column 1) should match a pattern (`=="orange"` means being equal to "orange"). Then, you define what it should spit out, surrounded by braces (`{ print $1 }` means the program outputs the first column only. You see that there is a syntax and logical order of things. The good thing is that `awk` can be used on the command line to process any text file.

Here is one possible [cheatsheet](https://www.shortcutfoo.com/app/dojos/awk/cheatsheet); you may use Google for more.

All of this (the first lectures on bash and this one on awk) is extremely useful when working with sequencing data as it comes out of the sequencing platform - because you deal with very large textfiles, but within each text file, there is a structure. You make use of this structure by querying something from a file. Hence, these tools are fundamental for doing modern genetic and genomic work.


## Matching with awk

What can we do? We can try to match things.

`awk '$2=="orange" { print $0 }' fileA.txt`

Now compare

`awk '$2=="apple" { print $0 }' fileA.txt | awk '$5== "B" {print $0}'`

and

`awk '$2=="apple" && $5=="B" { print $0 }' fileA.txt`

* What is the difference?

## Defining separators

Maybe we need to be explicit about separators...

`awk -v FS="\t" '$2=="apple" && $5=="B" { print $0 }' fileA.txt`

and

`awk -v FS="\t" '$2=="apple" { print $0 }' fileA.txt | awk -v FS="[ ]" '$2=="B" { print $0 }'`


## Pattern search and regular expressions

Now, search for something that is contained (though not exact)?

`awk '$2=="apple" { print $0 }' fileA.txt`

`awk '$2~"apple" { print $0 }' fileA.txt`

* What is the difference?

Now more using of regular expressions:

`awk '$2~"[aoi]pple" { print $0 }' fileA.txt`

and

`awk '$2~"^[aoi]pple" { print $0 }' fileA.txt`

Note that `awk` is blind to the function of a header! That needs to be considered when applying this tool.


* If you have a header, what may you do?

Caution with this symbol: `|`

`awk '$5~"A|C" { print $0 }' fileA.txt`

How about taking the opposite?

`awk '$5!~"A|C" { print $0 }' fileA.txt`

`awk '$5!="B" { print $0 }' fileA.txt`


## Comparing numbers

We can filter by comparison!

`awk '$1>=15 { print $0 }' fileA.txt`

`awk '$6>1.1 && $6<=2 { print $0 }' fileA.txt`

* Finally, we try to match to the asterisk `*`. Does this work?

## Special characters are special

`awk '$6==* { print $0 }' fileA.txt`

Or this?

`awk '$6=="*" { print $0 }' fileA.txt`

Or this: 

`grep "." fileA.txt`

versus this: 

`grep "\." fileA.txt`


## Now is the time to ask fundamental questions!

## Next, we will go for the first challenge!
