# Brief recap

File system & navigation: `cd `, `ls`, `less`, `cat` etc.; `cp` etc.

Subsetting: `grep "pattern" file`, `cut -f1-2 -d " " file`, `awk 'pattern {subset}'` - complex subsetting

Patterns, special characters, regular expressions: `==`, `&&`, `*`, `[abcd]`, `A|B` etc.

**Anything unclear so far?**

From now on, we stop working with the apples and oranges in the small text files in the subdirectories of this course. Instead, we should work on **real data** (big files) within the main home directory. These should be kept throughout the next sessions, as things build up. The notebooks may be updated inbetween, and any changes within the notebooks overwritten.

In order to go there, do `cd ` without further input, and then let's go!


# Reference genomes & raw sequencing data 

We know what a reference genome is. The file format is called fasta, usually with the extension `.fa`, and plain text (or compressed with `.gz`).

This file is nothing but a string of letters, representing a sequence (e.g. a genome of ~3.5 Gbp). Too big to download individually, hence we can use it in one directory.

```
less /home/local/ANTHROPOLOGY/kuhlwilmm83/refgen/hg19/hg19.p13.plusMT.no_alt_analysis_set.fa.gz
```

We should do some inspections here!

*Now, let's get some data!* Fastq files contain the raw sequencing data. Here is a link with a samll example data set.

```
wget https://ucloud.univie.ac.at/index.php/s/BDxyMZaGyKedssT/download

tar -zxvf download

less test.fastq.gz
```

So, you see here:

Header: read id, sequence info, read pair

Sequence: ATGCGCGTATCGATGCTATGC… bla bla

Qualities: confidence for each base called (ASCII)

**What do we do now?**

* Count the number of reads, or lines in the fasta file. How do they relate?

* Look at the start and end of the file.

* Get the unique reads (hint: NR%4==2)

* Search for patterns. E.g. "CGTATGCCGTCTTCTGCTTG" - what is going on? And why does it matter?


# Adapter trimming

![image.png](attachment:image.png)

For trimming, we use a tool called `trimmomatic`, one of the possible tools for this taks. This is a tool implemented in a language called `java`, and we will not check how this looks like. On the good side, it is a command line tool, so you can used it the same way as you do with `grep` or other basic commands, providing parameters, and directions from where to take input and where to write output. These are some standard parameters (but of course, with new data, you may need to do adjustments):

```
java -jar /opt/trimmomatic.jar SE -phred33 test.fastq.gz \
test_p.fastq.gz \
ILLUMINACLIP:/home/local/ANTHROPOLOGY/kuhlwilmm83/refgen/adapters/TruSeq2-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
``` 

* Let's count the reads before and after this step!
* What is the TruSeq2-SE.fa file?

Now we have performed the first important step in raw data filtering, congrats!


# FASTQC

Before moving on, let's have quick check on the raw data quality! In order to do that, we can use a program called FastQC. It is already installed on the cluster, and we can run it easily:

``` 
mkdir fastqc-output
/opt/FastQC/fastqc -o ./fastqc-output test.fastq.gz
```

This will create a report on data quality in `html` format, which we can inspect.

Now we have made sure the data looks good.

## Next time: mapping to the reference genome