# Recap


Raw sequencing data: `.fastq` format 

Removing adapters: `TRIMMOMATIC`

Mapping: `bwa` sequencing data to `.bam`  format 

`samtools` program

`samtools view` - look at (binary) file

`samtools sort` - nicely sort reads along the chromosomes

`samtools index` - create an index file

`samtools stats` - create statistics for `.bam` file


# Now: More to do with samtools

## Quality filtering
We can use `samtools` to do quality filtering. Assume you want to only keep reads with an insert size of at least 35 base pairs (why would that be the case?) and a mapping quality score of at least 25. Then, the way to go is:

```
samtools view -m 35 -q 25 test.sorted.bam
```

Now, count the number of reads, and compare to the unfiltered number of reads. What do you observe?

How about using `samtools view -c`?

## Merging

Often, you will have sequencing data from different sequencing runs. Usually, you want to treat each of them separately for adapter cutting, mapping and QC, but merge them for downstream analyses. Just adding one file at the end of the other, however, may lead to problems, considering that there are headers and metadata and stuff. `samtools` has a nice function to deal with this and create proper merged datasets.

```
samtools merge -o test.merged.bam test.sorted.bam test3.sorted.bam 
samtools index test.merged.bam 
```

This command will merge the two files. Obviously, in our case this is not a good idea because they do come from different data sources. So keep in mind that you should know what you are doing!


## Looking

Now, we can have a look at the data, e.g. just looking at the beginning of the sequencing data. `samtools` also lets you scroll through it, although you may not want to do this for the whole genome:
 
```
samtools tview -p chr1:10000 test.merged.bam 
```


# Duplicates

For sequencing, during library preparation, there is a PCR step, leading to the occurrence of some PCR duplicates in the data. This is not biologically meaningful data, so one may want to remove them. There are different strategies, some just mark them in the data but leave it there, while others straightforward remove them.
One of the tools for this purpose is the GenomeAnalysisToolKit (GATK): `gatk MarkDuplicates`. We will work more with this toolkit later on. It is the main toolkit for many tasks related to genetic data, and knowing the duplicates is an important feature.

```
gatk MarkDuplicates -I test3.sorted.bam -O test3.markdup.sorted.bam -M metrics.txt
```

What happened to the data? Let's inspect the metrics file!

Now, let's try the following:

```
gatk MarkDuplicates --REMOVE_DUPLICATES -I test3.sorted.bam -O test3.rmdup.sorted.bam -M metrics2.txt
```


# Let's have a look at the coverage! 

Do:

```
mosdepth -x -n -f /home/local/ANTHROPOLOGY/kuhlwilmm83/refgen/hg19/hg19.p13.plusMT.no_alt_analysis_set.fa test3 test3.markdup.sorted.bam
```

What happens and why?

Now we do it properly and inspect the output!

Is this expected? What does it tell about the necessary data in such an experiment? We may calculate based on the size of the human genome.


# Something on ancient DNA

Let's compare the statistics for modern vs. ancient DNA!

The data for test3.* is from a modern individual.

[Here](https://ucloud.univie.ac.at/index.php/s/4XSRnpreQxFC6KD) you find data from an ancient individual.

* Download and inspect it!
