# Recap

Data comes in different formats: `.fa.gz` for genomes, `.fastq.gz` for raw reads, `.bam` for mapped reads, `.vcf.gz` for genotype calls.

* Format means that specific elements are expected in a specific order.
* Reference genomes have different versions - always know which one you use!

Command line tools to filter, specialised tools for such data: `samtools` for mapped reads, `gatk MarkDuplicates` for filtering, `gatk HaplotypeCaller` for genotype calling.

Variant calls can be for all basepairs, for only SNPs, or in a condensed genome-wide format ("gvcf") - but all based on the convention of the `VCF` file format. `VCF` files can be for many individuals.

One important toolkit to work with the files is `bcftools`.

`bcftools concat` to add rows (genomic positions for the same individuals), `bcftools merge` to add columns (individuals for the same positions)

`bcftools view -r ` to obtain regions

`bcftools view -S ` to obtain individuals

`bcftools view -a ` to trim alleles

`bcftools filter` to perform a lot of filtering

Now for something slightly different. We have looked at sequencing data, and often we obtain numbers. It is always a good idea to visualise some of this. How do you do this from the command line (assuming you are not on your laptop but a remote server)?


# Some data visualisation

## Proportions of sequencing reads

Assume you store information on your raw data in files. You have 20 sequencing experiments, and obtain values for

* raw sequencing reads
* reads after adapter removal
* reads mapped
* reads with sufficient quality (e.g. -m 35 -q 30)
* reads after duplicate removal

These proportions would tell you if something went wrong - and where it went wrong. A problem in library preparation may lead to high adapter abundance. DNA degradation may lead to small number of reads mapped, short reads or reads of low quality. PCR overamplification may lead to high numbers of duplicates, etc. It might be useful to look at this visually to check what is going on.

Here is a table with the values: `~/bioinfo-course/notebooks/12_seq_viz/data/seqdata_summary.tsv`

For visualization, working in R is nice. RStudio does unfortunately not run on the server using the current platform, so we may work on the R command line interface for this course.

This also means plots are not showing up immediately, you need to save them.

However, if you have RStudio on your computer, you may download the material and use it there.

```
R --vanilla
```

Now we load the table:

```
seqdata<-read.table("~/bioinfo-course/notebooks/12_seq_viz/data/seqdata_summaries.tsv", sep="\t",header=T)
```

* A first thing is to get the proportions of reads retained, not calculating one by one, but at once.

```
seqdata[,-1]/seqdata[,2]
```

* Maybe easier to read as percentage?

```
round(seqdata[,-1]/seqdata[,2]*100,1)
```

As you may see, there are problematic samples around, where you lose a large proportion of reads towards the end of the tests. Some might be ancient DNA, some might be degraded, some might be overamplified with a PCR.

Let's visualise it! An easy way with base R would be a barplot, showing the loss of data in each sample. For that, you need to turn it into a matrix, exchange columns and rows.

* Let's do it!

```
bdata<-t(as.matrix(seqdata[,-1]))

png("barplot.png",800,400)
barplot(bdata,beside=T,names.arg=seqdata[,1])
dev.off()
```

How about a legend?

```
png("barplot.png",800,400)
barplot(bdata,beside=T,names.arg=seqdata[,1],legend.text=colnames(seqdata)[-1])
dev.off()
```

Ok, this looks ugly... We may need to adapt the plot a little bit to give it enough space (`ylim`), and change the legend a bit (`args.legend` for making it horizontal and located in the middle):

```
png("barplot.png",800,400)
barplot(bdata,beside=T,names.arg=seqdata[,1],legend.text=colnames(seqdata)[-1],args.legend=list(x="top",horiz=T),ylim=c(0,7200000))
dev.off()
```

As you can see, there are samples that visually stick out at different points. Assume this is a screening, where you want to know for a couple of samples how good they work for sequencing, you can choose which ones to continue with! This is very nice!

* From what we know about aDNA, they tend to lose a lot of data at each step, this might be the case for two samples.

* One sample has most data removed as unmapped - likely a sample with very little human DNA.

* One sample has a sharp drop when removing low quality mappings - this might be due to ultra-short reads?

* One sample loses most of the data when removing duplicates - this could be the case after over-amplification in a 


## Coverage distribution

Let's take the coverage from the chr18 data from a previous challenge! This is the `bcftools stats` output, and again we can use `awk` to subset the information which we need, the sequencing depths (`DP`).

```
awk '$1=="DP"' ~/bioinfo-course/notebooks/12_seq_viz/data/chr18stats.txt > ~/test/DP.txt
```

Note that you operate here in `bash` to run certain analyses, and in `R` to visualise. *These are different environments with different structures and commands!* You need to be aware of this, and may use two parallel Terminals.

* Reading in the table the same way:

```
covdata<-read.table("~/test/DP.txt", sep="\t",header=T)
```

* Oh no, another problem, the first is data, no header. What to do?

Here, we only need to care about the third and the sixth column, i.e. the depth and the number of sites at each depth. We can do a line plot from this, right?

```
plot(x=covdata[,3],y=covdata[,6],type="l")
```

Ok, not nice, because almost no positions are at depth >25.

```
covdata2<-covdata[which(as.numeric(covdata[,3])<=25),]
plot(x=covdata2[,3],y=covdata2[,6],type="l")
```

Much better! Finally, let's make the plot a bit nicer:

* a thick line between dots
* axis labels
* a nice title

Now, we can add the coverage distribution of another sample, as provided here:


## Heterozygosity

Now, we may go back to the 1000 genomes VCF file we used earlier, and obtain heterozygous positions. 

```
tbd
```


* Let's plot the results!

```
tbd
```

