# Recap

Data comes in different formats: `.fa.gz` for genomes, `.fastq.gz` for raw reads, `.bam` for mapped reads, `.vcf.gz` for genotype calls.

* Format means that specific elements are expected in a specific order.
* Reference genomes have different versions - always know which one you use!

Command line tools to filter, specialised tools for such data: `samtools` for mapped reads, `gatk MarkDuplicates` for filtering, `gatk HaplotypeCaller` for genotype calling.

Variant calls can be for all basepairs, for only SNPs, or in a condensed genome-wide format ("gvcf") - but all based on the convention of the `VCF` file format. `VCF` files can be for many individuals.

One important toolkit to work with the files is `bcftools`.

`bcftools concat` to add rows (genomic positions for the same individuals), `bcftools merge` to add columns (individuals for the same positions)

`bcftools view -r ` to obtain regions

`bcftools view -S ` to obtain individuals

`bcftools view -a ` to trim alleles

`bcftools filter` to perform a lot of filtering

Now for something slightly different. We have looked at sequencing data, and often we obtain numbers. It is always a good idea to visualise some of this. How do you do this from the command line (assuming you are not on your laptop but a remote server)?


# Some data visualisation

## Proportions of sequencing reads

Assume you store information on your raw data in files. You have 20 sequencing experiments, and obtain values for

* raw sequencing reads
* reads after adapter removal
* reads mapped
* reads with sufficient quality (e.g. -m 35 -q 30)
* reads after duplicate removal

These proportions would tell you if something went wrong - and where it went wrong. A problem in library preparation may lead to high adapter abundance. DNA degradation may lead to small number of reads mapped, short reads or reads of low quality. PCR overamplification may lead to high numbers of duplicates, etc. It might be useful to look at this visually to check what is going on.

Here is a table with the values: `~/bioinfo-course/notebooks/12_seq_viz/data/seqdata_summary.tsv`

For visualization, working in R is nice. RStudio does unfortunately not run on the server using the current platform, so we may work on the R command line interface:

```
R --vanilla
```

Now we load the table:

```
seqdata<-read.table("~/bioinfo-course/notebooks/12_seq_viz/data/seqdata_summary.tsv", sep="\t",header=T,as.is=T)
```

A first thing is to get the proportions of reads retained, not calculating one by one, but at once.

```
tbd
```

As you may see, there are problematic samples around. Some might be ancient DNA, some might be degraded, some might be overamplified with a PCR.

* Let's visualise it.

```
plot tbd
```

As you can see, there are samples that visually stick out at different points. Assume this is a screening, where you want to know for a couple of samples how good they work for sequencing, you can choose which ones to continue with! This is very nice!


## Coverage distribution

```
tbd
```


## Heterozygosity

Now, we may take the 1000 genomes vcf file, and obtain heterozygous positions:

```
tbd
```

Note that you operate in `bash` to run certain analyses, and in `R` to visualise. These are different environments!

* Let's plot the results!

```
tbd
```

