# Recap

Data comes in different formats: `.fa.gz` for genomes, `.fastq.gz` for raw reads, `.bam` for mapped reads, `.vcf.gz` for genotype calls.

* Format means that specific elements are expected in a specific order.
* Reference genomes have different versions - always know which one you use!

Command line tools to filter, specialised tools for such data: `samtools` for mapped reads, `gatk MarkDuplicates` for filtering, `gatk HaplotypeCaller` for genotype calling.

Variant calls can be for all basepairs, for only SNPs, or in a condensed genome-wide format ("gvcf") - but all based on the convention of the `VCF` file format. `VCF` files can be for many individuals.

One important toolkit to work with the files is `bcftools`.

`bcftools concat` to add rows (genomic positions for the same individuals), `bcftools merge` to add columns (individuals for the same positions)

`bcftools view -r ` to obtain regions

`bcftools view -S ` to obtain individuals


## Quality filtering

### Subsetting SNPs by quality

When analysing data, we usually want to use variants that can meet some conditions, for example, variants with good quality.

* As before, we may extract biallelic SNPs that passed the quality checks.

```
bcftools view chr18.FIN.vcf.gz -f PASS -m 2 -M 2 -v snps | bgzip -c > chr18.FIN.biallelic.snps.vcf.gz &
bcftools view chr18.IBS.vcf.gz -f PASS -m 2 -M 2 -v snps | bgzip -c > chr18.IBS.biallelic.snps.vcf.gz
```

If we have a look at these files, we can see that they contain a lot of positions where all individuals within the population are `0/0` (or `0|0`). If were are interested in what is going on within the population, we may restrict to polymorphisms within.


### Restricting data to polymorphic sites

Looking at the [documentation](https://samtools.github.io/bcftools/bcftools.html#view) of `bcftools view`, we find the flag `-a` to trim unseen alternative alleles. In this case, one should first trim the alleles, and afterwards filter for biallelic SNPs.

* Let's trim and filter, and count how much is left in each case!

```
bcftools view chr18.FIN.vcf.gz -a | bcftools view -f PASS -m 2 -M 2 -v snps | bgzip -c > chr18.FIN.true_biallelic.snps.vcf.gz &
bcftools view chr18.IBS.vcf.gz -a | bcftools view -f PASS -m 2 -M 2 -v snps | bgzip -c > chr18.IBS.true_biallelic.snps.vcf.gz
```

* Given that the number of individuals in each dataset is similar, does it tell us something about the diversity within these populations?


### Much more filtering

Beyond the very basic filtering above, `bcftools filter` provides a variety of options of filtering. The easiest are defined with `-e` to *exclude* sites with a certain condition, and `-i` to *include* those with a certain condition.

```
bcftools filter
```

You need to provide [*expressions*](https://samtools.github.io/bcftools/bcftools.html#expressions), in a very similar way you provide *expressions* to `awk`, including comparison operators (`>=` etc), logical operators (`&&`), selection of samples with square brackets (`[5:6]`), and many more which cannot be comprehensively discussed here.

* Let's try some!

```
bcftools filter '-i GT[0]=="0|1"' chr18.vcf.gz | less

bcftools filter '-e GT[20]=="0|0"' chr18.vcf.gz | less

```

* To see what it is doing, look at:

```
bcftools filter '-e GT[20]=="0|0"' chr18.vcf.gz | grep -v "^#" | cut -f 30 | less

bcftools filter '-i GT[1]=="1|2"' chr18.vcf.gz | less
```

* In the `INFO` column, there is a lot of information, such as allele frequencies (`AF`).

```
bcftools filter '-i INFO/AF>0.3' chr18.vcf.gz | cut -f 8 | less
```

The 1000 genomes file is particularly reduced, containing only genotypes. We may also try it on a more "raw" file from a previous session or the challenge.

* Just to try out some:

```
bcftools filter '-i GT=="./."' chr18_a.vcf.gz | less

bcftools filter '-i FORMAT/DP>5' chr18_a.vcf.gz | less 

bcftools filter '-e FORMAT/DP<5 && FORMAT/DP>8' chr18_a.vcf.gz | less 
```

With these types of commands, you can get a nice and clean dataset of genotypes.


## Now for something slightly different.

We have looked at sequencing data, and often we obtain numbers. It is always a good idea to visualise some of this. How do you do this from the command line (assuming you are not on your laptop but a remote server)?
