# Recap

Data comes in different formats: `.fa.gz` for genomes, `.fastq.gz` for raw reads, `.bam` for mapped reads, `.vcf.gz` for genotype calls.

* Format means that specific elements are expected in a specific order.
* Reference genomes have different versions - always know which one you use!

`samtools` to inspect, sort, merge, filter them

`gatk MarkDuplicates` to mark or remove PCR duplicates from the data

`mosdepth` to get coverage

Genotype calling: what is the state of an individual at a given position in the genome?

Variant calls can be for all basepairs, for only SNPs, or in a condensed genome-wide format ("gvcf") - but all based on the convention of the `VCF` file format. `VCF` files can be for many individuals.

Program: `gatk HaplotypeCaller`, then `gatk GenotypeGVCFs` - needs reference genome and mapped data

`bcftools` as toolkit for inspection, filtering, etc.

`bcftools view` - can include parameters for filtering, e.g. `-m2` means at least 2 alleles

`bcftools concat` - concatenate files

`bcftools stats` - obtain useful statistics


## Diversity data

### 1000 Genomes data

The 1000 Genomes Project is a commonly-used reference data resources in human genetic studies.

In the (final) phase 3 data, there are 2,504 individuals from 26 populations ([Sudmant et al. 2015](https://doi.org/10.1038/nature15394)).

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/1000_Genomes_Project.svg/1280px-1000_Genomes_Project.svg.png?1657572184698" alt="1KG" width="1000"/>

**Figure 2 [Locations](https://www.wikiwand.com/en/1000_Genomes_Project#/Human_genome_samples) of population samples of 1000 Genomes Project.** Each circle represents the number of sequences in the final release.

More information can be found in their [website](https://www.internationalgenome.org/).


# Population-scale VCFs

So, we should get data from the 1000 Genomes project:

```
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr18.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz
```

This is a multi-individual VCF file. This it not in the `gvcf` format, only SNPs are here, making it easier to handle.

* Let's have a look at the header as well as the metadata file provided by the project!
* Inspect the chromosome names!

```
bcftools view -h ALL.chr18.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz 

wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel
less integrated_call_samples_v3.20130502.ALL.panel
```


### Reference problems

Now, we encounter another typical problem - different [reference genomes](https://gatk.broadinstitute.org/hc/en-us/articles/360035891071-Reference-genome) for humans. A first problem is that existing data from different sources might have been mapped and genotype called using a newer version of the human genome ([version 38](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.26/): "hg38" from 2013), or the previous version ([version 19](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.13/) or "hg19" from 2009). These are the most common ones, and even older ones are really not recommended to use! These versions have different coordinates, so they are not compatible.

To make things worse, although there is a coordinate system for the same version numbers, with the same number of bases from start to end, there exist versions with different chromosome names ("chr18" vs. ("18"), coming from different [sources](https://genome-euro.ucsc.edu/cgi-bin/hgGateway?db=hg19&redirect=manual&source=genome.ucsc.edu)!

[Here](https://gatk.broadinstitute.org/hc/en-us/articles/360035890951-Human-genome-reference-builds-GRCh38-or-hg38-b37-hg19) is an explanation of the issue. That means, there are four different human genome versions commonly used (hg19 and hg38 with and without "chr").

**When obtaining a dataset, you always need to check which reference genome in which format was used!**

It's already a rather big file, can may see that it does contain >2M records for 2,504 individuals - a data frame of *>5.6 billion cells* for one of the smaller chromosomes... Even for the smallest human chromosome (chr21), this is >1M records!


## Subsetting data

### Subsetting regions

* With this in mind, let's subset the file to only a part of it, to make work easier!

```
bcftools index ALL.chr18.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz 
bcftools view -r 18:1-25000000 -Oz -o chr18.vcf.gz ALL.chr18.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz
bcftools view -h chr18.vcf.gz
```

* Let's calculate some stats:

```
bcftools stats chr18.vcf.gz > stats_chr18.vcf.gz
bcftools index chr18.vcf.gz
```

The `index` step above is useful for retrieving subsets from the data, and generally recommendable for `vcf` files.

### Subsetting individuals

If we want to extract data from some individuals, for example, samples from the same population, we can use the argument `-S` with a file containing names of samples you want to extract. Let's assume you want to compare the Finns to the Iberians, you could use awk to create a nice list with the identifiers for each. (Btw, again a practical use case of a simple command line tool.)

```
awk '$2=="FIN" { print $1 }' integrated_call_samples_v3.20130502.ALL.panel > FIN.list
awk '$2=="IBS" { print $1 }' integrated_call_samples_v3.20130502.ALL.panel > IBS.list
```

Now, each line in these files contains one sample ID. Then we can extract genetic variants from the FIN and IBS populations with the following commands, respectively.

```
bcftools view -Oz -o chr18.FIN.vcf.gz chr18.vcf.gz -S FIN.list &
bcftools view -Oz -o chr18.IBS.vcf.gz chr18.vcf.gz -S IBS.list
```

* Now there is a symbol `&` at the end of the line? What does this mean now?

* Let's have a look at the files!

As a result, you get a file containing the same variants as before, just for a subset of individuals instead of the whole thing.


## Adding data in vcf files

To clarify the difference between `bcftools concat` and `bcftools merge`: you <u>concatenate</u> different genomic <u>positions</u> for the same individuals, or you <u>merge</u> the same genomic positions for different <u>individuals</u>.

**Concatenating:**

![image-2.png](attachment:image-2.png)

**Merging:**

![image-3.png](attachment:image-3.png)


## Merge

Ok, then let's merge the vcfs of two individuals.

Here is a link with new files to download: 

```
wget https://ucloud.univie.ac.at/index.php/s/enPq6XjDqkaeXRW/download
tar -zxvf download.1
```

Pay attention to the file name downloaded! When downloading into the same directory, they get a number added!

You may also tidy up and remove all original download files: `rm download*`

```
bcftools index denisovan.filt.vcf.gz
bcftools index altai.filt.vcf.gz

bcftools merge -Oz -o merged_file.vcf.gz -r 21:1-35000000 -g /home/local/ANTHROPOLOGY/kuhlwilmm83/refgen/hg19/hg19.p13.plusMT.no_alt_analysis_set.fa.gz altai.filt.vcf.gz denisovan.filt.vcf.gz
```

This is the basic merging function. Type only `bcftools merge` to learn possible settings.

* Now let's have a look at the resulting file!

## Concat

Assume you have vcf files for separarate regions or chromosomes, if you want to make one file out of it (concatenate), this is again tricky with compressed files with headers and different metadata. `bcftools` can do that for you as well:

```
bcftools merge -Oz -o merged_file2.vcf.gz -r 21:35000001-40000000 -g /home/local/ANTHROPOLOGY/kuhlwilmm83/refgen/hg19/hg19.p13.plusMT.no_alt_analysis_set.fa.gz altai.filt.vcf.gz denisovan.filt.vcf.gz


bcftools concat -Oz -o merged_concat_file.vcf.gz merged_file.vcf.gz merged_file2.vcf.gz 
```


## Data filtering

Let's continue with the files we just created!

An important type of filtering is only positions that are variable:

```
bcftools view -m2 -M2 -v snps merged_concat_file.vcf.gz | less

bcftools view chr18.FIN.vcf.gz -f PASS -m 2 -M 2 -v snps -Oz -o chr18.FIN.biallelic.snps.vcf.gz &
bcftools view chr18.IBS.vcf.gz -f PASS -m 2 -M 2 -v snps -Oz -o chr18.IBS.biallelic.snps.vcf.gz
```

* What is the point of `-m`, `-M` and `-v`?

* How would you filter to **remove** SNPs, only keep insertions or deletions etc?


### Restricting data to polymorphic sites

If we have a look at these files, we can see that they contain a lot of positions where all individuals within the population are `0/0` (or `0|0`). If were are interested in what is going on within the population, we may restrict to polymorphisms within.

Looking at the [documentation](https://samtools.github.io/bcftools/bcftools.html#view) of `bcftools view`, we find the flag `-a` to trim unseen alternative alleles. In this case, one should first trim the alleles, and afterwards filter for biallelic SNPs.

* Let's trim and filter, and count how much is left in each case!

```
bcftools view chr18.FIN.vcf.gz -a | bcftools view -m 2 -M 2 -v snps -Oz -o chr18.FIN.true_biallelic.snps.vcf.gz &
bcftools view chr18.IBS.vcf.gz -a | bcftools view -m 2 -M 2 -v snps -Oz -o chr18.IBS.true_biallelic.snps.vcf.gz
```

* Given that the number of individuals in each dataset is similar, does it tell us something about the diversity within these populations?


### Much more filtering

Beyond the very basic filtering above, `bcftools filter` provides a variety of options of filtering. The easiest are defined with `-e` to *exclude* sites with a certain condition, and `-i` to *include* those with a certain condition.

```
bcftools filter
```

You need to provide [*expressions*](https://samtools.github.io/bcftools/bcftools.html#expressions), in a very similar way you provide *expressions* to `awk`, including comparison operators (`>=` etc), logical operators (`&&`), selection of samples with square brackets (`[5:6]`), and many more which cannot be comprehensively discussed here.

* Let's try some!

```
bcftools filter '-i GT[0]=="0|1"' chr18.vcf.gz | less

bcftools filter '-e GT[20]=="0|0"' chr18.vcf.gz | less

```

* To see what it is doing, look at:

```
bcftools filter '-e GT[20]=="0|0"' chr18.vcf.gz | grep -v "^#" | cut -f 30 | less

bcftools filter '-i GT[1]=="1|2"' chr18.vcf.gz | less
```

* In the `INFO` column, there is a lot of information, such as allele frequencies (`AF`).

```
bcftools filter '-i INFO/AF>0.3' chr18.vcf.gz | cut -f 8 | less
```

The 1000 genomes file is particularly reduced, containing only genotypes. We may also try it on a more "raw" file from a previous session or the challenge.

* Just to try out some:

```
bcftools filter '-i GT=="./."' merged_concat_file.vcf.gz | less

bcftools filter '-i FORMAT/DP>5' testb.vcf.gz | less 

bcftools filter '-e FORMAT/DP<5 || FORMAT/DP>8' testb.vcf.gz | less 
```

With these types of commands, you can get a nice and clean dataset of genotypes. Of course, always make sure that any flag (like `DP`) is actually present. VCF files are flexible not in general structure, but in specific content they may have.


# Challenge 4 is coming!

That's it for this course this time, except the last challenge!
