# Genotypes!

That is so far a nice progress! Now, what we want to know is differences between the reference genome and the sequencing data we have generated/obtained. That is, calling the genotypes: Getting a list of positions in the genome and which allele an individual carries. This is the information you actually want to know from genomic data.

These files contain different types of information:

* Header
* Position
* Alleles
* Info
* Genotypes

Note the transition from individual reads in a bam file to coordinates in the reference genome.

### VCF files

The variant call format (VCF) is a standard format of a text file to store genetic variants and their metadata. An example of a VCF file is shown below.

![image.png](attachment:image.png)

Lines starting with the character `#` are header lines. These header lines define different metadata. Lines starting without `#` store genetic variants. Each line represents a variant. This figure is from https://davetang.github.io/learning_vcf_file/#introduction.

More information about VCF files can be found in [hts-specs](https://samtools.github.io/hts-specs/).


## GATK

[Genome Analysis ToolKit](https://gatk.broadinstitute.org/hc/en-us): This is the most common tool to get genotypes from sequencing data. It can do many more things (*toolkit*), but one of the most important core uses is really getting the genotypes.

We can run GATK to call genotypes:


```
module load gatk
```

```
gatk --java-options "-Xmx1g" HaplotypeCaller -L chr21 -R $REF -I test.rmdup.sorted.filtered.bam -ERC GVCF -O test.vcf.gz
```

You need to specify the input (`-I`), the output (`-O`) and of course the reference genome to compare to (`-R`). There are many more options (and their details might depend on the version of GATK you are using), you can find the documentation on the website, or by `gatk HaplotypeCaller --help`.

* Let's inspect the output!

In our case, we restrict the analysis to chromosome 21 (`-L chr21`), as this is the smallest chromosome and a rather quick and easy procedure in a small sequencing dataset like this. Obviously, it will take much longer for whole genomes with several-fold coverage. However, the process of obtaining the genotypes by itself from "standard" sequencing data is not complex, as the tool is well-maintained in that regard. What would be "non-standard" situations?

The flag `-ERC` determines if genotypes for all positions should be called, or just for those where differences from the reference genome are observed.

## Something you may do with the VCF file

* How many SNPs are in the VCF file? (heterozygous sites: 0/1; homozygous different from the reference: 1/1)

* What is the position of the first and the last SNP?


## Data analysis of VCF files

VCF files may contain a lot of information! Since they are text files, you may actually filter them with command line tools such as `awk`.

For example, let's try `zgrep "1/1" test.vcf.gz` or `zcat test.vcf.gz | awk '$4=="C" && $5~"T"'`.

This is a very toy example of a vcf file from really bad data, so it is not informative. However, a **VCF** file can in principle contain millions of positions for thousands of individuals. Obviously, with such a complex data format, there are many more things you may want to do, and more specialized tools are available and useful. One of the most common ones is `bcftools`. There are others, such as [vcftools](https://vcftools.sourceforge.net/).


## BCFTOOLS

`bcftools` is an efficient software for manipulating and analysing VCF files. As is the case for most programs, there are many possible commands in `bcftools`. More information about `bcftools` can be found in its [manual](https://samtools.github.io/bcftools/bcftools.html).

```
module load bcftools htslib
```


To find out available options, you can type the following command in the terminal:

```
bcftools
```

You don't even need to specify `--help` as the program tries to be helpful by default. That is very much appreciated, and often not the case...

It is a great program that can do many things. But you really need to know what things you can and may do, what exactly it is doing, and what each flag/parameter is doing. Otherwise, it's quite easy to screw things up here.

Now, let's have a look at some of the functions!

```
bcftools view

bcftools view -Ov -r chr21:10000000-15000000 test.vcf.gz | less
```

As you can see, you can specify specific parts of the chromosome you want to look at, pretty much like with `samtools` for `bam` files. Like `bam` files, also `vcf` files can be indexed, and some downstream applications rely on this:

```
bcftools index test.vcf.gz
```

To save the output, these two are identical:

```
bcftools view -Ov -r chr21:10000000-15000000 test.vcf.gz | bgzip > test_out.vcf.gz

bcftools view -Oz -o  test_out.vcf.gz -r chr21:30000000-35000000 test.vcf.gz
```

* What does the `-O` flag do? 

An important type of filtering is only positions that are variable:

```
bcftools view -v snps test.vcf.gz | less
```

Here, you have only one individual, and the `<NON_REF>` marking is a bit difficult to handle. However, with multiple (many) samples this becomes more relevant, as we will see later. You may also choose indels or other things.


If there is too much information in the genotype calls, and you only want the pure genotypes, you may do this:

```
bcftools annotate -x 'FORMAT' test.vcf.gz | less
```

Assume you have vcf files for separarate regions or chromosomes, if you want to make one file out of it (concatenate), this is again tricky with compressed files with headers and different metadata. `bcftools` can do that for you as well:

```
bcftools view -Oz -o  test_out2.vcf.gz -r chr21:35000001-40000000 test.vcf.gz

bcftools concat -Oz -o test_out.merged.vcf.gz test_out.vcf.gz test_out2.vcf.gz 
```

Finally, like `samtools` for mapped reads, `bcftools` can provide useful summary statistics for genotype files:

```
bcftools stats test.vcf.gz > test_bcfstats.txt
```

* Let's inspect the output! 

## Merging

To clarify the difference between `bcftools concat` and `bcftools merge`: you concatenate different genomic positions for the same individuals, or your merge the same genomic positions for different individuals.

**Concatenating:**

![image.png](attachment:image.png)

**Merging:**

![image-2.png](attachment:image-2.png)


## Diversity data

### 1000 Genomes data

The 1000 Genomes Project is a commonly-used reference data resources in human genetic studies.

In the (final) phase 3 data, there are 2,504 individuals from 26 populations ([Sudmant et al. 2015](https://doi.org/10.1038/nature15394)).

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/1000_Genomes_Project.svg/1280px-1000_Genomes_Project.svg.png?1657572184698" alt="1KG" width="1000"/>

**Figure 2 [Locations](https://www.wikiwand.com/en/1000_Genomes_Project#/Human_genome_samples) of population samples of 1000 Genomes Project.** Each circle represents the number of sequences in the final release.

More information can be found in their [website](https://www.internationalgenome.org/).

## Extract data from a population

So, let's use data from the 1000 Genomes project:

```
less $WDIR/share/ALL.chr18.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz
```

This is a multi-individual VCF file. This it not in the `gvcf` format, only SNPs are here, making it easier to handle.

* Let's have a look at the header as well as the metadata file provided by the project!
* Inspect the chromosome names!

```
bcftools view -h $WDIR/share/ALL.chr18.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz 

less $WDIR/share/integrated_call_samples_v3.20130502.ALL.panel
```

Again, to make things easier, we keep the long file name in a variable:

```
TGDATA="$WDIR/share/ALL.chr18.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz" 

INDFILE="$WDIR/share/integrated_call_samples_v3.20130502.ALL.panel"
```


### Reference problems

Now, we encounter another typical problem - different [reference genomes](https://gatk.broadinstitute.org/hc/en-us/articles/360035891071-Reference-genome) for humans. A first problem is that existing data from different sources might have been mapped and genotype called using a newer version of the human genome ([version 38](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.26/): "hg38" from 2013), or the previous version ([version 19](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.13/) or "hg19" from 2009). These are the most common ones, and even older ones are really not recommended to use! These versions have different coordinates, so they are not compatible.

To make things worse, although there is a coordinate system for the same version numbers, with the same number of bases from start to end, there exist versions with different chromosome names ("chr18" vs. ("18"), coming from different [sources](https://genome-euro.ucsc.edu/cgi-bin/hgGateway?db=hg19&redirect=manual&source=genome.ucsc.edu)!

[Here](https://gatk.broadinstitute.org/hc/en-us/articles/360035890951-Human-genome-reference-builds-GRCh38-or-hg38-b37-hg19) is an explanation of the issue. That means, there are four different human genome versions commonly used (hg19 and hg38 with and without "chr").

**When obtaining a dataset, you always need to check which reference genome in which format was used!**

It's already a rather big file, can may see that it does contain >2M records for 2,504 individuals - a data frame of *>5.6 billion cells* for one of the smaller chromosomes... Even for the smallest human chromosome (chr21), this is >1M records!


## Subsetting & filtering

### Subsetting regions

* With this in mind, let's subset the file to make work easier! Note how the `-r` changes!

```
bcftools view -r 18:1-3000000 -Oz -o chr18.vcf.gz $TGDATA
bcftools view -h chr18.vcf.gz
```

* Let's calculate some stats:

```
bcftools stats chr18.vcf.gz > stats_chr18.vcf.gz
bcftools index chr18.vcf.gz
```

The `index` step above is useful for retrieving subsets from the data, and generally recommendable for `vcf` files.

### Subsetting individuals

If we want to extract data from some individuals, for example, samples from the same population, we can use the argument `-S` with a file containing names of samples you want to extract. Let's assume you want to compare the Finns to the Iberians, you could use awk to create a nice list with the identifiers for each.

```
awk '$2=="FIN" { print $1 }' $INDFILE > FIN.list
awk '$2=="IBS" { print $1 }' $INDFILE > IBS.list
```

Now, each line in these files contains one sample ID. Then we can extract genetic variants from the FIN and IBS populations with the following commands, respectively.

```
bcftools view -Oz -o chr18.FIN.vcf.gz chr18.vcf.gz -S FIN.list &
bcftools view -Oz -o chr18.IBS.vcf.gz chr18.vcf.gz -S IBS.list
```

* Now there is the `&` at the end of the line? What does this mean now?

* Let's have a look at the files!

As a result, you get a file containing the same variants as before, just for a subset of individuals instead of the whole thing.


## Quality filtering

### Subsetting SNPs by quality

When analysing data, we usually want to use variants that can meet some conditions, for example, variants with good quality.

* As before, we may extract biallelic SNPs that passed the quality checks.

```
bcftools view chr18.FIN.vcf.gz -f PASS -m 2 -M 2 -v snps | bgzip -c > chr18.FIN.biallelic.snps.vcf.gz &
bcftools view chr18.IBS.vcf.gz -f PASS -m 2 -M 2 -v snps | bgzip -c > chr18.IBS.biallelic.snps.vcf.gz
```

If we have a look at these files, we can see that they contain a lot of positions where all individuals within the population are `0/0` (or `0|0`). If were are interested in what is going on within the population, we may restrict to polymorphisms within.


### Restricting data to polymorphic sites

Looking at the [documentation](https://samtools.github.io/bcftools/bcftools.html#view) of `bcftools view`, we find the flag `-a` to trim unseen alternative alleles. In this case, one should first trim the alleles, and afterwards filter for biallelic SNPs.

* Let's trim and filter, and count how much is left in each case!

```
bcftools view chr18.FIN.vcf.gz -a | bcftools view -f PASS -m 2 -M 2 -v snps | bgzip -c > chr18.FIN.true_biallelic.snps.vcf.gz &
bcftools view chr18.IBS.vcf.gz -a | bcftools view -f PASS -m 2 -M 2 -v snps | bgzip -c > chr18.IBS.true_biallelic.snps.vcf.gz
```

* Given that the number of individuals in each dataset is similar, does it tell us something about the diversity within these populations?


### Much more filtering

Beyond the very basic filtering above, `bcftools filter` provides a variety of options of filtering. The easiest are defined with `-e` to *exclude* sites with a certain condition, and `-i` to *include* those with a certain condition.

```
bcftools filter
```

You need to provide [*expressions*](https://samtools.github.io/bcftools/bcftools.html#expressions), in a very similar way you provide *expressions* to `awk`, including comparison operators (`>=` etc), logical operators (`&&`), selection of samples with square brackets (`[5:6]`), and many more which cannot be comprehensively discussed here.

* Let's try some!

```
bcftools filter '-i GT[0]=="0|1"' chr18.vcf.gz | less

bcftools filter '-e GT[20]=="0|0"' chr18.vcf.gz | less

```

* To see what it is doing, look at:

```
bcftools filter '-e GT[20]=="0|0"' chr18.vcf.gz | grep -v "^#" | cut -f 30 | less

bcftools filter '-i GT[1]=="1|2"' chr18.vcf.gz | less
```

* In the `INFO` column, there is a lot of information, such as allele frequencies (`AF`).

```
bcftools filter '-i INFO/AF>0.3' chr18.vcf.gz | cut -f 8 | less
```

With these types of commands, you can get a nice and clean dataset of genotypes.
