# Introduction

## Download and get data

Go to: `notebooks/vcf/data/`

`wget https://ucloud.univie.ac.at/index.php/s/sjVDEgg2KDvI9u8/download`

`tar -zxvf download`


## VCF files

The variant call format (VCF) is a standard format of a text file to store genetic variants and their metadata. An example of a VCF file is shown below.

<img src="https://davetang.github.io/learning_vcf_file/img/vcf_format.png" alt="VCF example" width="1000"/>

**Figure 1 A VCF file.** Lines started with the character `#` are header lines. These header lines define different metadata. Lines started without `#` store genetic variants. Each line represents a variant. This figure is from https://davetang.github.io/learning_vcf_file/#introduction.


More information about VCF files can be found in [hts-specs](https://samtools.github.io/hts-specs/).

### The 1000 Genomes Project

The 1000 Genomes Project is a commonly-used reference data resources in human genetic studies.

In the phase 3 data, there are 2,504 individuals from 26 populations ([Sudmant et al. 2015](https://doi.org/10.1038/nature15394)).

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/1000_Genomes_Project.svg/1280px-1000_Genomes_Project.svg.png?1657572184698" alt="1KG" width="1000"/>

**Figure 2 Locations of population samples of 1000 Genomes Project.** Each circle represents the number of sequences in the final release. This figure is from https://www.wikiwand.com/en/1000_Genomes_Project#/Human_genome_samples.

In this course, we usually use genotype data from the YRI and CEU populations as examples.

Here,
- YRI: Yoruba in Ibadan, Nigeria.
- CEU: Utah Residents (CEPH) with Northern and Western European ancestry.

More information can be found in their [website](https://www.internationalgenome.org/).



# bcftools

`bcftools` is an effcient software for manipulating and analysing VCF files. There are several commands in `bcftools`.
To find out available commands, you can type the following command in the terminal.

```
bcftools
```

In this tutorial, we will use `bcftools view` to look at genetic variants and extract some data from VCF files. More information about `bcftools` can be found in its [manual](https://samtools.github.io/bcftools/bcftools.html).

## Check VCF files

It is simple to view data stored in a VCF file. Just type `bcftools view <VCF file name>` in the terminal. 

Here, we provide an example VCF file named `chr21.YRI.CEU.vcf.gz`. You can type the following command in your terminal to have a look at our example data.

```
bcftools view chr21.YRI.CEU.vcf.gz
```

As usual, it writes the whole file to the command line. Not good.

```
bcftools view chr21.YRI.CEU.vcf.gz | less
```

With the argument `-H`, no header lines are printed out by `bcftools`. This is convenient when you have a VCF file with many header lines and just want to look at the variants.

```
bcftools view chr21.YRI.CEU.vcf.gz -H | less
```

As you see, the file starts as a compressed file (.gz), but the output is uncompressed (you can read it).

The common tool for this kind of file to compress it `bgzip`. Try to look at:

```
bcftools view chr21.YRI.CEU.vcf.gz -H | bgzip -c | less
```

This you would usually write somewhere.

How do you do that?

## Extract data from a population

If we want to extract data from some individuals, for example, samples from the same population, we can use the argument `-S` with a file containing names of samples you want to extract. 

Here, we provide two example files `YRI.list` and `CEU.list` with samples from the YRI and CEU populations.

An example from `YRI.list` is below. Each line contains a sample name.

```
NA18486
NA18488
NA18489
NA18498
NA18499
```

Then we can extract genetic variants from the YRI and CEU populations with the following commands, respectively.

```
bcftools view chr21.YRI.CEU.vcf.gz -S YRI.list | bgzip -c > chr21.YRI.vcf.gz
bcftools view chr21.YRI.CEU.vcf.gz -S CEU.list | bgzip -c > chr21.CEU.vcf.gz
```

Let's have a look at the files!


## Filtering data

When analysing data, we usually want to use variants that can meet some conditions, for example, variants with good quality. Then we can use several arguments to filter data in VCF files.

For example, the following commands extract biallelic single nucleotide polymorphisms (SNPs) that passed quality checks from the YRI and CEU populations.

```
bcftools view chr21.YRI.vcf.gz -f PASS -m 2 -M 2 -v snps | bgzip -c > chr21.YRI.biallelic.snps.vcf.gz
bcftools view chr21.CEU.vcf.gz -f PASS -m 2 -M 2 -v snps | bgzip -c > chr21.CEU.biallelic.snps.vcf.gz
```

What is the meaning of the arguments `-f`, `-m`, `-M`, and `-v`? Ask bcftools directly: `bcftools view -h` 

What would happen with the argument `-a`?


Or use the [manual](https://samtools.github.io/bcftools/bcftools.html#view)!