# Introduction

## Download and get data

Go to: `~/appladmix/notebooks/2_data/data/`

`wget https://ucloud.univie.ac.at/index.php/s/sjVDEgg2KDvI9u8/download`

`tar -zxvf download`


### The 1000 Genomes Project

The 1000 Genomes Project is a commonly-used reference data resources in human genetic studies.

In the phase 3 data, there are 2,504 individuals from 26 populations ([Sudmant et al. 2015](https://doi.org/10.1038/nature15394)).

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/1000_Genomes_Project.svg/1280px-1000_Genomes_Project.svg.png?1657572184698" alt="1KG" width="1000"/>

**Figure 2 Locations of population samples of 1000 Genomes Project.** Each circle represents the number of sequences in the final release. This figure is from https://www.wikiwand.com/en/1000_Genomes_Project#/Human_genome_samples.

In this course, we usually use genotype data from the YRI and CEU populations as examples.

Here,
- YRI: Yoruba in Ibadan, Nigeria.
- CEU: Utah Residents (CEPH) with Northern and Western European ancestry.

More information can be found in their [website](https://www.internationalgenome.org/).


# Handling data in a VCF file

VCF files have columns (individuals) and rows (genotypes). We can add columns with `bcftools merge`, and select columns with *e.g.* `bcftools view -s`. Likewise, we can add rows with `bcftools concat` and select rows with *e.g.* `bcftools view -r`.

**Concatenating:**

![image.png](attachment:image.png)

**Merging:**

![image-2.png](attachment:image-2.png)

So let's first try to get a subset of rows:

```
bcftools view -H -r 21:10000000-15000000 chr21.YRI.CEU.vcf.gz | less
```

## Subsetting

Here, we mainly want to do subsetting to get good data. If we want to extract data from some individuals, for example, samples from the same population, we can use the argument `-S` with a file containing names of samples you want to extract. 

Here, we provide two example files `YRI.list` and `CEU.list` with samples from the YRI and CEU populations.

An example from `YRI.list` is below. Each line contains a sample name.

```
NA18486
NA18488
NA18489
NA18498
NA18499
```

Then we can extract genetic variants from the YRI and CEU populations with the following commands, respectively.

```
bcftools view chr21.YRI.CEU.vcf.gz -S YRI.list | bgzip -c > chr21.YRI.vcf.gz
bcftools view chr21.YRI.CEU.vcf.gz -S CEU.list | bgzip -c > chr21.CEU.vcf.gz
```

Let's have a look at the files!


## Filtering data

When analysing data, we usually want to use variants that can meet some conditions, for example, variants with good quality. Then we can use several arguments to filter data in VCF files.

For example, the following commands extract biallelic single nucleotide polymorphisms (SNPs) that passed quality checks from the YRI and CEU populations.

```
bcftools view chr21.YRI.vcf.gz -m 2 -M 2 -v snps | bgzip -c > chr21.YRI.biallelic.snps.vcf.gz
bcftools view chr21.CEU.vcf.gz -m 2 -M 2 -v snps | bgzip -c > chr21.CEU.biallelic.snps.vcf.gz
```

What is the meaning of the arguments `-m`, `-M`, and `-v`? Ask bcftools directly: `bcftools view -h` 

Or use the [manual](https://samtools.github.io/bcftools/bcftools.html#view)!

One can do much more filtering with `bcftools`, but this should suffice for now.


## Adding data

Very good. I assume we are all somewhat interested in ***admixture***. Ok, how about using Neanderthals and Denisovans as possible source populations? 

Let's download the vcf files:

```
wget https://ucloud.univie.ac.at/index.php/s/enPq6XjDqkaeXRW/download

tar -zxvf download.1

bcftools index altai.filt.vcf.gz
bcftools index denisovan.filt.vcf.gz
```


And perform a merging operation:

```
bcftools merge -o chr21.merged.vcf.gz -g /home/local/ANTHROPOLOGY/kuhlwilmm83/refgen/hg19/hg19.p13.plusMT.no_alt_analysis_set.fa.gz chr21.YRI.CEU.vcf.gz altai.filt.vcf.gz denisovan.filt.vcf.gz
```

Let's have a look at the file!



## Data analysis

More important is to get some idea what we have there. That is, for example, what is the diversity of these populations?

Let's start with heterozygosity (how many heterozygous sites are there in this chromosome?). It is very easy to obtain, again with bcftools:

```
bcftools stats -s -  chr21.merged.vcf.gz | grep "^PSC" -B 1 > hets.txt
```

If you inspect the file, you will see that there is one line per sample, with several columns containing the statistics. Column 6 is the number of Hets.

Maybe this is easier in `R`?

We can:
* run R from the command line
* read in these summary tables
* plot a histogram
* calculate the mean for CEU and YRI separately, compared to the archaics
* perform a statistical test of CEU and YRI are different
* There is ~30 million bases for which we can have data on chr21, often heterozygosity is expressed als hets per 1,000bp. We can present the values in such a way.

```
R --vanilla

hets<-read.table("hets.txt", sep="\t",header=T,comment.char="")

png("hets_histogram.png",600,600)
hist(hets[,6],breaks=20)
dev.off()


ceu<-unlist(read.table("CEU.list", sep="\t",header=F))
yri<-unlist(read.table("YRI.list", sep="\t",header=F))

wilcox.test(hets[which(hets[,3]%in%ceu),6],hets[which(hets[,3]%in%yri),6])

genom=30000000

mean(hets[which(hets[,3]%in%ceu),6])/genom*1000
mean(hets[which(hets[,3]%in%yri),6])/genom*1000
hets[which(hets[,3]%in%c("AltaiNea","DenisovaPinky")),6]/genom*1000

```

## Genetic diversity

The number of heterozygous positions reflects an evolutionarily important feature, as it is a very direct metric of genetic diversity.

* We can interpret the numbers above in the context of evolution.
* Which evolutionary forces can influence the observed heterozygosity?
* Which technical problems can influence it?


# Now, let's go for the first challenge!