# Data analysis

It's important to get some idea on what we have there. That is, for example, what is the diversity of these populations?

Let's start with heterozygosity (how many heterozygous sites are there in this chromosome?). It is very easy to obtain, again with bcftools:

```
bcftools stats -s -  chr21.merged.vcf.gz | grep "^PSC" -B 1 > hets.txt
```

If you inspect the file, you will see that there is one line per sample, with several columns containing the statistics. Column 6 is the number of Hets.

Maybe this is easier in `R`?

We can:
* run R from the command line `R --vanilla` or in Rstudio and go to your own directory with `setwd()`

* read in these summary tables

* plot a histogram

* calculate the mean for CEU and YRI separately, compared to the archaics

* perform a statistical test of CEU and YRI are different

* There is ~30 million bases for which we can have data on chr21, often heterozygosity is expressed als hets per 1,000bp. We can present the values in such a way.


```
hets<-read.table("hets.txt", sep="\t",header=T,comment.char="")

png("hets_histogram.png",600,600)
hist(hets[,6],breaks=20)
dev.off()


ceu<-unlist(read.table("~/appladmix/notebooks/1_data/data/CEU.list", sep="\t",header=F))
yri<-unlist(read.table("~/appladmix/notebooks/1_data/data/YRI.list", sep="\t",header=F))

wilcox.test(hets[which(hets[,3]%in%ceu),6],hets[which(hets[,3]%in%yri),6])

genom=30000000

mean(hets[which(hets[,3]%in%ceu),6])/genom*1000
mean(hets[which(hets[,3]%in%yri),6])/genom*1000
hets[which(hets[,3]%in%c("AltaiNea","DenisovaPinky")),6]/genom*1000

```

## Genetic diversity

The number of heterozygous positions reflects an evolutionarily important feature, as it is a very direct metric of genetic diversity.

* We can interpret the numbers above in the context of evolution.
* Which evolutionary forces can influence the observed heterozygosity?
* Which technical problems can influence it?


# Now, let's go for the first challenge!