## Heterozygosity

Now, let's look at heterozygosity at a more global scale. For this, we can use `bcftools`. With `-s` samples are selected, and the rows starting on `PSC` are per-sample-counts. This table could be the raw table.


```
bcftools stats -s - $TGDATA | grep "^PSC" -B 1 > hets.txt
```

Keep in mind that, this as well as many other things can be solved by multiple possible strategies. In this case, I show one version, but there are alternatives.

If you inspect the file, you will see that there is one line per sample, with several columns containing the statistics. Column 6 is the number of Hets.

* Let's have some look at the data!

```
# in R

R --vanilla

hets<-read.table("hets.txt", sep="\t",header=T,comment.char="")


#png("hets_histogram.png",600,600)
hist(hets[,6])
#dev.off()
```

* Is there a relationship between homozygous alternative (1/1) and heterozygous (0/1) sites?

```
#png("hets_vs_homALT.png",600,600)
plot(hets[,5],hets[,6])
#dev.off()

cor.test(hets[,5],hets[,6])
```

* Now, I would like to stratify this by population groups. For this, we need to load the metadata and do some merging. Unfortunately, when creating this file, the authors messed with empty columns, so we need to tell `R` to ignore this. Then we merge, taking only the sample ID and heterozygous call columns.


```
meta<-read.table("/lisc/scratch/course/2024w550001/share/integrated_call_samples_v3.20130502.ALL.panel", sep="\t",header=T,fill=T)

head(meta)

hetmet<-merge(hets[,c(3,6)],meta,by.x=1,by.y=1)

head(hetmet)
```

* Just having some fun with statistics in R:

```
mean(hetmet[which(hetmet$pop=="IBS"),2])
mean(hetmet[which(hetmet$pop=="FIN"),2])
wilcox.test(hetmet[which(hetmet$pop=="IBS"),2],hetmet[which(hetmet$pop=="FIN"),2])
```

* Then, let's make a boxplot stratified by continental population (already adding colour and labels to nice it up):

```
#png("hets_superpop_boxplot.png",600,600)
boxplot(X.6.nHets~super_pop,data=hetmet,
    col=c("blue","green","orange","yellow","red"),
    xlab="Continental population",ylab="Heterozygous sites")
#dev.off()
```

This very simple statistic does have a biological meaning, hence it is very informative!

* A final thing: nicely ordering the data by super-population, and plotting the distribution by more specific populations. This requires a bit of `R`-specific data handling, to get a properly sorted dataframe.

```
mypops<-c("AFR","SAS","EUR","EAS","AMR")
sortedpops<-list()
for (npop in mypops) { sortedpops[[npop]]<-unique(hetmet$pop[which(hetmet$super_pop==npop)]) }
hetmetnice<-data.frame(nHets=hetmet[,2],pop=hetmet$pop,super_pop=hetmet$super_pop)
hetmetnice$pop = factor(hetmetnice$pop, levels=unlist(sortedpops))

#png("hets_pop_boxplot.png",1200,600)
boxplot(nHets~pop,data=hetmetnice,
    xlab="Continental population",ylab="Heterozygous sites",
    col=c(rep("blue",length(sortedpops[[1]])),rep("green",length(sortedpops[[2]])),rep("orange",length(sortedpops[[3]])),rep("yellow",length(sortedpops[[4]])),rep("red",length(sortedpops[[5]]))))
#dev.off()
```

Now you can see how the heterozygosities differ between super-populations. Boxplots are not an ideal way of presenting data, but convenient in base `R`.
One might rather want to get a violin plot with individual dots, which requires more specialized `R` packages such as `ggplot2`: 

```
library("ggplot2")
colvec<-c("blue","green","orange","yellow","red");names(colvec)<-mypops

ggplot(hetmetnice) + theme_minimal() + geom_violin(mapping=aes(x=pop,y=nHets,fill=super_pop),adjust=1.0,draw_quantiles = c(0.25, 0.5, 0.75), scale="width", na.rm=T  ) + geom_jitter(mapping=aes(x=pop,y=nHets), height = 0, width = 0.2,na.rm=T,inherit.aes=F,show.legend=F,size=.2) +scale_fill_manual(values=colvec) + theme(panel.border = element_blank(), axis.text.x = element_text(angle = 45, vjust = 1, hjust=1,size=12.5),axis.text.y = element_text(size=11),plot.title = element_text(face="bold",hjust=0.5,size=15), axis.title.y = element_text(size=12)) + xlab("")  + ylab("Number of heterozygous sites") + ggtitle(label=paste("1000 Genomes heterozygous sites on chr18")) 
```
