# Calculating a PCA in R

Last time, we have calculated a PCA using a command line tool. There are many ways of doing it, and now we will do it in R as well.

But let's first have a look at the output we generated last time and get a nicer output figure. Then, we will calculate it with R functions.


## Making a nicer plot

I'm not happy with the plot, and maybe one could provide some info to the tool itself to get it in shape. But even better is to get a nice plot of your own. Let's `R` then!

```
R --vanilla
```

We read in the data and metadata:
```
pca<-read.table("geno_pca.eigenvec",sep="\t", header=T,comment.char="")
ceu<-unlist(read.table("../CEU.list", sep="\t",header=F,comment.char=""))
yri<-unlist(read.table("../YRI.list", sep="\t",header=F,comment.char=""))
```

We change groupings to our liking:
```
pca$Group[which(pca$SampleName%in%ceu)]<-"European"
pca$Group[which(pca$SampleName%in%yri)]<-"African"
pca$Group[which(pca$SampleName=="AltaiNea")]<-"Neanderthal"
pca$Group[which(pca$SampleName=="DenisovaPinky")]<-"Denisovan"
```

And add a nice color coding:
```
pca$Cluster[which(pca$SampleName%in%ceu)]<-"blue"
pca$Cluster[which(pca$SampleName%in%yri)]<-"green"
pca$Cluster[which(pca$SampleName=="AltaiNea")]<-"orange"
pca$Cluster[which(pca$SampleName=="DenisovaPinky")]<-"red"
```

And now make a nice plot:

```
pdf("PCA_chr21_col.pdf",6,6)
plot(pca[,4],pca[,5],col=pca[,3],pch=16,
    xlab="PC1",
    ylab="PC2",
    main=paste("PCA of chr21"))
legend("bottomright",legend=c("European","African","Neanderthal","Denisovan"), fill=c("blue","green","orange","red"),bty="n" )
dev.off()
```



## A simple PCA in R

### Data preparation

Let's take the same file. What we need now is a genotype matrix, or a table of numbers which can be used by a statistical method. This table should also not have missing data, or other stuff such as non-biallelic SNVs. We may want to ensure this with another filtering step.

Another neat feature of `bcftools` is that it actually can also transform `vcf` formatted files into anything you need. In this case, we really just want to have these numbers without anything else. `bcftools query` is good for that, since you can define which fields to extract, and how to separate them.

* Let's do it in one go!

```
# in BASH

bcftools view -a chr21.merged.vcf.gz | bcftools view -m 2 -M 2 -v snps | bcftools filter -e 'GT[*] = "mis"' | bcftools query -f '[%GT ]\n' > allgts.txt
```

* This is still a very big file! Let's "thin" it a bit by randomly taking 100000 SNPs:

```
shuf -n 100000 allgts.txt > somegts.txt
```

There could be different ways of doing this, but here this way is convenient to keep it simple.

* Then we can switch into `R`.


### Data preparation in R

```
R --vanilla

snps<-read.table("allgts.txt", sep=" ",header=F,comment.char="")
head(snps)
```

* Note that the separator is now a space, because that is how we defined it in the previous step. However, this was adding a last space to the end of the file, and R believes there is a (empty) column. Let's inspect the first and last column.

```
snps<-snps[,-ncol(snps)]

table(snps[,1])
table(snps[,209])
```

Now, these are not really numbers, but genotypes...

* A very easy way to turn them into single numbers would be using the `ifelse` statement like this:

```
snps[,1]<-ifelse(snps[,1]%in%c("0|0","0/0"),0,snps[,1])
snps[,1]<-ifelse(snps[,1]%in%c("1|1","1/1"),2,snps[,1])
snps[,1]<-ifelse(snps[,1]%in%c("0|1","1|0","0/1","1/0"),1,snps[,1])
table(snps[,1])
```

* However, this is not a numeric matrix:

```
is.numeric(snps[,1])
snps[,1]<-as.numeric(snps[,1])
is.numeric(snps[,1])
```

This looks better, but you certainly don't want to type in 208 more times the same things, and you would very likely make mistakes on the way.

* Let's use a `for` loop and nested `ifelse` statements to make it easier! As a side-effect, the whole column will be numeric as well!

```
for (j in (2:ncol(snps))) { snps[,j]<-ifelse(snps[,j]%in%c("0|0","0/0"),0,ifelse(snps[,j]%in%c("1|1","1/1"),2,ifelse(snps[,j]%in%c("0|1","1|0","0/1","1/0"),1,NA))); print(j) }


table(snps[,3])
table(snps[,208])
is.numeric(snps[,159])
```

I would note that at this stage it is ok to do things that way. Much better would be to write a `function` to make it more efficient...

If you feel that coming up with efficient solutions is complicated, that is fine - it is a matter of practice and experience. We need to learn things in order to do them.


### Calculating the PCA

* Did we forget anything? Ah, yes, we may need the proper functions to calculate this! A possible (and very nice) libary with such functions [adegenet](https://adegenet.r-forge.r-project.org/).

FYI: R has many packages, and often they are not pre-installed on a computer/structure. If that is the case, you may succeed with `install.packages("adegenet")`. But fortunately, we have it available here, you don't need to!

The method we want to use is called `dudi.pca`. It needs the individuals in the rows, and the positions in the columns, that is, a transposed table to be done with `t()`. As you see, nested functions are very common in `R`, do combine things on the fly without intermediate objects. 

* Let's to this:

```
library("adegenet")

pca_object <- dudi.pca(t(snps),nf=20,scannf=F)

tb<-list()
for (j in (1:ncol(snps)))  { tb[[j]]<-table(snps[,j]) }

pca_object
summary(pca_object)
```

This works so far! We have created a higher-level `R` object with lots of information. Important are the coordinates for each individual, which are in the sub-object called `li`, which we can access with the `$` sign.

### Plotting the PCA

* Let's plot this!

```
pdf("PCA_1000g.pdf",12,12)
plot(pca_object$li[,1],pca_object$li[,2])
dev.off()
```

Ok, this is just some distribution of dots... It is nicer to color it. For this, we need the metadata again. In `R`, we can then create a matching colour vector:

```
ceu<-unlist(read.table("../CEU.list", sep="\t",header=F,comment.char=""))
yri<-unlist(read.table("../YRI.list", sep="\t",header=F,comment.char=""))
names<-rbind(cbind("CEU",ceu),cbind("YRI",yri),cbind(c("AltaiNea","DenisovaPinky"),c("Neanderthal","Denisovan))
mypops<-c("CEU","YRI","Neanderthal","Denisovan")
cols=c("blue","green","orange","red")

#colvector<-names[,2]
#for (pop in (1:nrow(mypops))) { colvector<-gsub(names[pop],cols[pop],colvector) }
```

* Let's plot it!

```
pdf("PCA_1000g_col.pdf",12,12)
plot(pca_object$li[,1],pca_object$li[,2],col=colvector)
legend("topright",legend=mypops, fill=cols,bty="n" )
dev.off()
```

This is actually already quite good - considering that it is just a part of chromosome 18! We can do some more polishing to make it look nicer, and add the information how much of the variation is explained by these first two PCs.

* Let's add nice axis labels (`xlab` and `ylab`), and points instead of empty circles (with `pch`):


```
percent_variation=round(pca_object$eig/sum(pca_object$eig)*100,2)
pdf("PCA_1000g_nice.pdf",12,12)
plot(pca_object$li[,1],pca_object$li[,2],col=colvector,pch=16,
    xlab=paste("PC1"," (",percent_variation[1],"%)",sep=""),
    ylab=paste("PC",2," (",percent_variation[2],"%)",sep=""),
    main=paste("PCA of 1000G, chr18"))
legend("topright",legend=mypops, fill=cols,bty="n" )
dev.off()
```

Now, this looks good, and you can gain some insight into the relationships of these populations!



# Very good! Next step: D-statistics!