# eigenstrat format dataset

Now, everything is running. Let's first get our genetic data! We will look at a dataset with 104 populations (total of 198 individuals), with data for 1.24 million positions across the human genome.

We get some data:

```
cd ~/notebooks/admixtools/data

wget https://ucloud.univie.ac.at/index.php/s/fYNi6xXR83R1YEc/download
tar -zxvf download
```

Let's have a look at the top few lines of each file to understand what is inside:

    head dataset.ind

    head dataset.snp

Check the length of each of those files, which tell us how many individuals and how many SNPs are in the dataset:

```
wc -l dataset.ind
wc -l dataset.snp
```

The "dataset.geno" file is in binary format and therefore can't be visualised as text, but it contains the genotype information for each individual, for each 1.24 million positions.

A total of 74 modern populations from all across the world is included, and 28 ancient individuals, including some archaic hominins:

```
    Altai_Neanderthal.DG
    Chagyrskaya_Neandert.SG
    Denisova.DG
    etc
```

These files are often interpreted with admixtools. To make life easier, we will use admixtools in R, which provides the same and more functionalities as the usual command line tools.

You may use RStudio or call R from the command line, and load the package.

```
R --vanilla
library(admixtools)
```


## f4-statistics 

f4-statistics can be used to investigate allele sharing between populations, a classical marker of admixture. They are very similar to D-statistics.


MK: picture!!

For example if modern non-Africans (X) shared more alleles with Neanderthals than Africans:

```
f4(Yoruba, X; Neanderthal, Chimpanzee)
f4 > 0      X closer to Denisovan than to Neanderthal
f4 < 0      X closer to Neanderthal than to Denisovan
```

Or if a modern population (X) shared more alleles with Neanderthals than Denisovans.

```
f4(Yoruba, X; Neanderthal, Denisovan)
f4 > 0      X closer to Denisovan than to Neanderthal
f4 < 0      X closer to Neanderthal than to Denisovan
```

Now, the first step is to calculate the basic statistics. In order to do it within the Binder framework (large data), we need to restrict it to some individuals.

There is a command to do this:

```
extract_f2(pref="dataset",
    outdir="genos",blgsize=500000,
    overwrite=T,maxmem=1000,
    pops=c("Altai_Neanderthal.DG","Denisova.DG", "Finnish.DG", "Japanese.DG","Mbuti.DG", "Papuan.DG","Yoruba.DG","Chimp.REF"))
```

This function will calculate pairwise statistics only for the samples in question, and write these into new files. Nothing much to see here, though you may inspect the new directory "genos" with some binary files.

Now we load that into R:

```
f2_blocks = f2_from_precomp("genos")
```

This is a complex table with the basic statistics we need. But we want to focus on f4-statistics to see if there was admixture between some populations. So, we calculate f4-statistics.

First, let's check the f4-stats for Neanderthal introgression:

```
f4_table<-f4(f2_blocks,pop1="Yoruba.DG", pop2=c("Finnish.DG", "Japanese.DG","Mbuti.DG", "Papuan.DG"),pop3="Altai_Neanderthal.DG", pop4="Chimp.REF")
```

What do we see here?

```
f4_table<-f4(f2_blocks,pop1="Yoruba.DG", pop2=c("Finnish.DG", "Japanese.DG","Mbuti.DG", "Papuan.DG"),pop3="Altai_Neanderthal.DG", pop4="Denisova.DG")
f4_table<-f4(f2_blocks,pop1="Japanese.DG", pop2=c("Finnish.DG", "Japanese.DG","Mbuti.DG", "Papuan.DG"),pop3="Altai_Neanderthal.DG", pop4="Denisova.DG")
```

MK figure on history



This is for all possible combinations!


```
f4_table[which(f4_table[,2]=="Yoruba.DG" & f4_table[,3]=="Altai_Neanderthal.DG" & f4_table[,4]=="Chimp.REF"),]




## admixture graphs