# Nucleotide diversity

## Using REF as ancestral alleles

We are wondering how the ancestral allele inference effect the final treesequence object.
We are testing a *tstree* object created by imposing the reference allele as the 
ancestral allele. The effect is that there are more or less the same mutations as the
number of variants (while in the case of the ancestrall allele calculated with 
`est-sfs` we can see million of mutations). Let's start by loading data for this
*REF-based* treeseq object

In [None]:
import pandas as pd
import numpy as np
import tskit

In [None]:
ts = tskit.load("30075eb48ae770281697a7bb32e1b6/SMARTER-OA-OAR3-forward-0.4.9.focal.26.trees")
ts

I need to calculate nucleotide diversity *per site*. The only way to do this seems
to be calculating windows containing the SNPs an then calculating the nucleotide
diversity with the `tskit.TreeSequence.diversity` function. First create the windows
around the SNPs:

In [None]:
def create_windows(ts):
    """
    Create windows for the diversity function
    """
    # create a numpy array with position
    sites = np.array([site.position for site in ts.sites()])

    # now duplicate each element and add an offset array
    windows = np.repeat(sites, 2) + np.tile([0, 1], len(sites))

    # add the first window
    windows = np.insert(windows, 0, 0)

    # now add sequence length as the last window
    windows = np.append(windows, ts.sequence_length)

    return windows

Calculate diversity *per SNP positions*: use the `create_windows` function and select all the
odd positions:

In [None]:
# the last index is a simply a 2 step starting from position 1
ts_diversity = ts.diversity(windows=create_windows(ts))[1::2]
ts_diversity[:10]

Now let's compare the nucleotide diversity calculated using vcftools: here's the 
command line to calculate nucleotide diversity *per site*:

```bash
cd 30075eb48ae770281697a7bb32e1b6
vcftools --gzvcf SMARTER-OA-OAR3-forward-0.4.9.focal.26.vcf.gz --out allsamples_pi --site-pi
```

The `allsamples_pi.sites.pi` is a *TSV* file with the positions and the nucleotide diversity. Read it with pandas:

In [None]:
vcftools_diversity = pd.read_csv("30075eb48ae770281697a7bb32e1b6/allsamples_pi.sites.pi", sep="\t")
vcftools_diversity.head()

Are this values similar?

In [None]:
np.isclose(ts_diversity, vcftools_diversity["PI"], atol=1e-6).all()

## EST-SFS output as ancestral alleles

Can we calculate nucleotide diversity using the *tree files* generated by the pipeline
using the `est-sfs` output as ancestral alleles?

In [None]:
ts1 = tskit.load("results/tsinfer/SMARTER-OA-OAR3-forward-0.4.9.focal.26.trees")
ts1

Same stuff as before

In [None]:
# the last index is a simply a 2 step starting from position 1
ts1_diversity = ts1.diversity(windows=create_windows(ts1))[1::2]
ts1_diversity[:10]

Are this values similar to the values calculated using VCFtools?

In [None]:
np.isclose(ts1_diversity, vcftools_diversity["PI"], atol=1e-6).all()

So *nucleotide diversity* is the same in both cases (using the REF as ancestral allele and using the `est-sfs` output as ancestral allele)