# Nucleotide diversity

## Using REF as ancestral alleles

We are wondering how the ancestral allele inference effect the final treesequence object.
We are testing a *tstree* object created by imposing the reference allele as the 
ancestral allele. The effect is that there are more or less the same mutations as the
number of variants (while in the case of the ancestrall allele calculated with 
`est-sfs` we can see million of mutations). Let's start by loading data for this
*REF-based* treeseq object

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import tskit

from tskitetude import get_project_dir
from tskitetude.helper import create_windows

In [None]:
ts = tskit.load(str(get_project_dir() / "experiments/smarter-background-sheeps/SMARTER-OA-OAR3-forward-0.4.9.focal.26.trees"))
ts

I need to calculate nucleotide diversity *per site*. The only way to do this seems
to be calculating windows containing the SNPs an then calculating the nucleotide
diversity with the `tskit.TreeSequence.diversity` function. I've created a function
`create_windows` in `helper` module:

In [None]:
# the last index is a simply a 2 step starting from position 1
ts_diversity = ts.diversity(windows=create_windows(ts))[1::2]
ts_diversity[:10]

Now let's compare the nucleotide diversity calculated using vcftools: here's the 
command line to calculate nucleotide diversity *per site*:

```bash
cd experiments/smarter-background-sheeps/
vcftools --gzvcf SMARTER-OA-OAR3-forward-0.4.9.focal.26.vcf.gz --out allsamples_pi --site-pi
```

The `allsamples_pi.sites.pi` is a *TSV* file with the positions and the nucleotide diversity. Read it with pandas:

In [None]:
vcftools_diversity = pd.read_csv(get_project_dir() / "experiments/smarter-background-sheeps/allsamples_pi.sites.pi", sep="\t")
vcftools_diversity.head()

Are this values similar?

In [None]:
np.isclose(ts_diversity, vcftools_diversity["PI"], atol=1e-6).all()

Calculate diversity using *branch*:

In [None]:
# the last index is a simply a 2 step starting from position 1
ts_diversity_branch = ts.diversity(mode='branch', windows=create_windows(ts))[1::2]
ts_diversity_branch[:10]

Try to plot the tow different diversities with vcftools output:

In [None]:
plt.scatter(ts_diversity, vcftools_diversity["PI"])

In [None]:
plt.scatter(ts_diversity_branch, vcftools_diversity["PI"])
plt.xlim(0, 300)

## EST-SFS output as ancestral alleles

Can we calculate nucleotide diversity using the *tree files* generated by the pipeline
using the `est-sfs` output as ancestral alleles?

In [None]:
ts1 = tskit.load(str(get_project_dir() / "results/tsinfer/SMARTER-OA-OAR3-forward-0.4.9.focal.26.trees"))
ts1

Same stuff as before

In [None]:
# the last index is a simply a 2 step starting from position 1
ts1_diversity = ts1.diversity(windows=create_windows(ts1))[1::2]
ts1_diversity[:10]

Are this values similar to the values calculated using VCFtools?

In [None]:
np.isclose(ts1_diversity, vcftools_diversity["PI"], atol=1e-6).all()

So *nucleotide diversity* is the same in both cases (using the REF as ancestral allele and using the `est-sfs` output as ancestral allele)