# Reading data from chromosome 26
* [One-way statistics](#one-way-statistics)
* [Multi-way statistics](#multi-way-statistics)

In [None]:
import tskit
import tsinfer
import numpy as np
import pandas as pd

from tskitetude import get_project_dir

loading data from one chromosome:

In [None]:
ts = tskit.load(get_project_dir() / "results/tsinfer/SMARTER-OA-OAR3-forward-0.4.9.focal.26.trees")
ts

In [None]:
samples = tsinfer.load(get_project_dir() / "results/tsinfer/SMARTER-OA-OAR3-forward-0.4.9.focal.26.samples")
print(samples.info)

<a id='one-way-statistics'></a>
## One-way statistics
We refer to statistics that are defined with respect to a single set of samples as “one-way”. An example of such a statistic is diversity, which is computed using the [TreeSequence.diversity()](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.diversity) method:

In [None]:
pi = ts.diversity()
print(f"Average diversity per unit sequence length = {pi:.3G}")


Computes mean genetic diversity (also known as `pi`) in each of the sets of nodes from sample_sets. The statistic is also known as *sample heterozygosity*; a common citation for the definition is [Nei and Li (1979)](https://doi.org/10.1073/pnas.76.10.5269) (equation 22), so it is sometimes called called “Nei’s pi” (but also sometimes “Tajima’s pi”). This tells the average diversity across the whole sequence and returns a single number. We’ll usually want to compute statistics in [windows](https://tskit.dev/tskit/docs/stable/stats.html#sec-stats-windows) along the genome and we use the windows argument to do this:

In [None]:
print("Sequence length = ", ts.sequence_length)
# windows = np.linspace(0, ts.sequence_length, num=int(ts.sequence_length / 1_000_000) + 1)
# it's seems that windows needs to contain the initial and final positions
windows = np.append(np.arange(0, ts.sequence_length, 5_000_000), ts.sequence_length)
# transform into integer
windows = windows.astype(int)
pi = ts.diversity(windows=windows)
df = pd.DataFrame({"windows": windows[1:], "pi": pi})
df["pi"] = df["pi"].map(lambda x: f"{x:.3G}")
df

Suppose we wanted to compute diversity within a specific subset of samples. We can do this using the `sample_sets` argument:

In [None]:
A = ts.samples()[:100]
d = ts.diversity(sample_sets=A)
print(d)

Here, we’ve computed the average diversity within the first hundred samples across the whole genome. As we’ve not specified any windows, this is again a single value. We can also compute diversity in multiple sample sets at the same time by providing a list of sample sets as an argument:

In [None]:
A = ts.samples()[:100]
B = ts.samples()[100:200]
C = ts.samples()[200:300]
d = ts.diversity(sample_sets=[A, B, C])
print(d)

Ok, this was done by following the tutorial an getting samples by indexes. But can I select my data by *breeds*? this information seems not to be stored in *tstree* object itself, but in the *sample* data I used to generate my stuff. Let's discover the samples by breed. Remember that in my data I have 11477 samples, which stand for a pair of chromosomes for my 5739 individuals:

In [None]:
print(f"I have {ts.samples().size} samples")
print(f"which stand for {sum(1 for _ in samples.individuals())} individuals")

In [None]:
def get_sample_indexes(samples: tsinfer.SampleData, breed: str):
    # return np.where(samples.individuals_metadata["breed"] == breed)[0]
    # get breed index from samples.population_metadata
    breed_idx = next((index for index, d in enumerate(samples.populations_metadata) if d['breed'] == breed), None)

    # get individuals indexes by breed index
    individuals = [i.id for i in filter(lambda i: i.population == breed_idx, samples.individuals())]

    # get samples by individual index
    samples = [s.id for s in filter(lambda s: s.individual in individuals, samples.samples())]

    return samples

Get indexes for *MER* and *TEX* and calculate diversity:

In [None]:
TEX = get_sample_indexes(samples, "TEX")
MER = get_sample_indexes(samples, "MER")

In [None]:
pi = ts.diversity(sample_sets=[TEX, MER])
print(pi)

Same stuff as before but using windows:

In [None]:
windows = np.append(np.arange(0, ts.sequence_length, 5_000_000), ts.sequence_length)
windows = windows.astype(int)
pi = ts.diversity(sample_sets=[TEX, MER], windows=windows)
df = pd.DataFrame({"windows": windows[1:], "MER_pi": pi[:, 0], "TEX_pi": pi[:, 1]})
df["MER_pi"] = df["MER_pi"].map(lambda x: f"{x:.3G}")
df["TEX_pi"] = df["TEX_pi"].map(lambda x: f"{x:.3G}")
df

<a id='multi-way-statistics'></a>
## Multi-way statistics

Many population genetic statistics compare multiple sets of samples to each other. For example, the [TreeSequence.divergence()](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.divergence) method computes the divergence between two subsets of samples:

In [None]:
d = ts.divergence([TEX, MER])
print(f"Divergence between TEX and MER: {d:.3G}")

The divergence between two sets of samples is a single number, and we we again return a single floating point value as the result. We can also compute this in windows along the genome, as before:

In [None]:
d = ts.divergence([TEX, MER], windows=windows)
print(d)

A powerful feature of tskit’s stats API is that we can compute the divergences between multiple sets of samples simultaneously using the `indexes` argument:

In [None]:
CRL = get_sample_indexes(samples, "CRL")

In [None]:
d = ts.divergence([TEX, MER, CRL], indexes=[(0, 1), (0, 2)])
print(d)

The indexes argument is used to specify which pairs of sets we are interested in. In this example we’ve computed two different divergence values and the output is therefore a `numpy` array of length 2.

As before, we can combine computing multiple statistics in multiple windows to return a 2D `numpy` array:

In [None]:
d = ts.divergence([TEX, MER, CRL], indexes=[(0, 1), (0, 2)], windows=windows)
df = pd.DataFrame({"windows": windows[1:], "TEXvsMER": d[:, 0], "MERvsCRL": d[:, 1]})
for column in df.columns[1:]:
    df[column] = df[column].map(lambda x: f"{x:.3G}")
df

Each row again corresponds to a window, which contains the average divergence values between the chosen sets.