# Analyze the simulated data

## Simulation on 2K SNPs

In [None]:
import numpy as np
import pandas as pd

import tskit

from tskitetude import get_data_dir

Analyze the `data/sheepTSsimMilano/ts300I2k.vcf.gz` generated using msprime. Get a list of all sample names, then Call the `create_tstree` with the following parameters:

```bash
bcftools query -l ts300I2k.vcf.gz > ts300I2k.sample_names.txt
create_tstree --vcf ts300I2k.vcf.gz --focal ts300I2k.sample_names.txt --ancestral_as_reference \
    --output_samples ts300I2k.inferred.samples --output_trees ts300I2k.inferred.trees --num_threads 16 \
    --mutation_rate 5.87e-9 --ne 34500
```

The `5.87-9` and `34500` are the mutation rate and effective population size respectively. The `--ancestral_as_reference` flag is used to treat the ancestral allele as the reference allele. The `--num_threads 4` flag is used to specify the number of threads to use. The `--output_samples` flag is used to specify the output file for the inferred samples. The `--output_trees` flag is used to specify the output file for the inferred trees. 

In [None]:
mutation_rate = 5.87e-9
print("Mutation rate: ", mutation_rate)

In [None]:
ts300I2k = tskit.load(get_data_dir() / "sheepTSsimMilano/ts300I2k.inferred.trees")
ts300I2k

In [None]:
ts300I2k.diversity()

In [None]:
ts300I2k.diversity(mode="branch") * mutation_rate

## Simulation on 25K SNPs

```bash
bcftools query -l ts300I25k.vcf.gz > ts300I25k.sample_names.txt
create_tstree --vcf ts300I25k.vcf.gz --focal ts300I25k.sample_names.txt --ancestral_as_reference \
    --output_samples ts300I25k.inferred.samples --output_trees ts300I25k.inferred.trees --num_threads 16 \
    --mutation_rate 5.87e-9 --ne 34500
```

In [None]:
ts300I25k = tskit.load(get_data_dir() / "sheepTSsimMilano/ts300I25k.inferred.trees")
ts300I25k

In [None]:
ts300I25k.diversity()

In [None]:
ts300I25k.diversity(mode="branch") * mutation_rate

## Simulation on entire dataset

```bash
bbcftools query -l tsm100M300I.vcf.gz > tsm100M300I.sample_names.txt
create_tstree --vcf tsm100M300I.vcf.gz --focal tsm100M300I.sample_names.txt \
    --ancestral_as_reference --output_samples tsm100M300I.inferred.samples \
    --output_trees tsm100M300I.inferred.trees --num_threads 16 \
    --mutation_rate 5.87e-9 --ne 34500
```

In [None]:
tsm100M300I = tskit.load(get_data_dir() / "sheepTSsimMilano/tsm100M300I.inferred.trees")
tsm100M300I

In [None]:
tsm100M300I.diversity()

In [None]:
tsm100M300I.diversity(mode="branch") * mutation_rate

## Calculate FST
define individuals list:

In [None]:
indList = [np.arange(10)] + [np.arange(600*i+10, 600*(i+1)+10) for i in range(8)]

In [None]:
[i.Fst([indList[0], indList[1]], mode="branch") for i in [ts300I2k, ts300I25k, tsm100M300I]]


And then with `site` mode:

In [None]:
[i.Fst([indList[0], indList[1]], mode="site") for i in [ts300I2k, ts300I25k, tsm100M300I]]


In [None]:
tmp = {
    "simulation": ["ts300I2k", "ts300I25k", "tsm100M300I"],
    "diversity": [ts300I2k.diversity(), ts300I25k.diversity(), tsm100M300I.diversity()],
    "diversity_branch": [
        ts300I2k.diversity(mode="branch") * mutation_rate,
        ts300I25k.diversity(mode="branch") * mutation_rate,
        tsm100M300I.diversity(mode="branch") * mutation_rate
    ],
    "FST_branch": [i.Fst([indList[0], indList[1]], mode="branch") for i in [ts300I2k, ts300I25k, tsm100M300I]],
    "FST_site": [i.Fst([indList[0], indList[1]], mode="site") for i in [ts300I2k, ts300I25k, tsm100M300I]]
}
pd.DataFrame(tmp)

In [None]:
tree = ts300I2k.at_index(1)
tree.draw()

In [None]:
tree.root

Print the age of the tree:

In [None]:
tree.time(tree.root)

Iterate over the trees and get time. Remove the first and the last trees:

In [None]:
skipped_trees = list(ts300I2k.trees())[1:(ts300I2k.num_trees - 1)]
# skipped_trees = list(itertools.islice(
#     list(ts300I2k.trees()), 1, ts300I2k.num_trees - 1))
[tree.time(tree.roots[0]) for tree in skipped_trees]

In [None]:
skipped_trees

In [None]:
[(i, len(tree.roots)) for i, tree in enumerate(ts300I2k.trees()) if len(tree.roots) > 1]