## Tree space

In this notebook we will walk through example calculations using BHV tree space. This tutorial will be structured as follows:
1. Imports
2. Intro to BHV Space
3. Phylogenetic-tree dataset
4. Applications
    * Distance between trees
    * Mean
    * PCA
    * Regression 


### 1. Imports

In [1]:
import numpy as np

import geomstats.backend as gs
from geomstats.learning.frechet_mean import FrechetMean
from geomstats.geometry.stratified.bhv_space import generate_random_tree

gs.random.seed(666)

### 2. BHV Space

BHV space is the space of labelled trees. It was designed with phylogenetic trees in mind. BHV space is a stratified space, sections of Euclidean space connected by lower-dimensional manifolds. As such, its geodesics are defined by XXXX.


For more details on this space, please see the original paper -- here --
For a simple visualisation to gain intuition about the non-Euclidean nature of this space, see this link -- here -- 

In [2]:
from geomstats.geometry.stratified.trees import (
    ForestTopology,
    Split,
    delete_splits,
    generate_splits,
)

from geomstats.geometry.stratified.bhv_space import Tree, TreeSpace


def generate_random_phylogenetic_tree(n_labels, p_keep=0.9, btol=1e-8):
    """Generate a random instance of a phylogenetic ``Tree``.

    Phylogenetic trees have two additional constraints:
        - They only have interior edges (ie, no singleton splits).
        - There is a root node. (ignoring this for now...)

    Parameters
    ----------
    p_keep : float between 0 and 1
        The probability that a sampled edge is kept and not deleted randomly.
        To be precise, it is not exactly the probability, as some edges cannot be
        deleted since the requirement that two labels are separated by a split might
        be violated otherwise.
        Defaults to 0.9
    btol: float
        Tolerance for the boundary of the edge lengths. Defaults to 1e-08.
    """
    labels = list(range(n_labels))

    initial_splits = generate_splits(labels)
    temp_splits = delete_splits(initial_splits, labels, p_keep, check=False)

    splits = []
    for split in temp_splits:
        # don't care about the external edges
        if not ((len(split.part1) == 1) or (len(split.part2) == 1)):
            splits.append(split)

    x = gs.random.uniform(size=(len(splits),), low=0, high=1)
    x = gs.minimum(gs.maximum(btol, x), 1 - btol)
    lengths = gs.maximum(btol, gs.abs(gs.log(1 - x)))

    return Tree(splits, lengths)


tree = generate_random_phylogenetic_tree(5, p_keep=1)
print(tree)
trees = np.array([generate_random_phylogenetic_tree(5, p_keep=1) for i in range(10)])

(({0, 1, 4}|{2, 3}, {0, 4}|{1, 2, 3});[1.20543094 1.85909642])


In [5]:
bhv_space = TreeSpace(n_labels=5)

fm = FrechetMean(bhv_space, max_iter=500, epsilon=1e-3)
fm.fit(trees, use_frechet_convergence=True, verbose=True)

fm.estimate_

1.1796135794898073 inf [((({0, 1, 4}, {2, 3}), ({0, 4}, {1, 2, 3})), (np.float64(0.5116596693167555), np.float64(0.2960100805290133)))]
1.1208284871469338 0.2997571352769839 [((({0, 2, 3}, {1, 4}), ({0, 1, 4}, {2, 3})), (np.float64(0.8111016129025034), np.float64(0.33684492203471444)))]
0.533423562775134 0.3736260503161277 [((({0, 2, 3}, {1, 4}), ({0, 1, 4}, {2, 3})), (np.float64(0.28346106832041773), np.float64(0.258511321090819)))]
0.3200541376650804 0.1464871738547806 [((({0, 1, 4}, {2, 3}), ({0, 4}, {1, 2, 3})), (np.float64(0.21151116052448174), np.float64(0.03312325842883364)))]
0.23995386027666898 0.10676451023296218 [((({0, 1, 2}, {3, 4}), ({0, 1}, {2, 3, 4})), (np.float64(0.02487280236563147), np.float64(0.00811358371554939)))]
0.07814290743649369 0.006951031267014018 [((({0, 1, 2}, {3, 4}), ({0, 3, 4}, {1, 2})), (np.float64(0.014170347189937276), np.float64(0.06929294983195342)))]
0.15323448067034057 0.06701840592150665 [((({0, 1, 2}, {3, 4}), ({0, 3, 4}, {1, 2})), (np.float64

[((({0, 4}, {1, 2, 3}), ({0, 3, 4}, {1, 2})), (np.float64(0.014815935079507972), np.float64(0.03515239779883555)))]

In [68]:
trees

[((({0, 1, 3}, {2, 4}), ({0, 3}, {1, 2, 4})), (np.float64(1.301431500448973), np.float64(1.1286004811317047))),
 ((({0, 2, 3}, {1, 4}), ({0, 1, 4}, {2, 3})), (np.float64(0.012784572529367867), np.float64(3.025324999765537))),
 ((({0, 2, 3}, {1, 4}), ({0, 2}, {1, 3, 4})), (np.float64(0.5337321511169704), np.float64(0.05004438386422791))),
 ((({0, 2, 3}, {1, 4}), ({0, 2}, {1, 3, 4})), (np.float64(0.10528114249819379), np.float64(0.7094113393909464))),
 ((({0, 2, 3}, {1, 4}), ({0, 2}, {1, 3, 4})), (np.float64(1.3631802378411033), np.float64(0.2234530233638419))),
 ((({0, 1, 3}, {2, 4}), ({0, 2, 4}, {1, 3})), (np.float64(0.21429779443746455), np.float64(1.2067926168161975))),
 ((({0, 1, 2}, {3, 4}), ({0, 3, 4}, {1, 2})), (np.float64(0.3470473042808842), np.float64(1.4893440122988808))),
 ((({0, 1, 4}, {2, 3}), ({0, 4}, {1, 2, 3})), (np.float64(0.005121933498856595), np.float64(0.11974982886241947))),
 ((({0, 2, 4}, {1, 3}), ({0, 4}, {1, 2, 3})), (np.float64(0.11760593232598163), np.float64

In [55]:
from geomstats.learning.frechet_mean import FrechetMean

fm = FrechetMean(bhv_space)

fm.fit(trees)
# doesn't work because we don't hav enorm defined for trees

In [54]:
# BAD

from geomstats.learning.knn import KNearestNeighborsClassifier
import numpy as np

print(np.array([i % 3 for i in range(10)]).reshape(-1, 1))
print(len(trees))

trees = np.array(trees).reshape(-1, 1)
knn = KNearestNeighborsClassifier(bhv_space, n_neighbors=3)
knn.fit(trees, np.array([i % 3 for i in range(10)]).reshape(-1, 1))

[[0]
 [1]
 [2]
 [0]
 [1]
 [2]
 [0]
 [1]
 [2]
 [0]]
10


TypeError: float() argument must be a string or a real number, not 'Tree'