# Generate IBS-based Neighbor-Joining tree

## Sheep samples

Collect the latest SMARTER genotype dataset and unpack it into `data/interim` folder:

```bash
cd data/interim
wget ftp://webserver.ibba.cnr.it/smarter/SHEEP/OAR3/SMARTER-OA-OAR3-top-0.4.10.zip
unzip SMARTER-OA-OAR3-top-0.4.10.zip -d SMARTER-OA-OAR3-top-0.4.10
```

Next, start by generate a IBS matrix of sheep samples using `plink`. Focus on a sample
subset for simplicity. Remember to remove missing data from the whole dataset:

```bash
plink --chr-set 26 no-xy no-mt --allow-no-sex --keep SMARTER-OA-OAR3-top-0.4.10/5_breeds-0-50K.csv \
    --bfile SMARTER-OA-OAR3-top-0.4.10/SMARTER-OA-OAR3-top-0.4.10 \
    --geno 0.1 --distance square gz ibs --out SMARTER-OA-OAR3-top-0.4.10/5_breeds-0-50K
```

Now, let's do the python stuff:

In [1]:
import plotly.graph_objects as go
import pandas as pd
import numpy as np
from skbio import DistanceMatrix
from skbio.tree import nj

from src.features.utils import get_interim_dir

In [2]:
# Load IBS data from PLINK
dtype_dict = {'IID1': str, 'IID2': str, 'DST': float}
ibs_data = pd.read_csv(
    get_interim_dir() / "SMARTER-OA-OAR3-top-0.4.10/5_breeds-0-50K.ibs.genome.gz",
    sep=r'\s+',
    usecols=['IID1', 'IID2', 'DST'],
    dtype=dtype_dict
)

In [3]:
# This transform a pairwise items to a matrix
ibs_pivot = ibs_data.pivot(index='IID1', columns='IID2', values='DST').fillna(0)

# This will transform the IBS values to a distance matrix: 0 means identical, 1 means different
ibs_pivot = 1 - ibs_pivot
np.fill_diagonal(ibs_pivot.values, 0)
upper_triangular_matrix = ibs_pivot.values
individuals = ibs_pivot.index.values

The matrix I have is only the upper triangle of the matrix, so I need to mirror it to get the full matrix. This could be done by adding the transposed matrix to the original matrix. The diagonal should be subtracted to avoid double counting, in this case however is zero, since I called `np.fill_diagonal` with zeros.

In [4]:
distance_matrix = upper_triangular_matrix + upper_triangular_matrix.T - np.diag(upper_triangular_matrix.diagonal())

def is_symmetric(matrix, tol=1e-8):
    return np.allclose(matrix, matrix.T, atol=tol)

is_symmetric(distance_matrix)

True

In [5]:
# Now we have the distance matrix and the corresponding sample IDs
print(f"Distance matrix shape: {distance_matrix.shape}")
print(f"Number of samples: {len(individuals)}")

Distance matrix shape: (107, 107)
Number of samples: 107


In [6]:
# Create skbio DistanceMatrix
dm = DistanceMatrix(distance_matrix, individuals)

In [7]:
# Deleting all variables except `dm`
del distance_matrix
del dtype_dict
del ibs_data
del ibs_pivot
del individuals
del upper_triangular_matrix

In [8]:
# Build the Neighbor-Joining Tree
nj_tree = nj(dm)

In [9]:
# Save the tree to a Newick file
nj_tree.write("5_breeds-0-50K.nwk")

# Show the resulting tree in ASCII format
print(nj_tree.ascii_art())

                              /-BROA-BMN-000002904
                    /--------|
                   |          \-BROA-BMN-000002898
                   |
                   |                    /-BROA-BMN-000002903
                   |                   |
                   |                   |                    /-BROA-BMN-000002894
                   |          /--------|          /--------|
          /--------|         |         |         |         |          /-BROA-BMN-000002910
         |         |         |         |         |          \--------|
         |         |         |          \--------|                    \-BROA-BMN-000002896
         |         |         |                   |
         |         |         |                   |          /-BROA-BMN-000002905
         |         |         |                    \--------|
         |         |         |                              \-BROA-BMN-000002893
         |         |         |
         |          \--------|              

### Step 1: Assign Coordinates to Nodes

First, we need to assign x and y coordinates to each node in the tree.

* X-coordinate: For leaf nodes, assign x-coordinates evenly spaced along the x-axis. For internal nodes, assign x-coordinates as the average of their children's x-coordinates.
* Y-coordinate: Assign y-coordinates based on the depth (level) of the node in the tree.

In [10]:
# Get all the leaf nodes and assign x-coordinates
leaves = list(nj_tree.tips())
n_leaves = len(leaves)
for i, leaf in enumerate(leaves):
    leaf.x = i

# Function to assign x-coordinates to internal nodes
def assign_x(node):
    if node.is_tip():
        return node.x
    else:
        child_xs = [assign_x(child) for child in node.children]
        node.x = sum(child_xs) / len(child_xs)
        return node.x

assign_x(nj_tree)

# Function to assign y-coordinates based on depth
def assign_y(node, depth=0):
    node.y = -depth  # Negative depth so that the root is at y=0
    for child in node.children:
        assign_y(child, depth + 1)

assign_y(nj_tree)

### Step 2: Collect Edges

We need to collect the parent-child relationships to plot the edges of the tree.

In [11]:
edges = []

def collect_edges(node):
    for child in node.children:
        edges.append((node, child))
        collect_edges(child)

collect_edges(nj_tree)


### Step 3: Prepare Data for Plotly

Create lists of x and y coordinates for the edges and nodes.

In [12]:
# Prepare edge coordinates
edge_x = []
edge_y = []
for parent, child in edges:
    edge_x += [parent.x, child.x, None]  # None separates traces
    edge_y += [parent.y, child.y, None]

# Prepare node coordinates and labels
node_x = []
node_y = []
node_text = []

def collect_nodes(node):
    node_x.append(node.x)
    node_y.append(node.y)
    node_text.append(node.name if node.name else '')

    for child in node.children:
        collect_nodes(child)

collect_nodes(nj_tree)

### Step 4: Create Plotly Traces

Now, create Plotly traces for edges and nodes.

In [13]:
# Edge trace
edge_trace = go.Scatter(
    x=edge_x, y=edge_y,
    mode='lines',
    line=dict(color='black', width=1),
    hoverinfo='none'
)

# Node trace
node_trace = go.Scatter(
    x=node_x, y=node_y,
    mode='markers+text',
    text=node_text,
    textposition='top center',
    hoverinfo='text',
    marker=dict(color='blue', size=5)
)

### Step 5: Plot the Tree

Set up the layout and create the figure.

In [14]:
layout = go.Layout(
    showlegend=False,
    xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
    yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
    hovermode='closest',
    margin=dict(b=20,l=5,r=5,t=40)
)

fig = go.Figure(data=[edge_trace, node_trace], layout=layout)
fig.show()