# Generate IBS-based Neighbor-Joining tree

## Sheep samples

Collect the latest SMARTER genotype dataset and unpack it into `data/interim` folder:

```bash
cd data/interim
wget ftp://webserver.ibba.cnr.it/smarter/SHEEP/OAR3/SMARTER-OA-OAR3-top-0.4.10.zip
unzip SMARTER-OA-OAR3-top-0.4.10.zip -d SMARTER-OA-OAR3-top-0.4.10
```

Next, start by generate a IBS matrix of sheep samples using `plink`. Focus on a sample
subset for simplicity. Remember to remove missing data from the whole dataset:

```bash
plink --chr-set 26 no-xy no-mt --allow-no-sex --keep SMARTER-OA-OAR3-top-0.4.10/5_breeds-0-50K.csv \
    --bfile SMARTER-OA-OAR3-top-0.4.10/SMARTER-OA-OAR3-top-0.4.10 --geno 0.1 \
    --distance square gz ibs  --out SMARTER-OA-OAR3-top-0.4.10/5_breeds-0-50K
```

Now, let's do the python stuff:

In [1]:
import plotly.graph_objects as go
import pandas as pd
import numpy as np
from skbio import DistanceMatrix
from skbio.tree import nj

from src.features.utils import get_interim_dir

In [2]:
ibs_data = pd.read_table(get_interim_dir() / 'SMARTER-OA-OAR3-top-0.4.10/5_breeds-0-50K.mibs.gz', header=None)
sample_names = pd.read_table(get_interim_dir() / 'SMARTER-OA-OAR3-top-0.4.10/5_breeds-0-50K.mibs.id', header=None, names=['breed', "sample"])
distance_matrix = 1 - ibs_data

# assign names to the distance matrix
individuals = sample_names["sample"].tolist()
distance_matrix.index = individuals
distance_matrix.columns = individuals

In [3]:
# Now we have the distance matrix and the corresponding sample IDs
print(f"Distance matrix shape: {distance_matrix.shape}")
print(f"Number of samples: {len(individuals)}")

Distance matrix shape: (108, 108)
Number of samples: 108


In [4]:
# Create skbio DistanceMatrix
dm = DistanceMatrix(distance_matrix, individuals)

In [5]:
# Build the Neighbor-Joining Tree
nj_tree = nj(dm)

In [6]:
# Save the tree to a Newick file
nj_tree.write("5_breeds-0-50K.nwk")

# Show the resulting tree in ASCII format
print(nj_tree.ascii_art())

                    /-FROA-ROU-000000911
          /--------|
         |         |          /-FROA-ROU-000000919
         |          \--------|
         |                    \-FROA-ROU-000000913
         |
         |                    /-FROA-ROU-000000918
         |          /--------|
         |         |          \-FROA-ROU-000000907
         |         |
         |         |                              /-FROA-ROU-000000909
         |         |                    /--------|
         |         |                   |          \-FROA-ROU-000000908
         |         |                   |
         |         |                   |                                        /-SEOA-GUT-000004888
         |         |                   |                                       |
         |         |                   |                                       |                              /-SEOA-GUT-000004889
         |         |                   |                                       |             

### Step 1: Assign Coordinates to Nodes

First, we need to assign x and y coordinates to each node in the tree.

* X-coordinate: For leaf nodes, assign x-coordinates evenly spaced along the x-axis. For internal nodes, assign x-coordinates as the average of their children's x-coordinates.
* Y-coordinate: Assign y-coordinates based on the depth (level) of the node in the tree.

In [7]:
# Get all the leaf nodes and assign x-coordinates
leaves = list(nj_tree.tips())
n_leaves = len(leaves)
for i, leaf in enumerate(leaves):
    leaf.x = i

# Function to assign x-coordinates to internal nodes
def assign_x(node):
    if node.is_tip():
        return node.x
    else:
        child_xs = [assign_x(child) for child in node.children]
        node.x = sum(child_xs) / len(child_xs)
        return node.x

assign_x(nj_tree)

# Function to assign y-coordinates based on depth
def assign_y(node, depth=0):
    node.y = -depth  # Negative depth so that the root is at y=0
    for child in node.children:
        assign_y(child, depth + 1)

assign_y(nj_tree)

### Step 2: Collect Edges

We need to collect the parent-child relationships to plot the edges of the tree.

In [8]:
edges = []

def collect_edges(node):
    for child in node.children:
        edges.append((node, child))
        collect_edges(child)

collect_edges(nj_tree)


### Step 3: Prepare Data for Plotly

Create lists of x and y coordinates for the edges and nodes.

In [9]:
# Prepare edge coordinates
edge_x = []
edge_y = []
for parent, child in edges:
    edge_x += [parent.x, child.x, None]  # None separates traces
    edge_y += [parent.y, child.y, None]

# Prepare node coordinates and labels
node_x = []
node_y = []
node_text = []

def collect_nodes(node):
    node_x.append(node.x)
    node_y.append(node.y)
    node_text.append(node.name if node.name else '')

    for child in node.children:
        collect_nodes(child)

collect_nodes(nj_tree)

### Step 4: Create Plotly Traces

Now, create Plotly traces for edges and nodes.

In [10]:
# Edge trace
edge_trace = go.Scatter(
    x=edge_x, y=edge_y,
    mode='lines',
    line=dict(color='black', width=1),
    hoverinfo='none'
)

# Node trace
node_trace = go.Scatter(
    x=node_x, y=node_y,
    mode='markers+text',
    text=node_text,
    textposition='top center',
    hoverinfo='text',
    marker=dict(color='blue', size=5)
)

### Step 5: Plot the Tree

Set up the layout and create the figure.

In [11]:
layout = go.Layout(
    showlegend=False,
    xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
    yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
    hovermode='closest',
    margin=dict(b=20,l=5,r=5,t=40)
)

fig = go.Figure(data=[edge_trace, node_trace], layout=layout)
fig.show()

## calculate the intra-*breed* distances and performing outlier analysis.

### Detailed Procedure

1. **Intra-*breed* Distance Calculation**:
   - For each *breed*, we extract the distances between individuals belonging to that same group.
   - We then calculate the mean \($\mu$\) and the standard deviation \($\sigma$\) of the distances between individuals of the same *breed*.
   
2. **Identification of Potential *Crossbreeds***:
   - For each individual, we calculate the average distance between them and other members of their *breed*.
   - If an individual’s average distance from their *breed* exceeds the overall intra-*breed* distance mean (for example, above \($\mu + 2\sigma $\)), then that individual is considered a potential *crossbreed*.

3. **Step-by-Step Algorithm**:
   - **Input**: Distance matrix \( D \), individuals’ breed assignments.
   - **Output**: List of individuals potentially *crossbreeds* or assigned to the wrong *breed*.

In [12]:
# Function to calculate intra-breed distances using an assignment DataFrame
def calculate_intra_breed_stats_with_assignments(distance_df, assignment_df):
    unique_breeds = assignment_df['breed'].unique()
    breed_stats = {}

    for breed in unique_breeds:
        # Get the samples for a given breed
        breed_samples = assignment_df[assignment_df['breed'] == breed]['sample'].values
        breed_distances = distance_df.loc[breed_samples, breed_samples].values

        # Extract only values above the diagonal (distances between distinct individuals)
        upper_triangle_indices = np.triu_indices_from(breed_distances, k=1)
        intra_breed_distances = breed_distances[upper_triangle_indices]

        # Calculate mean and standard deviation
        mean_distance = np.mean(intra_breed_distances)
        std_distance = np.std(intra_breed_distances)

        breed_stats[breed] = {
            'mean_distance': mean_distance,
            'std_distance': std_distance
        }

    return breed_stats

# Function to identify potential crossbreeds using the assignment DataFrame
def identify_outliers_with_assignments(distance_df, assignment_df, breed_stats, threshold=2):
    outliers = []
    for i, row in assignment_df.iterrows():
        ind = row['sample']
        breed = row['breed']

        # Get other samples of the same breed (excluding the current individual)
        breed_samples = assignment_df[(assignment_df['breed'] == breed) & (assignment_df['sample'] != ind)]['sample'].values
        distances_to_breed = distance_df.loc[ind, breed_samples].values

        # Calculate the mean distance to other individuals of the same breed
        mean_distance_to_breed = np.mean(distances_to_breed)

        # Compare with the intra-breed mean and standard deviation
        mean_intra_breed = breed_stats[breed]['mean_distance']
        std_intra_breed = breed_stats[breed]['std_distance']

        if mean_distance_to_breed > mean_intra_breed + threshold * std_intra_breed:
            outliers.append(ind)

    return outliers

In [13]:
# Calculate the intra-breed statistics using the assignment DataFrame
breed_stats_with_assignments = calculate_intra_breed_stats_with_assignments(distance_matrix, sample_names)

# Identify potential crossbreeds with the assignment DataFrame
outliers_with_assignments = identify_outliers_with_assignments(distance_matrix, sample_names, breed_stats_with_assignments)

# Display results
outliers_with_assignments, breed_stats_with_assignments


([],
 {'BMN': {'mean_distance': 0.26521966233766237,
   'std_distance': 0.028097615357187217},
  'GUT': {'mean_distance': 0.2352332727272727,
   'std_distance': 0.014657077441748476},
  'MSP': {'mean_distance': 0.3090749761904762,
   'std_distance': 0.027964204201841508},
  'ROU': {'mean_distance': 0.27674486666666664,
   'std_distance': 0.007965034427410922},
  'SAM': {'mean_distance': 0.2958760346320346,
   'std_distance': 0.014550799173203073}})