GPt-4 prompt:

I have this exercise and the following data:

*Available Data:
tree.csv with columns [Parent,Child,age_ch,t,species]
vert_genes.csv with columns [ensembl_id,orthId,glength,species]

*Subject/Background information:
GAUSSIAN TREE MODELS OF GENE LENGTH EVOLUTION
3. INFERENCE OF HIDDEN NODES
Only the leaf nodes, $X_i$, are in practice observed, and prediction of the unobserved gene lengths of the ancestors, $Z_0, \ldots, Z_{n-2}$, is an inference problem.
4
GAUSSIAN TREE MODELS
FigURE 2. An illustration of the Bayesian network structure used in this project. The variables $X_1, \ldots, X_n$ are gene lengths of orthologous genes from $n$ present day species, and the variables are the leaves in a tree where the root, $Z_0$, and the other non-leaves, $Z_1, \ldots, Z_{n-2}$, are generally unobserved. Each non-leaf variable has precisely two children and equals the gene length in the evolutionary most recent common ancestor of its children.
The conditional distribution of $Z_0, \ldots, Z_{n-2}$ given $X_1, \ldots, X_n$ is Gaussian and can be found by matrix algebra. It is useful to implement this for testing, but you must implement at least one efficient algorithm for computing $Z_i \mid X_1, \ldots, X_n$ for $i=0, \ldots, n-2$.


*Exercise:
Implement inference algorithms for computing the conditional distribution of each of the variables $Z_0, \ldots, Z_{n-2}$ given $X_1, \ldots, X_n$.
How would i do this in a simple way in python?

In [1]:
# Imports
import pandas as pd
import numpy as np
from scipy.linalg import block_diag
from collections import defaultdict

In [4]:
# Read data
tree_df = pd.read_csv('data/tree.csv')
vert_genes_df = pd.read_csv('data/vert_genes.csv')


In [18]:
# Construct the tree
def build_tree(tree_df):
    tree = {}
    for _, row in tree_df.iterrows():
        right_child = tree_df.loc[tree_df['Parent'] == row['Child'], 'Child'].values
        tree[row['Parent']] = {'left': row['Child'], 'right': right_child[0] if len(right_child) > 0 else None}
    return tree

# Get the number of species
n = len(vert_genes_df['species'].unique())

# Define a function to compute the covariance matrix for a subtree rooted at node
def compute_covariance_matrix(tree, node):
    if node not in tree:  # If the node is a leaf
        return np.empty((0, 0))  # Return an empty 2-dimensional array

    children = tree[node]

    # Recursively compute covariance matrices for child subtrees
    cov_matrices = [compute_covariance_matrix(tree, child) for child in children.values()]

    # Calculate the covariance matrix for the current node
    age_ch = tree_df.loc[tree_df['Child'] == children['left'], 'age_ch'].values[0]
    t = tree_df.loc[tree_df['Child'] == children['left'], 't'].values[0]
    cov_matrix = age_ch * np.array([[t, t / 2], [t / 2, t]])

    return block_diag(*cov_matrices) + cov_matrix

# Define a function to compute the conditional distribution of a hidden node given leaf nodes
def compute_conditional_distribution(tree, node, inv_cov, X):
    if node not in tree:  # If the node is a leaf
        return None

    children = tree[node]

    # Find the indices of the hidden node and its children in the covariance matrix
    idx_node = tree_df.loc[tree_df['Parent'] == node].index[0]
    idx_child1 = tree_df.loc[tree_df['Parent'] == children['left']].index[0]
    idx_child2 = tree_df.loc[tree_df['Parent'] == children['right']].index[0]

    # Calculate the conditional distribution parameters
    mean = (inv_cov[idx_node, idx_child1] * X[children['left']] + inv_cov[idx_node, idx_child2] * X[children['right']]) / inv_cov[idx_node, idx_node]
    variance = 1 / inv_cov[idx_node, idx_node]

    return mean, variance


# ... (previous code for importing libraries and defining functions)

tree = build_tree(tree_df)
covariance_matrix = compute_covariance_matrix(tree, root_node)
inv_covariance_matrix = np.linalg.inv(covariance_matrix)
X = vert_genes_df.set_index('species')['glength'].to_dict()

# Identify the root node (the node that is not a child)
root_node = tree_df.loc[~tree_df['Parent'].isin(tree_df['Child']), 'Parent'].values[0]

# Identify the internal nodes (excluding leaf nodes and the root node)
internal_nodes = set(tree_df['Parent']).difference(set(tree_df['Child']), {root_node})

conditional_distributions = {node: compute_conditional_distribution(tree, node, inv_covariance_matrix, X) for node in internal_nodes}

print(conditional_distributions)

{nan: None}


In [13]:
tree = build_tree(tree_df)
print(tree)


{222.0: {'left': 2, 'right': None}, 221.0: {'left': 222, 'right': 1}, 220.0: {'left': 221, 'right': 3}, 219.0: {'left': 220, 'right': 4}, 218.0: {'left': 219, 'right': 5}, 217.0: {'left': 218, 'right': 6}, 216.0: {'left': 217, 'right': 7}, 215.0: {'left': 216, 'right': 8}, 224.0: {'left': 11, 'right': None}, 226.0: {'left': 13, 'right': None}, 225.0: {'left': 226, 'right': 12}, 213.0: {'left': 214, 'right': 215}, 241.0: {'left': 17, 'right': None}, 240.0: {'left': 241, 'right': 16}, 239.0: {'left': 240, 'right': 18}, 238.0: {'left': 239, 'right': 19}, 242.0: {'left': 22, 'right': None}, 243.0: {'left': 24, 'right': None}, 235.0: {'left': 236, 'right': 237}, 234.0: {'left': 235, 'right': 25}, 233.0: {'left': 234, 'right': 26}, 249.0: {'left': 29, 'right': None}, 248.0: {'left': 249, 'right': 28}, 247.0: {'left': 248, 'right': 30}, 252.0: {'left': 33, 'right': None}, 251.0: {'left': 252, 'right': 32}, 250.0: {'left': 251, 'right': 34}, 254.0: {'left': 37, 'right': None}, 258.0: {'left': 

In [14]:
conditional_distributions = {str(node): compute_conditional_distribution(tree, node, inv_covariance_matrix, X) for node in tree if str(node).startswith('Z')}
print("Processed nodes:", [str(node) for node in tree if str(node).startswith('Z')])
print(conditional_distributions)


Processed nodes: []
{}
