# Create a constraint tree

In this notebook, we will create a family-level constraint tree to use in a BEAST2 analysis of Diptera genera for which we have both molecular data and larval phenotypes scored.

Previously, we obtained a dataset with several Diptera genus names with larval data, and used matrix condenser to parse these names and obtain NCBI IDs.

We then used phylotaR to obtain a molecular dataset for these genera, finding that there are 6 genes with a good representation among them. Later we completed the phylotaR dataset with manually downloaded data for additional genera not present initially.

We also downloaded constraint trees from supplementary data available in relevant publications. These encompass both an all-Diptera tree and several robust and more recent trees for less inclusive taxa.

We now need to accomplish the following tasks to infer a phylogeny:

1. Obtain a constraint tree for all of the genera included in our alingments
  * Use dendropy to extract tip names in all published phylogenies
  * Create chatGPT prompts
  * Use chatGPT to parse these tip names and extract structured taxonomic data
  * Run TaxReformer to get additional information about these names
  * Rename all tips to family level according to the most recent taxonomy
  * Prune the large all-Diptera tree at higher levels (e. g. superfamily) when a better tree is available for those taxa
  * Graft the better trees in place
  * Collapse all familes to single tips
  * Graft all genera tips to the correct family, making sure names match our alingments

## Get taxonomic data for tips

### Use dendropy yo extract tip names

Here we will open all of our constraint tree sources and use dendropy to extract tip names. 

In [13]:
import dendropy, numpy as np, copy
from pathlib import Path
from collections import defaultdict

# Initialize an empty list to store the tip names in their original order
tip_names = []

# Specify your directory
directory = Path("../constraint_trees")

# Define the list of schemas we'll try for each file
schemas = ['newick', 'nexus']

files = ['Bayless2022_Analysis6_64taxa1130aa_132mp_roguesremoved.rerooted.tre',
         'Buenaventura2020_oest_50per_RAxML.rerooted.tre',
         'Shin2018_RAxML_bipartitions.100BT_data_v.rerooted.tree',
         'Winkler2022_ephy_aa.rerooted.tre',
         'Wiegmann2011_pg_437.nex']

# Go through each file in the directory
for file in files:
    # Check if the file is a phylogenetic tree (e.g., in Newick or Nexus format)
    # Try to load the tree with each schema
    for schema in schemas:
        try:
            tree = dendropy.Tree.get(
                path=str(directory/file),
                schema=schema,
                preserve_underscores=True
            )

            # If we get here, it means the tree was loaded successfully.
            # We can break the loop over schemas.
            break
        except Exception as e:
            pass

    else:
        # If we get here, it means all schemas failed. We skip this file.
        print(f"Skipping {file} as no schema could load it.")
        continue

    # Replace spaces in taxon names with underscores
    for taxon in tree.taxon_namespace:
        taxon.label = taxon.label.replace(" ", "_")

    print(file)

    # Add the names of all the leaf nodes (tips) to the list, preserving original order
    # and avoiding duplicates
    tree.print_plot()
    for leaf in tree.leaf_node_iter():
        if leaf.taxon.label not in tip_names:
            tip_names.append(leaf.taxon.label)


Bayless2022_Analysis6_64taxa1130aa_132mp_roguesremoved.rerooted.tre
                                          /--------- Meghyperus                
/-----------------------------------------+                                    
|                                         |  /------ Hilarini                  
|                                         \--+                                 
|                                            |  /--- Heteropsilopus_ingenuus   
|                                            \--+                              
|                                               \--- Stilpon_pauciseta         
+                                                                              
|                                            /------ Lonchoptera_bifurcata     
|  /-----------------------------------------+                                 
|  |                                         |  /--- Platypeza_anthrax         
|  |                                         \--+   

### Process tip names with chatGPT and parse results into table
Now that we have the list, we can split it in sets of 35 and create chatGPT prompts to extract information from them.

In [14]:
chunk_size=35
lists_names = [tip_names[i:i + chunk_size] for i in range(0, len(tip_names), chunk_size)]

In [15]:
prompts = []
for l in lists_names:
    new_prompt = ('''The following list contains tip names extracted from a Diptera phylogenetic tree. Parse them to make a table including the following columns:
```tip_name,superfamily,family,subfamily,tribe,genus,species,additional_identifiers```
```
''' +
                  '\n'.join(l) +
                  '''
                  ```
Rules:
- Taxonomic ranks above genus may be abbreviated in the tip names (for example, SAR for Sarcophagidae). If there is an abbreviation, try your best guess to write athe full name. Consider family missing if there is no data about it in the original tip name, do not infer from genus name alone.
- Leave a cell empty if information is missing. "gen" usually means a genus is unknown, leave genus empty. "sp" usually means a species is unknown, leave species empty.
Output the table in csv format.

                  '''
                 )
    prompts.append(new_prompt)

Now let's loop through these prompts and copy them one by one with a GUI:
    

In [16]:
import ipywidgets as widgets

# Your list of strings
string_list = prompts
index = 0

# Create widgets
next_button = widgets.Button(description="Next")
back_button = widgets.Button(description="Back")
jump_button = widgets.Button(description="Jump")
index_input = widgets.BoundedIntText(min=1, max=len(string_list), description='Index:')
textarea = widgets.Textarea(rows=20, cols=200)
output = widgets.Output()
label = widgets.Label()

def update_label():
    label.value = f"Showing string {index+1} of {len(string_list)}"

def update_textarea():
    if index < len(string_list):
        textarea.value = string_list[index]
    else:
        textarea.value = ""

def on_next_button_clicked(b):
    global index
    with output:
        if index < len(string_list) - 1:
            index += 1
            update_label()
            update_textarea()
        else:
            print("End of list")

def on_back_button_clicked(b):
    global index
    with output:
        if index > 0:
            index -= 1
            update_label()
            update_textarea()
        else:
            print("Start of list")

def on_jump_button_clicked(b):
    global index
    with output:
        desired_index = index_input.value - 1  # Subtract 1 because our list is 0-indexed
        if 0 <= desired_index < len(string_list):
            index = desired_index
            update_label()
            update_textarea()
        else:
            print("Invalid index")

next_button.on_click(on_next_button_clicked)
back_button.on_click(on_back_button_clicked)
jump_button.on_click(on_jump_button_clicked)

button_box = widgets.HBox([back_button, next_button, jump_button, index_input])
display_area = widgets.VBox([button_box, label, textarea, output])

display(display_area)

# Update the textarea and label to start
update_textarea()
update_label()

VBox(children=(HBox(children=(Button(description='Back', style=ButtonStyle()), Button(description='Next', styl…

We will now loop through the prompts above, use them in chatGPT and paste the response in a csv table.

The results can be found in this chat: https://chat.openai.com/share/13934687-d473-4eb6-8af3-4c7dedc836e5

Now that we have done it, let's load the table and make sure everything looks all right.

In [17]:
import pandas as pd
gpt_table = pd.read_csv("../constraint_trees/taxonomy_parsing/parsed_gpt_tips.csv")
gpt_table

Unnamed: 0,tip_name,superfamily,family,subfamily,tribe,genus,species,additional_identifiers
0,Liriomyza_sativae_AG1433_4372_308,,,,,Liriomyza,sativae,AG1433_4372_308
1,Braula_coeca_Braula_coeca_GDDH01_1_5741_317,,,,,Braula,coeca,Braula_coeca_GDDH01_1_5741_317
2,Braula_coeca_I6794_507_280,,,,,Braula,coeca,I6794_507_280
3,Leucophenga_varia_L_varia_4359_279_assm.augustus,,,,,Leucophenga,varia,L_varia_4359_279_assm.augustus
4,Leucophenga_maculata_leucma_323_264,,,,,Leucophenga,maculata,leucma_323_264
...,...,...,...,...,...,...,...,...
2596,Chymomyza_costata,,,,,Chymomyza,costata,
2597,Eristalis_pertinax,,,,,Eristalis,pertinax,
2598,Lonchoptera_bifurcata,,,,,Lonchoptera,bifurcata,
2599,Megaselia_abdita,,,,,Megaselia,abdita,


Let's check if any row has made-up tip names, or any missing, and then check for duplicated tip names

In [19]:
gpt_table.loc[~gpt_table['tip_name'].isin(tip_names)]

Unnamed: 0,tip_name,superfamily,family,subfamily,tribe,genus,species,additional_identifiers
43,Chordonota_sp_CB0436_Stratiomyidae_S,,Stratiomyidae,,,Chordonota,,CB0436
44,Leucoptilum_sp_CB0326_Stratiomyidae_S,,Stratiomyidae,,,Leucoptilum,,CB0326
45,Ditylometopa_elegans_Stratiomyidae_S,,Stratiomyidae,,,Ditylometopa,elegans,
46,Euryneura_propinqua_Stratiomyidae_S,,Stratiomyidae,,,Euryneura,propinqua,
47,Diaphorostylus_nasica_Stratiomyidae_S,,Stratiomyidae,,,Diaphorostylus,nasica,
...,...,...,...,...,...,...,...,...
1134,Cyphomyia_nr_androgyna_CB0201_Stratiomyidae_S,,Stratiomyidae,,,Cyphomyia,,CB0201
1135,Dicyphoma_schaefferi_Stratiomyidae_S,,Stratiomyidae,,,Dicyphoma,schaefferi,
1136,Abasanistus_rubricornis_Stratiomyidae_S,,Stratiomyidae,,,Abasanistus,rubricornis,
1137,Labocerina_atrata_Stratiomyidae_S,,Stratiomyidae,,,Labocerina,atrata,


In [20]:
[t for t in tip_names if t not in gpt_table['tip_name'].tolist()]

[]

In [21]:
gpt_table['tip_name'].loc[gpt_table['tip_name'].duplicated()]

Series([], Name: tip_name, dtype: object)

Great, now each entry in the table is unique and we have correspondences to all tip names, let's update the table to use [TaxReformer](https://github.com/brunoasm/TaxReformer/blob/master/README.md)

In [22]:
def create_name(row):
    if pd.notnull(row['genus']) and pd.notnull(row['species']):
        return row['genus'] + ' ' + row['species']
    else:
        for column in ['species', 'genus', 'tribe', 'subfamily', 'family', 'superfamily']:
            if pd.notnull(row[column]):
                return row[column]

gpt_table['name'] = gpt_table.apply(create_name, axis=1)

gpt_table

Unnamed: 0,tip_name,superfamily,family,subfamily,tribe,genus,species,additional_identifiers,name
0,Liriomyza_sativae_AG1433_4372_308,,,,,Liriomyza,sativae,AG1433_4372_308,Liriomyza sativae
1,Braula_coeca_Braula_coeca_GDDH01_1_5741_317,,,,,Braula,coeca,Braula_coeca_GDDH01_1_5741_317,Braula coeca
2,Braula_coeca_I6794_507_280,,,,,Braula,coeca,I6794_507_280,Braula coeca
3,Leucophenga_varia_L_varia_4359_279_assm.augustus,,,,,Leucophenga,varia,L_varia_4359_279_assm.augustus,Leucophenga varia
4,Leucophenga_maculata_leucma_323_264,,,,,Leucophenga,maculata,leucma_323_264,Leucophenga maculata
...,...,...,...,...,...,...,...,...,...
2596,Chymomyza_costata,,,,,Chymomyza,costata,,Chymomyza costata
2597,Eristalis_pertinax,,,,,Eristalis,pertinax,,Eristalis pertinax
2598,Lonchoptera_bifurcata,,,,,Lonchoptera,bifurcata,,Lonchoptera bifurcata
2599,Megaselia_abdita,,,,,Megaselia,abdita,,Megaselia abdita


Let's save the table to run TaxReformer.

In [10]:
#gpt_table.to_csv("../constraint_trees/taxonomy_parsing/parsed_gpt_tips_for_taxreformer.csv")

### Obtain taxonomic information with TaxReformer

Now let's run TaxReformer. We run it using docker, so it is easier to do outside this notebook. We ran TaxReformer and saved an edited version of the table (with original columns other than tip name removed). Let's now open it an merge with our table.

In [24]:
taxref_df = pd.read_csv("../constraint_trees/taxonomy_parsing/output_matched_all_edited.csv",index_col=0)
taxref_df

Unnamed: 0_level_0,name,updated_fullname,taxonomy_source,rank_matched,matched,score,ott_id,ncbi_id,name_source,matched_id_in_source,...,section,superfamily,family,subfamily,tribe,subtribe,species group,subgenus,species subgroup,notes
tip_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Liriomyza_sativae_AG1433_4372_308,Liriomyza sativae,Liriomyza sativae,OTT,species,Liriomyza sativae,100,246452,127406.0,OTT,246452,...,Schizophora,Opomyzoidea,Agromyzidae,Phytomyzinae,,,,,,
Braula_coeca_Braula_coeca_GDDH01_1_5741_317,Braula coeca,Braula coeca,OTT,species,Braula coeca,100,26327,305562.0,OTT,26327,...,Schizophora,Ephydroidea,Braulidae,,,,,,,superfamily manually updated
Braula_coeca_I6794_507_280,Braula coeca,Braula coeca,OTT,,Braula coeca,100,26327,305562.0,OTT,26327,...,Schizophora,Ephydroidea,Braulidae,,,,,,,superfamily manually updated
Leucophenga_varia_L_varia_4359_279_assm.augustus,Leucophenga varia,Leucophenga varia,OTT,species,Leucophenga varia,100,765416,745178.0,OTT,765416,...,Schizophora,Ephydroidea,Drosophilidae,Steganinae,Steganini,Leucophengina,,,,
Leucophenga_maculata_leucma_323_264,Leucophenga maculata,Leucophenga maculata,OTT,species,Leucophenga maculata,100,742404,30051.0,OTT,742404,...,Schizophora,Ephydroidea,Drosophilidae,Steganinae,Steganini,Leucophengina,maculata group,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Ruppellia_multisetosis_Therevidae,Ruppellia,Ruppellia,OTT,genus,Ruppellia,100,545287,95102.0,OTT,545287,...,,Therevidae,Phycinae,Ruppellia_multisetosis_Therevidae,,Therevidae,,,,
Propebrevitrichia_patersonensi_Scenopinidaes,Scenopinidae,Scenopinidae,OTT,family,Scenopinidae,100,653872,50675.0,OTT,653872,...,,,,Propebrevitrichia_patersonensi_Scenopinidaes,,Scenopinidae,,,,
CerdistusLRC_Asilidae,Cerdistus,Cerdistus,OTT,genus,Cerdistus,100,919308,188249.0,OTT,919308,...,,Asilidae,Asilinae,CerdistusLRC_Asilidae,,Asilidae,,,,
Hilarini,Empidinae,Empidinae,OTT,subfamily,Empidinae,100,235209,92562.0,OTT,235209,...,,Empididae,,Hilarini,,,,,,


In [25]:
tipnames2tax = (gpt_table[["name","tip_name"]].
                merge(taxref_df, on = ["name","tip_name"], how = "left").
                set_index("tip_name").
                rename(columns={'updated_genus': 'genus', 'updated_species': 'species'})
               )

cols = list(tipnames2tax   .columns)
cols.remove('genus')
cols.remove('subgenus')
cols.remove('species')
cols.remove('species group')
cols.remove('species subgroup')
tipnames2tax    = tipnames2tax[cols+['genus', 'subgenus', 'species group','species subgroup', 'species']]

tipnames2tax.loc[tipnames2tax['species'].notna(), 'species'] = tipnames2tax['genus'] + ' ' + tipnames2tax['species']
                
tipnames2tax                 

Unnamed: 0_level_0,name,updated_fullname,taxonomy_source,rank_matched,matched,score,ott_id,ncbi_id,name_source,matched_id_in_source,...,family,subfamily,tribe,subtribe,notes,genus,subgenus,species group,species subgroup,species
tip_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Liriomyza_sativae_AG1433_4372_308,Liriomyza sativae,Liriomyza sativae,OTT,species,Liriomyza sativae,100.0,246452.0,127406.0,OTT,246452.0,...,Agromyzidae,Phytomyzinae,,,,Liriomyza,,,,Liriomyza sativae
Braula_coeca_Braula_coeca_GDDH01_1_5741_317,Braula coeca,Braula coeca,OTT,species,Braula coeca,100.0,26327.0,305562.0,OTT,26327.0,...,Braulidae,,,,superfamily manually updated,Braula,,,,Braula coeca
Braula_coeca_I6794_507_280,Braula coeca,Braula coeca,OTT,,Braula coeca,100.0,26327.0,305562.0,OTT,26327.0,...,Braulidae,,,,superfamily manually updated,,,,,
Leucophenga_varia_L_varia_4359_279_assm.augustus,Leucophenga varia,Leucophenga varia,OTT,species,Leucophenga varia,100.0,765416.0,745178.0,OTT,765416.0,...,Drosophilidae,Steganinae,Steganini,Leucophengina,,Leucophenga,,,,Leucophenga varia
Leucophenga_maculata_leucma_323_264,Leucophenga maculata,Leucophenga maculata,OTT,species,Leucophenga maculata,100.0,742404.0,30051.0,OTT,742404.0,...,Drosophilidae,Steganinae,Steganini,Leucophengina,,Leucophenga,,maculata group,,Leucophenga maculata
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Chymomyza_costata,Chymomyza costata,Chymomyza costata,OTT,,Chymomyza costata,100.0,305192.0,76946.0,OTT,305192.0,...,Drosophilidae,Drosophilinae,Colocasiomyini,,,,,costata group,,
Eristalis_pertinax,Eristalis pertinax,Eristalis pertinax,OTT,species,Eristalis pertinax,100.0,4354572.0,1572519.0,OTT,4354572.0,...,Syrphidae,Eristalinae,Eristalini,,,Eristalis,,,,Eristalis pertinax
Lonchoptera_bifurcata,Lonchoptera bifurcata,Lonchoptera bifurcata,OTT,species,Lonchoptera bifurcata,100.0,170799.0,385268.0,OTT,170799.0,...,Lonchopteridae,,,,,Lonchoptera,,,,Lonchoptera bifurcata
Megaselia_abdita,Megaselia abdita,Megaselia abdita,OTT,species,Megaselia abdita,100.0,1062936.0,88686.0,OTT,1062936.0,...,Phoridae,Metopininae,Megaseliini,,,Megaselia,,,,Megaselia abdita


In [26]:
ordered_ranks = ['domain',
       'kingdom', 'phylum', 'subphylum', 'class', 'subclass', 'infraclass',
       'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'section',
       'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus',
       'subgenus', 'species group', 'species subgroup', 'species']

tipnames2tax.columns

Index(['name', 'updated_fullname', 'taxonomy_source', 'rank_matched',
       'matched', 'score', 'ott_id', 'ncbi_id', 'name_source',
       'matched_id_in_source', 'updated_genus_ott_id', 'updated_genus_ncbi_id',
       'updated_species_ott_id', 'updated_species_ncbi_id',
       'ott_accepted_name', 'ott_version', 'higher_source', 'domain',
       'kingdom', 'phylum', 'subphylum', 'class', 'subclass', 'cohort',
       'infraclass', 'superorder', 'order', 'suborder', 'infraorder',
       'section', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe',
       'notes', 'genus', 'subgenus', 'species group', 'species subgroup',
       'species'],
      dtype='object')

## Join trees
### Load constraint trees and add metadata to tips

Now let's load our constraint trees and add taxonomic information as metadata.

In [53]:
#tns = dendropy.TaxonNamespace()

def load_tree(file):
    # Try to load the tree with each schema
    for schema in schemas:
        try:
            tree = dendropy.Tree.get(
                path=str(file),
                schema=schema,
                preserve_underscores=True,
                rooting="default-rooted"#,
                #taxon_namespace = tns
            )
    
            # If we get here, it means the tree was loaded successfully.
            # We can break the loop over schemas.
            break
        except Exception as e:
            pass
    else:
        # If we get here, it means all schemas failed. We skip this file.
        print(f"Skipping {file} as no schema could load it.")
        
    for leaf in tree.leaf_node_iter():
        leaf.taxon.label = leaf.taxon.label.replace(" ","_")
        
        taxon_info = tipnames2tax.loc[leaf.taxon.label]

        # Add the taxonomic information as metadata to the tip
        for column in ['taxonomy_source', 'ott_id', 'ncbi_id', 'ott_version', 'updated_fullname'] + ordered_ranks:
            if taxon_info[column] is not np.nan:
                leaf.annotations.add_new(column, taxon_info[column])
        
        
    return(tree)
                
    

    

directory = Path("../constraint_trees")
schemas = ['newick', 'nexus']
files = {'Diptera':directory/'Wiegmann2011_pg_437.nex',
         'Brachycera':directory/'Shin2018_RAxML_bipartitions.100BT_data_v.rerooted.tree',
         'Schizophora':directory/'Bayless2022_Analysis6_64taxa1130aa_132mp_roguesremoved.rerooted.tre',
         'Oestroidea':directory/'Buenaventura2020_oest_50per_RAxML.rerooted.tre',
         'Ephydroidea':directory/'Winkler2022_ephy_aa.rerooted.tre'
        }


original_trees = {focal_taxon:load_tree(file) for focal_taxon, file in files.items()}

### Create synthetic constraint tree

Now let's add names to nodes in our trees. For each higher taxon, we will search for the MRCA of all tips containing that taxon, and we will add the taxon as property to the MRCA. While we do it, we will check for conflicts (i. e. non-monophyletic higher taxa) for the focal taxa in our trees.

In [41]:
for tree_key in original_trees:
    tree = original_trees[tree_key]
    unique_taxa_tree = dict()

    # Iterate over each rank
    for rank in ordered_ranks:

        unique_taxa_tree[rank] = set()
        taxon_tips = dict()

        for leaf in tree.leaf_node_iter():
            annotation = leaf.annotations.get_value(rank)
            if annotation:
                unique_taxa_tree[rank].add(annotation)

                if annotation not in taxon_tips:
                    taxon_tips[annotation] = []
                taxon_tips[annotation].append(leaf)

        for taxon, tips in taxon_tips.items():
            # Get the MRCA of these tips
            mrca = tree.mrca(taxon_labels=[tip.taxon.label for tip in tips])

            # Verify that all descendants of the MRCA are not associated with any other taxon of the same rank
            conflict_count = 0
            total_descendants = 0
            conflict_tips = []

            for descendant in mrca.leaf_iter():
                total_descendants += 1
                descendant_taxon = descendant.annotations.get_value(rank)
                if descendant_taxon and descendant_taxon != taxon:
                    conflict_count += 1
                    conflict_tips.append(descendant.taxon.label)

            annotated_tips = len(tips)
            
            if taxon in original_trees.keys():
                if conflict_count == 0:
                    # If no conflict was found, add the taxon as a property of the MRCA
                    mrca.annotations.add_new(rank, taxon)
                    print(f"Tree {tree_key}, Rank {rank}, Taxon {taxon} annotated. {annotated_tips} tips annotated, {total_descendants} tips descending from MRCA.")
                else:
                    #print(f"Tree {tree_key}, Rank {rank}, Taxon {taxon} has conflicts. {annotated_tips} tips annotated, {total_descendants} tips descending from MRCA, {conflict_count} conflicts at tips: {conflict_tips}")
                    print(f"Tree {tree_key}, Rank {rank}, Taxon {taxon} has conflicts. {annotated_tips} tips annotated, {total_descendants} tips descending from MRCA, {conflict_count}")


Tree Diptera, Rank order, Taxon Diptera annotated. 203 tips annotated, 206 tips descending from MRCA.
Tree Diptera, Rank suborder, Taxon Brachycera annotated. 157 tips annotated, 160 tips descending from MRCA.
Tree Diptera, Rank section, Taxon Schizophora annotated. 110 tips annotated, 112 tips descending from MRCA.
Tree Diptera, Rank superfamily, Taxon Ephydroidea has conflicts. 8 tips annotated, 10 tips descending from MRCA, 2
Tree Diptera, Rank superfamily, Taxon Oestroidea annotated. 11 tips annotated, 11 tips descending from MRCA.
Tree Brachycera, Rank order, Taxon Diptera has conflicts. 1076 tips annotated, 1095 tips descending from MRCA, 1
Tree Brachycera, Rank suborder, Taxon Brachycera annotated. 1063 tips annotated, 1082 tips descending from MRCA.
Tree Brachycera, Rank section, Taxon Schizophora annotated. 1 tips annotated, 1 tips descending from MRCA.
Tree Schizophora, Rank order, Taxon Diptera annotated. 62 tips annotated, 64 tips descending from MRCA.
Tree Schizophora, Ran

We had to manually reroot some trees and update taxonomy in Ephydroidea, and now we are free from conflicts. Let's combine these trees into one. We will do this by replacing the focal node in each tree as follows:

- For each tree, we will remove outgroups
- We will use the Diptera tree as base
- We will then replace Brachycera in this tree
- We will then replace Schizophora in this tree
- We will then replace Oestroidea and Ephydroidea

### Prune outgroups

In [42]:
# To store the pruned trees
pruned_trees = {k:v.clone() for k,v in original_trees.items()}

for ingroup_taxon, tree in pruned_trees.items():
    # Find the ingroup node.
    ingroup_node = None
    for node in tree.postorder_node_iter():
        # Skip leaf nodes
        if node.is_leaf():
            continue
        
        # Infer the ranks from the node's annotations
        ranks = list(node.annotations.values_as_dict().keys())

        for rank in ranks:
            if node.annotations.get_value(rank) == ingroup_taxon:
                ingroup_node = node
                break
        if ingroup_node is not None:
            break

    # If ingroup_node is not found, skip this iteration.
    if ingroup_node is None:
        print('fail')
        continue

    # Infer the ingroup taxa from the descendant nodes of the ingroup node.
    ingroup_taxa = set(leaf.taxon for leaf in ingroup_node.leaf_iter())

    # Remove outgroup taxa from the tree's taxon namespace.
    tree.prune_taxa([taxon for taxon in tree.taxon_namespace if taxon not in ingroup_taxa])

### Replace Brachycera in the Diptera tree

In [131]:
# Step 2.1: In both the 'Brachycera' and 'Diptera' trees, find all annotations of rank "family"
brachycera_tree = pruned_trees['Brachycera'].clone()
diptera_tree = pruned_trees['Diptera'].clone()

brachycera_families = set()
diptera_families = set()

for node in brachycera_tree.postorder_node_iter():
    annotations_dict = node.annotations.values_as_dict()
    if "family" in annotations_dict:
        brachycera_families.add(annotations_dict["family"])

for node in diptera_tree.postorder_node_iter():
    if node.is_leaf():
        annotations_dict = node.annotations.values_as_dict()
        if "family" in annotations_dict:
            diptera_families.add(annotations_dict["family"])

# Step 2.2: Consider only the family names in common between both trees
common_families = brachycera_families.intersection(diptera_families)

# Step 2.3: In the 'Diptera' tree, search for the MRCA of all tips containing these family annotations
tips_with_common_families = []

for node in diptera_tree.postorder_node_iter():
    if node.is_leaf():
        annotations_dict = node.annotations.values_as_dict()
        if "family" in annotations_dict and annotations_dict["family"] in common_families:
            tips_with_common_families.append(node)

# Step 2.4: Find the MRCA of these tips
if tips_with_common_families:
    mrca_node_in_diptera = diptera_tree.mrca(taxa=[tip.taxon for tip in tips_with_common_families])

    # Print out the MRCA node for verification (optional)
    if mrca_node_in_diptera is not None:
        print("MRCA found: ", mrca_node_in_diptera.annotations.values_as_dict())
        # Extract the subtree rooted at MRCA and print it
        subtree = dendropy.Tree(seed_node=copy.deepcopy(mrca_node_in_diptera))
        print("Subtree rooted at MRCA: ")
        print(subtree.print_plot())

        # Step 2.5: Replace the subtree starting at the MRCA in the 'Diptera' tree with the entire 'Brachycera' tree
        # Getting the parent of the MRCA
        parent_node = mrca_node_in_diptera.parent_node

        # Removing the MRCA node from the Diptera tree
        parent_node.remove_child(mrca_node_in_diptera)

        # Adding the Brachycera tree in place of the MRCA in Diptera tree
        parent_node.add_child(brachycera_tree.seed_node)

        print("Subtree replaced successfully.")
    else:
        print("No MRCA found for the given tips.")
else:
    print("No tips found with common families.")
    

# Step 2.7: Encode splits and update namespace
diptera_tree.update_taxon_namespace()
diptera_tree.update_bipartitions()


MRCA found:  {'suborder': 'Brachycera'}
Subtree rooted at MRCA: 
                                            /---- Trichophthalma_sp            
                                  /---------+                                  
                                  |         | /-- Xylophagus_abdominalis       
                                  |         \-+                                
                                  |           \-- Exeretonevra_angustifrons    
                               /--+                                            
                               |  |           /-- Rhagio_hirtis                
                               |  | /---------+                                
                               |  | |         \-- Vermileo_opacus              
                               |  | |                                          
                               |  \-+         /-- Chrysopilus_thoracicus       
                               |    | /-------+        

### Replace Schizophora in the resulting tree

In [132]:
schizophora_tree = pruned_trees['Schizophora'].clone()

schizophora_families = set()
diptera_families = set()

for node in schizophora_tree.postorder_node_iter():
    annotations_dict = node.annotations.values_as_dict()
    if "family" in annotations_dict:
        schizophora_families.add(annotations_dict["family"])

for node in diptera_tree.postorder_node_iter():
    if node.is_leaf():
        annotations_dict = node.annotations.values_as_dict()
        if "family" in annotations_dict:
            diptera_families.add(annotations_dict["family"])

# Step 2.2: Consider only the family names in common between both trees
common_families = schizophora_families.intersection(diptera_families)

# Step 2.3: In the 'Diptera' tree, search for the MRCA of all tips containing these family annotations
tips_with_common_families = []

for node in diptera_tree.postorder_node_iter():
    if node.is_leaf():
        annotations_dict = node.annotations.values_as_dict()
        if "family" in annotations_dict and annotations_dict["family"] in common_families:
            tips_with_common_families.append(node)

# Step 2.4: Find the MRCA of these tips
if tips_with_common_families:
    mrca_node_in_diptera = diptera_tree.mrca(taxa=[tip.taxon for tip in tips_with_common_families])

    # Print out the MRCA node for verification (optional)
    if mrca_node_in_diptera is not None:
        print("MRCA found: ", mrca_node_in_diptera.annotations.values_as_dict())

        # Step 2.5: Replace the subtree starting at the MRCA in the 'Diptera' tree with the entire 'Schizophora' tree
        # Getting the parent of the MRCA
        parent_node = mrca_node_in_diptera.parent_node

        # Removing the MRCA node from the Diptera tree
        parent_node.remove_child(mrca_node_in_diptera)

        # Adding the Schizophora tree in place of the MRCA in Diptera tree
        parent_node.add_child(schizophora_tree.seed_node)

        print("Subtree replaced successfully.")
    else:
        print("No MRCA found for the given tips.")
else:
    print("No tips found with common families.")
    

# Step 2.7: Encode splits and update namespace
diptera_tree.update_taxon_namespace()
diptera_tree.update_bipartitions()


MRCA found:  {'taxonomy_source': 'OTT', 'ott_id': 693712.0, 'ncbi_id': 385270.0, 'ott_version': 'ott3.5draft1', 'updated_fullname': 'Pherbellia cinerella', 'domain': 'Eukaryota', 'kingdom': 'Metazoa', 'phylum': 'Arthropoda', 'subphylum': 'Hexapoda', 'class': 'Insecta', 'subclass': 'Pterygota', 'infraclass': 'Neoptera', 'cohort': 'Holometabola', 'order': 'Diptera', 'suborder': 'Brachycera', 'infraorder': 'Muscomorpha', 'section': 'Schizophora', 'superfamily': 'Sciomyzoidea', 'family': 'Sciomyzidae'}
Subtree replaced successfully.


### Replace Oestroidea in the resulting tree

In [133]:
oestroidea_tree = pruned_trees['Oestroidea'].clone()

oestroidea_families = set()
diptera_families = set()

for node in oestroidea_tree.postorder_node_iter():
    annotations_dict = node.annotations.values_as_dict()
    if "family" in annotations_dict:
        oestroidea_families.add(annotations_dict["family"])

for node in diptera_tree.postorder_node_iter():
    if node.is_leaf():
        annotations_dict = node.annotations.values_as_dict()
        if "family" in annotations_dict:
            diptera_families.add(annotations_dict["family"])

# Step 2.2: Consider only the family names in common between both trees
common_families = oestroidea_families.intersection(diptera_families)

# Step 2.3: In the 'Diptera' tree, search for the MRCA of all tips containing these family annotations
tips_with_common_families = []

for node in diptera_tree.postorder_node_iter():
    if node.is_leaf():
        annotations_dict = node.annotations.values_as_dict()
        if "family" in annotations_dict and annotations_dict["family"] in common_families:
            tips_with_common_families.append(node)

# Step 2.4: Find the MRCA of these tips
if tips_with_common_families:
    mrca_node_in_diptera = diptera_tree.mrca(taxa=[tip.taxon for tip in tips_with_common_families])

    # Print out the MRCA node for verification (optional)
    if mrca_node_in_diptera is not None:
        print("MRCA found: ", mrca_node_in_diptera.annotations.values_as_dict())
        
        # Extract the subtree rooted at MRCA and print it
        subtree = dendropy.Tree(seed_node=copy.deepcopy(mrca_node_in_diptera))
        print("Subtree rooted at MRCA: ")
        print(subtree.print_plot())


        # Step 2.5: Replace the subtree starting at the MRCA in the 'Diptera' tree with the entire 'Oestroidea' tree
        # Getting the parent of the MRCA
        parent_node = mrca_node_in_diptera.parent_node

        # Removing the MRCA node from the Diptera tree
        parent_node.remove_child(mrca_node_in_diptera)

        # Adding the Oestroidea tree in place of the MRCA in Diptera tree
        parent_node.add_child(oestroidea_tree.seed_node)

        print("Subtree replaced successfully.")
    else:
        print("No MRCA found for the given tips.")
else:
    print("No tips found with common families.")
    

# Step 2.7: Encode splits and update namespace
diptera_tree.update_taxon_namespace()
diptera_tree.update_bipartitions()


MRCA found:  {'superfamily': 'Oestroidea'}
Subtree rooted at MRCA: 
/------------------------------------------------------- Sarcophaga_bullata    
|                                                                              
+                                    /------------------ Pollenia              
|                 /------------------+                                         
|                 |                  \------------------ Triarthria_setipennis 
\-----------------+                                                            
                  |                  /------------------ Calliphora_vomitoria  
                  \------------------+                                         
                                     \------------------ Stomorhina_subapicalis
                                                                               
                                                                               
None
Subtree replaced successfully.


### Replace Ephydroidea in the resulting tree

In [134]:
ephydroidea_tree = pruned_trees['Ephydroidea'].clone()

ephydroidea_families = set()
diptera_families = set()

for node in ephydroidea_tree.postorder_node_iter():
    annotations_dict = node.annotations.values_as_dict()
    if "family" in annotations_dict:
        ephydroidea_families.add(annotations_dict["family"])

for node in diptera_tree.postorder_node_iter():
    if node.is_leaf():
        annotations_dict = node.annotations.values_as_dict()
        if "family" in annotations_dict:
            diptera_families.add(annotations_dict["family"])

# Step 2.2: Consider only the family names in common between both trees
common_families = ephydroidea_families.intersection(diptera_families)

# Step 2.3: In the 'Diptera' tree, search for the MRCA of all tips containing these family annotations
tips_with_common_families = []

for node in diptera_tree.postorder_node_iter():
    if node.is_leaf():
        annotations_dict = node.annotations.values_as_dict()
        if "family" in annotations_dict and annotations_dict["family"] in common_families:
            tips_with_common_families.append(node)

# Step 2.4: Find the MRCA of these tips
if tips_with_common_families:
    mrca_node_in_diptera = diptera_tree.mrca(taxa=[tip.taxon for tip in tips_with_common_families])

    # Print out the MRCA node for verification (optional)
    if mrca_node_in_diptera is not None:
        print("MRCA found: ", mrca_node_in_diptera.annotations.values_as_dict())
        
        # Extract the subtree rooted at MRCA and print it
        subtree = dendropy.Tree(seed_node=copy.deepcopy(mrca_node_in_diptera))
        print("Subtree rooted at MRCA: ")
        print(subtree.print_plot())


        # Step 2.5: Replace the subtree starting at the MRCA in the 'Diptera' tree with the entire 'Ephydroidea' tree
        # Getting the parent of the MRCA
        parent_node = mrca_node_in_diptera.parent_node

        # Removing the MRCA node from the Diptera tree
        parent_node.remove_child(mrca_node_in_diptera)

        # Adding the Ephydroidea tree in place of the MRCA in Diptera tree
        parent_node.add_child(ephydroidea_tree.seed_node)

        print("Subtree replaced successfully.")
    else:
        print("No MRCA found for the given tips.")
else:
    print("No tips found with common families.")
    

# Step 2.7: Encode splits and update namespace
diptera_tree.update_taxon_namespace()
diptera_tree.update_bipartitions()


MRCA found:  {}
Subtree rooted at MRCA: 
/------------------------------------------------------ Scatella_tenuicosta    
|                                                                              
|                                              /------- Diastata_repleta       
+      /---------------------------------------+                               
|      |                                       \------- Curtonotum             
|      |                                                                       
\------+       /--------------------------------------- Cryptochetum           
       |       |                                                               
       |       |                       /--------------- Braula_coeca           
       \-------+       /---------------+                                       
               |       |               |       /------- Stegana                
               |       |               \-------+                               

### Plot the resulting synthetic tree

In [135]:
diptera_tree.print_plot()

/-------------------- Deuterophlebia_coloradensis                              
|                                                                              
/-------------------- Nymphomyia_dolichopeza                                   
|                                                                              
|                /--- Trichocera_brevicornis                                   
|                |                                                             
|/---------------+/-- Ula_elegans                                              
||               ||                                                            
||               ||/- Dactylolabis_montana                                     
||               \++                                                           
||                ||/ Hexatoma_longicornis                                     
||                |\+                                                          
||                + \ Hoplolabis_armata 

### List non-mophyletic genera

In [136]:
# create a dictionary to store all leaf nodes in the same genus
genus_dict = defaultdict(list)

# iterate over the leaf nodes in the tree
for leaf in diptera_tree.leaf_node_iter():
    genus = leaf.annotations.get_value("genus")
    genus_dict[genus].append(leaf.taxon)

# list to store non-monophyletic genera
non_monophyletic = []

# iterate over each genus
for genus, taxa in genus_dict.items():
    # get the MRCA of all taxa in the genus
    mrca = diptera_tree.mrca(taxa=taxa)
    descendants = set(mrca.leaf_nodes())
    # if all descendants of the MRCA are in the genus, then it's monophyletic
    if not all([leaf.taxon in taxa for leaf in descendants]) and genus is not None:
        non_monophyletic.append(genus)

# print all non-monophyletic genera
print("Non-monophyletic genera:", non_monophyletic)

# print tree plot for each non-monophyletic genus
for genus in non_monophyletic:
    taxa = genus_dict[genus]
    mrca = diptera_tree.mrca(taxa=taxa)
    print("MRCA for genus", genus, ":")
    subtree = dendropy.Tree(seed_node=copy.deepcopy(mrca))
    print("Subtree rooted at MRCA: ")
    print(subtree.print_plot())

Non-monophyletic genera: ['Esenbeckia', 'Mesembrinella']
MRCA for genus Esenbeckia :
Subtree rooted at MRCA: 
                 /---- Esenbeckia_Palassomyia_fascipe_Tabanidaennis            
/----------------+                                                             
|                \---- Mycteromyia_sp_1_Tabanidae                              
+                                                                              
|   /----------------- Pegasomyia_abaureus_Tabanidae                           
\---+                                                                          
    |   /------------- Esenbeckia_Esenbeckia_prasiniv_Tabanidaeentris          
    \---+                                                                      
        |    /-------- Esenbeckia_nr_lugubris_Fidena__TabanidaeLaphriomyia_sp_4
        \----+                                                                 
             |   /---- Esenbeckia_Ricardoa_delta_Tabanidae                     
          

### Fix non-monophyletic genera

There are only two non-mophyletic genera. Mesembrinella is made non-monophyletic because Mesembrinella spicata was updated to Henriquella spicata in Open Tree Taxonomy. However, a recent taxonomic treatment shows that it is currently considered Mesembrinella (https://www.biotaxa.org/Zootaxa/article/view/zootaxa.4659.1.1), and so says the phylogenetic tree. In this case, we will update the node information.

The other conflict is Esenbeckia. It seems the subgenus Palassomyia renders it non-monophyletic. In this case, we will remove this node from the tree.

Let's start by updating Mesembrinella spicata:

In [140]:
# Find the corresponding node in the tree
specific_node = diptera_tree.find_node_with_taxon_label("MES_Mesembrinella_spicata")

# Annotations to be updated
annotations_to_update = {
    "taxonomy_source": "OTT+manual",
    "updated_fullname": "Mesembrinella spicata",
    "genus": "Mesembrinella",
    "species": "Mesembrinella spicata"
}

# Update annotations for the specific node
for annotation in specific_node.annotations:
    if annotation.name in annotations_to_update:
        annotation.value = annotations_to_update[annotation.name]

# Print all annotations for the specific node to confirm the updates
for annotation in specific_node.annotations:
    print(f"{annotation.name}: {annotation.value}")


taxonomy_source: OTT+manual
ott_id: 4389190.0
ncbi_id: nan
ott_version: ott3.5draft1
updated_fullname: Mesembrinella spicata
domain: Eukaryota
kingdom: Metazoa
phylum: Arthropoda
subphylum: Hexapoda
class: Insecta
subclass: Pterygota
infraclass: Neoptera
cohort: Holometabola
order: Diptera
suborder: Brachycera
infraorder: Muscomorpha
section: Schizophora
superfamily: Oestroidea
family: Calliphoridae
subfamily: Mesembrinellinae
genus: Mesembrinella
species: Mesembrinella spicata


Now let's remove Esenbeckia_Palassomyia_fascipe_Tabanidaennis

In [138]:
# Remove the taxon from the tree
diptera_tree.prune_taxa_with_labels(["Esenbeckia_Palassomyia_fascipe_Tabanidaennis"])

Now let's check again

In [141]:
# create a dictionary to store all leaf nodes in the same genus
genus_dict = defaultdict(list)

# iterate over the leaf nodes in the tree
for leaf in diptera_tree.leaf_node_iter():
    genus = leaf.annotations.get_value("genus")
    genus_dict[genus].append(leaf.taxon)

# list to store non-monophyletic genera
non_monophyletic = []

# iterate over each genus
for genus, taxa in genus_dict.items():
    # get the MRCA of all taxa in the genus
    mrca = diptera_tree.mrca(taxa=taxa)
    descendants = set(mrca.leaf_nodes())
    # if all descendants of the MRCA are in the genus, then it's monophyletic
    if not all([leaf.taxon in taxa for leaf in descendants]) and genus is not None:
        non_monophyletic.append(genus)

# print all non-monophyletic genera
print("Non-monophyletic genera:", non_monophyletic)

# print tree plot for each non-monophyletic genus
for genus in non_monophyletic:
    taxa = genus_dict[genus]
    mrca = diptera_tree.mrca(taxa=taxa)
    print("MRCA for genus", genus, ":")
    subtree = dendropy.Tree(seed_node=copy.deepcopy(mrca))
    print("Subtree rooted at MRCA: ")
    print(subtree.print_plot())

Non-monophyletic genera: []


### List non-monophyletic families

Now let's do the same for families

In [192]:
# create a function that returns a label combining the family name and the taxon label
def label_with_family(node):
    family = node.annotations.get_value("family")
    return f"{family}_{node.taxon.label}"

# create a dictionary to store all leaf nodes in the same family
family_dict = defaultdict(list)

# iterate over the leaf nodes in the tree
for leaf in diptera_tree.leaf_node_iter():
    family = leaf.annotations.get_value("family")
    family_dict[family].append(leaf.taxon)

# list to store non-monophyletic families
non_monophyletic = []
monophyletic = []

# iterate over each family
for family, taxa in family_dict.items():
    # get the MRCA of all taxa in the family
    mrca = diptera_tree.mrca(taxa=taxa)
    descendants = set(mrca.leaf_nodes())
    # if all descendants of the MRCA are in the family, then it's monophyletic
    if not all([leaf.taxon in taxa for leaf in descendants]) and family is not None:
        non_monophyletic.append(family)
    else:
        monophyletic.append(family)

# print all non-monophyletic families
print("Non-monophyletic families:", non_monophyletic)
print("Monophyletic families:", monophyletic)


Non-monophyletic families: ['Limoniidae', 'Mycetophilidae', 'Hilarimorphidae', 'Nemestrinidae', 'Acroceridae', 'Xylomyidae', 'Stratiomyidae', 'Rhagionidae', 'Tabanidae', 'Bombyliidae', 'Mydidae', 'Asilidae', 'Scenopinidae', 'Therevidae', 'Conopidae', 'Heleomyzidae', 'Calliphoridae', 'Oestridae', 'Rhiniidae', 'Ephydridae', 'Drosophilidae', 'Empididae', 'Hybotidae', 'Dolichopodidae']
Monophyletic families: ['Deuterophlebiidae', 'Nymphomyiidae', 'Trichoceridae', 'Pediciidae', 'Cylindrotomidae', 'Tipulidae', 'Ptychopteridae', 'Blephariceridae', 'Tanyderidae', 'Psychodidae', 'Dixidae', 'Chaoboridae', 'Culicidae', 'Ceratopogonidae', 'Chironomidae', 'Thaumaleidae', 'Simuliidae', 'Perissommatidae', 'Anisopodidae', 'Synneuridae', 'Scatopsidae', 'Axymyiidae', 'Bibionidae', 'Pachyneuridae', 'Diadocidiidae', 'Keroplatidae', 'Sciaridae', 'Lygistorrhinidae', 'Cecidomyiidae', None, 'Pantophthalmidae', 'Xylophagaidae', 'Vermileonidae', 'Austroleptidae', 'Pelecorhynchidae', 'Oreoleptidae', 'Athericidae

Several families are non-monophyletic. Let's not fix this now. Let's just keep this in mind when building constraints

### Save tree as NEXML

In [143]:
with open('../constraint_trees/result/all_taxa_merged.nexml', 'w') as output_file:
    diptera_tree.write(file=output_file, schema="nexml")

## Prepare a genus-level constraint tree

Since genus will be our unit of analysis, let's collapse all genera and rename tips to just their genus names and ncbi IDs. We will also remove genera without NCBI IDS

In [162]:
# Clone the original tree
cloned_tree = diptera_tree.clone()

# Iterate over the leaf nodes of the cloned tree
for leaf in cloned_tree.leaf_node_iter():
    # Get the genus annotation for the leaf node
    genus = leaf.annotations.get_value("genus")
    # If the genus annotation does not exist or is None, prune the leaf node
    if genus is None:
        cloned_tree.prune_subtree(leaf)

# Now, the cloned_tree should only contain leaves that have a non-null genus annotation


In [163]:
# Function to check if all elements in a list are equal
def all_equal(iterable):
    iterator = iter(iterable)
    try:
        first = next(iterator)
    except StopIteration:
        return True
    return all(first == rest for rest in iterator)

# Step 1: create a list of all genera listed in the "genus" attribute of leaves in the tree.
genus_list = list(set([leaf.annotations.get_value("genus") for leaf in cloned_tree.leaf_node_iter() if leaf.annotations.get_value("genus") is not None]))

# Step 2: Loop through genera
for genus in genus_list:
    # Step 3: Find leaves with this genus
    leaves = [leaf for leaf in cloned_tree.leaf_node_iter() if leaf.annotations.get_value("genus") == genus]
    
    
    if len(leaves) == 1:
        leaf = leaves[0]
        leaf.annotations.drop(name="species")
        continue
    

    else:
        mrca = cloned_tree.mrca(taxon_labels=[leaf.taxon.label for leaf in leaves])
        
        # Clear all annotations of this MRCA
        mrca.annotations.clear()

        # Get the unique set of annotations across all leaves for the current genus
        annotation_keys = set().union(*[leaf.annotations for leaf in leaves])

        # For each unique annotation, replace it at the MRCA only if it has the same value in all tips
        for key in annotation_keys:
            # Get the values of the current annotation for all leaves
            annotation_values = [leaf.annotations.get_value(key) for leaf in leaves]

            # If the annotation has the same value in all leaves, set it for the MRCA
            if all_equal(annotation_values):
                mrca.annotations.add_new(key, annotation_values[0])

        # Prune all leaves under the MRCA, making it a new leaf
        for leaf in leaves:
            cloned_tree.prune_subtree(leaf)


In [179]:
to_keep = []
for leaf in cloned_tree.leaf_node_iter():
    if not np.isnan(leaf.annotations.get_value("ncbi_id")):
        leaf.taxon.label = leaf.annotations.get_value("genus") + "_" + str(int(leaf.annotations.get_value("ncbi_id")))
        to_keep.append(leaf.taxon.label)
        
genus_tree = cloned_tree.extract_tree_with_taxa_labels(to_keep)

In [180]:
genus_tree.print_plot()

/--------------------------------------------------- Deuterophlebia_560720     
|                                                                              
|/-------------------------------------------------- Nymphomyia_560767         
||                                                                             
||                                       /---------- Trichocera_560761         
||                                       |                                     
|| /-------------------------------------+ /-------- Ula_560789                
+| |                                     | |                                   
|| |                                     | |    /--- Dactylolabis_560718       
|| |                                     \-+/---+                              
|| |                                       ||   | /- Hexatoma_560713           
|| |                                       ||   \-+                            
|| |                                    

Next:

- Check overlap of our dataset at the genus level
- If there is good overlap, specifiy constraints as a nexus file

In [187]:
# define the path
path = Path('../') / 'alignments' / 'aligned_trimmed_cleaned'

# get all files
all_files = list(path.glob("*"))

# filter out the files that include "updated" in the name
files_without_updated = [file for file in all_files if "updated" not in file.name]

# Create an empty set to store sequence names
sequence_names = set()

# Loop over each file
for file in files_without_updated:
    with open(file, 'r') as f:
        for line in f:
            # If the line starts with '>', it's a sequence name
            if line.startswith('>'):
                # Add sequence name to the set (remove '>' and newline character)
                sequence_names.add(line[1:].strip())

In [189]:
# Get leaf labels from the tree
tree_labels = {leaf.taxon.label for leaf in genus_tree.leaf_node_iter()}

# Compute the overlap
overlap_seq_tree = sequence_names & tree_labels
overlap_tree_seq = tree_labels & sequence_names

# Calculate the percentage
percentage_seq_in_tree = len(overlap_seq_tree) / len(sequence_names) * 100
percentage_tree_in_seq = len(overlap_tree_seq) / len(tree_labels) * 100

print(f"{percentage_seq_in_tree}% of sequence names are in the tree.")
print(f"{percentage_tree_in_seq}% of tree labels are in the sequence names.")

2.3985239852398523% of sequence names are in the tree.
9.774436090225564% of tree labels are in the sequence names.
