# Analyze cladistic properties of taxonomic terms given tree

The code aims to assign each of "single", "monophyletic" and "non-mononphyletic" (paraphyletic or polyphyletic) labels to each taxonomic term given a reference tree.

Dependencies

In [1]:
import pandas as pd
from skbio import TreeNode

Custom function

In [2]:
def cladistic(tree, taxa, classified=None):
    """Determines the cladistic property of the given taxon set.

    Parameters
    ----------
    tree : skbio.TreeNode
        reference tree
    taxa : iterable of str
        taxa (tip names)
    classified : iterable of str
        (optional) classified taxa at the rank

    Returns
    -------
    str
        'uni' if input taxon is a single tip in given tree
        'mono' if input taxa are monophyletic in given tree
        'poly' if input taxa are polyphyletic in given tree


    Notes
    -----
    In the following tree example:
                                  /-a
                        /--------|
                       |          \-b
              /--------|
             |         |          /-c
             |         |         |
             |          \--------|--d
    ---------|                   |
             |                    \-e
             |
             |                    /-f
             |          /--------|
              \--------|          \-g
                       |
                        \-h
    ['a'] returns 'uni'
    ['c', 'd', 'e'] returns 'mono'
    ['a', 'c', 'f'] returns 'poly'
    ['f', 'h'] returns 'poly'

    Paraphyly, which is programmably indistinguishable from polyphyly, returns
    "poly" here.

    If "classified" is provided, unclassified taxa at the rank will be ignored
    when calculating cladistic properties. For example:
    
                        /-a
              /--------|
             |          \--
    ---------|
             |          /-a
              \--------|
                        \-a
    
    
    ['a'] returns 'mono' instead of 'poly'.

    Raises
    ------
    ValueError
        if one or more taxon names are not present in the tree
    """
    tips = []
    taxa = set(taxa)
    for tip in tree.tips():
        if tip.name in taxa:
            tips.append(tip)
    n = len(taxa)
    if len(tips) < n:
        raise ValueError('Taxa not found in the tree.')
    if n == 1:
        return 'uni'
    else:
        subset = tree.lca(tips).subset()
        if len(subset) == n:
            return 'mono'
        elif classified is not None:
            if (subset - taxa).intersection(classified):
                return 'poly'
            else:
                return 'mono'
        else:
            return 'poly'

Read input tree

In [3]:
tree_fp = '../trees/astral.nwk'

In [4]:
tree = TreeNode.read(tree_fp)
tree.count(tips=True)

10575

Read taxonomy table

In [5]:
dft = pd.read_table('rank_names.tsv', index_col=0)
dft.head()

Unnamed: 0_level_0,kingdom,phylum,class,order,family,genus,species
genome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
G000005825,Bacteria,Firmicutes,Bacilli,Bacillales,Bacillaceae,Bacillus,Bacillus pseudofirmus
G000006175,Archaea,Euryarchaeota,Methanococci,Methanococcales,Methanococcaceae,Methanococcus,Methanococcus voltae
G000006605,Bacteria,Actinobacteria,Actinobacteria,Corynebacteriales,Corynebacteriaceae,Corynebacterium,Corynebacterium jeikeium
G000006725,Bacteria,Proteobacteria,Gammaproteobacteria,Xanthomonadales,Xanthomonadaceae,Xylella,Xylella fastidiosa
G000006745,Bacteria,Proteobacteria,Gammaproteobacteria,Vibrionales,Vibrionaceae,Vibrio,Vibrio cholerae


Whether to consider unclassified taxa at the rank

In [6]:
strict = True

In [7]:
%%time
dfc = pd.DataFrame()
columns=['rank', 'taxon', 'num', 'cladistic']
for rank in dft.columns:
    g2taxon = dft[rank].dropna().to_dict()
    taxon2gs = {}
    for g, taxon in g2taxon.items():
        taxon2gs.setdefault(taxon, []).append(g)
    data = []
    for taxon, gs in taxon2gs.items():
        clad = cladistic(tree, gs, None if strict else g2taxon.keys())
        data.append([rank, taxon, len(gs), clad])
    dfc = pd.concat([dfc, pd.DataFrame(
        data, columns=columns).sort_values(by=['taxon'])])
dfc.head()

CPU times: user 3min 9s, sys: 372 ms, total: 3min 10s
Wall time: 3min 10s


In [8]:
dfc.to_csv('cladistics.%s.tsv' % ('strict' if strict else 'relax'),
           sep='\t', index=False)