# Prepare tree rendering package

This notebooks manipulates a tree and generates rendering packages. It does:

 - Collapse a tree at given taxonomic rank(s) based on serveral criteria.
 - Generate color gradient for branch support values.
 - Generate files that can be directly parsed and rendered using [**iTOL**](https://itol.embl.de/) and [**FigTree**](http://tree.bio.ed.ac.uk/software/figtree/).

## Preparation

### Dependencies

In [1]:
import re
import numpy as np
import pandas as pd

In [2]:
from skbio import TreeNode

### Input files

Tree file (with node IDs, and without support values)

In [3]:
tree_fp = '../trees/release/astral.cons.nwk'

Taxonomic information file (original or tax2tree-curated)

In [4]:
taxonomy_fp = '../taxonomy/tax2tree/ncbi/astral/filled_ranks.tsv'

Custom node attributes (bootstrap, estimated time range, metadata category, additional name, etc.)

In [5]:
custom_attrs_fps = {
    'lpp': '../trees/release/supports/astral.txt'}

### Parameters

Collapse the tree from this rank up. For example, "class" will have the tree collapsed at class (if possible) or phylum. Leave empty or None if not needed.

In [6]:
# collapse_rank = None
collapse_rank = 'genus'

Determine the visual length of a collapsed clade (triangle or line). Options are: mean, std (don't use), min, 25%, 50% (median), 75% and max.

In [7]:
collapse_length_func = '50%'

Clades with descendants less than this threshold will not be collapsed. Either a fixed number, or a rank-to-number dictionary. Example: phylum = 1, class = 10. Leave 0 if not needed.

In [8]:
# min_clade_size = 0
# for full-scale (10k-taxon) trees
min_clade_size = {'kingdom': 1, 'phylum': 1, 'class': 5, 'order': 50, 'family': 50, 'genus': 50, 'species': 50}
# for 1k-taxon trees
# min_clade_size = {'kingdom': 1, 'phylum': 1, 'class': 1, 'order': 5, 'family': 10, 'genus': 10, 'species': 10}
# for class / phylum only:
# min_clade_size = {'kingdom': 1, 'phylum': 1, 'class': 10, 'order': 0, 'family': 0, 'genus': 0, 'species': 0}

Split clades with descendants less than this threshold of *fraction* of the dominant clade of the same taxon will not be collapsed. For example, `Firmicutes_1` has 1000 tips, then if `Firmicutes_10` has 45 tips (< 1000 * 5%), it will not be collapsed.


In [9]:
# min_split_clade_frac = 0
min_split_clade_frac = 0.05

Whether to delete tips not belonging to any collapsed clades.

In [10]:
delete_uncollapsed = True

Whether to hide uncollapsed tip names. Effective when `delete_uncollapsed` is `False`.

In [11]:
hide_uncollapsed = True

Manipulate node labels using the following regular expressions (pairs of pattern and replacement).

In [12]:
label_format_regexes = [
    (r'^Candidatus ', r'Ca. '),
    (r'^candidate division ', r'Ca. ')
]

Append rank code to taxon (e.g.,: `Bacteria` => `k__Bacteria`).

In [13]:
append_rank_code = True

Append clade size to taxon

In [14]:
append_clade_size = True

Low and high end of color gradient.

In [15]:
color_range = ('#f0f0f0', '#191919')  # gray
# color_range = ('#deebf7', '#3182bd')  # blue

### Helpers

Basic utilities

In [16]:
def sort_node_ids(d):
    """Sort names of tips and internal nodes."""
    return sorted(d, key=lambda x: (x[0], int(x[1:])))

In [17]:
def digits(num):
    """Get number digits after decimal point."""
    if not num.replace('.', '').isdigit() or num.count('.') != 1:
        raise ValueError('Not a valid float number: %s' % num)
    return len(num.split('.')[1])

In [18]:
def de_suffix(taxon, names):
    """Restore suffixed taxon name."""
    if '_' not in taxon:
        return taxon
    res = '_'.join(taxon.split('_')[:-1])
    return res if res in names else taxon

Node dimension calculation

In [19]:
def get_clade_dimensions(node):
    """Calculate the dimensions of a clade.

    Parameters
    ----------
    node : skbio.TreeNode
        clade to calculate

    Returns
    -------
    pd.Series
        count, mean, std, min, 25%, 50%, 75%, max
    """
    lengths = pd.Series(x.accumulate_to_ancestor(node) for x in node.tips())
    return lengths.describe()

Selective tree shearing and pruning

In [20]:
def selective_prune(tree, tips_to_keep, nodes_to_keep=[]):
    """Shear a tree and selectively prune it.

    Parameters
    ----------
    tree : skbio.TreeNode
        tree to shear
    tips_to_keep : iterable of str
        tip names to keep
    nodes_to_keep : iterable of str
        internal node names to keep

    Returns
    -------
    tree : skbio.TreeNode
        resulting tree

    Notes
    -----
    Inherited from scikit-bio's `shear` and  `prune` functions, but will
    selectively remove internal nodes.
    """
    tcopy = tree.deepcopy()

    ids = set(tips_to_keep)
    marked = set()
    for tip in tcopy.tips():
        if tip.name in ids:
            marked.add(tip)
            for anc in tip.ancestors():
                if anc in marked:
                    break
                else:
                    marked.add(anc)

    for node in list(tcopy.traverse()):
        if node not in marked:
            node.parent.remove(node)

    ids = set(nodes_to_keep)
    nodes_to_remove = []
    for node in tcopy.traverse(include_self=False):
        if len(node.children) == 1:
            if node.name not in ids:
                nodes_to_remove.append(node)

    for node in nodes_to_remove:
        child = node.children[0]

        if child.length is None or node.length is None:
            child.length = child.length or node.length
        else:
            child.length += node.length

        if node.parent is None:
            continue

        node.parent.append(child)
        node.parent.remove(node)

    return tcopy

Newick string formatting

In [21]:
def format_newick(tree, operators=',:_;()[] ', digits=None):
    """Generate a Newick string from a tree.

    Parameters
    ----------
    tree : skbio.TreeNode
        tree to convert to a Newick string
    operators : str
        list of characters that have special meaning in a tree file so that
        a node name containing any of them must be quoted
    digits : int or tuple of (int, int)
        number of digits (float and scientific) to print in a branch length

    Returns
    -------
    str
        formatted Newick string

    Notes
    -----
    Modified from scikit-bio's `_tree_node_to_newick`. In addition to the
    prototype, it can do:
    
    1. Keep spaces without converting them to underscores.
    2. Print branch lengths based on given precision.
    """
    res = ''
    operators = set(operators or '')
    if isinstance(digits, int):
        digits = (digits, digits) 
    current_depth = 0
    nodes_left = [(tree, 0)]
    while len(nodes_left) > 0:
        entry = nodes_left.pop()
        node, node_depth = entry
        if node.children and node_depth >= current_depth:
            res += '('
            nodes_left.append(entry)
            nodes_left += ((child, node_depth + 1) for child in
                           reversed(node.children))
            current_depth = node_depth + 1
        else:
            if node_depth < current_depth:
                res += ')'
                current_depth -= 1
            if node.name:
                escaped = "%s" % node.name.replace("'", "''")
                if any(t in operators for t in node.name):
                    res += "'"
                    res += escaped
                    res += "'"
                else:
                    res += escaped
            if node.length is not None:
                res += ':'
                length = str(node.length)
                if digits:
                    length = '%.*g' % ((digits[0] if 'e' in length
                                        else digits[1]), node.length)
                res += length

            if nodes_left and nodes_left[-1][1] == current_depth:
                res += ','
    return res + ';'

Color gradient generation

In [22]:
def hex2rgb(h):
    return tuple(int(h.lstrip('#')[i: i + 2], 16) for i in (0, 2 ,4))

In [23]:
def rgb2hex(r):
    return '#{:02x}{:02x}{:02x}'.format(r[0], r[1], r[2])

In [24]:
def make_color_palette(start, end, n=101):
    """Generate a gradient of 101 colors.

    Parameters
    ----------
    start : str
        start color in hex format
    end : str
        end color in hex format
    n : int
        number of colors to return

    Returns
    -------
    list of str
        colors in hex format
    """
    start_, end_ = hex2rgb(start), hex2rgb(end)
    seqs = [np.linspace(start_[i], end_[i], n).astype(int) for i in range(3)]
    rgbs = [[seqs[x][i] for x in range(3)] for i in range(n)]
    return [rgb2hex(x) for x in rgbs]

In [25]:
def make_color_gradient(node2val, colors):
    """Deal with polytomic taxa.

    Parameters
    ----------
    node2val : dict of float or int
        node ID to value map
    colors : list of str
        101 colors for values of 0 to 100

    Returns
    -------
    dict of str
        node ID to color map
    """
    for id_, val in node2val.items():
        if val is None or np.isnan(val) or val == '':
            node2val[id_] = 0
        elif not isinstance(val, int) and not isinstance(val, float):
            raise ValueError('Invalid number %s.' % val)

    # shrink larger integers to 0-100 range
    max_val = max(node2val.values())
    if max_val > 100:
        for id_ in node2val:
            node2val[id_] /= (max_val / 100)

    # convert fraction into percentage, and percentage to integer
    convert = True if max_val <= 1 else False
    for id_ in node2val:
        try:
            node2val[id_] = (
                int(node2val[id_] * 100) if convert else int(node2val[id_]))
        except ValueError:
            print('%s' % id_)

    # map support to color
    return {k: colors[v] for k, v in node2val.items()}

iTOL file generation

In [26]:
def write_itol_label(f, id2label):
    """Generate iTOL node label file."""
    f.write('LABELS\n')
    f.write('SEPARATOR TAB\n')
    f.write('DATA\n')
    for id_ in sort_node_ids(id2label):
        f.write('%s\t%s\n' % (id_, id2label[id_]))

In [27]:
def write_itol_collapse(f, nodes_to_collapse):
    """Generate an iTOL collapse file.

    Parameters
    ----------
    nodes_to_collapse : iterable of str
        node IDs to collapse
    f : file handle
        file to write collapse information
    """
    f.write('COLLAPSE\n')
    f.write('DATA\n')
    for id_ in nodes_to_collapse:
        f.write('%s\n' % id_)

In [28]:
def write_itol_tree_colors(f, id2color, target='branch',
                           label_or_style='normal', size='1'):
    """Generate an iTOL tree colors file.

    Parameters
    ----------
    id2label : dict of str
        node ID to text map
    f : file handle
        file to write node texts
    target, label_or_style, size : str or dict of str
        iToL flavors, either a fixed value or a node ID to value map
        target == "type" in iTOL jargon
    """
    f.write('TREE_COLORS\n')
    f.write('SEPARATOR TAB\n')
    f.write('DATA\n')
    # format: ID, target, color, label_or_style, size_factor
    for id_ in sort_node_ids(id2color):
        f.write('%s\t%s\t%s\t%s\t%s\n' % (
            id_, target[id_] if isinstance(target, dict) else target,
            id2color[id_], label_or_style[id_] if isinstance(
                label_or_style, dict) else label_or_style,
            size[id_] if isinstance(size, dict) else size))

In [29]:
def write_itol_dataset_text(f, title, id2text, position='0.5',
                            color='#000000', style='normal',
                            size='1', rotation='0'):
    """Generate an iTOL text dataset file.

    Parameters
    ----------
    id2label : dict of str
        node ID to text map
    f : file handle
        file to write node texts
    title : str
        title of this dataset
    position, color, style, size, rotation : str or dict of str
        iToL flavors, either a fixed value or a node ID to value map
    """
    f.write('DATASET_TEXT\n')
    f.write('SEPARATOR TAB\n')
    f.write('DATASET_LABEL\t%s\n' % title)
    f.write('SHOW_INTERNAL\t1\n')
    f.write('DATA\n')
    # format: ID, label, position, color, style, size_factor, rotation
    for id_ in sort_node_ids(id2text):
        text = id2text[id_]
        if isinstance(text, float):
            text = '%.3g' % text
        f.write('%s\t%s\t%s\t%s\t%s\t%s\t%s\n' % (
            id_,
            text,
            position[id_] if isinstance(position, dict) else position,
            color[id_] if isinstance(color, dict) else color,
            style[id_] if isinstance(style, dict) else style,
            size[id_] if isinstance(size, dict) else size,
            rotation[id_] if isinstance(rotation, dict) else rotation))

In [30]:
def write_itol_dataset_style(f, title, ids, target='branch', what='node',
                             color='#000000', factor='normal', style='1',
                             bgcolor=None):
    """Generate an iTOL style dataset file.

    Parameters
    ----------
    f : file handle
        file to write node texts
    title : str
        title of this dataset
    ids : iterable of str
        node ID list
    target, what, color, factor, style, bgcolor : str or str or dict
        iToL flavors, either a fixed value or a node ID to value map
    """
    f.write('DATASET_STYLE\n')
    f.write('SEPARATOR TAB\n')
    f.write('DATASET_LABEL\t%s\n' % title)
    f.write('COLOR\t#000000\n')
    f.write('DATA\n')
    # format: ID, target, what, color, factor, style, bgcolor
    for id_ in sort_node_ids(ids):
        f.write('%s\t%s\t%s\t%s\t%s\t%s' % (
            id_,
            target[id_] if isinstance(target, dict) else target,
            what[id_] if isinstance(what, dict) else what,
            color[id_] if isinstance(color, dict) else color,
            factor[id_] if isinstance(factor, dict) else factor,
            style[id_] if isinstance(style, dict) else style))
        if bgcolor is not None:
            f.write('\t%s' % bgcolor[id_] if isinstance(bgcolor, dict) else bgcolor)
        f.write('\n')

FigTree file generation

In a FigTree-compatible Nexus tree file, nodes (tips and internal nodes) and taxa may contain attributes in the following format:
```
(taxon1,taxon2)[&!name="Escherichia coli",support=90,range={80,95},!color=#ff0000]:1.234,...
```

Here "!name", "support", "range" and "!color" are the attributes.

In [31]:
def make_figtree_attr_str(name, attr_db, attrs={}):
    """Generate a FigTree-compatible attribute string.

    Parameters
    ----------
    name : str
        name of node or taxon to annotate
    attr_db : dict of dict of str
        map of names to attributes

    Returns
    -------
    str
        formatted attribute string

    Notes
    -----
    For example, attr_db = {
        '!name': {'N1': '"spA"', 'N2': '"spB"'...},
        'support': {'N1': '90', 'N2': '75'...}
    }
    
    For node "N2", the result will be `[&!name="spB",support=75]`.
    
    All values should be str. Strings should be double-quoted. Tuples should be
    written like `{"spA",0.95,#000000}`. Special FigTree-aware attributes such
    as "name", "color" and "collapse" should have a prefix `!`.
    """
    for attr in attr_db:
        if name in attr_db[attr]:
            val = attr_db[attr][name]
            if val:  # omit null or empty string
                attrs[attr] = val
    attr_strs = []
    for attr, val in sorted(attrs.items(), key=lambda x: x[0]):
        attr_strs.append('%s=%s' % (attr, val))
    return '[&%s]' % ','.join(attr_strs) if len(attr_strs) > 0 else ''

In [32]:
def add_figtree_node_attrs(tree, node2attrs):
    """Add FigTree-compatible attributes to nodes of a tree.

    Parameters
    ----------
    tree : skbio.TreeNode
        tree to add node attributes to
    node2attrs : dict of dict
        map of node names to attributes
    """
    for node in tree.traverse(include_self=True):
        if not node.name:
            continue
        attrs = {} if node.is_tip() else {'id': '"%s"' % node.name}
        attr_str = make_figtree_attr_str(node.name, node2attrs, attrs)
        node.name = ('%s%s' % (node.name, attr_str) if node.is_tip()
                     else attr_str)

In [33]:
def add_figtree_taxon_attrs(tree, taxon2attrs):
    """Add FigTree-compatible attributes to the taxon labels.

    Parameters
    ----------
    tree : skbio.TreeNode
        tree to add node attributes to
    taxon2attrs : dict of dict
        map of taxa to attributes

    Returns
    -------
    list of str
        taxon labels with attributes appended
    """
    res = []
    for taxon in sorted(tree.subset()):
        attr_str = make_figtree_attr_str(taxon, taxon2attrs)
        res.append('%s%s' % (taxon, attr_str))
    return res

In [34]:
def write_figtree_nexus(tree, f, title='tree1', taxlabels=None):
    """Generate a FigTree-compatible Nexus tree file.

    Parameters
    ----------
    tree : skbio.TreeNode
        tree to add node attributes to
    f : file handle
        file to write nexus tree
    title : str
        title of the tree
    taxlabels : list of str
        custom taxon labels to write
    """
    f.write('#NEXUS\n')
    f.write('begin taxa;\n')
    f.write('\tdimensions ntax=%d;\n' % tree.count(tips=True))
    f.write('\ttaxlabels\n')
    if taxlabels is None:
        taxlabels = sorted(tree.subset())
    for taxon in taxlabels:
        f.write('\t%s\n' % taxon)
    f.write(';\n')
    f.write('end;\n')
    f.write('\n')
    f.write('begin trees;\n')
    f.write('\ttree %s = [&%s] ' % (
        title, 'R' if len(tree.children) == 2 else 'U'))
    f.write(format_newick(tree, operators=None))
    f.write('\n')
    f.write('end;\n')

## Pre-processing

### Read and process tree

Read tree.

In [35]:
tree = TreeNode.read(tree_fp)
n, m = tree.count(), tree.count(tips=True)
print('Tree has %d tips and %d internal nodes.' % (m, n - m))

Tree has 10575 tips and 10574 internal nodes.


In [36]:
tips = tree.subset()

Convert null branch lengths to zero.

In [37]:
for node in tree.traverse(include_self=False):
    node.length = node.length or 0.0

Get the precision (maximum number of float or scientific notion digits) of branch lengths. Will be useful in the correct formatting of branch lengths after collapsing the tree.

In [38]:
max_f, max_e = 0, 0
for node in tree.traverse():
    if node.length is not None:
        x = str(float(node.length))
        if 'e' in x:
            max_e = max(max_e, digits(str(float(x.split('e')[0]))))
        else:
            max_f = max(max_f, digits(x))
max_f, max_e

(10, 5)

Calculate number of descendants of each node.

In [39]:
node2n = {}
for node in tree.postorder(include_self=True):
    if node.is_tip():
        node2n[node.name] = 1
    else:
        node2n[node.name] = sum([node2n[x.name] for x in node.children])

### Read and process taxonomy

In [40]:
dfr = pd.read_csv(taxonomy_fp, sep='\t', index_col=0)
dfr = dfr[dfr.index.isin(tips)]
dfr.index.name = 'node'
dfr.dropna().head(5)

Unnamed: 0_level_0,kingdom,phylum,class,order,family,genus,species
node,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
G000005825,Bacteria,Firmicutes_1,Bacilli_1,Bacillales_1,Bacillaceae_3,Bacillus_2,Bacillus pseudofirmus
G000006175,Archaea,Euryarchaeota_1,Methanococci,Methanococcales,Methanococcaceae,Methanococcus,Methanococcus voltae
G000006605,Bacteria,Actinobacteria,Actinobacteria,Corynebacteriales,Corynebacteriaceae,Corynebacterium,Corynebacterium falsenii
G000006725,Bacteria,Proteobacteria_1,Gammaproteobacteria,Xanthomonadales,Xanthomonadaceae,Xylella,Xylella fastidiosa
G000006785,Bacteria,Firmicutes_1,Bacilli_1,Lactobacillales,Streptococcaceae,Streptococcus,Streptococcus pyogenes


In [41]:
ranks = dfr.columns.tolist()
ranks

['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']

## Tree annotation

### Generate node labels

The lowest common ancestor (LCA) of genomes represented by each taxon will receive this taxon as the node label. One node may receive multiple taxa if they all meet this criterium.

If this operation is applied to the tax2tree consensus strings (`consensus_ranks.tsv`), the outcome should match the labels decorated to the tree by tax2tree (`decorations_by_rank.tsv`).

In the current analysis, the input file should be the tax2tree consensus string filled by taxa representing single genomes (`filled_ranks.tsv`). Therefore the outcome will contain more information. Both tips and internal nodes will be included.

In [42]:
labels = {}
for rank in ranks:
    for taxon in dfr[rank].value_counts().index:
        indices = dfr[dfr[rank] == taxon].index.tolist()
        node = (indices[0] if len(indices) == 1
                else tree.lca(list(tips.intersection(indices))).name)
        labels.setdefault(node, {})[rank] = taxon

In [43]:
dfl = pd.DataFrame.from_dict(labels, orient='index')
dfl.index.name = 'node'
dfl = dfl[ranks]
dfl = dfl.loc[sorted(dfl.index, key=lambda x: (x[0], int(x[1:])))]
dfl.head(3)

Unnamed: 0_level_0,kingdom,phylum,class,order,family,genus,species
node,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
G000005825,,,,,,,Bacillus pseudofirmus
G000006175,,,,,,,Methanococcus voltae
G000006725,,,,,,,Xylella fastidiosa


Get the highest-rank name when multiple ranks have names in a node label.

In [44]:
def get_highest_taxon(row):
    """Get the highest taxon in a row."""
    for rank in row.index:
        if pd.notnull(row[rank]):
            return rank, row[rank]
    return np.nan, np.nan

In [45]:
dfl['hrank'], dfl['htaxon'] = zip(*dfl.apply(get_highest_taxon, axis=1))
dfl[['hrank', 'htaxon']].dropna().head(5)

Unnamed: 0_level_0,hrank,htaxon
node,Unnamed: 1_level_1,Unnamed: 2_level_1
G000005825,species,Bacillus pseudofirmus
G000006175,species,Methanococcus voltae
G000006725,species,Xylella fastidiosa
G000006745,species,Vibrio cholerae
G000006845,species,Neisseria gonorrhoeae


### Collapse clades at or above a rank

Identify the ranks to collapse.

In [46]:
collapse_ranks = []
for rank in ranks:
    collapse_ranks.append(rank)
    if collapse_rank and rank == collapse_rank:
        break
print('Collapse at the following ranks: %s.' % ', '.join(collapse_ranks))

Collapse at the following ranks: kingdom, phylum, class, order, family, genus.


Generate a list of candidate nodes.

In [47]:
df_can = dfl[dfl['hrank'].isin(collapse_ranks)][['hrank', 'htaxon']]
df_can['size'] = df_can.index.to_series().map(node2n)
df_can.head()

Unnamed: 0_level_0,hrank,htaxon,size
node,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
G000007185,class,Methanopyri,1
G000007485,genus,Tropheryma,1
G000008265,family,Picrophilaceae,1
G000008325,genus,Methylococcus,1
G000008625,genus,Aquifex,1


Exclude nodes with number of descendants below threshold.

In [48]:
if min_clade_size:
    to_keep = []
    for row in df_can.itertuples():
        th = min_clade_size[row.hrank] if isinstance(min_clade_size, dict) else min_clade_size
        if row.size >= th:
            to_keep.append(row.Index)
    df_can = df_can[df_can.index.isin(to_keep)]
df_can.shape[0]

288

In [49]:
df_ = df_can[df_can['htaxon'].str.contains('_\d+$', regex=True)].copy()

Exclude split clades of the same taxon which has less than a fraction of the dominant clade.

In [50]:
if min_split_clade_frac > 0:
    df_ = df_can[df_can['htaxon'].str.contains('_\d+$', regex=True)].copy()
    df_['taxon'], df_['idx'] = zip(*df_['htaxon'].apply(lambda x: x[::-1]).str.split(
        '_', n=1).apply(lambda x: (x[1][::-1], x[0][::-1])))
    top_clade_sizes = dict(df_.query('idx == "1"')[['taxon', 'size']].values.tolist())
    df_ = df_[df_['size'] >= df_['taxon'].map(top_clade_sizes) * min_split_clade_frac]
    df_can = df_can[df_can.index.isin(df_.index) | ~df_can['htaxon'].str.contains('_\d+$', regex=True)]
df_can.shape[0]

270

Get the dimensions of clades represented by internal nodes.

In [51]:
tips = tree.subset()
dimensions = {x: get_clade_dimensions(tree.find(x)) for x in df_can.index if x not in tips}
df_dim = pd.DataFrame.from_dict(dimensions, orient='index')
df_dim = df_dim.loc[sorted(df_dim.index, key=lambda x: int(x[1:]))]
df_dim.head()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
N2,669.0,0.923035,0.14575,0.592703,0.796175,0.900001,1.048704,1.226321
N3,9906.0,0.911626,0.1653,0.475954,0.786587,0.910457,1.016349,1.715364
N17,2.0,0.426866,9e-06,0.42686,0.426863,0.426866,0.426869,0.426872
N18,366.0,0.755701,0.107231,0.491407,0.673813,0.730512,0.844795,1.077242
N21,3.0,0.410779,0.093041,0.350268,0.357211,0.364154,0.441034,0.517913


Determine which clades (as represented by nodes) should be collapsed.

The rationale is: Start from the lowest rank, move up the hierarchy. If a node is already marked as "collapsed", all its ancestral nodes will be prohibited from being selected.

In [52]:
nodes_to_collapse = []
nodes_to_skip = set()
for rank in collapse_ranks[::-1]:
    for node in df_can[df_can['hrank'] == rank].index:
        if node not in nodes_to_skip:
            nodes_to_collapse.append(node)
            for anc in tree.find(node).ancestors():
                nodes_to_skip.add(anc.name)
print('Nodes to collapse: %d.' % len(nodes_to_collapse))

Nodes to collapse: 207.


Calculate how many tips (genomes) are covered by the collapsed clades.

In [53]:
tips_covered = set()
for name in nodes_to_collapse:
    node = tree.find(name)
    tips = set([name]) if node.is_tip() else node.subset()
    if len(tips.intersection(tips_covered)) > 0:
        raise ValueError('Overlapping clades detected.')
    tips_covered.update(tips)
tips_missed = tree.subset() - tips_covered
print('Tips covered: %d. Tips missed: %d.'
      % (len(tips_covered), len(tips_missed)))

Tips covered: 6944. Tips missed: 3631.


## Tree visualization

### Tree and labels manipulation

#### Tree pruning

Original tree dimensions.

In [54]:
tree_tips = tree.subset()
tree_nodes = set(x.name for x in tree.non_tips(include_self=True))
print('Original tree has %d tips and %d internal nodes.'
      % (len(tree_tips), len(tree_nodes)))

Original tree has 10575 tips and 10574 internal nodes.


Prune tree to include collapsed clades only.

In [55]:
nodes_w_labels = [x for x in dfl['htaxon'].dropna().index if x in tree_nodes]
tree1 = selective_prune(tree, tips_covered, nodes_w_labels) if collapse_rank and delete_uncollapsed else tree.copy()
tree1_tips = tree1.subset()
tree1_nodes = set(x.name for x in tree1.non_tips(include_self=True))
print('Output tree has %d tips and %d internal nodes.'
      % (len(tree1_tips), len(tree1_nodes)))

Output tree has 6944 tips and 6996 internal nodes.


Export pruned tree.

In [56]:
with open('pruned_tree.nwk', 'w') as f:
    f.write('%s\n' % format_newick(tree1, operators=None))

#### Tree shrinking

Generate a tree in which the collapsed clades are actually deleted.

In [57]:
tree3 = tree1.copy()
nodes_to_remove = []
for node in tree3.non_tips():
    if node.name in nodes_to_collapse:
        node.length += df_dim[collapse_length_func][node.name]
        nodes_to_remove.extend(node.children)
tree3.remove_deleted(lambda x: x in nodes_to_remove)
tree3.prune()
print('Collapsed tree has %d tips.' % tree3.count(tips=True))
with open('collapsed_tree.nwk', 'w') as f:
    f.write(format_newick(tree3, operators=None))

Collapsed tree has 207 tips.


#### Node label formatting

Format node label strings.

In [58]:
name_map = dfl['htaxon'].to_dict()
if len(label_format_regexes) > 0:
    for id_ in name_map:
        for pattern, repl in label_format_regexes:
            name_map[id_] = re.sub(pattern, repl, name_map[id_])
if append_rank_code is True:
    for id_ in name_map:
        name_map[id_] = '%s__%s' % (dfl['hrank'][id_][0], name_map[id_])
if append_clade_size is True:
    for id_ in name_map:
        n = node2n[id_]
        if n > 1:
            name_map[id_] = '%s (%d)' % (name_map[id_], node2n[id_])
sorted(name_map.items())[:5]

[('G000005825', 's__Bacillus pseudofirmus'),
 ('G000006175', 's__Methanococcus voltae'),
 ('G000006725', 's__Xylella fastidiosa'),
 ('G000006745', 's__Vibrio cholerae'),
 ('G000006845', 's__Neisseria gonorrhoeae')]

In [59]:
tip_name_map, node_name_map = {}, {}
for id_, name in name_map.items():
    if id_ in tree_tips:
        tip_name_map[id_] = name
    elif id_ in tree_nodes:
        node_name_map[id_] = name

#### Additional attributes

In [60]:
if custom_attrs_fps:
    dfa = {}
    for name, fp in custom_attrs_fps.items():
        dfa[name] = pd.read_table(fp, index_col=0, names=[name])

### FigTree file generation

Generate FigTree tip and node name maps.

In [61]:
figtree_tip_name_map = {k: '"%s"' % v for k, v in tip_name_map.items()}
figtree_node_name_map = {k: '"%s"' % v for k, v in node_name_map.items()}

In [62]:
if collapse_rank and hide_uncollapsed:
    for name in figtree_tip_name_map:
        if name in tips_missed:
            figtree_tip_name_map[name] = '""'

Let FigTree display internal node labels without displaying labels of tips (including collapsed clades).

In [63]:
no_labels = set(nodes_to_collapse).union(tree1_tips)
figtree_label_map = {k: v for k, v in figtree_node_name_map.items() if k not in no_labels}

In [64]:
tip2attrs = {'!name': figtree_tip_name_map}
node2attrs = {'!name': figtree_node_name_map, 'label': figtree_label_map}

Generate a FigTree collapse map.

In [65]:
figtree_collapse_map = {}
tree_radius = max(x.accumulate_to_ancestor(tree1) for x in tree1.tips())
for name in nodes_to_collapse:
    if name not in tree1_tips:
        length = df_dim[collapse_length_func][name]
        height = tree_radius - tree1.find(name).accumulate_to_ancestor(tree1) - length
        figtree_collapse_map[name] = '{"collapsed",%.*g}' % (max_f, height)
sorted(figtree_collapse_map.items())[:5]

[('N100', '{"collapsed",1.080952001}'),
 ('N10087', '{"collapsed",0.6039728893}'),
 ('N1016', '{"collapsed",0.9363783172}'),
 ('N1040', '{"collapsed",0.7580735353}'),
 ('N1043', '{"collapsed",0.5273656381}')]

In [66]:
node2attrs['!collapse'] = figtree_collapse_map

Generate FigTree size map.

In [67]:
node2attrs['size'] = node2n

Generate additional attributes for FigTree.

In [68]:
if custom_attrs_fps:
    for name, df_ in dfa.items():
        if np.issubdtype(df_[name], np.number):
            map_ = {k: str(v) for k, v in df_[name].iteritems()}
        else:
            map_ = {k: '"%s"' % v for k, v in df_[name].iteritems()}
        tip2attrs[name] = node2attrs[name] = map_

Write FigTree files.

In [69]:
tree2 = tree1.copy()
taxlabels = add_figtree_taxon_attrs(tree2, tip2attrs)
add_figtree_node_attrs(tree2, node2attrs)
with open('figtree.tre', 'w') as f:
    write_figtree_nexus(tree2, f, taxlabels=taxlabels)

In [70]:
print('Task completed.')

Task completed.


### iTOL files generation

Step 1: Upload the already exported pruned tree file (Newick format) to iTOL.

Write iTOL node label file. (Applies to both tips and internal nodes, including collapsed triangles.)

In [71]:
with open('label.txt', 'w') as f:
    write_itol_label(f, {**node_name_map, **tip_name_map})

Write iTOL branch text file

In [72]:
branch_name_map = {k: v for k, v in node_name_map.items() if k not in nodes_to_collapse}
with open('branch_text.txt', 'w') as f:
    # position = 0.5: at the middle of branch
    write_itol_dataset_text(f, 'branch text', branch_name_map, position='0.5', size='1')

Write iTOL collapse file.

In [73]:
if collapse_rank:
    with open('collapse.txt', 'w') as f:
        write_itol_collapse(f, sorted(
            x for x in nodes_to_collapse if x in tree1_nodes))

Write iTOL files for extra node attributes.

In [74]:
color_gradient = make_color_palette(color_range[0], color_range[1])

In [75]:
if custom_attrs_fps:
    for name, df_ in dfa.items():
        # node text
        with open('%s_node_text.txt' % name, 'w') as f:
            write_itol_dataset_text(
                f, '%s node text' % name, df_[name].to_dict(), position='1', size='1')
        # branch color gradient
        if np.issubdtype(df_[name], np.number):
            branch_color_map = make_color_gradient(df_[name].to_dict(), color_gradient)
            with open('%s_branch_color.txt' % name, 'w') as f:
                write_itol_dataset_style(
                    f, '%s color gradient' % name, branch_color_map, target='branch',
                    what='node', color=branch_color_map)

In [76]:
print('Task completed!')

Task completed!


### Misc features

A funky workaround for rendering collapsed clade with size proportional to number of descendants.

In the output tree, collapsed clades are converted into polytomic branches with equal length. FigTree or iTOL will render it like a sector. But actually the sector constitutes numerous tiny branches.

Then one may further recover the actual sectors by hacking the resulting SVG file.

```
tree3 = tree1.copy()
polytomized_labels = {}
for node in tree3.levelorder():
    if node.name in nodes_to_collapse:
        if node.is_tip():
            polytomized_labels[node.name] = name_map[node.name]
        else:
            length = df_dim[collapse_length_func][node.name]
            tips = [x for x in node.tips()]
            n = len(tips)
            for i, tip in enumerate(tips):
                tip.length = length
                if i == int(n / 2):
                    polytomized_labels[tip.name] = name_map[node.name]
            node.children = tips
with open('polytomized_pruned_tree.nwk', 'w') as f:
    f.write(format_newick(tree3, operators=None))
with open('polytomized_label.txt', 'w') as f:
    write_itol_label(f, polytomized_labels)
```

The followng code will set branch widths proportional to number of descendants. However, iTOL is frozen when dealing with this number of styles.

```
from math import sqrt, ceil
node2n_intree = {k: v for k, v in node2n.items() if k in tree_tips or k in tree_nodes}
factor_map = {k: sqrt(v) for k, v in node2n_intree.items()}
max_val = max(factor_map.values())
factor_map = {k: ceil(v / max_val * 100) / 10 for k, v in factor_map.items()}
with open('branch_size_factor.txt', 'w') as f:
    write_itol_dataset_style(
        f, 'branch size factor', factor_map, target='branch',
        what='node', factor=factor_map, style='normal')
```