## Summarizing annotations to a term and descendants

This notebook demonstrates summarizing annotation counts for a term and its descendants.

An example use of this is a GO annotator exploring refactoring a subtree in GO

Of course, if this were a regular thing we would make a command line or even web interface,
but keeping as a notebook gives us some flexibility in logic, and anyway is intended largely
as a demonstration

### boilerplate

 * importing relevant ontobiolibraries
 * set up key objects

In [43]:
import pandas as pd

## Create an ontology factory in order to fetch GO
from ontobio.ontol_factory import OntologyFactory
ofactory = OntologyFactory()

## GOLR queries
from ontobio.golr.golr_query import GolrAssociationQuery

## rendering ontologies
from ontobio import GraphRenderer

In [44]:
## Load GO. Note the first time this runs Jupyter will show '*' - be patient
ont = ofactory.create("go")  

### Finding descendants

Here we are using the in-memory ontology object, no external service calls are executed

Change the value of `term_id` to what you like

In [52]:

term_id = "GO:0009070" ## serine family amino acid biosynthetic process
descendants = ont.descendants(term_id, reflexive=True, relations=['subClassOf', 'BFO:0000050'])

In [53]:
descendants

['GO:0009070',
 'GO:0071269',
 'GO:0019344',
 'GO:0019345',
 'GO:0006545',
 'GO:0006535',
 'GO:0016260',
 'GO:0019264',
 'GO:0019343',
 'GO:0004124',
 'GO:0019265',
 'GO:0009090',
 'GO:0070179',
 'GO:0006564']

### rendering subtrees

We use the good-old-fashioned Tree renderer

(this doesn't scale well for latticey-subontologies)

In [55]:
renderer = GraphRenderer.create('tree')

In [56]:
print(renderer.render_subgraph(ont, nodes=descendants))

. GO:0009070 ! serine family amino acid biosynthetic process
 % GO:0006545 ! glycine biosynthetic process
  % GO:0019264 ! glycine biosynthetic process from serine
  % GO:0019265 ! glycine biosynthetic process, by transamination of glyoxylate
 % GO:0006564 ! L-serine biosynthetic process
 % GO:0019344 ! cysteine biosynthetic process
  % GO:0006535 ! cysteine biosynthetic process from serine
  % GO:0019343 ! cysteine biosynthetic process via cystathionine
  % GO:0019345 ! cysteine biosynthetic process via S-sulfo-L-cysteine
  < GO:0004124 ! cysteine synthase activity
 % GO:0009090 ! homoserine biosynthetic process
 % GO:0016260 ! selenocysteine biosynthetic process
 % GO:0070179 ! D-serine biosynthetic process
 % GO:0071269 ! L-homocysteine biosynthetic process




### summarizing annotations

We write a short procedure to wrap calling Golr and returning a summary dict

The dict is keyed by taxon label. We also include an entry for `ALL`


In [57]:
def summarize(t: str) -> dict:
    """
    Summarize a term
    """
    q = GolrAssociationQuery(object=t, rows=0, object_category='function')
    result = q.exec()
    fc = result['facet_counts']
    if 'taxon_label' in fc:
        item = {'ALL': result['numFound']}  ## make sure this is the first entry
        for k,v in fc['taxon_label'].items():
            item[k] = v
        return item
    else:
        return {}


In [58]:
print(summarize(term_id))

{'ALL': 3765, 'Triticum aestivum': 100, 'Brassica napus': 84, 'Arabidopsis thaliana': 81, 'Glycine max': 81, 'Gossypium hirsutum': 80, 'Escherichia coli K-12': 71, 'Nicotiana tabacum': 69, 'Zea mays': 65, 'Helianthus annuus': 63, 'Brassica rapa subsp. pekinensis': 60, 'Caenorhabditis elegans': 59, 'Populus trichocarpa': 58, 'Medicago truncatula': 55, 'Saccharomyces cerevisiae S288C': 55, 'Solanum tuberosum': 53, 'Hordeum vulgare subsp. vulgare': 50, 'Aspergillus nidulans FGSC A4': 48, 'Manihot esculenta': 47, 'Musa acuminata subsp. malaccensis': 45, 'Vitis vinifera': 45, 'Oryza sativa Japonica Group': 44, 'Selaginella moellendorffii': 44, 'Setaria italica': 42, 'Branchiostoma floridae': 40, 'Candida albicans SC5314': 40}


In [62]:
def summarize_set(ids) -> pd.DataFrame:
    """
    Summarize a set of annotations, return a dataframe
    """
    items = []
    for id in ids:
        item = {'id': id, 'name:': ont.label(id)}
        for k,v in summarize(id).items():
            item[k] = v
        items.append(item)
    df =  pd.DataFrame(items).fillna(0)
    # sort using total number
    df.sort_values('ALL', axis=0, ascending=False, inplace=True)
    return df

## Summarize GO term and descendants

We just write out the dataframe directly

More advanced visualziations are easy with plotly etc. We leave as an exercise to the reader...

In [63]:
df = summarize_set(descendants)
df

Unnamed: 0,id,name:,ALL,Triticum aestivum,Brassica napus,Arabidopsis thaliana,Glycine max,Gossypium hirsutum,Escherichia coli K-12,Nicotiana tabacum,...,Bradyrhizobium diazoefficiens USDA 110,Deinococcus radiodurans R1,Phytophthora ramorum,Batrachochytrium dendrobatidis JAM81,Chlamydomonas reinhardtii,Clostridium botulinum A str. Hall,Ciona intestinalis,Nematostella vectensis,Dictyostelium discoideum,Pristionchus pacificus
0,GO:0009070,serine family amino acid biosynthetic process,3765,100.0,84.0,81.0,81.0,80.0,71.0,69.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,GO:0019344,cysteine biosynthetic process,1624,52.0,45.0,52.0,43.0,34.0,32.0,30.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,GO:0006545,glycine biosynthetic process,1018,19.0,21.0,12.0,22.0,30.0,0.0,24.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,GO:0006535,cysteine biosynthetic process from serine,580,22.0,20.0,18.0,19.0,15.0,8.0,11.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,GO:0019264,glycine biosynthetic process from serine,573,9.0,13.0,8.0,16.0,21.0,0.0,19.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,GO:0004124,cysteine synthase activity,492,22.0,20.0,24.0,19.0,15.0,7.0,11.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13,GO:0006564,L-serine biosynthetic process,482,9.0,9.0,9.0,5.0,8.0,16.0,5.0,...,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0
11,GO:0009090,homoserine biosynthetic process,385,17.0,7.0,6.0,9.0,7.0,7.0,8.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,GO:0019343,cysteine biosynthetic process via cystathionine,377,8.0,5.0,0.0,5.0,4.0,0.0,8.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,GO:0019265,"glycine biosynthetic process, by transaminatio...",200,0.0,0.0,0.0,3.0,4.0,0.0,0.0,...,4.0,3.0,3.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0
