# Parsing functional annotation

We used EggNOG-mapper to annotate the functional annotation of the proteins. The output of
EggNOG-mapper is a tab-separated file that contains information about the seed ortholog of each
query, as well as information about the corresponding orthogroup and its phylogenetic context. Here
we will parse this output to reduce it to one description per gene, and see how well the different
annotation sources are represented.

In [1]:
import pandas as pd
import numpy as np

First read the data. Remember that emapper output has four comment lines in the header and another
three in the footer:

In [2]:
emapper_loc = "/Volumes/scratch/pycnogonum/genome/draft/annot_merge/out.emapper.annotations"

emapper = pd.read_csv(emapper_loc, sep="\t", skiprows=4, skipfooter=3, engine="python")

We need to extract the gene ID from the protein ID. This is mostly dot-separated, but PacBio IsoSeq 
genes have more dots in their names, so we need to parse those ddifferently.

In [3]:
def parse_gene_id(x):
    if "PB" in x:
        parts = x.split(".")
        return ".".join(parts[:2])
    elif x.startswith("r2") or x.startswith("g") or x.startswith("at"):
        return x.split(".")[0]
    else:
        return ValueError("Unknown gene ID format")

apply the parsing on each row...

In [4]:
emapper["gene"] = emapper["#query"].apply(parse_gene_id)

...and now group by gene ID and keep the best-scoring isoform/CDS for each gene.

In [5]:
summary = emapper.groupby("gene").apply(lambda x: x.sort_values("score", ascending=False).head(1), include_groups=False)

Next we need to extract the annotation type from the gene IDs, something we can easily do by the 
start of the gene ID.

* P: PacBio IsoSeq (round 1)
* g: BRAKER3 (round 1)
* r: BRAKER3 (round 2)
* a: assembled transcriptomes (round 3)

We will keep track of these in a dictionary to make calculations easier. First, let's find out how
many genes we have from each source.

```bash
$ grep "      gene    " merged_sorted.gff3 | grep PB | wc -l
8904
$ grep "      gene    " merged_sorted.gff3 | grep r2 | wc -l
2223
$ grep "      gene    " merged_sorted.gff3 | grep at | wc -l
774
$ grep "      gene    " merged_sorted.gff3 | grep "=g" | wc -l
3596
```

In [6]:
translate = {
    "P": "PacBio",
    "g": "BRAKER3_r1",
    "r": "BRAKER3_r2",
    "a": "Trinity",
}

totals = {
    "PacBio": 8904,
    "BRAKER3_r1": 3596,
    "BRAKER3_r2": 2223,
    "Trinity": 774
}

rounds = {
    "PacBio": 1,
    "BRAKER3_r1": 1,
    "BRAKER3_r2": 2,
    "Trinity": 3
}

In [7]:
annotated = pd.DataFrame(summary["#query"].str[:1].value_counts())
annotated.index = [translate[x] for x in annotated.index]
annotated["totals"] = [totals[x] for x in annotated.index]
annotated["round"] = [rounds[x] for x in annotated.index]

We can now see what percentage of gene models find support for each annotation source.

In [8]:
(annotated["count"] / annotated["totals"] * 100).round(2)

PacBio        73.69
BRAKER3_r1    57.40
BRAKER3_r2    51.73
Trinity       31.27
dtype: float64

As expected, the more trustworthy annotations find more orthologs. The same pattern holds if we
aggregate the round 1 sources:

In [9]:
tmp = annotated.groupby("round").sum()
(tmp["count"] / tmp["totals"] * 100).round(2)

round
1    69.00
2    51.73
3    31.27
dtype: float64

Write out the reduced annotation file for downstream analysis.

In [10]:
summary_loc = "/Volumes/scratch/pycnogonum/genome/draft/annot_merge/out.emapper.best.annotations"
summary.to_csv(summary_loc, sep="\t", index=False)

In [11]:
annotated["count"].sum()

np.int64(10017)

In [12]:
annotated["totals"].sum()

np.int64(15497)