# Gene annotation

The model have genes whose identifiers are locus tag from refseq.

[The refseq table from the assembly](https://www.ncbi.nlm.nih.gov/assembly/GCF_000685155.1) can be used to annotate the ncbi proteins. Then we can map those proteins using uniprot to retrieve UniProtKB identifiers.

In [1]:
from collections import defaultdict
from pathlib import Path

import cobra
import re
from datatable import dt, f, join, update

In [2]:
ROOT = Path.cwd().parent
model_file = str(ROOT / "iMENI452.xml")

In [3]:
model = cobra.io.read_sbml_model(model_file)

Scaling...
 A: min|aij| =  1.000e+00  max|aij| =  1.000e+00  ratio =  1.000e+00
Problem data seem to be well scaled


Note that the leading "#" was removed from the first line of the feature table to have it read properly,

In [4]:
feat_table = dt.fread(ROOT / "GCF_000685155.1_ANME2D_V10_feature_table.txt", header=True)

In [6]:
feat_table.head()

Unnamed: 0_level_0,feature,class,assembly,assembly_unit,seq_type,chromosome,genomic_accession,start,end,strand,…,GeneID,locus_tag,feature_interval_length,product_length,attributes
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,Unnamed: 6_level_1,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,Unnamed: 11_level_1,Unnamed: 12_level_1,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪
0,gene,protein_coding,GCF_000685155.1,Primary Assembly,unplaced scaffold,(unknown),NZ_JMIY01000001.1,1,265,+,…,(unknown),ANME2D_RS00005,265,,partial;old_locus_tag=ANME2D_00001
1,CDS,with_protein,GCF_000685155.1,Primary Assembly,unplaced scaffold,(unknown),NZ_JMIY01000001.1,1,265,+,…,(unknown),ANME2D_RS00005,265,87.0,partial
2,gene,protein_coding,GCF_000685155.1,Primary Assembly,unplaced scaffold,(unknown),NZ_JMIY01000001.1,522,818,+,…,(unknown),ANME2D_RS00010,297,,old_locus_tag=ANME2D_00003
3,CDS,with_protein,GCF_000685155.1,Primary Assembly,unplaced scaffold,(unknown),NZ_JMIY01000001.1,522,818,+,…,(unknown),ANME2D_RS00010,297,98.0,
4,gene,protein_coding,GCF_000685155.1,Primary Assembly,unplaced scaffold,(unknown),NZ_JMIY01000001.1,984,1664,-,…,(unknown),ANME2D_RS00015,681,,old_locus_tag=ANME2D_00004
5,CDS,with_protein,GCF_000685155.1,Primary Assembly,unplaced scaffold,(unknown),NZ_JMIY01000001.1,984,1664,-,…,(unknown),ANME2D_RS00015,681,226.0,
6,gene,protein_coding,GCF_000685155.1,Primary Assembly,unplaced scaffold,(unknown),NZ_JMIY01000001.1,2310,3011,-,…,(unknown),ANME2D_RS00020,702,,old_locus_tag=ANME2D_00005
7,CDS,with_protein,GCF_000685155.1,Primary Assembly,unplaced scaffold,(unknown),NZ_JMIY01000001.1,2310,3011,-,…,(unknown),ANME2D_RS00020,702,233.0,
8,gene,protein_coding,GCF_000685155.1,Primary Assembly,unplaced scaffold,(unknown),NZ_JMIY01000001.1,3072,4781,-,…,(unknown),ANME2D_RS00025,1710,,old_locus_tag=ANME2D_00006
9,CDS,with_protein,GCF_000685155.1,Primary Assembly,unplaced scaffold,(unknown),NZ_JMIY01000001.1,3072,4781,-,…,(unknown),ANME2D_RS00025,1710,569.0,


In [11]:
feat_table.names

('feature',
 'class',
 'assembly',
 'assembly_unit',
 'seq_type',
 'chromosome',
 'genomic_accession',
 'start',
 'end',
 'strand',
 'product_accession',
 'non-redundant_refseq',
 'related_accession',
 'name',
 'symbol',
 'GeneID',
 'locus_tag',
 'feature_interval_length',
 'product_length',
 'attributes')

In [19]:
cds = feat_table[f.feature == "CDS", ["locus_tag", "name", "GeneID", "symbol", "product_accession"]]

In [20]:
gene_names = dt.Frame(locus_tag=[gene.id for gene in model.genes])

In [21]:
cds.key = "locus_tag"

In [23]:
df

In [24]:
df_genes = gene_names[:, :, join(cds)]

In [25]:
df_genes.head()

Unnamed: 0_level_0,locus_tag,name,GeneID,symbol,product_accession
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪,Unnamed: 3_level_1,▪▪▪▪,▪▪▪▪
0,ANME2D_RS14405,alanine dehydrogenase,(unknown),,WP_048093069.1
1,ANME2D_RS03200,aminotransferase class I/II-fold pyridoxal phospha…,(unknown),,WP_048089082.1
2,ANME2D_RS05600,pyridoxal phosphate-dependent aminotransferase,(unknown),,WP_048089698.1
3,ANME2D_RS07885,pyridoxal phosphate-dependent aminotransferase,(unknown),,WP_048090317.1
4,ANME2D_RS08380,aminotransferase class I/II-fold pyridoxal phospha…,(unknown),,WP_048090480.1
5,ANME2D_RS05565,argininosuccinate synthase,(unknown),,WP_048089688.1
6,ANME2D_RS01360,argininosuccinate lyase,(unknown),argH,WP_048088486.1
7,ANME2D_RS03615,adenylosuccinate synthase,(unknown),,WP_048089156.1
8,ANME2D_RS00280,adenylosuccinate lyase,(unknown),,WP_048088213.1
9,ANME2D_RS00300,aspartate ammonia-lyase,(unknown),,WP_048088217.1


In [26]:
dt.isna(df_genes["GeneID"]).sum()

Unnamed: 0_level_0,GeneID
Unnamed: 0_level_1,▪▪▪▪▪▪▪▪
0,452


We can drop the column

In [27]:
del df_genes["GeneID"]

In [34]:
df_genes[f.locus_tag=="ANME2D_RS14405", ["name", "symbol", "product_accession"]][0, [0,1,2]].to_list()

[['alanine dehydrogenase'], [''], ['WP_048093069.1']]

In [38]:
for gene in model.genes:
    matched = df_genes[f.locus_tag==gene.id, ["name", "symbol", "product_accession"]]
    name, symbol, ncbiprotein = [m[0, 0] for m in matched]
    gene.name = name
    gene.annotation = {
        "locus_tag": gene.id,
        "ncbiprotein": ncbiprotein,
    }
    if symbol:
        gene.annotation["symbol"] =  symbol

In [40]:
model.genes.ANME2D_RS00220

0,1
Gene identifier,ANME2D_RS00220
Name,phosphoglycerate dehydrogenase
Memory address,0x07f456605ac10
Functional,True
In 1 reaction(s),PGCD


In [42]:
model.genes.ANME2D_RS00220.annotation

{'locus_tag': 'ANME2D_RS00220', 'ncbiprotein': 'WP_048088203.1'}

Now, we need to extract each ncbiprotein (Refseq Protein) to map them to UniProt [here](https://www.uniprot.org/uploadlists/).

In [43]:
with open(ROOT / "refseq_proteins", "w") as file:
    file.write("\n".join([gene.annotation["ncbiprotein"] for gene in model.genes]))

The result was downloaded as a tab separated table.

In [46]:
uni = dt.fread(ROOT / "uniprot-yourlistM2021102892C7BAECDB1C5C413EE0E0348724B682257D40T.tab")

In [53]:
uni.names = {"yourlist:M2021102892C7BAECDB1C5C413EE0E0348724B682257D40T": "ncbiprotein"}

In [54]:
uni.head()

Unnamed: 0_level_0,ncbiprotein,Entry,Entry name,Status,Protein names,Gene names,Organism,Length
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪
0,WP_048093069.1,A0A062V0U9,A0A062V0U9_9EURY,unreviewed,Alanine dehydrogenase (AlaDH) (EC 1.4.1.1),ala ANME2D_03016,Candidatus Methanoperedens nitroreducens,329
1,WP_048089698.1,A0A062VAG2,A0A062VAG2_9EURY,unreviewed,Aminotransferase (EC 2.6.1.-),ANME2D_01156,Candidatus Methanoperedens nitroreducens,366
2,WP_048090317.1,A0A062V4F6,A0A062V4F6_9EURY,unreviewed,Aminotransferase (EC 2.6.1.-),ANME2D_01641,Candidatus Methanoperedens nitroreducens,379
3,WP_048090480.1,A0A062V4T2,A0A062V4T2_9EURY,unreviewed,Aminotransferase (EC 2.6.1.-),ANME2D_01743,Candidatus Methanoperedens nitroreducens,384
4,WP_048088486.1,A0A062VBZ7,A0A062VBZ7_9EURY,unreviewed,Argininosuccinate lyase (ASAL) (EC 4.3.2.1) (Argin…,argH ANME2D_00289,Candidatus Methanoperedens nitroreducens,488
5,WP_048089156.1,A0A062V3K3,A0A062V3K3_9EURY,unreviewed,Adenylosuccinate synthetase (AMPSase) (AdSS) (EC 6…,purA ANME2D_00749,Candidatus Methanoperedens nitroreducens,421
6,WP_048088213.1,A0A062V905,A0A062V905_9EURY,unreviewed,Adenylosuccinate lyase (ASL) (EC 4.3.2.2) (Adenylo…,ANME2D_00062,Candidatus Methanoperedens nitroreducens,445
7,WP_048088217.1,A0A062V6T6,A0A062V6T6_9EURY,unreviewed,Aspartate ammonia-lyase (EC 4.3.1.1),ANME2D_00066,Candidatus Methanoperedens nitroreducens,477
8,WP_048088667.1,A0A062V9X1,A0A062V9X1_9EURY,unreviewed,Aspartate carbamoyltransferase regulatory chain,pyrI ANME2D_00389,Candidatus Methanoperedens nitroreducens,153
9,WP_048088669.1,A0A062VCA0,A0A062VCA0_9EURY,unreviewed,Aspartate carbamoyltransferase (EC 2.1.3.2) (Aspar…,pyrB ANME2D_00390,Candidatus Methanoperedens nitroreducens,302


In [55]:
dt.unique(uni["ncbiprotein"]).nrows

359

In [56]:
uni["ncbiprotein"].nrows

359

In [58]:
for gene in model.genes:
    matched = uni[f.ncbiprotein == gene.annotation["ncbiprotein"], "Entry"]
    if matched.nrows:
        gene.annotation["uniprot"] = matched[0, 0]
    

In [59]:
model.genes.ANME2D_RS00220.annotation

{'locus_tag': 'ANME2D_RS00220',
 'ncbiprotein': 'WP_048088203.1',
 'uniprot': 'A0A062VCL9'}

In [60]:
cobra.io.write_sbml_model(model, model_file)

# SBO terms

In [63]:
for genes in model.genes:
    genes.annotation["sbo"] = "SBO:0000243"

In [65]:
cobra.io.write_sbml_model(model, model_file)