# Introduction
In Issue 100, we discuss going about adding gene ID information to the reactions we have in the model. Currently, GPRs are stored based on the original annotation. But since we've added more reactions that we have evidence that should be there we need to ensure consistancy along the model. 

So to do so, we will make all GPRs fit the RTMO0XXX format we have currently. Then for each gene associated to a reaction, I will add the KEGG ID to the gene annotation so that it can be found back in the published p. thermoglucosidasius genome in KEGG. Finally, in some cases we will add the refseq_name annotation which can contain the gene name if we want.

In [1]:
import cameo
import pandas as pd
import cobra.io
import escher

In [2]:
model = cobra.io.read_sbml_model('../model/g-thermo.xml')

In [3]:
model_e_coli = cameo.load_model('iML1515')

First, I will make sure that all the genes in the model are compliant with the RTMOXXXXX system given in the original system. To do so, I will add one to the highest number of the current genes (RTMO05952). I will do so by first filtering for the genes without a gene name.

Then, for the genes where there was the gene name encoded as gene, I will move that to the gene.annotation['refseq_name'] field.

N.B.: for the NADKX genes, they should get the same gene annotation as the NADK reaction as they are side reactions of the same enzyme. I will just fix that by hand

In [4]:
model.reactions.NADK1.gene_reaction_rule = 'RTMO02237 or RTMO03852'
model.reactions.NADK2.gene_reaction_rule = 'RTMO02237 or RTMO03852'
model.reactions.NADK3.gene_reaction_rule = 'RTMO02237 or RTMO03852'
model.reactions.NADK4.gene_reaction_rule = 'RTMO02237 or RTMO03852'
model.reactions.NADK5.gene_reaction_rule = 'RTMO02237 or RTMO03852'

In [5]:
#save&commit
cobra.io.write_sbml_model(model,'../model/g-thermo.xml')

In [67]:
#give all genes without a gene name 
for rct in model.reactions:
    #first filter for reactions without a gene
    if not rct.gene_reaction_rule: #reactions without GPR
        #filter out for exchange reactions
        if rct.id[0:2] in 'EX':
            continue
        else:
            #find newest highest gene number
            genes =[]
            for gene in model.genes:
                if gene.id[0:4] in 'RTMO': #just find the RTMO genes
                    genes.append(gene.id[4:]) #add just the numbers
            genes.sort(reverse = True) #sort with the highest up top
            highest = int(genes[2]) #skip the FA synthase, as this one has a bit of a different numbering
            new = highest + 1
            rct.gene_reaction_rule = f"RTMO0{highest+1}" 
    elif rct.gene_reaction_rule[0:4] in 'RTMO': #these are fine, just leave them
        continue
    else: #these are the ones with 'letter' names
        #here I need to store the original name, add the RTMO system and then move the old name
        old_name = rct.gene_reaction_rule #store name
        #then give it a new RTMOXXXXX name, same as above
        genes =[]
        for gene in model.genes:
            if gene.id[0:4] in 'RTMO': #just find the RTMO genes
                genes.append(gene.id[4:]) #add just the numbers
        genes.sort(reverse = True) #sort with the highest up top
        highest = int(genes[2]) #skip the FA synthase, as this one has a bit of a different numbering
        new = highest + 1
        rct.gene_reaction_rule = f"RTMO0{highest+1}" 
        #then add the old name as an annotation to the gene
        new_gene = rct.gene_reaction_rule
        model.genes.get_by_id(new_gene).name = old_name
        continue

In [6]:
#save&commit
cobra.io.write_sbml_model(model,'../model/g-thermo.xml')

In doing the above, I observed that our model has some genes that are not associated to any reactions. Here I will observe them and remove them.

In [3]:
genes_remove = []
for gene in model.genes:
    if len(gene.reactions) == 0:
        genes_remove.append(gene)
    else: 
        continue

In [4]:
len(genes_remove)

111

In [5]:
#remove genes with no reaction
for gene in genes_remove:
    model.genes.remove(gene)

In [6]:
#save & commit
cobra.io.write_sbml_model(model,'../model/g-thermo.xml')

Now that we've maintained consitancy with the RTMO genes, it would be nice to have a bit more information associated to them. For that, I will first try to map the EC-code given to each reaction to the KEGG gene ID associated with it, allowing an easier identification of what the RTMO number means.

To do so, I will use this database: http://rest.kegg.jp/link/ptl/enzyme which links each annotation in the kegg genome to an enzyme. The kegg genome that is found will be added to that genes annotation as kegg.genes.

In [6]:
#import the data
df_kegg_genes = pd.read_csv('http://rest.kegg.jp/link/ptl/enzyme', header=None, sep = '\t')

In [7]:
#need to change headers
df_kegg_genes.columns = ['EC', 'Gene']

In [8]:
#need to get rid of the 'ec:' and 'ptl:' parts
df_kegg_genes['EC'] = df_kegg_genes['EC'].str.replace(r'ec:', '')

In [9]:
df_kegg_genes['Gene'] = df_kegg_genes['Gene'].str.replace(r'ptl:', '')

In [10]:
no_anno = []
diff_ec = []
no_kegg = []
for gene in model.genes:
    gene.annotation['kegg.genes'] =[]
    try:
        ec = list(gene.reactions)[0].annotation['ec-code']#lift the ec-codes from the first reaction 
        if len(str(ec)) > 11: #this is when there are two or more ec codes:
            first = list(gene.reactions)[0].annotation['ec-code'][0]
        elif len(str(ec)) <= 11: #this is when there is just one ec code:
            first = list(gene.reactions)[0].annotation['ec-code']
        try: 
            second = list(gene.reactions)[1].annotation['ec-code']
            if first in second: #i.e. the reactions have the same ec code
                try:
                    kegg_gene = df_kegg_genes.loc[df_kegg_genes["EC"] == first]
                    for index, row in kegg_gene.iterrows(): #for each gene found, it should be added
                        anno = row['Gene'] #find the KEGG gene for that ec-code
                        gene.annotation['kegg.genes'] = gene.annotation['kegg.genes'] + [anno]
                except IndexError:
                    no_kegg.append(gene.id) #when the e.c. code doesnt have a matching kegg gene
            else: #if the two recations have different ec-codes
                diff_ec.append(gene.id)
        except IndexError: #i.e. if there is no second reaction for that gene
            try: #try to map the first ec-code that is there
                kegg_gene = df_kegg_genes.loc[df_kegg_genes["EC"] == first]
                for index, row in kegg_gene.iterrows(): #for each gene found, it should be added
                        anno = row['Gene'] #find the KEGG gene for that ec-code
                        gene.annotation['kegg.genes'] = gene.annotation['kegg.genes'] + [anno]
            except IndexError:#when the first e.c. code doesnt have a matching kegg gene
                #try the second e.c. code
                ec2= list(gene.reactions)[0].annotation['ec-code']
                if len(str(ec2)) > 11: #this is when there are two or more ec codes:
                    second_ec = list(gene.reactions)[0].annotation['ec-code'][1]
                elif len(str(ec)) <= 11: #this is when there is just one ec code:
                    continue #has been covered by the code above
                try:
                    kegg_gene = df_kegg_genes.loc[df_kegg_genes["EC"] == second_ec]
                    for index, row in kegg_gene.iterrows(): #for each gene found, it should be added
                        anno = row['Gene'] #find the KEGG gene for that ec-code
                        gene.annotation['kegg.genes'] = gene.annotation['kegg.genes'] + [anno]
                except IndexError:
                    no_kegg.append(gene.id) #when the first and second e.c. code doesnt have a matching kegg gene
    except KeyError: # there is no ec-code in the first reaction, so try the second
        no_anno.append(gene.id)        

In [10]:
#save&commit
cobra.io.write_sbml_model(model,'../model/g-thermo.xml')

Now we have three lists: 
- diff_ec (109 genes): these have reactions with different ec-codes. here I will need to check through all the reactions from the gene and append all of the kegg.genes to each other so that we have the multiple kegg.genes for the gene in our model.


- no_kegg (0): So there are no genes with one reaction that has a non-matching Kegg ID.


- no_anno (145 genes): the first reaction of this gene has no ec-code and so we can't match a Kegg gene to the gene. Here I will go through all the genes. IF they encode a transport I will ignore them (as these don't have a kegg annotation?) and then see which ones remain and may need more checking.
 

__Fix diff_ec__
Here I will write the script that should fix some of the genes that fall in the diff_ec list to decrease the number of genes that didn't have an annotation associated to them.

In [19]:
for gene in model.genes:
    if gene.id in diff_ec: #just get the ones we had an issue with
        tot_anno = []
        for rct in gene.reactions: #iterate through all the reactions of this gene
            ec_codes = rct.annotation['ec-code']  
            if len(str(ec_codes)) <= 11: #this is when there is just one ec code:
                ec = rct.annotation['ec-code']
                try:
                    kegg_gene = df_kegg_genes.loc[df_kegg_genes["EC"] == ec]
                    for index, row in kegg_gene.iterrows(): #for each gene found, it should be added
                        anno = row['Gene'] #find the KEGG gene for that ec-code
                        tot_anno.append(anno)
                except IndexError:
                    continue
            elif len(str(ec_codes)) > 11: #this is when there is more than one ec code:
                for ec in ec_codes:
                    try:
                        kegg_gene = df_kegg_genes.loc[df_kegg_genes["EC"] == ec]
                        for index, row in kegg_gene.iterrows(): #for each gene found, it should be added
                            anno = row['Gene'] #find the KEGG gene for that ec-code
                            tot_anno.append(anno)
                    except IndexError:
                        continue
        gene.annotation['kegg.genes'] = tot_anno
    else:
        continue

In [20]:
cobra.io.write_sbml_model(model,'../model/g-thermo.xml')

Now I added the KeGG ID to the reactions that were missing one because the first reaction didn't have a matching E.C. code. We've now gone from 425 unannotated genes to 319. 

Finally I will check what is wrong with the genes in no_anno to try to fix those too. Some of these genes are associated to transport reactions, which do not necessarily come with a gene annotation in Kegg. 

In [11]:
for gene in model.genes:
    if gene.id in no_anno:
        if len(gene.reactions) == 1:#when they only have one reaction, check if it is a transport
            rct_id = list(gene.reactions)[0].id
            if rct_id[-1:] in 't': #remove transport reactions from this list
                no_anno.remove(gene.id)
            elif rct_id[-3:] in 'abc':
                no_anno.remove(gene.id)
            elif rct_id[-3:] in 'pts':
                no_anno.remove(gene.id)
            else:
                continue
        else: #keep genes with two reactions, then we can check them later
            continue

So there are only 49 genes which are not transport genes that don't have a Kegg ID associated to them. I will look into them a bit more to try to figure out a solution.

In [12]:
len(no_anno)

49

In some cases, it seems that the first reaction doesn't have an ec-code, but the second one does. So similar to wht i did for the diff_ec list, that code should solve the issue for some of these reactions here too.

In [13]:
for gene in model.genes:
    if gene.id in no_anno: #just get the ones we had an issue with
        tot_anno = []
        for rct in gene.reactions: #iterate through all the reactions of this gene
            try: 
                ec_codes = rct.annotation['ec-code']  
                if len(str(ec_codes)) <= 11: #this is when there is just one ec code:
                    ec = rct.annotation['ec-code']
                    try:
                        kegg_gene = df_kegg_genes.loc[df_kegg_genes["EC"] == ec]
                        for index, row in kegg_gene.iterrows(): #for each gene found, it should be added
                            anno = row['Gene'] #find the KEGG gene for that ec-code
                            tot_anno.append(anno)
                    except IndexError:
                        continue
                elif len(str(ec_codes)) > 11: #this is when there is more than one ec code:
                    for ec in ec_codes:
                        try:
                            kegg_gene = df_kegg_genes.loc[df_kegg_genes["EC"] == ec]
                            for index, row in kegg_gene.iterrows(): #for each gene found, it should be added
                                anno = row['Gene'] #find the KEGG gene for that ec-code
                                tot_anno.append(anno)
                        except IndexError:
                            continue
            
            except KeyError:
                continue
                #print(gene.id)
        gene.annotation['kegg.genes'] = tot_anno
    else:
        continue

This only solved the issue for 2 genes... It is something but worth looking into the other ones left more. 

RTMO12345 appears often: but looking into it, it actually has the kegg.genes associated to it.... so something is wrong with this no_anno list. I should clean it and get rid of anything that actually does have a kegg.gene annotation in it.

In [14]:
for gene in model.genes:
    if gene.id in no_anno:
        if len(gene.annotation['kegg.genes']) > 0:#when the gene already has an annotation
            no_anno.remove(gene.id) #remove it from the list
        else:
            continue
    else: 
        continue

This again only removed two genes from the list. These are all reactions that don't have an ec-code in their annotation and so won't be mapped to the Kegg_IDs like this. 

However, there are still many unannotated genes: 299. around 100 of those we know won't be annotated as they are transports, and so there are still about 200 that are unannotated without a logical explanation in my eyes. I will take a look at this a bit more, to see if I can get those annotated further. 

In [27]:
unannotated =[]
for genes in model.genes:
    if not len(genes.annotation['kegg.genes']):
        unannotated.append(genes.id)
    else:
        continue
len(unannotated)

299

In [30]:
#remove transport reactions from unannotated
for gene in model.genes:
    if gene.id in unannotated:
        if len(gene.reactions) == 1:#when they only have one reaction, check if it is a transport
            rct_id = list(gene.reactions)[0].id
            if rct_id[-1:] in 't': #remove transport reactions from this list
                unannotated.remove(gene.id)
            elif rct_id[-3:] in 'abc':
                unannotated.remove(gene.id)
            elif rct_id[-3:] in 'pts':
                unannotated.remove(gene.id)
            else:
                continue
        else: #keep genes with two reactions, then we can check them later
            continue

some of these genes have a reaction, e.g. CAT which has a first ec-code that doesn't match, but a second ec-code that does. So here, I think I can fix this by modifying and running some code i wrote earlier.

In [46]:
for gene in model.genes:
    if gene.id in unannotated: #just get the ones we had an issue with
        tot_anno = []
        for rct in gene.reactions: #iterate through all the reactions of this gene
            try: 
                ec_codes = rct.annotation['ec-code']  
                if len(str(ec_codes)) <= 11: #this is when there is just one ec code:
                    ec = rct.annotation['ec-code']
                    try:
                        kegg_gene = df_kegg_genes.loc[df_kegg_genes["EC"] == ec]
                        for index, row in kegg_gene.iterrows(): #for each gene found, it should be added
                            anno = row['Gene'] #find the KEGG gene for that ec-code
                            tot_anno.append(anno)
                    except IndexError:
                        continue
                elif len(str(ec_codes)) > 11: #this is when there is more than one ec code:
                    for ec in ec_codes:
                        try:
                            kegg_gene = df_kegg_genes.loc[df_kegg_genes["EC"] == ec]
                            for index, row in kegg_gene.iterrows(): #for each gene found, it should be added
                                anno = row['Gene'] #find the KEGG gene for that ec-code
                                tot_anno.append(anno)
                        except IndexError:
                            continue
            except KeyError:
                continue
            gene.annotation['kegg.genes'] = tot_anno
    else:
        continue

In [47]:
unannotated =[]
for genes in model.genes:
    if not len(genes.annotation['kegg.genes']):
        unannotated.append(genes.id)
    else:
        continue
len(unannotated)

218

In [48]:
#remove transport reactions from unannotated
for gene in model.genes:
    if gene.id in unannotated:
        if len(gene.reactions) == 1:#when they only have one reaction, check if it is a transport
            rct_id = list(gene.reactions)[0].id
            if rct_id[-1:] in 't': #remove transport reactions from this list
                unannotated.remove(gene.id)
            elif rct_id[-3:] in 'abc':
                unannotated.remove(gene.id)
            elif rct_id[-3:] in 'pts':
                unannotated.remove(gene.id)
            else:
                continue
        else: #keep genes with two reactions, then we can check them later
            continue

In [49]:
len(unannotated)

120

So now we only have 120 genes without the kegg_id annotation. Looking into these more, this is because their reactions either dont have an ec-code, or the ec-code they have doesn't give a match in the dataframe. So these I will just leave.

In [58]:
#save&commit
cobra.io.write_sbml_model(model,'../model/g-thermo.xml')

Another thing i noticed, is that the newly added genes didnt get an SBO number added to them, where they should be. So I will just do that now.

In [73]:
for gene in model.genes:
    if int(gene.id[-4:]) > 5952: #i.e. all the newly added genes
        gene.annotation['sbo'] = 'SBO:0000243'
    else:
        continue

In [74]:
#save&commit
cobra.io.write_sbml_model(model,'../model/g-thermo.xml')

By using the kegg database, we could now also add the uniprot IDs to each of the genes in our model to provide extra information as well. I will add that into the annotation field ['uniprot'] Also in the same manner, we can add the ncbi_proteinid information into the model in the annotation field ['ncbigi']. I will do that now.

To do these two things, I will use these following databases:
- http://rest.kegg.jp/conv/uniprot/ptl 

- http://rest.kegg.jp/conv/ncbi-proteinid/ptl

In [3]:
#import data for uniprot
df_uniprot = pd.read_csv('http://rest.kegg.jp/conv/uniprot/ptl', header=None, sep = '\t')

In [4]:
#need to change headers
df_uniprot.columns = ['Gene', 'UniprotID']

In [5]:
#need to get rid of the 'up:' and 'ptl:' parts
df_uniprot['Gene'] = df_uniprot['Gene'].str.replace(r'ptl:', '')

In [6]:
df_uniprot['UniprotID'] = df_uniprot['UniprotID'].str.replace(r'up:', '')

In [27]:
no_anno_uni = []

for gene in model.genes:
    try: #try to lift the genes annotation
        tot_anno = []
        kegg = gene.annotation['kegg.genes']
        if type(kegg) == str: #i.e. there is only one gene annotation
            uni_id = df_uniprot.loc[df_uniprot["Gene"] == kegg]
            for index, row in uni_id.iterrows(): #for each uniprot ID found, it should be added
                    prot_id = row['UniprotID'] #collec tthe uniprot ID
                    tot_anno.append(prot_id)
        elif type(kegg) == list: #i.e. if there are multiple kegg gene IDs, we need the uniprot ID for each
            for anno in kegg: #go through each gene annotation
                uni_id = df_uniprot.loc[df_uniprot["Gene"] == anno]
                for index, row in uni_id.iterrows(): #for each uniprot ID found, it should be added
                    prot_id = row['UniprotID'] #collec tthe uniprot ID
                    tot_anno.append(prot_id)
        gene.annotation['uniprot'] = tot_anno
    except KeyError: #if they dont have the kegg.genes annotation they will be added here
        no_anno_uni.append(gene.id)

In [28]:
len(no_anno_uni) #len should be 218, which it is!

218

In [29]:
#save&commit
cobra.io.write_sbml_model(model,'../model/g-thermo.xml')

As for the uniprot IDs, I will do the same for the ncbi-protein IDs.

In [3]:
#import data for uniprot
df_ncbi = pd.read_csv('http://rest.kegg.jp/conv/ncbi-proteinid/ptl', header=None, sep = '\t')

In [4]:
#need to change headers
df_ncbi.columns = ['Gene', 'NCBIprotID']

In [5]:
#need to get rid of the 'ncbi-proteinid:' and 'ptl:' parts
df_ncbi['Gene'] = df_ncbi['Gene'].str.replace(r'ptl:', '')

In [6]:
df_ncbi['NCBIprotID'] = df_ncbi['NCBIprotID'].str.replace(r'ncbi-proteinid:', '')

In [7]:
no_anno_ncbi = []
for gene in model.genes:
    try: #try to lift the genes annotation
        tot_anno = []
        kegg = gene.annotation['kegg.genes']
        if type(kegg) == str: #i.e. there is only one gene annotation
            ncbi_id = df_ncbi.loc[df_ncbi["Gene"] == kegg]
            for index, row in ncbi_id.iterrows(): #for each uniprot ID found, it should be added
                    prot_id = row['NCBIprotID'] #collec tthe uniprot ID
                    tot_anno.append(prot_id)
        elif type(kegg) == list: #i.e. if there are multiple kegg gene IDs, we need the uniprot ID for each
            for anno in kegg: #go through each gene annotation
                ncbi_id = df_ncbi.loc[df_ncbi["Gene"] == anno]
                for index, row in ncbi_id.iterrows(): #for each uniprot ID found, it should be added
                    prot_id = row['NCBIprotID'] #collec tthe uniprot ID
                    tot_anno.append(prot_id)
        gene.annotation['ncbigi'] = tot_anno
    except KeyError: #if they dont have the kegg.genes annotation they will be added here
        no_anno_ncbi.append(gene.id)

In [8]:
len(no_anno_ncbi) #len should be 218, which it is!

218

In [10]:
#save&commit
cobra.io.write_sbml_model(model,'../model/g-thermo.xml')

After the changes done here, I will re-run memote to check that the annotations of the genes have improved. I will attach the report as '../reports/2020-07-02-517b016.html' We can see that the quality of the annotations of the genes has improve to 46%. It is still not great, but also not crucial to stick a lot of energy into improving this further.

N.B.: you can see mass balance is no longer 100%, this is an issue, this is from some metabolites that lost their formula. I will add them in now.

In [5]:
model.metabolites.ps_c.formula = 'C38.14H74.28NO10P'
model.metabolites.pa_c.formula = 'C35.14H69.28O8P'
model.metabolites.cdpdag_c.formula = 'C44.14H79.28N3O15P2'
model.metabolites.aglyc3p_c.formula = 'C19.07H38.14O7P'
model.metabolites.pg_c.formula = 'C38.14H75.28O10P'
model.metabolites.pgp_c.formula = 'C38.14H76.28O13P2'
model.metabolites.pe_c.formula = 'C37.14H74.28NO8P'
model.metabolites.clpn_c.formula = 'C73.28H142.56O17P2'
model.metabolites.acylACP_c.formula = 'C27.07H51.14O8N2PRS'
model.metabolites.acylcoa_c.formula = 'C37.07H63.14O17N7P3S'

In [6]:
#save&commit
cobra.io.write_sbml_model(model,'../model/g-thermo.xml')

Now we have mass balance fixed again, and I've uploaded a new memote report: '../reports/2020-07-02-29b13d9.html'