This script is used for generating the literature data to be annotated. It need to be distinguished between the topics of the literature: activate assay of sequences or organisms(1), sequencing or unrelated(2). During the construction of the graph(`2-ref-org-acc-graph.ipynb`), we excluded the literatures which were unrelated to activity.

### generate the annotation file

In [46]:
import os
from os.path import join
import pandas as pd
CURRENT_DIR = os.getcwd()
print(CURRENT_DIR)

d:\Python\aox\enzyme-mining-aox


In [47]:
DATADIR = os.path.join(CURRENT_DIR, "data", "aox")
CURATEDIR = os.path.join(DATADIR, "graph", "ref_org_acc", "curate")

filenames = {
    # download or curate from the database website
    "pubmed": join(DATADIR, "raw", "pubmed.tsv"),
    "brenda_reference": join(DATADIR, "raw", "brenda_reference.tsv"),

    # results
    "reference": join(CURATEDIR, "references.tsv"),
    "reference_annotation": join(CURATEDIR, "references_annotation.tsv"),
}

Disambiguation of literature data to construct a unified local reference id (rid).

In [48]:
reference_pubmed = pd.read_csv(filenames['pubmed'], sep='\t')
reference_pubmed = reference_pubmed.drop(columns=['query', 'abstract', 'doi'])
reference_pubmed.head()

Unnamed: 0,title,year,pubmed id
0,Isolation of alcohol oxidase and two other met...,1985,3889590
1,Molecular cloning and characterization of a ge...,1985,2582370
2,Functional characterization of the two alcohol...,1989,2657390
3,Structural comparison of the Pichia pastoris a...,1989,2660463
4,Modification of flavin adenine dinucleotide in...,1991,1770353


In [49]:
reference_brenda = pd.read_csv(filenames['brenda_reference'], sep='\t')
reference_brenda = reference_brenda.rename(columns={
    'REF': 'brenda id',
    'TITLE ': 'title',
    'YEAR ': 'year',
    'PUBMED ID ': 'pubmed id'
})
reference_brenda['brenda id'] = reference_brenda['brenda id'].astype(str)
reference_brenda = reference_brenda[['brenda id', 'title', 'year', 'pubmed id']]

reference_brenda.head()

Unnamed: 0,brenda id,title,year,pubmed id
0,484905,"Alcohol oxidase, a flavoprotein from several B...",1968,5636370
1,484906,Alcohol oxidase from basidiomycetes,1975,236460
2,484907,Purification and characterization of alcohol o...,1979,-
3,484908,Purification and properties of alcohol oxidase...,1979,118005
4,484909,Irreversible inactivation of the flavoenzyme a...,1980,7006601


In [50]:
# # Convert the 'title' columns to sets
# set_pubmed_titles = set(reference_pubmed['title'])
# set_brenda_titles = set(reference_brenda['title'])

# # Calculate the intersection
# intersection_titles = set_pubmed_titles.intersection(set_brenda_titles)

# # Get the number of intersecting titles
# intersection_count = len(intersection_titles)

# print(f"Number of intersecting titles: {intersection_count}")


In [51]:
merged_reference = pd.concat([reference_pubmed, reference_brenda], axis=0, ignore_index=True, sort=False).fillna("")
merged_reference_sorted = merged_reference.sort_values(by=['year', 'title'], ascending=[True, True]).reset_index(drop=True)
merged_reference_sorted['type'] = -1
merged_reference_sorted['rid'] = merged_reference_sorted.index
merged_reference_sorted.insert(0, 'rid', merged_reference_sorted.pop('rid'))

merged_reference_sorted

Unnamed: 0,rid,title,year,pubmed id,brenda id,type
0,0,"Alcohol oxidase, a flavoprotein from several B...",1968,5636370,484905,-1
1,1,Alcohol oxidase from basidiomycetes,1975,236460,484906,-1
2,2,Purification and characterization of alcohol o...,1979,-,484907,-1
3,3,Purification and properties of alcohol oxidase...,1979,118005,484908,-1
4,4,Irreversible inactivation of the flavoenzyme a...,1980,7006601,484909,-1
...,...,...,...,...,...,...
117,117,Comprehensive insights into the production of ...,2021,-,762556,-1
118,118,Enantioselective oxidation of secondary alcoho...,2021,33910055,762679,-1
119,119,Genome analysis of Candida subhashii reveals i...,2021,34129020,,-1
120,120,Increased activity of alcohol oxidase at high ...,2021,33750541,762989,-1


In [52]:
# output to the file, backup
merged_reference_sorted.to_csv(filenames["reference_annotation"], sep='\t', index=False, header=True)


# to-annotate manually
# merged_reference_sorted.to_csv(filenames["reference"], sep='\t', index=False, header=True)


### Manually Annotate Literature Data
Editing of documents could be done with other tools, such as excel. The title was checked to mark the topic of the references. The curation time required for 122 references by a PhD was about 9 minutes, i.e., the curation velocity was `4.5 s/paper`.

### Analysis of the annotation results

The analysis is also based on Excel. Here are some statistical results:
| | activate assay of sequences or organisms(1) | sequencing or unrelated(2) | Total |
|--|----------|----------|--|
|All| 92 | 30 | 122 |
| Brenda | 79 | 8 | 87 |
| UniProt | 13(16) | 22(19) | 35 |

There were 3 duplicate references, corresponding to Pubmed as follows, 1770353, 17660304, 17476699.


