Antimicrobial resistance is a significant public health problem worldwide. In recent years, the scientific community has been intensifying efforts to combat this problem; many experiments have been developed, and many articles are published in this area. However, the growing volume of biological literature increases the difficulty of the biocuration process due to the cost and time required. Modern text mining tools with the adoption of artificial intelligence technology are helpful to assist in the evolution of research. In this article, we propose a text mining model capable of identifying and ranking prioritizing scientific articles in the context of antimicrobial resistance. We retrieved scientific articles from the PubMed database, adopted machine learning techniques to generate the vector representation of the retrieved scientific articles, and identified their similarity with the context. As a result of this process, we obtained a dataset labeled "Relevant" and “Irrelevant” and used this dataset to implement one supervised learning algorithm to classify new records. The model's overall performance reached 90% accuracy and the f-measure (harmonic mean between the metrics) reached 82% accuracy for positive class and 93% for negative class, showing quality in the identification of scientific articles relevant to the context.
Figure 1 shows the TM steps implemented in this work in order to label the data.
We retrieved a collection of relevant articles in the Drug Resistance and Microbial domain from the Pubmed Central (PMC) database of the National Library of Medicine and the US National Institutes of Health (NIH/NLM).
We used the MeSH hierarchy terms for antimicrobial resistance (https://meshb.nlm.nih.gov/record/ui?ui=D004352) and obtained a list of PMCIDs (unique identifiers provided by PubMed Central to each document) with which we will access the full texts of the articles through the E-Fetch utility.
We used the Doc2Vec unsupervised learning algorithm from the Gensim library, which implements the Paragraph Vector – Distributed Memory model, to obtain the embedding of the retrieved documents (Figure 1-C). With the pre-trained model, we inferred the similarity of the documents to the AMR context, represented by 4,290 terms extracted from CARD and the Gene Ontology Database (Figure 1-D), and automatically labeled each of the scientific articles (Figure 1-E) as relevant or irrelevant.
Table 3. Results of Labeling and Classification vs Experts steps
|
Relevant |
Irrelevant |
Labeling |
|
|
Dataset_1 |
80% |
68% |
Dataset_2 |
66% |
34% |
Classification |
|
|
SVM_1 |
93% |
89% |
SVM_2 |
60% |
29% |
Table 3 presents the percentage of correct predictions, both in the labeling and in the classification stage, in comparison with the data labeled by experts and validates the hypothesis that the use of Paragraph Vector, Distributed Representations of Sentences, and Documents associated with similarity with a specific context is able not only to perform the binary classification of large volumes of data satisfactorily but also to optimize the percentage of correct answers when submitted to supervised classifiers.
Relevant articles in the Drug Resistance domain from Pubmed Central database:
Set of pre-trained document embeddings in the AMR domain.
- ATTENTION:
*1. Download the template files (Document Embedding) in the "results" directory
*2. Select all zipped files and use the unzip option here, this will merge the files that needed to be split into files of up to 1 GB to be uploaded to github.
Table S1 - CARD and the Gene Ontology Database
Table S2 - Dataset Doc2vec Label