Text Mining for Identification of Biological Entities Related to Antibiotic Resistant Organisms

Antimicrobial resistance is a significant public health problem worldwide. In recent years, the scientific community has been intensifying efforts to combat this problem; many experiments have been developed, and many articles are published in this area. However, the growing volume of biological literature increases the difficulty of the biocuration process due to the cost and time required. Modern text mining tools with the adoption of artificial intelligence technology are helpful to assist in the evolution of research. In this article, we propose a text mining model capable of identifying and ranking prioritizing scientific articles in the context of antimicrobial resistance. We retrieved scientific articles from the PubMed database, adopted machine learning techniques to generate the vector representation of the retrieved scientific articles, and identified their similarity with the context. As a result of this process, we obtained a dataset labeled "Relevant" and “Irrelevant” and used this dataset to implement one supervised learning algorithm to classify new records. The model's overall performance reached 90% accuracy and the f-measure (harmonic mean between the metrics) reached 82% accuracy for positive class and 93% for negative class, showing quality in the identification of scientific articles relevant to the context.

Overview

Figure 1. Proposed TM model.

Figure 1 shows the TM steps implemented in this work in order to label the data. We retrieved a collection of relevant articles in the Drug Resistance and Microbial domain from the Pubmed Central (PMC) database of the National Library of Medicine and the US National Institutes of Health (NIH/NLM). We used the MeSH hierarchy terms for antimicrobial resistance (https://meshb.nlm.nih.gov/record/ui?ui=D004352) and obtained a list of PMCIDs (unique identifiers provided by PubMed Central to each document) with which we will access the full texts of the articles through the E-Fetch utility.

We used the Doc2Vec unsupervised learning algorithm from the Gensim library, which implements the Paragraph Vector – Distributed Memory model, to obtain the embedding of the retrieved documents (Figure 1-C). With the pre-trained model, we inferred the similarity of the documents to the AMR context, represented by 4,290 terms extracted from CARD and the Gene Ontology Database (Figure 1-D), and automatically labeled each of the scientific articles (Figure 1-E) as relevant or irrelevant.

Results

Table 3. Results of Labeling and Classification vs Experts steps

	Relevant	Irrelevant
Labeling
Dataset_1	80%	68%
Dataset_2	66%	34%
Classification
SVM_1	93%	89%
SVM_2	60%	29%

Table 3 presents the percentage of correct predictions, both in the labeling and in the classification stage, in comparison with the data labeled by experts and validates the hypothesis that the use of Paragraph Vector, Distributed Representations of Sentences, and Documents associated with similarity with a specific context is able not only to perform the binary classification of large volumes of data satisfactorily but also to optimize the percentage of correct answers when submitted to supervised classifiers.

Initial Dataset

Relevant articles in the Drug Resistance domain from Pubmed Central database:

Initial Dataset

Document Embedding (AMR Context)

Set of pre-trained document embeddings in the AMR domain.

Document Embedding 1

Document Embedding 2

Document Embedding 3_part1

Document Embedding 3_part2

Document Embedding 3_part3

Document Embedding 3_part4

Document Embedding 4_part1

Document Embedding 4_part2

Document Embedding 4_part3

Document Embedding 4_part4

ATTENTION:

*1. Download the template files (Document Embedding) in the "results" directory

*2. Select all zipped files and use the unzip option here, this will merge the files that needed to be split into files of up to 1 GB to be uploaded to github.

Figures and Tables

Figures

Tables

Complementary Materials

Table S1 - CARD and the Gene Ontology Database

Table S2 - Dataset Doc2vec Label

Table S3 - Dataset TF-IDF Label

Table S4 - Experiments vs Specialists

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
figure		figure
scripts		scripts
table		table
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Mining for Identification of Biological Entities Related to Antibiotic Resistant Organisms

Overview

Results

Initial Dataset

Document Embedding (AMR Context)

Figures and Tables

Complementary Materials

About

Releases

Packages

Contributors 2

Languages

labgm/TextMiningAMR

Folders and files

Latest commit

History

Repository files navigation

Text Mining for Identification of Biological Entities Related to Antibiotic Resistant Organisms

Overview

Results

Initial Dataset

Document Embedding (AMR Context)

Figures and Tables

Complementary Materials

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages