Skip to content

labgm/TextMiningAMR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text Mining for Identification of Biological Entities Related to Antibiotic Resistant Organisms

Antimicrobial resistance is a significant public health problem worldwide. In recent years, the scientific community has been intensifying efforts to combat this problem; many experiments have been developed, and many articles are published in this area. However, the growing volume of biological literature increases the difficulty of the biocuration process due to the cost and time required. Modern text mining tools with the adoption of artificial intelligence technology are helpful to assist in the evolution of research. In this article, we propose a text mining model capable of identifying and ranking prioritizing scientific articles in the context of antimicrobial resistance. We retrieved scientific articles from the PubMed database, adopted machine learning techniques to generate the vector representation of the retrieved scientific articles, and identified their similarity with the context. As a result of this process, we obtained a dataset labeled "Relevant" and “Irrelevant” and used this dataset to implement one supervised learning algorithm to classify new records. The model's overall performance reached 90% accuracy and the f-measure (harmonic mean between the metrics) reached 82% accuracy for positive class and 93% for negative class, showing quality in the identification of scientific articles relevant to the context.

Overview

alt text Figure 1. Proposed TM model.


Figure 1 shows the TM steps implemented in this work in order to label the data. We retrieved a collection of relevant articles in the Drug Resistance and Microbial domain from the Pubmed Central (PMC) database of the National Library of Medicine and the US National Institutes of Health (NIH/NLM). We used the MeSH hierarchy terms for antimicrobial resistance (https://meshb.nlm.nih.gov/record/ui?ui=D004352) and obtained a list of PMCIDs (unique identifiers provided by PubMed Central to each document) with which we will access the full texts of the articles through the E-Fetch utility.

We used the Doc2Vec unsupervised learning algorithm from the Gensim library, which implements the Paragraph Vector – Distributed Memory model, to obtain the embedding of the retrieved documents (Figure 1-C). With the pre-trained model, we inferred the similarity of the documents to the AMR context, represented by 4,290 terms extracted from CARD and the Gene Ontology Database (Figure 1-D), and automatically labeled each of the scientific articles (Figure 1-E) as relevant or irrelevant.

Results

Table 3. Results of Labeling and Classification vs Experts steps

 

Relevant

Irrelevant

Labeling

 

 

Dataset_1

80%

68%

Dataset_2

66%

34%

Classification

 

 

SVM_1

93%

89%

SVM_2

60%

29%

Table 3 presents the percentage of correct predictions, both in the labeling and in the classification stage, in comparison with the data labeled by experts and validates the hypothesis that the use of Paragraph Vector, Distributed Representations of Sentences, and Documents associated with similarity with a specific context is able not only to perform the binary classification of large volumes of data satisfactorily but also to optimize the percentage of correct answers when submitted to supervised classifiers.

Initial Dataset

Relevant articles in the Drug Resistance domain from Pubmed Central database:

Initial Dataset

Document Embedding (AMR Context)

Set of pre-trained document embeddings in the AMR domain.

Document Embedding 1

Document Embedding 2

Document Embedding 3_part1

Document Embedding 3_part2

Document Embedding 3_part3

Document Embedding 3_part4

Document Embedding 4_part1

Document Embedding 4_part2

Document Embedding 4_part3

Document Embedding 4_part4

  • ATTENTION:

*1. Download the template files (Document Embedding) in the "results" directory

*2. Select all zipped files and use the unzip option here, this will merge the files that needed to be split into files of up to 1 GB to be uploaded to github.

Figures and Tables

Figures

Tables

Complementary Materials

Table S1 - CARD and the Gene Ontology Database

Table S2 - Dataset Doc2vec Label

Table S3 - Dataset TF-IDF Label

Table S4 - Experiments vs Specialists

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages