## Task 1: Search and Extract Data from Pubmed using Keywords

**Author**: Ariful Mondal (ariful.mondal@gmail.com)

**Date**: February 12, 2021

Search using MesH keywords and fetch literature from Pubmed (NCBI)

In [14]:
# install Bio package if it is not installed already - https://biopython.org/
#!pip install Bio #- run once  

In [1]:
from Bio import Entrez  #for Entrez Programming Utilities Help - https://www.ncbi.nlm.nih.gov/books/NBK25501/
from Bio.Entrez import efetch
from Bio import Medline # for parsing Medline file format

In [8]:
# testing connections
# Entrez.email = "ariful.mondal@gmail.com" # Always tell NCBI who you are
#handle = Entrez.einfo()
#result = handle.read()
#handle.close()
#print(result)

In [22]:
# testing handle
#handle = Entrez.efetch(db="pubmed", id='19304878', rettype="medline", retmode="xml")
#records = Entrez.read(handle)
#for record in records["PubmedArticle"]:
#    print(record["MedlineCitation"]["Article"]["ArticleTitle"])

Biopython: freely available Python tools for computational molecular biology and bioinformatics.


### Define Key words/MeSH

Mesh database: drug side-effects  and congenital anomalies have entries in the MeSH database (https://www.ncbi.nlm.nih.gov/mesh/68064420 and https://www.ncbi.nlm.nih.gov/mesh/68000013)

Dataset: Pubmed (https://pubmed.ncbi.nlm.nih.gov/)

#### Keywords for Drug adverse events

In [2]:
key_words_to_search_ADE = "\"Drug-Related Side Effects and Adverse Reactions\"[Mesh] AND  human[Mesh] AND English[Language]"

#### Keywords for Congenital Abnormalities

In [44]:
key_words_to_search_CA = "\"Congenital Abnormalities\"[Mesh] AND  human[Mesh] AND English[Language]"

#### Keywords for Congenital Abnormalities due to drug doses

In [45]:

key_words_to_search_CADS = "(\"Congenital Abnormalities/drug effects\"[Mesh] OR \"Congenital Abnormalities/drug therapy\"[Mesh] ) AND  human[Mesh] AND English[Language]"


#### Keywords for Other Samples not in ADE or CA

In [46]:
key_words_to_search_others = "human[Mesh] AND English[Language] NOT (\"Congenital Abnormalities\"[Mesh] \
        OR \"Drug-Related Side Effects and Adverse Reactions\"[Mesh])"

Note: Further improvements can be done by doing reasearch on MeSH and Pubmed topics

### Define parameters

In [3]:
# Define parameters 
email_id = "ariful.mondal@gmail.com" # Always tell NCBI who you are
records_to_download = 2500            # total number of records to download
batch_size = 500                     # retrieve data in batches of 500 max (remax) - for responsible scrapping 
number_of_days_to_search = 1825      # search for literatures published in the last 5 years 
db_name = "pubmed"                   # NCBI database name Dataset: Pubmed (https://pubmed.ncbi.nlm.nih.gov/)

### Define function for reusable search and download of data from NCBI

In [5]:
#Design search engine and extraction handle
def search_extract_abstract_ncbi(keywords, outfile):
    Entrez.email = email_id
    try:
        search_results = Entrez.read(
            Entrez.esearch(
                db=db_name, term=keywords, reldate=number_of_days_to_search, datetype="pdat", usehistory="y"
            )
        )
        
        num_records_count = int(search_results["Count"])
        print("Found %i results" % num_records_count)

        count = min(records_to_download, num_records_count)
        print("Requested download for %i results" % count)

        out_handle = open(outfile, "w")
        for start in range(0, count, batch_size):
            end = min(count, start + batch_size)
            print("Going to download record %i to %i" % (start + 1, end))
            fetch_handle = Entrez.efetch(
                db="pubmed",
                rettype="medline",
                retmode="text",
                retstart=start,
                retmax=batch_size,
                webenv=search_results["WebEnv"],
                query_key=search_results["QueryKey"]
            )
            data = fetch_handle.read()
            fetch_handle.close()
            out_handle.write(data)
        out_handle.close() 
    
    except Exception as e:
        print("Something went wrong or no records found!")
        print(e)
 


### Prepare to search and download abstracts

In [6]:
# Extract abstract for ADE
search_extract_abstract_ncbi(key_words_to_search_ADE, "../data/abst_adverse_drug_event.txt")


Found 15564 results
Requested download for 2500 results
Going to download record 1 to 500
Going to download record 501 to 1000
Going to download record 1001 to 1500
Going to download record 1501 to 2000
Going to download record 2001 to 2500


In [75]:
# Extract abstract for congenital anomalies
search_extract_abstract_ncbi(key_words_to_search_CA, "../data/abst_congenital_anomalies.txt")

Found 66559 results
Requested download for 2500 results
Going to download record 1 to 500
Going to download record 501 to 1000
Going to download record 1001 to 1500
Going to download record 1501 to 2000
Going to download record 2001 to 2500


In [76]:
# Extract abstract for congenital anomalies caused by drug usage
search_extract_abstract_ncbi(key_words_to_search_CADS, "../data/abst_congenital_anomalies_DS.txt")

Found 2044 results
Requested download for 2044 results
Going to download record 1 to 500
Going to download record 501 to 1000
Going to download record 1001 to 1500
Going to download record 1501 to 2000
Going to download record 2001 to 2044


In [77]:
# Extract abstracts not in ADE or CA
search_extract_abstract_ncbi(key_words_to_search_others, "../data/abst_others.txt")

Found 2680061 results
Requested download for 2500 results
Going to download record 1 to 500
Going to download record 501 to 1000
Going to download record 1001 to 1500
Going to download record 1501 to 2000
Going to download record 2001 to 2500


### Parse data

Refer more about Medline package here https://biopython.org/docs/1.75/api/Bio.Medline.html

In [77]:
#from Bio import Medline

#define own function to apply on various medline text file
def parse_medline_abstract(infile, outfile):
    with open(infile) as handle:
       # records = Medline.read(handle)
        count = 0
       
        records = Medline.parse(handle)
        records = list(records)
        with open(outfile, "w+") as file_object:
            for record in records:
                list_record = str(record.get("PMID","?"))+"\t"+str(record.get("TI", "?"))+"\t"+str(record.get("AB", "?"))
                file_object.write(list_record + "\t" + str(count) + "\n")
                count = count + 1;
        print(count, list_record) #print last record
           

In [78]:
parse_medline_abstract('../data/abst_adverse_drug_event.txt','../data/abst_adverse_drug_event_parsed.txt')
parse_medline_abstract('../data/abst_congenital_anomalies.txt', '../data/abst_congenital_anomalies_parsed.txt')
parse_medline_abstract('../data/abst_congenital_anomalies_DS.txt', '../data/abst_cong_anoml_DS_parsed.txt')
parse_medline_abstract('../data/abst_others.txt','../data/abst_others_parsed.txt')

2500 33447029	SnO2-Doped ZnO/Reduced Graphene Oxide Nanocomposites: Synthesis, Characterization, and Improved Anticancer Activity via Oxidative Stress Pathway.	Background: Therapeutic selectivity and drug resistance are critical issues in cancer therapy. Currently, zinc oxide nanoparticles (ZnO NPs) hold considerable promise to tackle this problem due to their tunable physicochemical properties. This work was designed to prepare SnO2-doped ZnO NPs/reduced graphene oxide nanocomposites (SnO2-ZnO/rGO NCs) with enhanced anticancer activity and better biocompatibility than those of pure ZnO NPs. Materials and Methods: Pure ZnO NPs, SnO2-doped ZnO (SnO2-ZnO) NPs, and SnO2-ZnO/rGO NCs were prepared via a facile hydrothermal method. Prepared samples were characterized by field emission transmission electron microscopy (FETEM), energy dispersive spectroscopy (EDS), field emission scanning electron microscopy (FESEM), X-ray diffraction (XRD), ultraviolet-visible (UV-VIS) spectrometer, and dynam

In [79]:
!wc -l '../data/abst_adverse_drug_event_parsed.txt'

0 ../data/abst_adverse_drug_event_parsed.txt


In [80]:
!wc -l '../data/abst_congenital_anomalies_parsed.txt'

wc: ../data/abst_congenital_anomalies_parsed.txt: No such file or directory


In [81]:
!wc -l '../data/abst_cong_anoml_DS_parsed.txt'

wc: ../data/abst_cong_anoml_DS_parsed.txt: No such file or directory


In [73]:
!wc -l '../data/abst_others_parsed.txt'

2500 ../data/abst_others_parsed.txt


### Do more with medline file

In [57]:
help(records)

Help on Record in module Bio.Medline object:

class Record(builtins.dict)
 |  A dictionary holding information from a Medline record.
 |  
 |  All data are stored under the mnemonic appearing in the Medline
 |  file. These mnemonics have the following interpretations:
 |  
 |  Mnemonic  Description
 |  --------- ------------------------------
 |  AB        Abstract
 |  CI        Copyright Information
 |  AD        Affiliation
 |  IRAD      Investigator Affiliation
 |  AID       Article Identifier
 |  AU        Author
 |  FAU       Full Author
 |  CN        Corporate Author
 |  DCOM      Date Completed
 |  DA        Date Created
 |  LR        Date Last Revised
 |  DEP       Date of Electronic Publication
 |  DP        Date of Publication
 |  EDAT      Entrez Date
 |  GS        Gene Symbol
 |  GN        General Note
 |  GR        Grant Number
 |  IR        Investigator Name
 |  FIR       Full Investigator Name
 |  IS        ISSN
 |  IP        Issue
 |  TA        Journal Title Abbreviatio

In [67]:
#from Bio import Medline

with open("../data/abst_adverse_drug_event.txt") as handle:
    for record in Medline.parse(handle):
        print(record['PMID'], record['TI'], record['AU'])

33517299 Adverse Events and Economic Burden Among Patients Receiving Systemic Treatment for Mantle Cell Lymphoma: A Real-World Retrospective Cohort Study. ['Kabadi SM', 'Byfield SD', 'LE L', 'Olufade T']
33514507 Association between human papillomavirus vaccination and serious adverse events in South Korean adolescent girls: nationwide cohort study. ['Yoon D', 'Lee JH', 'Lee H', 'Shin JY']
33509333 Pulmonary Manifestations of Rheumatoid Arthritis. ['Christensen KJ', 'Malesker MA', 'Jagan N', 'Moore DR']
33503163 Adverse drug reactions in patients with COVID-19 in Brazil: analysis of spontaneous notifications of the Brazilian pharmacovigilance system. ['Melo JRR', 'Duarte EC', 'Moraes MV', 'Fleck K', 'Silva ASDNE', 'Arrais PSD']
33488016 Cutaneous Cell-Mediated Delayed Hypersensitivity to Intravitreal Bevacizumab. ['Fam A', 'Finger PT']
33471645 Self-Reported Allergy to Thyroid Replacement Therapy: A Multicenter Retrospective Chart Review. ['Chamorro-Pareja N', 'Carrillo-Martin I', 'Hae

KeyError: 'AU'

In [61]:
help(record)

Help on Record in module Bio.Medline object:

class Record(builtins.dict)
 |  A dictionary holding information from a Medline record.
 |  
 |  All data are stored under the mnemonic appearing in the Medline
 |  file. These mnemonics have the following interpretations:
 |  
 |  Mnemonic  Description
 |  --------- ------------------------------
 |  AB        Abstract
 |  CI        Copyright Information
 |  AD        Affiliation
 |  IRAD      Investigator Affiliation
 |  AID       Article Identifier
 |  AU        Author
 |  FAU       Full Author
 |  CN        Corporate Author
 |  DCOM      Date Completed
 |  DA        Date Created
 |  LR        Date Last Revised
 |  DEP       Date of Electronic Publication
 |  DP        Date of Publication
 |  EDAT      Entrez Date
 |  GS        Gene Symbol
 |  GN        General Note
 |  GR        Grant Number
 |  IR        Investigator Name
 |  FIR       Full Investigator Name
 |  IS        ISSN
 |  IP        Issue
 |  TA        Journal Title Abbreviatio

### Credits

1. https://pubmed.ncbi.nlm.nih.gov/

2. https://www.ncbi.nlm.nih.gov/books/NBK25501/

3. https://www.biostars.org/p/111313/

4. https://biopython.org/