# Setup

In [16]:
from Bio import Entrez
import pandas as pd
import json # for pretty printing

In [17]:
Entrez.email = "giorgiocoal@gmail.com"

# PubMed Articles IDs Search and Retrieval

## Batch Articles IDs Search per Data Range

For PubMed, ESearch can only retrieve the first 10,000 records matching the query. To obtain more than 10,000 PubMed records, consider using <EDirect> that contains additional logic to batch PubMed search results automatically so that an arbitrary number can be retrieved.

To avoid the maximum number of articles retrieved limitation, i decided to search them in batches of maximum 9999 articles, using a date range as limiter.

Fetching the most recent 9934 articles published from **2009** to **2023**:

In [18]:
NUM_ARTICLES = 10000
search_record = {}
handle = Entrez.esearch(db = "pubmed", 
                        retmax = NUM_ARTICLES, 
                        retstart = 0, 
                        term = '(("toxoplasma gondii"[Title/Abstract]) AND ("2009/01/01"[Date - Publication] : "2023/03/22"[Date - Publication]))', 
                        idtype = "acc",
                        retmode = "xml")
search_record[">2009"] = Entrez.read(handle)
handle.close()

In [19]:
print(search_record[">2009"].keys())
print(search_record[">2009"]["QueryTranslation"])
print(len(search_record[">2009"]["IdList"]))

"toxoplasma gondii"[Title/Abstract] AND 2009/01/01:2023/03/22[Date - Publication]
9935


Fetching older 7770 articles published **before 2009**:

In [20]:
NUM_ARTICLES = 10000
handle = Entrez.esearch(db = "pubmed", 
                        retmax = NUM_ARTICLES, 
                        retstart = 0, 
                        term = '(("toxoplasma gondii"[Title/Abstract]) AND ("1900/01/01"[Date - Publication] : "2008/12/31"[Date - Publication]))', 
                        idtype = "acc",
                        retmode = "xml")
search_record["<2009"] = Entrez.read(handle)
handle.close()

In [21]:
print(search_record["<2009"].keys())
print(search_record["<2009"]["WarningList"])
print(len(search_record["<2009"]["IdList"]))

{'PhraseIgnored': [], 'OutputMessage': ['Restrictions achieved. start and count adjusted to 0, 9999'], 'QuotedPhraseNotFound': []}
7770


## Merging the batches and Remove Duplicates

Now I'll merge the IDs Batches obtaining a single list of 10000+ IDs.

In [22]:
search_record_full = list(search_record[">2009"]["IdList"]) + list(search_record["<2009"]["IdList"])

Number of articles to fetch

In [23]:
# Number of articles
print(len(search_record_full))
# Number of unique articles
print(len(set(search_record_full)))

17705
17639


Removing the duplicates

In [24]:
search_record_full = list(set(search_record_full))

In [25]:
len(search_record_full)

17639

# Fetch Bibliographic Information from retrieved IDs

Also `Entrez.efetch` has a maximum number of 'IDs' which can be retrieved, it is important to slice the list of IDs during fetching.

Fetch detailed information about every article in the list

In [26]:
# Retrieve only `top_n` most recent articles
records = {}
half = len(search_record_full) // 2
handle = Entrez.efetch(db = "pubmed", 
                       # Convert list to string with comma as separator 
                       # to pass to Entrez.efetch
                       id = ','.join(search_record_full[:half]),
                       retmode = "xml")
records['1stHalf'] = Entrez.read(handle)
handle.close()
handle = Entrez.efetch(db = "pubmed", 
                       # Convert list to string with comma as separator 
                       # to pass to Entrez.efetch
                       id = ','.join(search_record_full[half:]),
                       retmode = "xml")
records['2ndHalf'] = Entrez.read(handle)
handle.close()

Record dictionary details:

In [27]:
# First Half
print(records['1stHalf'].keys())

print(len(records['1stHalf']["PubmedArticle"]))
print(type(records['1stHalf']["PubmedArticle"]))
print(type(records['1stHalf']["PubmedArticle"][1]))

# Second Half
print(len(records['2ndHalf']["PubmedArticle"]))
print(type(records['2ndHalf']["PubmedArticle"]))
print(type(records['2ndHalf']["PubmedArticle"][1]))

dict_keys(['PubmedArticle', 'PubmedBookArticle'])
8814
<class 'list'>
<class 'Bio.Entrez.Parser.DictionaryElement'>
8816
<class 'list'>
<class 'Bio.Entrez.Parser.DictionaryElement'>


Merging the two records dictionaries in a single one:

In [28]:
full_records = {}
full_records['PubmedArticle'] = records['1stHalf']["PubmedArticle"] + records['2ndHalf']["PubmedArticle"]
full_records['PubmedBookArticle'] = records['1stHalf']["PubmedBookArticle"] + records['2ndHalf']["PubmedBookArticle"]

In [29]:
print(len(full_records["PubmedArticle"]))
print(type(full_records["PubmedArticle"]))
print(type(full_records["PubmedArticle"][1]))

17630
<class 'list'>
<class 'Bio.Entrez.Parser.DictionaryElement'>


In [53]:
print(json.dumps(full_records['PubmedArticle'][1], indent = 2))

{
  "MedlineCitation": {
    "OtherID": [],
    "OtherAbstract": [],
    "KeywordList": [],
    "GeneralNote": [],
    "CitationSubset": [
      "IM"
    ],
    "SpaceFlightMission": [],
    "PMID": "12197139",
    "DateCompleted": {
      "Year": "2002",
      "Month": "09",
      "Day": "17"
    },
    "DateRevised": {
      "Year": "2006",
      "Month": "11",
      "Day": "15"
    },
    "Article": {
      "ArticleDate": [],
      "ELocationID": [],
      "Language": [
        "eng"
      ],
      "Journal": {
        "ISSN": "0022-3395",
        "JournalIssue": {
          "Volume": "88",
          "Issue": "4",
          "PubDate": {
            "Year": "2002",
            "Month": "Aug"
          }
        },
        "Title": "The Journal of parasitology",
        "ISOAbbreviation": "J Parasitol"
      },
      "ArticleTitle": "Development and evaluation of an enzyme-linked immunosorbent assay with recombinant SAG2 for diagnosis of Toxoplasma gondii infection in cats.",
      "P

# Parsing retrieved articles and build a dataframe

## JSON Parsing

We create multiple dataframes containing information about:
-  Retrieved Article
-  ~~Journal that published the articles~~

First of all we'll extract some basic information about the retrieved articles:
- PubMed ID
- Title
- Abstract Text
- Dates
  - Completed Date
  - Revised Date
  - Published Year
- Language
- Authors List
- PublicationTypeList
- Journal Country

In [None]:
# Create empty dictionary to store pubmed_id its info
id_article = {}
counter = 0
for record in full_records['PubmedArticle']:
    print(counter)
    counter += 1
    # Root element to access all data about a given article
    root = record['MedlineCitation']['Article']
    # Get pubmed id to be used for dictionary key access
    uid = record['MedlineCitation']['PMID']
    print(uid)
    # Create empty dictionary for each given article id
    id_article[uid] = {}
    # Title
    try:
        id_article[uid]['Title'] = root['ArticleTitle']
    except:
        id_article[uid]['Title'] = None
    # Country
    try:
        id_article[uid]['Journal Country'] = record['MedlineCitation']['MedlineJournalInfo']['Country']
    except:
        id_article[uid]['Journal Country'] = None
    # Abstract text different sections like results, conclusions, 
    # that are in separate elements of the list are joined together
    try:
        id_article[uid]['Abstract'] = "\t".join(root['Abstract']['AbstractText']) if 'Abstract' in root.keys() else None
    except:
        id_article[uid]['Abstract'] = None
    # Date (submission?) as MM/DD/YYYY
    try: 
        id_article[uid]['ArticleDate'] = "/".join(root['ArticleDate'][0].values()) if len(root['ArticleDate']) != 0 else None 
    except:
        id_article[uid]['ArticleDate'] = None
    # Completed date as MM/DD/YYYY
    try: 
        id_article[uid]['CompletedDate'] = "/".join(record['MedlineCitation']['DateCompleted'].values()) if len(record['MedlineCitation']['DateCompleted']) != 0 else None
    except:
        id_article[uid]['CompletedDate'] = None
    # Revised date as MM/DD/YYYY
    try:
        id_article[uid]['RevisedDate'] = "/".join(record['MedlineCitation']['DateRevised'].values()) if len(record['MedlineCitation']['DateRevised']) != 0 else None
    except:
        id_article[uid]['RevisedDate'] = None
    # Publication year
    try:
        id_article[uid]['PublicationYear'] = record['MedlineCitation']['Article']['Journal']['JournalIssue']['PubDate']["Year"]
    except:
        id_article[uid]['PublicationYear'] = None
    # Language(s)
    try:
        id_article[uid]['Language'] = " | ".join(root['Language'])
    except:
        id_article[uid]['Language'] = None
    # Authors list
    try:
        id_article[uid]["AuthorList"] = " | ".join([i['LastName'] + ", " + i['Initials']\
                                                for i in root['AuthorList'] if 'LastName' and 'Initials' in i.keys()]) if 'AuthorList' in root.keys() else None
    except:
        id_article[uid]["AuthorList"] = None
    # List of publication types
    try: 
        id_article[uid]['PublicationTypeList'] = " | ".join(root['PublicationTypeList'])
    except:
        id_article[uid]['PublicationTypeList'] = None
        

Example of the parsed data:

In [36]:
print(json.dumps(id_article["2605834"], indent = 2))

{
  "Title": "[Acute acquired toxoplasmosis presenting as polymyositis and chorioretinitis in a Japanese male].",
  "Journal Country": "Japan",
  "Abstract": "A 55-year-old Japanese male who developed acute polymyositis and chorioretinitis due to toxoplasmosis is described. The patients was well until one month prior to the present admission, when he had an onset of painful swelling of lymphnodes in the posterior cervical region, proximal muscle weakness, myalgia and a partial defect in the visual field of the right eye. He admitted that he had had a chance to eat half-cooked mutton while he had visited Saudi Arabia 40 days before. He was unable to go up and down the stairs at the peak of the illness. Serum CPK was 2050 u/l (N = 5-50) on January 11, 1989. These symptoms improved spontaneously except for the visual field defect. He was admitted to our hospital on January 31, 1989. On admission, neurological examination was unremarkable except for retinal exudate in the right eye which a

Create the dataframe

In [32]:
id_article_df = pd.DataFrame.from_dict(id_article, orient = "index")
id_article_df.head()

Unnamed: 0,Title,Journal Country,Abstract,ArticleDate,CompletedDate,RevisedDate,PublicationYear,Language,AuthorList,PublicationTypeList
2605834,[Acute acquired toxoplasmosis presenting as po...,Japan,A 55-year-old Japanese male who developed acut...,,1990/02/09,2006/11/15,1989,jpn,"Yamada, T | Nakagawa, Y | Komiya, T | Sakuma, ...",Case Reports | English Abstract | Journal Article
23391103,Experimental vaginal infection of goats with s...,United States,The objective was to characterize the transmis...,2013/02/07,2013/10/22,2013/08/05,2013,eng,"Wanderley, FS | Porto, WJ | Câmara, DR | da Cr...",Journal Article | Randomized Controlled Trial ...
23559345,Prevalence and risk factors for Toxoplasma gon...,Brazil,To determine the prevalence of immunoglobulin ...,,2013/10/22,2019/03/25,2013,eng,"Moura, FL | Amendoeira, MR | Bastos, OM | Matt...","Journal Article | Research Support, Non-U.S. G..."
12685205,Fine needle aspiration cytologic diagnosis of ...,Switzerland,Fine needle aspiration (FNA) cytologic diagnos...,,2003/05/30,2018/02/17,2003,eng,"Pathan, SK | Francis, IM | Das, DK | Mallik, M...",Case Reports | Journal Article
28887145,Role of an estradiol regulatory factor-hydroxy...,England,Toxoplasma gondii is an apicomplexan parasite ...,2017/09/05,2017/11/13,2017/12/08,2017,eng,"Zhang, X | Liu, J | Li, M | Fu, Y | Zhang, T |...","Journal Article | Research Support, Non-U.S. G..."


## Abstract Text Cleaning

In [33]:
import re 

In [43]:
# Copy abstract to new column
id_article_df['PreprocessedAbstract'] = id_article_df['Abstract']

Convert all None elements to empty strings (to be able to use re functions)

In [44]:
id_article_df['PreprocessedAbstract'] = id_article_df['PreprocessedAbstract'].fillna('')

Remove `<i>` and `</i>` tags from Title and Abstract

In [45]:
def remove_tags(text):
    return text.replace('<i>', '').replace('<i>', '')

id_article_df['Title'] = id_article_df['Title'].map(remove_tags)
id_article_df['PreprocessedAbstract'] = id_article_df['PreprocessedAbstract'].map(remove_tags)

Remove Puntuaction

In [46]:
def remove_punctuation(text):
    return re.sub('[^\w\s]', '', text)
id_article_df['PreprocessedAbstract'] = id_article_df['PreprocessedAbstract'].map(remove_punctuation)

Lower the text

In [47]:
def to_lower(text):
    return text.lower()
id_article_df['PreprocessedAbstract'] = id_article_df['PreprocessedAbstract'].map(to_lower)

In [48]:
id_article_df['PreprocessedAbstract'][100]

'a light and transmission electron microscopic study was performed in skeletal muscles from mice experimentally infected with toxoplasma gondii parasite cysts were not observed capillary endothelial cytoplasm abnormalities included proliferation of organelles decrease of pynocytic vesicles degenerative changes and necrosis in some capillaries the lumen was reduced or absent pericytes also were altered in all animals n  13 the basement membrane was normal the cellular infiltrate consisted of macrophages lymphocytes mastocytes and eosinophils the alterations observed in muscle microvasculature in absence of toxoplasma gondii cysts could be due to a hostimmune response to the parasite'

Remove squared brackets from Title

In [49]:
def remove_squared_brackets(text):
    return text.replace('[', '').replace(']', '')

id_article_df['Title'] = id_article_df['Title'].map(remove_squared_brackets)

In [50]:
id_article_df.head()

Unnamed: 0,Title,Journal Country,Abstract,ArticleDate,CompletedDate,RevisedDate,PublicationYear,Language,AuthorList,PublicationTypeList,PreprocessedAbstract
2605834,Acute acquired toxoplasmosis presenting as pol...,Japan,A 55-year-old Japanese male who developed acut...,,1990/02/09,2006/11/15,1989,jpn,"Yamada, T | Nakagawa, Y | Komiya, T | Sakuma, ...",Case Reports | English Abstract | Journal Article,a 55yearold japanese male who developed acute ...
23391103,Experimental vaginal infection of goats with s...,United States,The objective was to characterize the transmis...,2013/02/07,2013/10/22,2013/08/05,2013,eng,"Wanderley, FS | Porto, WJ | Câmara, DR | da Cr...",Journal Article | Randomized Controlled Trial ...,the objective was to characterize the transmis...
23559345,Prevalence and risk factors for Toxoplasma gon...,Brazil,To determine the prevalence of immunoglobulin ...,,2013/10/22,2019/03/25,2013,eng,"Moura, FL | Amendoeira, MR | Bastos, OM | Matt...","Journal Article | Research Support, Non-U.S. G...",to determine the prevalence of immunoglobulin ...
12685205,Fine needle aspiration cytologic diagnosis of ...,Switzerland,Fine needle aspiration (FNA) cytologic diagnos...,,2003/05/30,2018/02/17,2003,eng,"Pathan, SK | Francis, IM | Das, DK | Mallik, M...",Case Reports | Journal Article,fine needle aspiration fna cytologic diagnosis...
28887145,Role of an estradiol regulatory factor-hydroxy...,England,Toxoplasma gondii is an apicomplexan parasite ...,2017/09/05,2017/11/13,2017/12/08,2017,eng,"Zhang, X | Liu, J | Li, M | Fu, Y | Zhang, T |...","Journal Article | Research Support, Non-U.S. G...",toxoplasma gondii is an apicomplexan parasite ...


## Convert Authors and PublicationTypeList in lists

In [51]:
# Definizione della funzione per trasformare una stringa in una lista di stringhe utilizzando il separatore '|'
def string_to_list(text):
    return text.split('|')

# Applicazione della funzione alla colonna 'my_column' tramite apply()
id_article_df['AuthorList'] = id_article_df['AuthorList'].fillna('')
id_article_df['AuthorList'] = id_article_df['AuthorList'].apply(string_to_list)

id_article_df['PublicationTypeList'] = id_article_df['PublicationTypeList'].fillna('')
id_article_df['PublicationTypeList'] = id_article_df['PublicationTypeList'].apply(string_to_list)

In [52]:
id_article_df.head()

Unnamed: 0,Title,Journal Country,Abstract,ArticleDate,CompletedDate,RevisedDate,PublicationYear,Language,AuthorList,PublicationTypeList,PreprocessedAbstract
2605834,Acute acquired toxoplasmosis presenting as pol...,Japan,A 55-year-old Japanese male who developed acut...,,1990/02/09,2006/11/15,1989,jpn,"[Yamada, T , Nakagawa, Y , Komiya, T , Saku...","[Case Reports , English Abstract , Journal A...",a 55yearold japanese male who developed acute ...
23391103,Experimental vaginal infection of goats with s...,United States,The objective was to characterize the transmis...,2013/02/07,2013/10/22,2013/08/05,2013,eng,"[Wanderley, FS , Porto, WJ , Câmara, DR , d...","[Journal Article , Randomized Controlled Tria...",the objective was to characterize the transmis...
23559345,Prevalence and risk factors for Toxoplasma gon...,Brazil,To determine the prevalence of immunoglobulin ...,,2013/10/22,2019/03/25,2013,eng,"[Moura, FL , Amendoeira, MR , Bastos, OM , ...","[Journal Article , Research Support, Non-U.S....",to determine the prevalence of immunoglobulin ...
12685205,Fine needle aspiration cytologic diagnosis of ...,Switzerland,Fine needle aspiration (FNA) cytologic diagnos...,,2003/05/30,2018/02/17,2003,eng,"[Pathan, SK , Francis, IM , Das, DK , Malli...","[Case Reports , Journal Article]",fine needle aspiration fna cytologic diagnosis...
28887145,Role of an estradiol regulatory factor-hydroxy...,England,Toxoplasma gondii is an apicomplexan parasite ...,2017/09/05,2017/11/13,2017/12/08,2017,eng,"[Zhang, X , Liu, J , Li, M , Fu, Y , Zhang...","[Journal Article , Research Support, Non-U.S....",toxoplasma gondii is an apicomplexan parasite ...


## Save the dataset

In [53]:
id_article_df.to_csv("Data/toxoplasma_gondii_pubmed.csv", index_label = "pubmed_id", encoding = 'utf-8')

In [89]:
print(id_article_df.columns)

Index(['Title', 'Journal Country', 'Abstract', 'ArticleDate', 'CompletedDate',
       'RevisedDate', 'PublicationYear', 'Language', 'AuthorList',
       'PublicationTypeList', 'PreprocessedAbstract'],
      dtype='object')
