# How to Parse PubMed Search Text Files Using Data Element Field #


This article follows up on a piece I published a few months ago about building a dataset of PubMed-listed publications on cardiovascular disease research. 

Please read the original article for the background context before proceeding: https://medium.com/towards-data-science/building-a-pubmed-dataset-b1267408417c 

The original dataset of PubMed-listed publications on cardiovascular disease research, created for my Master’s Thesis titled "Factors Associated with Impactful Scientific Publications in NIH-Funded Heart Disease Research," required details such as the journal name, the first author's institutional affiliation, and their country. To preserve this information for further parsing, I saved the advanced search PubMed queries[1] by selecting the abstract format in the display options and choosing the PubMed format in the "Save Citations" menu. This will create a file in the text format.

The flowchart below outlines the steps to parse the information described above from the PubMed format file. A detailed explanation follows.

![Parsing_API_1](Parsing_API_1.jpg)

Since the PubMed format dataset cannot be saved in a CSV format and easily be formatted into a table, it has to be parsed to extract Journal Title (JT), first Author Institution Affiliation (AD), and country. 

Bellow is the example of the single PubMed format file entry. 

![PubMed_dataset_4](PubMed_dataset_4.jpg)

I developed a parsing script using Python 3.10.1 for this purpose.  First author affiliation was determined by making an Application Programming Interface (API) request to the Research Organization Registry (ROR) API [2]. ROR matching was necessary because Data Element field provided inconsistent names for the research institutions along with unnecessary information such as address and department name. ROR affiliation matching allows to find research organizations mentioned in the full affiliation strings from the PubMed format datasets which are then provided in the API call. The results of the API call are returned in the JSON format. I parsed journal titles and countries from the PubMed format datasets using PubMed Data Element (Field) Descriptions included in this type of file format. Once the script is executed, the output file will be in a single-line format with parsed data separated by a vertical line (`|`). This format allows for easy conversion from a one-line text file to a table in CSV format within a Jupyter Lab notebook.

In [7]:
"""
import requests
import urllib.parse
import json

# writing to file
outfile = open('Heart_NHLBI_2002_oneline.txt', 'w', encoding="utf-8")
 
# Using readlines()
infile = open('Heart_disease_publications_NHLBI_2002.txt', 'r', encoding="utf-8")
institution_info = ''
pmid = ''
journal = ''

def determine_ror_id(institution_info):
    response =  None
    try:
        response = requests.get('https://api.ror.org/organizations?affiliation=' + urllib.parse.quote(institution_info)).json()
        number_of_results = response['number_of_results']
        if number_of_results > 0:
            institution_name = response['items'][0]['organization']['name']
            country = response['items'][0]['organization']['country']['country_name']        
        else:
            institution_name = 'not found'
            country = 'not found'
            print(institution_info, response)
    except:
        institution_name = 'not found'
        country = 'not found'
        print(institution_info)
        print(response)
    return (institution_name, country)

# Strips the newline character
while True:
    # Get next line from file
    line = infile.readline()
    # if line is empty
    # end of file is reached
    if not line:
        break
    if line.startswith('PMID'):

        if institution_info:
            institution_name, country = determine_ror_id(institution_info.strip())
            
            outfile.write(pmid.strip() + '|' + institution_info.strip() + '|' + institution_name + '|' + country + '|' + journal.strip() + '\n')
            institution_info = ''
            institution_name = ''
            country = ''
            
        line_tokens = line.split('- ')
        pmid = line_tokens[1].strip()

    if line.startswith('AD'):
        if not institution_info:
            line = line.replace('|',' ')
            line_tokens = line.split('- ')
            institution_info = line_tokens[1].strip()
            line = infile.readline()
            while line and line.startswith(' '):
                line = line.replace('|',' ')
                institution_info = institution_info + ' ' + line.strip()
                line = infile.readline()

    if line.startswith('JT'):       
        line_tokens = line.split('- ')
        journal = line_tokens[1].strip().lower()
        if journal.startswith('the'):
            sep = 'the'
            journal = journal.split(sep, 1)[1]
        if '(' in journal:
            sep = '('
            journal = journal.split(sep, 1)[0]
        elif ' :' in journal:
            sep = ' :'
            journal = journal.split(sep, 1)[0]

        line = infile.readline()
                
infile.close()
outfile.close()
"""

'\nimport requests\nimport urllib.parse\nimport json\n\n# writing to file\noutfile = open(\'Heart_NHLBI_2020_oneline.txt\', \'w\', encoding="utf-8")\n \n# Using readlines()\ninfile = open(\'Heart_disease_publications_NHLBI_2020.txt\', \'r\', encoding="utf-8")\ninstitution_info = \'\'\npmid = \'\'\njournal = \'\'\n\ndef determine_ror_id(institution_info):\n    response =  None\n    try:\n        response = requests.get(\'https://api.ror.org/organizations?affiliation=\' + urllib.parse.quote(institution_info)).json()\n        number_of_results = response[\'number_of_results\']\n        if number_of_results > 0:\n            institution_name = response[\'items\'][0][\'organization\'][\'name\']\n            country = response[\'items\'][0][\'organization\'][\'country\'][\'country_name\']        \n        else:\n            institution_name = \'not found\'\n            country = \'not found\'\n            print(institution_info, response)\n    except:\n        institution_name = \'not found\

The Jupyter Notebook used for this article, along with the parsing script, a sample file to run the script, and an example of the expected output file, can be found on [GitHub](https://github.com/drozenshteyn/Parsing-PubMed-Files-and-Sending-API-Requests)

The full MS Thesis referenced here can also be found on [GitHub](https://github.com/drozenshteyn/Master-s-Thesis)

Note: I used GitHub embeds to publish this article.

Thank you for reading,

Diana

## References ##

1. U.S. National Library of Medicine, “Advanced search results - pubmed,”. Available: https://pubmed.ncbi.nlm.nih.gov/advanced/
2. Research Organization Registry, “ROR,” ror.org. Available: https://ror.org/