# Extract data from PubMed

This notebook shows how to extract information from PubMed records. We will extract:

1. Publication title
2. Publication year
3. Full names of authors
4. Affiliation of each author

By applying minor code modifications, additional information can be extracted, such as keywords or even the full abstract.

The starting point is a small **example list** containing 3 pubmed ids.

The ending point will be a network where **research institutes** and **publications** are the nodes, while the edges connecting them hold the details about **authors** and their **affiliations**. 

-----------------------------------------------------------------------------------

### 1. Import packages, define functions and variables

In [13]:
import xml.etree.ElementTree as ET
from urllib.request import urlopen
import time
import pandas as pd
import urllib
from xml.dom import minidom
import xmltodict
import os
counter = 0     # Used to compute the 'contribution score' when merging edges in code block [4]
edges = dict()
from tqdm import tnrange, tqdm_notebook
# example_list = ['26046436', '32963239', '28793255']
pd.set_option('display.max_colwidth', 500)
pd.set_option('display.max_rows', 500)

In [2]:
output_file = '/home/bay001/publication_list.tsv'

In [49]:
last_author_list = [
    'Yeo+GW', 'Ahrens+ET', 'Carson+DA', 'Cheresh+DA', 'Chien+S', 
    'Christman+KL', 'Engler+AJ', 'Goldstein+LSB', 'Guan+KL', 'Halpain+S',
    'Hevner+RF', 'Jamieson+C', 'Kaufman+DS', 'Kwon+EJ', 'Laurent+LC', 
    'Marsala+M', 'Muotri+AR', 'Parast+MM', 'Reya+T', 'Rich+JN', 
    'Sander+M', 'Wilkinson+MF', 'Willert+K', 'Wechsler-Reya+RJ', 
    'Snyder+EY', 'Rao+A', 'Gage+FH', 'Aguado+BA', 'Cook-Andersen+H', 
    'Coufal+NG', 'Gur-Cohen+S', 'Shtrahman+M', 'Soncin+F', 'Mercola+M', 
    'La+Spada+A'
]

def esearch(term, db='pubmed'):
    """
    Queries NCBI using the esearch utility. GEO ('gds') database is used as default for search term.
    """
    term = term.replace(' ', '%20')
    url = f'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db={db}&term={term}[author]&usehistory=y&retmax=5000'
    print(url)
    response = urllib.request.urlopen(url)
    return response.read()


def get_pubs_from_author(author):
    # xmldoc = minidom.parseString(esearch(author))
    # webenv = xmldoc.getElementsByTagName('WebEnv')
    try:
        return xmltodict.parse(esearch(author))['eSearchResult']['IdList']['Id']
    except TypeError:
        print(author)
        return []

example_list = []
if not os.path.exists(output_file):

    for author in last_author_list:
        example_list += get_pubs_from_author(author)

example_list = list(set(example_list))
print(len(example_list))
example_list[:3]

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=Yeo+GW[author]&usehistory=y&retmax=5000
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=Ahrens+ET[author]&usehistory=y&retmax=5000
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=Carson+DA[author]&usehistory=y&retmax=5000
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=Cheresh+DA[author]&usehistory=y&retmax=5000
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=Chien+S[author]&usehistory=y&retmax=5000
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=Christman+KL[author]&usehistory=y&retmax=5000
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=Engler+AJ[author]&usehistory=y&retmax=5000
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=Goldstein+LSB[author]&usehistory=y&retmax=5000
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db

['20001453', '24190326', '21828135']

In [50]:
# Get the normalized institute name from affiliation
def getInstitute(affiliation):
    if 'Burnham' in affiliation:
        return 'Sanford Burnham Prebys';
    elif 'Scripps' in affiliation:
        return 'The Scripps Research Institute';
    elif 'Salk' in affiliation:
        return 'Salk Institute'
    elif 'University of California' in affiliation:
        return 'UC San Diego'
    elif 'La Jolla Institute for Allergy and Immunology' in affiliation:
        return 'LJIAI'
    else:
        return 'Other'

### 2. Extract information from PubMed element tree

In the last part of this block of code, we merge all the edges in the **edges** dictionary that have the same source and target nodes and add the merge count to the dictionary. This will reduce the number of edges between a given instute and a given publication to 1.

In [76]:
if not os.path.exists(output_file):
    edge_idx = {}
    progress = tnrange(len(example_list))
    for item in example_list:
        efetch = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?&db=pubmed&retmode=xml&id=%s" % (item)
        try:
            handle = urlopen(efetch)
            data = handle.read()
            root = ET.fromstring(data)
            time.sleep(0.5)
            for article in root.findall("PubmedArticle"):
                pmid = article.find("MedlineCitation/PMID").text
                year = article.find("MedlineCitation/Article/Journal/JournalIssue/PubDate/Year")
                if year is None: year = 'NA'
                else: year = year.text
                aulist = article.findall("MedlineCitation/Article/AuthorList/Author")
                title = article.find("MedlineCitation/Article/ArticleTitle")
                for author in aulist:
                    if author.find('AffiliationInfo'):

                        last_name = author[0].text
                        fore_name = author[1].text 
                        if fore_name is None:
                            print("this pub does not have affiliation: {} (authors: {})".format(title.text, [a.text for a in aulist]))
                            fore_name = ["", ""]
                        if last_name is None:
                            print("this pub does not have affiliation: {} (authors: {})".format(title.text, [a.text for a in aulist]))
                            last_name = ""
                        affiliation = author.find('AffiliationInfo')[0].text
                        # if "San Diego" in affiliation or 'La Jolla' in affiliation:
                        institute = getInstitute(affiliation)

                        # Merge edges and compute the 'contribution score'

                        lookupKey = pmid + '_' + institute
                        if lookupKey in edge_idx:
                            oldRec = edges[edge_idx[lookupKey]]
                            try:
                                newRec = (oldRec[0],oldRec[1],oldRec[2] + ', ' + fore_name[0] + ' ' + last_name,oldRec[3],oldRec[4],oldRec[5], oldRec[6]+1)
                            except TypeError:

                                print("fore name", fore_name)
                                print("last_name", last_name)
                                print(len(oldRec), len(newRec))
                                print('OLD REC', oldRec)
                                print('NEW REC', newRec)
                                raise
                            edges[edge_idx[lookupKey]] = newRec
                        else:
                            edges[counter] = (pmid, title.text, fore_name[0] + ' ' + last_name, affiliation, year, institute, 1)
                            edge_idx[lookupKey] = counter
                            counter += 1
        except Exception as e:
            print("problem with URL: {} ({})".format(efetch, e))
        progress.update(1)

HBox(children=(IntProgress(value=0, max=12363), HTML(value='')))

this pub does not have affiliation: Oxidative stress in tumour-bearing fore-stomach and distant normal organs of Swiss albino mice. (authors: [None, None, None])
this pub does not have affiliation: Intracardiac ablation for atrioventricular nodal reentry tachycardia using a 6 mm distal electrode cryoablation catheter: Prospective, multicenter, North American study (ICY-AVNRT STUDY). (authors: [None, None, None, None, None, None, None, None, None, None, None, None, None, None])
this pub does not have affiliation: Chemopreventive effects of mustard (Brassica compestris) on chemically induced tumorigenesis in murine forestomach and uterine cervix. (authors: [None, None, None, None, None])
this pub does not have affiliation: Chemopreventive effects of Cuminum cyminum in chemically induced forestomach and uterine cervix tumors in murine model systems. (authors: [None, None, None, None, None])
this pub does not have affiliation: Amelanotic Melanoma in the Vicinity of Acquired Melanocytic Nev

[None, None, None, None, None]

### 3. Create Pandas Dataframe

In [77]:
df = pd.DataFrame.from_dict(data = edges,
                            orient='index',
                            columns = ['pmid', 'title', 'author', 'affiliation',
                                       'year', 'institute', 'contribution score'])
df['last author'] = df['author'].apply(lambda x: x.split(',')[-1].rstrip(' '))
df.to_csv(output_file, sep='\t')

In [108]:
df = pd.read_csv(output_file, sep='\t', index_col=0)
df.head()

Unnamed: 0,pmid,title,author,affiliation,year,institute,contribution score,last author
0,35798029,Wastewater sequencing reveals early cryptic SARS-CoV-2 variant transmission.,"S Karthikeyan, P De Hoff, G Humphrey, A Birmingham, K Jepsen, S Farmer, H Tubb, T Valles, C Tribelhorn, R Tsai, S Aigner, S Sathe, N Moshiri, B Henson, A Mark, A Hakim, N Baer, T Barber, P Belda-Ferre, M Chacón, W Cheung, E Cresini, E Eisner, A Lastrella, E Lawrence, C Marotz, T Ngo, T Ostrander, A Plascencia, R Salido, P Seaver, E Smoot, D McDonald, R Neuhard, A Scioscia, A Satterlund, E Simmons, D Abelman, D Brenner, J Bruner, A Buckley, M Ellison, J Gattas, S Gonias, M Hale, F Hawkins, L ...","Department of Pediatrics, University of California San Diego, La Jolla, CA, USA.",2022.0,UC San Diego,77,R Knight
1,35798029,Wastewater sequencing reveals early cryptic SARS-CoV-2 variant transmission.,"J Levy, C Aceves, C Anderson, K Gangavarapu, E Hufbauer, E Kurzban, J Lee, N Matteson, E Parker, S Perkins, K Ramesh, R Robles-Sikisaka, M Schwab, E Spencer, S Wohl, L Nicholson, I Mchardy, M Zeller, K Andersen","Department of Immunology and Microbiology, The Scripps Research Institute, La Jolla, CA, USA.",2022.0,The Scripps Research Institute,19,K Andersen
2,35798029,Wastewater sequencing reveals early cryptic SARS-CoV-2 variant transmission.,"D Dimmock, C Hobbs, O Bakhtar, A Harding, A Mendoza, A Bolze, D Becker, E Cirulli, M Isaksson, K Schiabor Barrett, N Washington, J Malone, A Schafer, N Gurfield, S Stous, B Austin, D MacCannell, S Kingsmore, W Lee, S Shah, E McDonald, A Yu","Rady Children's Institute for Genomic Medicine, San Diego, CA, USA.",2022.0,Other,22,A Yu
3,35781533,The splicing factor RBM17 drives leukemic stem cell maintenance by evading nonsense-mediated decay of pro-leukemic factors.,"L Liu, A Vujovic, N Deshpande, G Anande, H Chen, J Xu, M Minden, A Unnikrishnan, K Hope, Y Lu","Department of Medicine, Faculty of Health Sciences, McMaster University, Hamilton, ON, Canada.",2022.0,Other,10,Y Lu
4,35781533,The splicing factor RBM17 drives leukemic stem cell maintenance by evading nonsense-mediated decay of pro-leukemic factors.,"S Sathe, G Yeo","Department of Cellular and Molecular Medicine, Stem Cell Program and Institute for Genomic Medicine, University of California at San Diego, San Diego, CA, USA.",2022.0,UC San Diego,2,G Yeo


### Add. Expand author list

What we really want now are 'interactions' between any two investigators (interaction = co-authored a paper together).


In [109]:
s = df['author'].str.split(', ').apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1) # to line up with df's index
s.name = 'author' # needs a name to join
del df['author']
df = df.join(s)

In [110]:
df.head()

Unnamed: 0,pmid,title,affiliation,year,institute,contribution score,last author,author
0,35798029,Wastewater sequencing reveals early cryptic SARS-CoV-2 variant transmission.,"Department of Pediatrics, University of California San Diego, La Jolla, CA, USA.",2022.0,UC San Diego,77,R Knight,S Karthikeyan
0,35798029,Wastewater sequencing reveals early cryptic SARS-CoV-2 variant transmission.,"Department of Pediatrics, University of California San Diego, La Jolla, CA, USA.",2022.0,UC San Diego,77,R Knight,P De Hoff
0,35798029,Wastewater sequencing reveals early cryptic SARS-CoV-2 variant transmission.,"Department of Pediatrics, University of California San Diego, La Jolla, CA, USA.",2022.0,UC San Diego,77,R Knight,G Humphrey
0,35798029,Wastewater sequencing reveals early cryptic SARS-CoV-2 variant transmission.,"Department of Pediatrics, University of California San Diego, La Jolla, CA, USA.",2022.0,UC San Diego,77,R Knight,A Birmingham
0,35798029,Wastewater sequencing reveals early cryptic SARS-CoV-2 variant transmission.,"Department of Pediatrics, University of California San Diego, La Jolla, CA, USA.",2022.0,UC San Diego,77,R Knight,K Jepsen


In [111]:
df[df['author']=='K Willert'].head(1)

Unnamed: 0,pmid,title,affiliation,year,institute,contribution score,last author,author
415,34667113,A FZD7-specific Antibody-Drug Conjugate Induces Ovarian Tumor Regression in Preclinical Models.,"Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, California.",2022.0,UC San Diego,11,K Willert,K Willert


### Add. Filter

Filter only the list of SCRM-affiliated investigators, whose names don't always match the search term...
Also, filter out any publication older than 2012.

In [112]:
real_author_list = [
 'B Aguado',
 'E Ahrens',
 'H Cook-Andersen',
 'D Carson',
 'D Cheresh',
 'S Chien',
 'K Christman',
 'N Coufal',
 'A Engler',
 'F Gage',
 'L Goldstein',
 'K Guan',
 'S Gur-Cohen',
 'S Halpain',
 'R Hevner',
 'C Jamieson',
 'D Kaufman',
 'E Kwon',
 'A La Spada',
 'L Laurent',
 'M Marsala',
 'M Mercola',
 'A Muotri',
 'M Parast',
 'A Rao',
 'T Reya',
 'J Rich',
 'M Sander',
 'M Shtrahman',
 'E Snyder',
 'F Soncin',
 'R Wechsler-Reya',
 'M Wilkinson',
 'K Willert',
 'G Yeo',
]


df = df[df['author'].isin(real_author_list)]
# df = df[df['year'] >= 2012]
df.head()

Unnamed: 0,pmid,title,affiliation,year,institute,contribution score,last author,author
0,35798029,Wastewater sequencing reveals early cryptic SARS-CoV-2 variant transmission.,"Department of Pediatrics, University of California San Diego, La Jolla, CA, USA.",2022.0,UC San Diego,77,R Knight,L Laurent
0,35798029,Wastewater sequencing reveals early cryptic SARS-CoV-2 variant transmission.,"Department of Pediatrics, University of California San Diego, La Jolla, CA, USA.",2022.0,UC San Diego,77,R Knight,G Yeo
4,35781533,The splicing factor RBM17 drives leukemic stem cell maintenance by evading nonsense-mediated decay of pro-leukemic factors.,"Department of Cellular and Molecular Medicine, Stem Cell Program and Institute for Genomic Medicine, University of California at San Diego, San Diego, CA, USA.",2022.0,UC San Diego,2,G Yeo,G Yeo
6,35778567,Aberrant NOVA1 function disrupts alternative splicing in early stages of amyotrophic lateral sclerosis.,"Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, USA.",2022.0,UC San Diego,7,G Yeo,G Yeo
7,35767654,MECP2-related pathways are dysregulated in a cortical organoid model of myotonic dystrophy.,"Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA 92093, USA.",2022.0,UC San Diego,11,G Yeo,G Yeo


In [113]:
df.to_csv('publication_list_filtered_pi.tsv', sep='\t')

In [117]:
# df['year'].fillna(2023, inplace=True)
# df['year_int'] = df['year'] # .apply(lambda x: int(x.replace('NA','2023')))
# df.sort_values('year', ascending=False).head()
df = df[df['year']>=2012]

In [118]:
def get_interactions(df):
    interaction = []
    for pmid in set(df['pmid']):
    
        for author1 in df[df['pmid']==pmid]['author']:
            for author2 in df[df['pmid']==pmid]['author']:
                if author1 != author2:
                    if sorted([author1, author2])+[pmid] not in interaction:
                        interaction.append(sorted([author1, author2])+[pmid])
    interaction = pd.DataFrame(interaction)
    interaction.columns = ['author1', 'author2', 'pmid']
    return interaction

interactions = get_interactions(df)
interactions.head()

Unnamed: 0,author1,author2,pmid
0,E Snyder,L Laurent,27213850
1,J Rich,R Wechsler-Reya,25493560
2,G Yeo,L Laurent,35332213
3,H Cook-Andersen,M Wilkinson,30679197
4,L Laurent,M Parast,25034944


In [119]:
df.sort_values(by='year', ascending=True).head()

Unnamed: 0,pmid,title,affiliation,year,institute,contribution score,last author,author
7488,23236291,Deciphering the transcriptional-regulatory network of flocculation in Schizosaccharomyces pombe.,"Institute for Biocomplexity and Informatics, University of Calgary, Calgary, Alberta, Canada.",2012.0,Other,1,E Kwon,E Kwon
7929,22818276,Laparoscopic vs. open liver resection for malignant liver disease. A systematic review.,"Department of Surgery, Ward 31, Foresterhill, Aberdeen Royal Infirmary, Aberdeen AB25 2ZA, United Kingdom. ahsan.rao@nhs.net",2012.0,Other,1,A Rao,A Rao
6578,23186707,Transplantation in the future.,"Laboratory of Genetics, Salk Institute for Biological Studies, La Jolla, CA, USA. gage@salk.edu",2012.0,Salk Institute,1,F Gage,F Gage
5254,22682913,"The two-component model of memory development, and its potential implications for educational settings.","Center for Lifespan Psychology, Max Planck Institute for Human Development, Lentzeallee 94, 14195 Berlin, Germany.",2012.0,Other,1,M Sander,M Sander
18866,22433990,Wnt signaling from membrane to nucleus: β-catenin caught in a loop.,"Westmead Institute for Cancer Research, The University of Sydney, Westmead Millennium Institute at Westmead Hospital, Westmead, NSW 2145, Australia.",2012.0,Other,1,C Jamieson,C Jamieson


### Use this space to double check our work.

In [120]:
interactions[
    ((interactions['author1']=='G Yeo') & (interactions['author2']=='L Laurent')) | \
    ((interactions['author1']=='L Laurent') & (interactions['author2']=='G Yeo'))
]

Unnamed: 0,author1,author2,pmid
2,G Yeo,L Laurent,35332213
12,G Yeo,L Laurent,34726486
43,G Yeo,L Laurent,33564781
47,G Yeo,L Laurent,35703436
48,G Yeo,L Laurent,35703437
77,G Yeo,L Laurent,34909793
81,G Yeo,L Laurent,34508652
88,G Yeo,L Laurent,33861950
107,G Yeo,L Laurent,35411350
110,G Yeo,L Laurent,35575492


### 4. Create NiceCX network from Pandas

Here we create a NiceCX network using the Pandas dataframe from the previous step.
When creating the network, we specify what Pandas columns to use as source and target nodes, source and target node attributes as well as edge attributes. We also define a default edge interaction.

The last 2 lines of code allow us to display the network in the notebook via the cyjupyter widget.


In [121]:
import ndex2
from cyjupyter import Cytoscape

nice_cx = ndex2.create_nice_cx_from_pandas(
    interactions, source_field='author1', target_field='author2', 
    source_node_attr=[], 
    #target_node_attr=['title', 'year'], 
    #edge_attr=['author', 'affiliation', 'contribution score'], 
    #edge_interaction='contributed to', 
    source_represents=None, 
    target_represents=None
)
nice_cx.print_summary()

# Display the network in the notebook using the cyjupyter widget
nice_cx_viz = nice_cx.to_cx()
Cytoscape(data=nice_cx_viz, format='cx')

Name: created from pandas by ndex2.create_nice_cx_from_pandas()
Nodes: 32
Edges: 156
Node Attributes: 0
Edge Attributes: 0

Generating CX


Cytoscape(data=[{'numberVerification': [{'longNumber': 281474976710655}]}, {'metaData': [{'name': 'nodes', 'el…

### 5. Upload to NDEx

This last step loads the network to you NDEx account. You need to provide your NDEx account credentials (**user** and **password**) in order to upload the network.
The code will also generate a clickable URL that you can use to open a browser tab and view your network.

In [122]:
server = 'www.ndexbio.org'

# Set credentials to access your NDEx account
user = 'bay001'
password = '3TbN2Hjkx4@d9zN'

# Upload the network
result = nice_cx.upload_to(server, user, password)

# Generate a clickable link to view your network in the browser directly from the notebook.
# Please note that the browser might ask you to login in to your NDEx account in order to view the network.
base_url = 'https://www.ndexbio.org/viewer/networks/'
print (f"View your network: {base_url}{result.split('/')[-1]}")


Generating CX
View your network: https://www.ndexbio.org/viewer/networks/60bfb94d-0ecf-11ed-ac45-0ac135e8bacf


 ### 6. Next steps
 
Your network is now saved in your NDEx account and its visibility set to PRIVATE, so you are the only one who can see it. You can perform additional operations on the network directly in NDEx; these include:
 
 - Adding/editing network attributes (title, description, version, etc)
 - Changing the network visibility
 - Importing it in Cytoscape for visual styling or further analysis
 - Requesting a DOI
 - Querying to extract sub-networks of interest

In [127]:
df[(df['author']=='E Kwon') & (df['affiliation'].str.contains('San Diego'))]

Unnamed: 0,pmid,title,affiliation,year,institute,contribution score,last author,author
5306,34401968,Pharmacokinetic Analysis of Peptide-Modified Nanoparticles with Engineered Physicochemical Properties in a Mouse Model of Traumatic Brain Injury.,"Department of Nanoengineering, University of California San Diego, La Jolla , CA , USA.",2021.0,UC San Diego,4,E Kwon,E Kwon
8065,34401968,Pharmacokinetic Analysis of Peptide-Modified Nanoparticles with Engineered Physicochemical Properties in a Mouse Model of Traumatic Brain Injury.,"Department of Nanoengineering, University of California San Diego, La Jolla , CA , USA.",2021.0,UC San Diego,4,E Kwon,E Kwon
11292,34870408,Targeting the Extracellular Matrix in Traumatic Brain Injury Increases Signal Generation from an Activity-Based Nanosensor.,"Department of Bioengineering, University of California-San Diego, La Jolla, California 92093, United States.",2021.0,UC San Diego,3,E Kwon,E Kwon
15133,32100994,An Activity-Based Nanosensor for Traumatic Brain Injury.,"Department of Bioengineering, University of California, San Diego, La Jolla, California 92093, United States.",2020.0,UC San Diego,5,E Kwon,E Kwon


-----------------------------------------------------------------------------------

###### Questions/comments:   rpillich@ucsd.edu