# Graph Relationships Among Researchers

We are going to create Graphs describing relationships between researchers based on co-authorships. In this notebook we are going to use [Biopython](http://biopython.org/) to query PubMed and get citation information for articles published by various researchers.

Feel free to create your own list of researchers (including yourself!)



### Uncomment and run the cell below if you need to install biopython

In [None]:
#!conda install biopython -y

In [1]:
from Bio import Entrez
import networkx as nx
import os
DATADIR = os.getcwd()
print(os.path.exists(DATADIR))
from IPython.display import Image
import getpass
import gzip
import pickle

True


### An Example List of BMI Faculty

Since our names are not unique identifiers, it can be challenging to query PubMed based on name. For example, I try to be "Brian E Chapman" professionally but I have had papers published as "Brian Chapman". The list below is copied from a spreadsheet with some tweaking to get the names into the most common form for publishing. Since I copied this from a spreadsheet, I have to do a little manipulation to get the names into FIRSTNAME LASTNAME form.


In [None]:
faculty = [tuple(s.split("\t")) for s in 
"""AbdelRahman	Samir E
Adler	Frederick R
Bray	Bruce E
Camp	Nicola J
Chapman	Brian E
Chapman	Wendy W
Conway	Michael A
Cummins	Mollie R
Del Fiol	Guilherme
Drews	Frank A
Egger	Marlene J
Eilbeck	Karen
Evans	R Scott
Facelli	Julio C
Gibson	Bryan S
Gouripeddi	Ramkiran
Haug	Peter J
Huff	Stanley M
Hurdle	John F
Kawamoto	Kensaku
Lee	Younghee
Narus	Scott P
Nebeker	Jonathan
Parker	Dennis L
Piccolo	Stephen
Quinlan	Aaron
Samore	Matthew H
Sauer	Brian C
Staes	Catherine J
Sward	Katherine A
Weir	Charlene R
Yandell	Mark
Dean	J Michael
Gesteland	Per H
Gundlapalli	Adi V
Jackson	Brian R
Lincoln	Michael J
Morris	Alan H
Xu	Wu""".split("\n")]
faculty = ["%s %s"%(f[1],f[0]) for f in faculty]


### Here is a shorter, alternative list
#### Edit and uncomment

In [2]:
faculty = ["Brian E Chapman", "David Gur", "Wendy W Chapman", "Peter J Haug", "Dennis L Parker", "Matthew H Samore"]

### Get the pubmed IDs matching query

In [3]:
email_string = input("Enter your e-mail: ").strip()

Enter your e-mail: brian.chapman@utah.edu


In [4]:
def search(query, email=''):
    Entrez.email = email
    handle = Entrez.esearch(db='pubmed', 
                            sort='relevance', 
                            retmax='100',
                            retmode='xml', 
                            term=query)
    results = Entrez.read(handle)
    return results

### Fetch papers corresponding to ids

In [20]:
def fetch_details(id_list, email="brian.chapman@utah.edu"):
    ids = ','.join(id_list)
    Entrez.email = email
    handle = Entrez.efetch(db='pubmed',
                           retmode='xml',
                           id=ids)
    results = Entrez.read(handle)
    return results

In [7]:
help(Entrez.efetch)

Help on function efetch in module Bio.Entrez:

efetch(db, **keywords)
    Fetch Entrez results which are returned as a handle.
    
    EFetch retrieves records in the requested format from a list of one or
    more UIs or from user's environment.
    
    See the online documentation for an explanation of the parameters:
    http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch
    
    Return a handle to the results.
    
    Raises an IOError exception if there's a network error.
    
    Short example:
    
    >>> from Bio import Entrez
    >>> Entrez.email = "Your.Name.Here@example.org"
    >>> handle = Entrez.efetch(db="nucleotide", id="AY851612", rettype="gb", retmode="text")
    >>> print(handle.readline().strip())
    LOCUS       AY851612                 892 bp    DNA     linear   PLN 10-APR-2007
    >>> handle.close()
    
    This will automatically use an HTTP POST rather than HTTP GET if there
    are over 200 identifiers as recommended by the NCBI.
    
    databas

### Get Co-authorship

Entrez returns a lot of information. We hone it down to just the names. We need to use exceptions because the returned papers doesn't always have the fields we want.

In [8]:
def get_coauthor_lists(papers):
    paper_authors = {}
    for p in papers:
        try:
            tmp = p['MedlineCitation']
            alist = []
            for a in tmp['Article']['AuthorList']:
                try:
                    s = "%s %s"%(a['ForeName'],a['LastName'])
                    alist.append(s)
                except Exception as error:
                    pass
                    #print(error)
            paper_authors[tmp['Article']['ArticleTitle']] = alist
        except:
            pass
    return paper_authors

In [9]:

def get_faculty_coauthors(faculty, email=''):
    return get_coauthor_lists( 
                              fetch_details(
                                  search(faculty, email=email)['IdList'], email=email)["PubmedArticle"])

In [11]:
Entrez.email = "brian.chapman@utah.edu"
handle = Entrez.esearch(db='pubmed', 
                        sort='relevance', 
                        retmax='100',
                        retmode='xml', 
                        term=["Brian E. Chapman"])

In [13]:
type(handle)

_io.TextIOWrapper

In [14]:
results = Entrez.read(handle)


In [18]:
results.keys()

dict_keys(['Count', 'RetMax', 'RetStart', 'IdList', 'TranslationSet', 'TranslationStack', 'QueryTranslation'])

In [19]:
results["IdList"]

['29135365', '27175226', '24556644', '26165778', '23920642', '23304270', '22081224', '21459155', '20693106', '21031012', '18581389', '16189435', '15854841', '17354729', '15848266', '15016383', '15671008', '16685891', '15063861', '15360860', '14626297', '12807805', '12815647', '12490516']

In [21]:
result_details = fetch_details(results["IdList"])

### What is `result_details`?

In [22]:
type(result_details)

Bio.Entrez.Parser.StructureElement

In [27]:
isinstance(result_details, dict)
len(result_details)

2

In [26]:
result_details.keys()

dict_keys(['PubmedArticle', 'PubmedBookArticle'])

In [29]:
type(result_details['PubmedArticle'])

list

In [30]:
type(result_details['PubmedArticle'][0])

Bio.Entrez.Parser.DictionaryElement

In [31]:
result_details['PubmedArticle'][0].keys()


dict_keys(['MedlineCitation', 'PubmedData'])

In [32]:
result_details['PubmedArticle'][0]['PubmedData']

{'History': [DictElement({'Year': '2017', 'Month': '11', 'Day': '15', 'Hour': '6', 'Minute': '0'}, attributes={'PubStatus': 'pubmed'}), DictElement({'Year': '2018', 'Month': '4', 'Day': '27', 'Hour': '6', 'Minute': '0'}, attributes={'PubStatus': 'medline'}), DictElement({'Year': '2017', 'Month': '11', 'Day': '15', 'Hour': '6', 'Minute': '0'}, attributes={'PubStatus': 'entrez'})], 'PublicationStatus': 'ppublish', 'ArticleIdList': [StringElement('29135365', attributes={'IdType': 'pubmed'}), StringElement('10.1148/radiol.2017171115', attributes={'IdType': 'doi'})]}

In [36]:
result_details['PubmedArticle'][0]['MedlineCitation']

DictElement({'CitationSubset': ['AIM', 'IM'], 'OtherID': [], 'OtherAbstract': [], 'KeywordList': [], 'SpaceFlightMission': [], 'GeneralNote': [], 'PMID': StringElement('29135365', attributes={'Version': '1'}), 'DateCompleted': {'Year': '2018', 'Month': '04', 'Day': '26'}, 'DateRevised': {'Year': '2018', 'Month': '04', 'Day': '26'}, 'Article': DictElement({'ELocationID': [StringElement('10.1148/radiol.2017171115', attributes={'EIdType': 'doi', 'ValidYN': 'Y'})], 'Language': ['eng'], 'ArticleDate': [DictElement({'Year': '2017', 'Month': '11', 'Day': '13'}, attributes={'DateType': 'Electronic'})], 'Journal': {'ISSN': StringElement('1527-1315', attributes={'IssnType': 'Electronic'}), 'JournalIssue': DictElement({'Volume': '286', 'Issue': '3', 'PubDate': {'Year': '2018', 'Month': '03'}}, attributes={'CitedMedium': 'Internet'}), 'Title': 'Radiology', 'ISOAbbreviation': 'Radiology'}, 'ArticleTitle': 'Deep Learning to Classify Radiology Free-Text Reports.', 'Pagination': {'MedlinePgn': '845-85

In [None]:
# def fetch_details(id_list, email="brian.chapman@utah.edu"):
id_list = search(faculty, email="brian.chapman@utah.edu")
#examine_data = fetch_details(search)

### Author:Co-author dictionary

In [None]:
coauthors_with_ext = {"%s"%f : get_faculty_coauthors(f, email=email_string) for f in faculty}

In [None]:
with gzip.open("researchers_pubmed.pickle.gzip", "wb") as f0:
    pickle.dump(coauthors_with_ext, f0)

In [None]:
!ls -l