# Intro and Background

In 2018 I published a piece of software called MELODI - http://melodi.biocompute.org.uk/. Essentially it compares the text from two sets of publications and identifies common overlapping enriched terms based around a [PubMed](https://www.ncbi.nlm.nih.gov/pubmed/) search. I realised that a set of text could also be based on a person, making it possible to identify enriched terms for a person and common shared terms across two people. At the same time the JGI launched a competition to analyse the [University of Bristol's PURE data](https://research-information.bris.ac.uk) in a novel way, which I entered using some of the ideas from the MELODI work. This led to the production of AXON (http://axon.biocompute.org.uk/) and an AXON instance of the University of Bristol academic research output http://axon-bristol.biocompute.org.uk/. However, maintaining this and keeping it up-to-date was not feasible, as I am currently working at the Integrative Epidemiology Unit, and this is not really epidemiology. 

However, I think the ideas and (some of) the code might be of interest to others.


### Setup

Possibly the most important aspect of the data for this project is ensuring robust and unique identifiers. For individuals this can be achieved using ORCID identifiers (https://orcid.org/) and for publications we can use PubMed identifiers (https://www.ncbi.nlm.nih.gov/pubmed/).  

First, let's create some directories for data and output

In [1]:
import os
import config
os.makedirs('output',exist_ok=True)
os.makedirs('data',exist_ok=True)

Check python executable 

In [2]:
import sys
sys.executable

'/Users/be15516/anaconda3/envs/jgi-data-week-workshop/bin/python'

Result should be something like `/xxx/xxx/anaconda3/envs/jgi-data-week-workshop/bin/python`

### PubMed

PubMed (https://www.ncbi.nlm.nih.gov/pubmed/) comprises more than 29 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full-text content from PubMed Central and publisher web sites.

We can use some simple commands to get PubMed data. First, let's import the pubmed function:

In [12]:
from scripts.pubmed_functions import get_pubmed_data_entrez

Using this, we can retrieve some data using a pubmed ID, e.g. 123

In [6]:
pubData=get_pubmed_data_entrez(['123'])

Read existing downloaded pubmed data from output/pubmed.tsv
2
123 is done
Nothing to do


This has fetched some summary data for the publication with ID 123 and added it to a the file `output/pubmed.tsv`. 

Run it again, this time it will use local file

In [7]:
pubData=get_pubmed_data_entrez(['123'])

Read existing downloaded pubmed data from output/pubmed.tsv
151
123 is done
Nothing to do


### ORCID 

ORCID (https://orcid.org/) provides a persistent digital identifier that distinguishes you from every other researcher and, through integration in key research workflows such as manuscript and grant submission, supports automated linkages between you and your professional activities ensuring that your work is recognized. 

Let's get some info from an ORCID account

In [8]:
def get_ids_from_orcid_public_api(orcid):
    resp = requests.get("http://pub.orcid.org/"+orcid+"/works/", 
                        headers={'Accept':'application/orcid+json'})
    results = resp.json()
    pubData = []
    if 'group' in results:
        for i, result in enumerate( results['group']):
            pubDic={}
            if 'external-ids' in result:
                for e in result['external-ids']['external-id']:
                    if e['external-id-type']=='pmid':
                        pmid = e['external-id-value']
                        pubDic['pmid']=pmid
                    elif e['external-id-type']=='doi':
                        doi = e['external-id-value']
                        pubDic['doi']=doi
            if len(pubDic)>0:
                pubData.append(pubDic)
    else:
        print('no data found')
    return pubData

In [9]:
orcidData=get_ids_from_orcid_public_api('0000-0001-7328-4233')

In [10]:
print(orcidData)

[{'doi': '10.1038/s41467-019-08923-6'}, {'doi': '10.1038/s41598-018-26050-y'}, {'doi': '10.1093/ije/dyx251', 'pmid': '29342271'}, {'doi': '10.1093/nar/gkx1072', 'pmid': '29156009'}, {'doi': '10.1093/gigascience/gix035', 'pmid': '28486658'}, {'doi': '10.1101/118513'}, {'pmid': '27863423', 'doi': '10.18632/oncotarget.13387'}, {'pmid': '27663502'}, {'pmid': '25813983', 'doi': '10.1038/ncomms7548'}, {'doi': '10.1186/s13058-015-0593-0', 'pmid': '26070602'}, {'doi': '10.1093/bioinformatics/btt466', 'pmid': '23940251'}, {'pmid': '22281184', 'doi': '10.1186/1756-0500-5-68'}, {'doi': '10.1111/j.1365-3024.2011.01342.x', 'pmid': '22044053'}, {'doi': '10.1016/j.ijpara.2011.03.009', 'pmid': '21550347'}, {'doi': '10.1126/science.1147046', 'pmid': '18174420'}, {'doi': '10.1098/rsbl.2003.0130', 'pmid': '15252980'}]


From this dictionary we can easily get both PubMed IDs and DOIs

In [14]:
pubMedIDs = set()
doiIDs = set()
for i in orcidData:
    if 'pmid' in i:
        pubMedIDs.add(i['pmid'])
    if 'doi' in i:
        doiIDs.add(i['doi'])
print(len(pubMedIDs),'PMIDs')
print(len(doiIDs),'DOIs')

13 PMIDs
15 DOIs


Then using the same function as before we can get the PubMed data using the PubMed IDs:

In [15]:
#get the publication data using the PMIDs
pubData1=get_pubmed_data_entrez(list(pubMedIDs))
print(len(pubData1),'publication records returned')

Read existing downloaded pubmed data from output/pubmed.tsv
151
18174420 is done
27863423 is done
28486658 is done
25813983 is done
22044053 is done
21550347 is done
23940251 is done
29156009 is done
27663502 is done
15252980 is done
29342271 is done
26070602 is done
22281184 is done
Nothing to do
13 publication records returned


Often, a record in an ORCID account will not contain a PubMed identifier. In this case we can convert DOIs to PMIDs using and ID converter API - https://www.ncbi.nlm.nih.gov/pmc/tools/id-converter-api/

In [17]:
from scripts.pubmed_functions import doi_to_pmid
doi_pmid=doi_to_pmid(list(doiIDs))

In [15]:
doi_pmid

['26070602',
 '18174420',
 '23940251',
 '28486658',
 '22281184',
 '27863423',
 '29342271',
 '15252980',
 '29156009',
 '30837455',
 '29777112']

Now we can create a single list of PMIDs and get all publication data

In [18]:
allPMIDs = list(set(list(pubMedIDs)+list(doi_pmid)))

In [19]:
pubData2=get_pubmed_data_entrez(allPMIDs)
print(len(pubData2),'publication records returned')

Read existing downloaded pubmed data from output/pubmed.tsv
151
18174420 is done
27863423 is done
28486658 is done
29777112 is done
25813983 is done
22044053 is done
21550347 is done
30837455 is done
23940251 is done
29156009 is done
27663502 is done
15252980 is done
29342271 is done
26070602 is done
22281184 is done
Nothing to do
15 publication records returned


We can wrap all this up, in a single function, to go from ORCID to PubMed data:

In [1]:
from scripts.common_functions import orcid_to_pubmedData

pubData=orcid_to_pubmedData(['0000-0001-7328-4233','0000-0003-0924-3247'])

1 Getting ORCID data for 0000-0001-7328-4233
13 PMIDs
15 DOIs
Read existing downloaded pubmed data from output/pubmed.tsv
17
18174420 is done
27863423 is done
28486658 is done
29777112 is done
25813983 is done
22044053 is done
21550347 is done
30837455 is done
23940251 is done
29156009 is done
27663502 is done
15252980 is done
29342271 is done
26070602 is done
22281184 is done
Nothing to do
2 Getting ORCID data for 0000-0003-0924-3247
107 PMIDs
161 DOIs
Read existing downloaded pubmed data from output/pubmed.tsv
17
29342271 is done
27663502 is done
Processing ['28423762', '28552196', '27114411', '29462323', '27591082', '23977022', '29093763', '28137713', '21455730', '27040690', '21534939', '28002404', '28929496', '21282362', '24162466', '28458444', '25329069', '20829508', '22100073', '27000383', '23620363', '28361446', '27128313', '26781229', '27036880', '21030955', '25011450', '22253814', '29846171', '25869828', '12711693', '23001569', '15863668', '28456096', '25743335', '29848354', '

In [2]:
print(len(pubData))

136


### A 'real life' data set

As mentioned above, the key is to generate a robust set of individual/group IDs to text. ORCID is one option, but really we need to automatically create ORCID data for a large group. 

The University of Bristol uses the PURE architecture for housing and distributing research material. As part of this, users can add their ORCID IDs. For example - https://research-information.bristol.ac.uk/en/persons/benjamin-l-elsworth(b4014828-88e9-4861-ae1d-5c369b6ae35a).html

Extracting the ORCID ID from here is fairly simple:

In [3]:
import requests
import re

url = 'https://research-information.bristol.ac.uk/en/persons/benjamin-l-elsworth(b4014828-88e9-4861-ae1d-5c369b6ae35a).html'
res = requests.get(url)
orcid = re.findall('orcid.org/(.*?)".*', res.text)
print('orcid',orcid)


orcid ['0000-0001-7328-4233']


Wonderful, but what is that strange ID in the URL above - **b4014828-88e9-4861-ae1d-5c369b6ae35a** ?

These are actually the PURE identifiers for each person at the University. So, if we go to the persons page (https://research-information.bristol.ac.uk/en/persons/search.html) we can, in theory, get these for everyone at the University. 

In [4]:
import requests
import re

url = 'http://research-information.bristol.ac.uk/en/persons/search.html?filter=academic&page=1&pageSize=10'
res = requests.get(url)
pDic={}
uuid = re.findall('persons/(.*?)\((.*?)\).html', res.text)
#print(uuid)
for u in uuid:
    name = u[0].replace('-',' ').title()
    uuid = u[1]
    pDic[uuid]=name
for p in pDic:
    print(p,pDic[p])

e56cc335-3cbf-4b24-aca1-21d9ead1004d Byron Adams
4803d085-611f-4365-99cb-cbdf7ede80c0 Chris J Adams
182d22e2-d9a2-4b95-9cf0-5ce91da1266f Josephine C Adams
a6d8796d-4ecf-43a9-bba8-66a8595d3bc9 Jeremy Adcock
9e4f879b-4f8f-4ae8-a5a0-f86aa750e0b4 Martin Addy
045d849f-a6d0-4e4c-be02-517c320c9a97 Foluke I Adebisi
834a45cf-8319-4cd0-829a-5a1c9be678cc A E Ades
8b156807-c78e-4cd6-98fa-26e873ff4b78 Marco Adinolfi
67d038ce-752b-44ff-99bc-bd1eac103147 Marinella Afshin
d29e7842-e9a3-4826-bfa5-37105cef192d Maryam Afzal


Now, this kind of scraping is not ideal, but is effective. To save time, and getting in troule with the PURE team at the University, we've extracted data for all academics with a listed ORCID. This includes the following:

| Description | File | 
| --- |---|
| PURE Person UUID and Person Name | [data/pure_people.txt](data/pure_people.txt) | 
| PURE Person UUID and ORCID ID | [data/pure_person_to_orcid.txt](data/pure_person_to_orcid.txt) |
| PURE Person UUID and Organisation UUID | [data/pure_person_to_org.txt](data/pure_person_to_org.txt) |
| PURE Organisation UUID and Organisation Name | [data/pure_org_to_name.txt](data/pure_org_to_name.txt) |
 
From here we can start looking at enriched terms for each person and organisation.