# Intro and Background

In 2018 I published a tool called MELODI - http://melodi.biocompute.org.uk/, a tool which compares two sets of publications and identifies common overlapping enriched terms based around a [PubMed](https://www.ncbi.nlm.nih.gov/pubmed/) search. I realised that a search could be based on people, identifying enriched terms for a person and common shared terms across two people. At the same time the JGI launched a competition to analyse the [University of Bristol's PURE data](https://research-information.bris.ac.uk) in an interesting way, which I entered using the same ideas as MELODI. This led to the production of AXON (http://axon.biocompute.org.uk/) and an AXON instance of the University of Bristol academic research output http://axon-bristol.biocompute.org.uk/. However, maintaining this and keeping it up-to-date was not feasible, as I am currently working at the Integrative Epidemiology Unit, and this is not really epidemiology. 

However, I think the ideas and (some of) the code might be of interest to others.


### Setup

Possibly the most important aspect of the data for this project is ensuring robust and unique identifiers. For individuals this can be achieved using ORCID identifiers (https://orcid.org/) and for publications we can use PubMed identifiers (https://www.ncbi.nlm.nih.gov/pubmed/).  

First create some directories for data and output

In [None]:
import os
import config
os.makedirs('output',exist_ok=True)
os.makedirs('data',exist_ok=True)

Check python executable

In [None]:
import sys
sys.executable

### PubMed

Import the pubmed functions

In [None]:
from scripts.pubmed_functions import *

Retrieve some data using a pubmed ID

In [None]:
pubData=get_pubmed_data_entrez(['123'])
#print(pubData)

Run it again, this time it will use local file

In [None]:
pubData=get_pubmed_data_entrez(['123'])

### ORCID 

Let's get some info from an ORCID account

In [None]:
from scripts.common_functions import *

In [None]:
orcidData=get_ids_from_orcid_public_api('0000-0001-7328-4233')

In [None]:
print(orcidData)

Get PubMed IDs

In [None]:
pubMedIDs = set()
doiIDs = set()
for i in orcidData:
    if 'pmid' in i:
        pubMedIDs.add(i['pmid'])
    if 'doi' in i:
        doiIDs.add(i['doi'])
print(len(pubMedIDs))
print(len(doiIDs))

In [None]:
print(len(pubMedIDs))
pubData1=get_pubmed_data_entrez(list(pubMedIDs))
print(len(pubData1),'publication records returned')

We can convert DOIs to PMIDs using and ID converter API - https://www.ncbi.nlm.nih.gov/pmc/tools/id-converter-api/

In [None]:
doi_pmid=doi_to_pmid(list(doiIDs))

In [None]:
doi_pmid

Now we can create a single list of PMIDs and get all publication data

In [None]:
allPMIDs = list(set(list(pubMedIDs)+list(doi_pmid)))

In [None]:
pubData2=get_pubmed_data_entrez(allPMIDs)
print(len(pubData2),'publication records returned')

In [None]:
pubData=orcid_to_pubmed(['0000-0001-7328-4233','0000-0003-0924-3247'])

In [None]:
print(len(pubData))

And do the same thing but reading from our demo data:

In [None]:
orcidData=set()
with open(config.demo) as f:
    for line in f:
        person,orcid,group = line.rstrip().split('\t')
        orcidData.add(orcid)
pubData=orcid_to_pubmed(list(orcidData))

### A 'real life' data set

As mentioned, the key is to generate a robust set of individual/group IDs to text. ORCID is one option, but ideally we would automatically extract ORCID data for a large group. 

The University of Bristol uses the PURE architecture for housing and distributing research material. As part of this, users can add their ORCID IDs. For example - https://research-information.bristol.ac.uk/en/persons/benjamin-l-elsworth(b4014828-88e9-4861-ae1d-5c369b6ae35a).html

Extracting the ORCID ID from here is fairly simple:

In [9]:
import requests
import re

url = 'https://research-information.bristol.ac.uk/en/persons/benjamin-l-elsworth(b4014828-88e9-4861-ae1d-5c369b6ae35a).html'
res = requests.get(url)
orcid = re.findall('orcid.org/(.*?)".*', res.text)
print('orcid',orcid)


orcid ['0000-0001-7328-4233']


Wonderful, but what is that strange ID in the URL above - **b4014828-88e9-4861-ae1d-5c369b6ae35a** ?

These are actually the PURE identifiers for each person at the University. So, if we go to the persons page (https://research-information.bristol.ac.uk/en/persons/search.html) we can, in theory, get these for everyone at the University. 

In [7]:
import requests
import re

url = 'http://research-information.bristol.ac.uk/en/persons/search.html?filter=academic&page=1&pageSize=10'
res = requests.get(url)
pDic={}
uuid = re.findall('persons/(.*?)\((.*?)\).html', res.text)
#print(uuid)
for u in uuid:
    name = u[0].replace('-',' ').title()
    uuid = u[1]
    pDic[uuid]=name
for p in pDic:
    print(p,pDic[p])

e56cc335-3cbf-4b24-aca1-21d9ead1004d Byron Adams
4803d085-611f-4365-99cb-cbdf7ede80c0 Chris J Adams
182d22e2-d9a2-4b95-9cf0-5ce91da1266f Josephine C Adams
a6d8796d-4ecf-43a9-bba8-66a8595d3bc9 Jeremy Adcock
9e4f879b-4f8f-4ae8-a5a0-f86aa750e0b4 Martin Addy
045d849f-a6d0-4e4c-be02-517c320c9a97 Foluke I Adebisi
834a45cf-8319-4cd0-829a-5a1c9be678cc A E Ades
8b156807-c78e-4cd6-98fa-26e873ff4b78 Marco Adinolfi
67d038ce-752b-44ff-99bc-bd1eac103147 Marinella Afshin
d29e7842-e9a3-4826-bfa5-37105cef192d Maryam Afzal


Now, this kind of scraping is not ideal, but is effective. To save time, and getting in troule with the PURE team at the University, we've extracted data for all academics `data/pure`. This includes the following:

 - PURE Person UUID and Person Name : (xxx)
 - PURE Person UUID and ORCID ID : (xxx)
 - PURE Person UUID and Organisation UUID (xxx)
 - PURE Organisation UUID and Organisation Name (xxx)
 
From here we can start looking at enriched terms for each person and organisation.