# Intro and Background

In 2018 I published a piece of software called MELODI - http://melodi.biocompute.org.uk/. Essentially it compares the text from two sets of publications and identifies common overlapping enriched terms based around a [PubMed](https://www.ncbi.nlm.nih.gov/pubmed/) search. I realised that a set of text could also be based on a person, making it possible to identify enriched terms for a person and common shared terms across two people. At the same time the JGI launched a competition to analyse the [University of Bristol's PURE data](https://research-information.bris.ac.uk) in a novel way, which I entered using some of the ideas from the MELODI work. This led to the production of AXON (http://axon.biocompute.org.uk/) and an AXON instance of the University of Bristol academic research output http://axon-bristol.biocompute.org.uk/. However, maintaining this and keeping it up-to-date was not feasible, as I am currently working at the Integrative Epidemiology Unit, and this is not really epidemiology. 

However, I think the ideas and (some of) the code might be of interest to others.


### Setup

Possibly the most important aspect of the data for this project is ensuring robust and unique identifiers. For individuals this can be achieved using ORCID identifiers (https://orcid.org/) and for publications we can use PubMed identifiers (https://www.ncbi.nlm.nih.gov/pubmed/).  

First, let's create some directories for data and output

In [None]:
import os

#this file (config.py) lists the names of files used throughout
import config

#make a directory for output from the notebooks
os.makedirs('output',exist_ok=True)

Check python executable 

In [None]:
import sys
sys.executable

Result should be something like `/xxx/xxx/anaconda3/envs/jgi-data-week-workshop/bin/python`

##### Pandas

We will also be using Pandas (https://pandas.pydata.org/) for various things 

>pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

In [None]:
import pandas as pd

### PubMed

PubMed (https://www.ncbi.nlm.nih.gov/pubmed/) comprises more than 29 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full-text content from PubMed Central and publisher web sites.

We can use some simple commands to get PubMed data. First, let's import the pubmed function:

In [None]:
from scripts.pubmed_functions import get_pubmed_data_efetch

Using this, we can retrieve some data using a pubmed ID, e.g. 123

In [None]:
pubData=get_pubmed_data_efetch(['123'])

This has fetched some summary data for the publication with ID 123 and added it to a the file `output/pubmed.tsv`. 

Run it again, this time it will use local file

In [None]:
pubData=get_pubmed_data_efetch(['123'])

### ORCID 

ORCID (https://orcid.org/) provides a persistent digital identifier that distinguishes you from every other researcher and, through integration in key research workflows such as manuscript and grant submission, supports automated linkages between you and your professional activities ensuring that your work is recognized. 

Let's get some info from an ORCID account

In [None]:
import requests

#create a function to get publication IDs from an ORCID account
def get_ids_from_orcid_public_api(orcid):
    resp = requests.get("http://pub.orcid.org/"+orcid+"/works/", 
                        headers={'Accept':'application/orcid+json'})
    results = resp.json()
    pubData = []
    if 'group' in results:
        for i, result in enumerate( results['group']):
            pubDic={}
            if 'external-ids' in result:
                for e in result['external-ids']['external-id']:
                    if e['external-id-type']=='pmid':
                        pmid = e['external-id-value']
                        pubDic['pmid']=pmid
                    elif e['external-id-type']=='doi':
                        doi = e['external-id-value']
                        pubDic['doi']=doi
            if len(pubDic)>0:
                pubData.append(pubDic)
    else:
        print('no data found')
    return pubData

In [None]:
orcidData=get_ids_from_orcid_public_api('0000-0001-7328-4233')

In [None]:
#convert dictionary to dataframe
df=pd.DataFrame.from_dict(orcidData)
print(df)

From this dictionary we can easily get both PubMed IDs and DOIs

In [None]:
#process PubMed IDs and DOIs separately
pubMedIDs = set()
doiIDs = set()
for i in orcidData:
    if 'pmid' in i:
        pubMedIDs.add(i['pmid'])
    if 'doi' in i:
        doiIDs.add(i['doi'])
print(len(pubMedIDs),'PMIDs')
print(len(doiIDs),'DOIs')

Then using the same function as before we can get the PubMed data using the PubMed IDs:

In [None]:
#get the publication data using the PMIDs
pubData1=get_pubmed_data_efetch(list(pubMedIDs))
print(len(pubData1),'publication records returned')

Often, a record in an ORCID account will not contain a PubMed identifier. In this case we can convert DOIs to PMIDs using and ID converter API - https://www.ncbi.nlm.nih.gov/pmc/tools/id-converter-api/

In [None]:
from scripts.pubmed_functions import doi_to_pmid
doi_pmid=doi_to_pmid(list(doiIDs))

In [None]:
print(doi_pmid)

Now we can create a single list of PMIDs and get all publication data

In [None]:
allPMIDs = list(set(list(pubMedIDs)+list(doi_pmid)))

In [None]:
pubData2=get_pubmed_data_efetch(allPMIDs)
print(len(pubData2),'publication records returned')

We can wrap all this up, in a single function, to go from ORCID to PubMed data:

In [None]:
from scripts.common_functions import orcid_to_pubmedData

pubData=orcid_to_pubmedData(['0000-0001-7328-4233','0000-0003-0924-3247'])

In [None]:
print(len(pubData))

### A 'real life' data set

As mentioned above, the key is to generate a robust set of individual/group IDs to text. ORCID is one option, but really we need to automatically create ORCID data for a large group. 

The University of Bristol uses the PURE architecture for housing and distributing research material. As part of this, users can add their ORCID IDs. For example - https://research-information.bristol.ac.uk/en/persons/benjamin-l-elsworth(b4014828-88e9-4861-ae1d-5c369b6ae35a).html

Extracting the ORCID ID from here is fairly simple:

In [None]:
import requests
import re

url = 'https://research-information.bristol.ac.uk/en/persons/benjamin-l-elsworth(b4014828-88e9-4861-ae1d-5c369b6ae35a).html'
res = requests.get(url)
orcid = re.findall('orcid.org/(.*?)".*', res.text)
print('orcid',orcid)


Wonderful, but what is that strange ID in the URL above - **b4014828-88e9-4861-ae1d-5c369b6ae35a** ?

These are actually the PURE identifiers for each person at the University. So, if we go to the persons page (https://research-information.bristol.ac.uk/en/persons/search.html) we can, in theory, get these for everyone at the University. 

In [None]:
url = 'http://research-information.bristol.ac.uk/en/persons/search.html?filter=academic&page=1&pageSize=10'
res = requests.get(url)
pDic={}
uuid = re.findall('persons/(.*?)\((.*?)\).html', res.text)
#print(uuid)
for u in uuid:
    name = u[0].replace('-',' ').title()
    uuid = u[1]
    pDic[uuid]=name
for p in pDic:
    print(p,pDic[p])

Now, this kind of scraping is not ideal, but is effective. To save time, and getting in troule with the PURE team at the University, we've extracted data for all academics with a listed ORCID. This includes the following:

| Description | File | 
| --- |---|
| PURE Person UUID and Person Name | [data/pure_people.txt](data/pure_people.txt) | 
| PURE Person UUID and ORCID ID | [data/pure_person_to_orcid.txt](data/pure_person_to_orcid.txt) |
| PURE Person UUID and Organisation UUID | [data/pure_person_to_org.txt](data/pure_person_to_org.txt) |
| PURE Organisation UUID and Organisation Name | [data/pure_org_to_name.txt](data/pure_org_to_name.txt) |
 
From here we can start looking at enriched terms for each person and organisation.

## QC

So far, we haven't really checked any of the data. This is something we should do as everything downstream will be affected by the data at this point. One thing we can do, is look at the publication text.

In [None]:
import matplotlib.pyplot as plt

pubmedToInfo = pd.read_csv('data/pubmed.tsv',sep='\t')
print(pubmedToInfo.shape)
print(pubmedToInfo.head())

In [None]:
textData=pubmedToInfo['title'].str.len()+pubmedToInfo['abstract'].str.len()
textData.plot.hist(bins = 100)

Perhaps we should remove publications with very short title+abstract?

In [None]:
(textData<50).value_counts()

It seems that all title+abstract are > 50 characters, so we will keep them all.

We can also look at the distribution of publication year, exluding 0 (as that was included to cover missing data)

In [None]:
pubYearData=pubmedToInfo[pubmedToInfo['year']>0]['year']
pubYearData.plot.hist(bins = 50)

Lastly, numbers of publication per person:

In [None]:
#ORCID to PubMed identifiers
orcidToPubmed = pd.read_csv('data/orcid.tsv',sep='\t')
print(orcidToPubmed.shape)
print(orcidToPubmed.head())

In [None]:
orcidToPubmed['orcid_id'].value_counts().plot.hist(bins = 50)

There are a no people with zero publications from their ORCID accounts, so no need to filter.

In [None]:
(orcidToPubmed['orcid_id'].value_counts()==0).sum()