<a href="https://colab.research.google.com/github/aCampello/Download_all_files_slack/blob/master/Getting_Publications_from_EPMC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we'll learn how to query the EPMC api using the `wellcomeml` set of utils. 

EPMC stands for Europe Pubmed Central. It is a central repository for academic publications in the life sciences. You can accss EPMC directly [here](https://europepmc.org). 

Wellcomeml is an awesome (unbiased opinion) python library of utils for text processing, querying external data for academic publications, and visualising text data. For more info and documentation: https://github.com/wellcometrust/wellcomeml.

# 🔧 Initial set-up

Installing wellcomeml and its core dependencies:

In [None]:
pip install wellcomeml

Collecting wellcomeml
  Downloading wellcomeml-2.0.1-py3-none-any.whl (74 kB)
[K     |████████████████████████████████| 74 kB 986 kB/s 
[?25hCollecting flake8
  Downloading flake8-4.0.1-py2.py3-none-any.whl (64 kB)
[K     |████████████████████████████████| 64 kB 1.4 MB/s 
[?25hCollecting boto3
  Downloading boto3-1.20.12-py3-none-any.whl (131 kB)
[K     |████████████████████████████████| 131 kB 58.3 MB/s 
Collecting twine
  Downloading twine-3.6.0-py3-none-any.whl (35 kB)
Collecting black
  Downloading black-21.11b1-py3-none-any.whl (155 kB)
[K     |████████████████████████████████| 155 kB 67.3 MB/s 
[?25hCollecting mypy-extensions>=0.4.3
  Downloading mypy_extensions-0.4.3-py2.py3-none-any.whl (4.5 kB)
Collecting pathspec<1,>=0.9.0
  Downloading pathspec-0.9.0-py2.py3-none-any.whl (31 kB)
Collecting regex>=2021.4.4
  Downloading regex-2021.11.10-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (749 kB)
[K     |████████████████████████████████| 749 kB 33.0 MB/s 
Collec

In [None]:
import wellcomeml

In [None]:
wellcomeml.__version__

'2.0.1'

# 📞 Definining an EPMC API Client



An EPMC api client will hit the basic endpoint: "https://www.ebi.ac.uk/europepmc/webservices/rest" and get relevant information from a unique identifier (pmid), a group of pmids, or a query. It has several parameters such as exponential backoff in case a query fails - this is particularly relevant for long queries that require a lot of pagination.

In [None]:
from wellcomeml.io import EPMCClient

In [None]:
epmc_client = EPMCClient(max_retries=3)

Ask for a session at EPMC, inspect a paper and its references.

In [None]:
session = epmc_client.requests_session()

In [None]:
paper = epmc_client.search_by_pmid(session, pmid=24287784)



In [None]:

paper

{'abstractText': 'A theoretical approach aiming at the prediction of segregation of dopant atoms on nanocrystalline systems is discussed here. It considers the free energy minimization argument in order to provide the most likely dopant distribution as a function of the total doping level. For this, it requires as input (i) a fixed polyhedral geometry with defined facets, and (ii) a set of functions that describe the surface energy as a function of dopant content for different crystallographic planes. Two Sb-doped SnO2 nanocrystalline systems with different morphology and dopant content were selected as a case study, and the calculation of the dopant distributions expected for them is presented in detail. The obtained results were compared to previously reported characterization of this system by a combination of HRTEM and surface energy calculations, and both methods are shown to be equivalent. Considering its application pre-requisites, the present theoretical approach can provide a 

In [None]:
references = epmc_client.get_references(session, pub_id=24287784)

# 🤔 More involving queries

The EPMC API is very powerful to answer more complicated questions regarding funding data. Next we'll see how we would approach two of them:
- What was the most common topic among the papers published by Wellcome Trust grantees in 2019?
- How many publications by Wellcome Trust grantees also involved authors co-funded by other organisations in 2019?

** The analysis below is merely illustrative and does not account for all subtleties of the data nor should be quoted externally for any purposes other than this workshop.

In [None]:
wellcome_pubs = epmc_client.search(
    session, 
    query='pub_year:2019 and grant_agency:"Wellcome Trust"',
    page_size=1000,
    only_first=False
)

We'll transform the query results into a pandas dataframe, and from there on it's a simple data wrangling exercise! :)

In [None]:
import pandas as pd

wellcome_pubs_df = pd.DataFrame(wellcome_pubs)

Answering the topic question with medical subject headings (MeSH)

---



In [None]:
all_mesh = wellcome_pubs_df['meshHeadingList'].apply(
    lambda row: (
        [x['descriptorName'] for x in row['meshHeading']]
        if pd.notna(row)
        else []
    )
)

In [None]:
all_mesh.explode().value_counts()[:20]

Humans                        5938
Female                        3112
Male                          2916
Animals                       2229
Adult                         1644
Middle Aged                   1260
Aged                           857
Young Adult                    853
Mice                           843
Adolescent                     801
Child                          648
Risk Factors                   454
Child, Preschool               424
Mutation                       395
Infant                         387
Cohort Studies                 355
Brain                          355
Magnetic Resonance Imaging     354
United Kingdom                 349
Signal Transduction            335
Name: meshHeadingList, dtype: int64

Answering the topic question

In [None]:
agencies = wellcome_pubs_df['grantsList'].apply(
    lambda row: (
        [x['agency'] for x in row['grant']]
        if pd.notna(row)
        else []
    )
)


In [None]:
(agencies.apply(len) > 1).mean()

0.9098232266726337

So 90% of WT publications in 2019 were co-funded! 