# OpenAlex Cited References
### STI 2023
#### Eric Schares, Iowa State University; [eschares.github.io](eschares.github.io) 
#### Sandra Mierz; [https://github.com/smierz](https://github.com/smierz) 
---

## Part 1: Pull the Data
In this notebook, we demonstrate how to use the OpenAlex API to find all journal articles published by Iowa State University authors in 2021 and gather their references.


### Data collection
The data collection consists of querying the OpenAlex API for
1. **all publications** from Iowa State University published in 2021 and
2. the **cited references** listed in their bibliographies.


### Data storage
Once the data is retrieved, it is stored into local files to avoid repeatedly requesting the same information. This not only saves time but also reduces the burden on the OpenAlex API. 

* metadata about **publications** / **references** will be stored in a csv file called "publications.csv"/ "references.parquet" respectively
  * To minimize the amount of data storage needed, only a subset of attributes relevant to the data analysis is extracted from the metadata of both publications and references and preserved.  
  * The following figure shows the selected attributes and their corresponding names within the local files.
 <br><br><img src="../assets/publications_and_references.png" alt="structure of data files" width="1000">

<br>

* The **connections between publications and their references** are stored in a dedicated .csv file called "pub2ref.csv". 
  * Each entry has the form of a tuple of OpenAlex IDs, one identifying the publication, the other one if its references. 
  * This file serves as a so-called [join table](https://web.csulb.edu/colleges/coe/cecs/dbdesign/dbdesign.php?page=manymany.php) and allows us to efficiently join the publications with their corresponding references into one united dataset
  <br><br><img src="../assets/pub2ref.png" alt="structure of pub2ref" width="800">


So, now that we have charted the course, let us set the sail!

---

In [1]:
# needed dependencies
import requests
import pandas as pd
import pyarrow

---
### 1. All publications from Iowa State University published in 2021

First we are going to download the metadata about all publications from Iowa State University that were published 2021.

To do this we **build an OpenAlex URL** specifying the following criteria:
* Since we want to query for metadata about "publications", the entity type should be [`works`](https://docs.openalex.org/api-entities/works).
* The `works` need to further match the following [criteria](https://docs.openalex.org/api-entities/works/filter-works):
  * have at least one authorship affiliation with the Iowa State University: `institutions.ror:https://ror.org/04rswrd78`
  * were published between 01.01.2021 and 31.12.2021: `from_publication_date:2021-01-01`, `to_publication_date:2021-12-31`
  * are journal-articles: `type:journal-article`
  * are no paratext: `is_paratext:false`

In [2]:
# input
ror_id = 'https://ror.org/04rswrd78'
from_publication_date = '2021-01-01'
to_publication_date = '2021-12-31'
work_type = 'journal-article'
is_paratext = False
email = 'eschares@iastate.edu'

From the parts we put the URL together as follows:
* Starting point is the base URL of the OpenAlex API: https://api.openalex.org/
* We append the entity type to it: https://api.openalex.org/works
* All criteria need to go into the query parameter filter that is added after a question mark: https://api.openalex.org/works?filter=
* To construct the filter value we take the criteria we specified and concatenate them using commas as separators: https://api.openalex.org/works?filter=institutions.ror:https://ror.org/04rswrd78,is_paratext:False,type:journal-article,from_publication_date:2021-01-01,to_publication_date:2021-12-31
* Additionally to get into the [polite pool](https://docs.openalex.org/how-to-use-the-api/rate-limits-and-authentication#the-polite-pool) we add the `mailto` parameter to the URL and set it to the email specified in the input values.

In [3]:
def build_url_for_publications_for_institution_between_dates(ror_id, from_pub_date, to_pub_date, work_type, is_paratext, mailto):
  # specify endpoint
  endpoint = 'works'

  # build the 'filter' parameter
  filters = (
      f'institutions.ror:{ror_id}',
      f'is_paratext:{is_paratext}',
      f'type:{work_type}', 
      f'from_publication_date:{from_pub_date}',
      f'to_publication_date:{to_pub_date}'
  )

  # put the URL together
  return f'https://api.openalex.org/{endpoint}?filter={",".join(filters)}&mailto={mailto}'


filtered_works_url = build_url_for_publications_for_institution_between_dates(ror_id, from_publication_date, to_publication_date, work_type, is_paratext, email)
print(f'complete URL:\n{filtered_works_url}')

complete URL:
https://api.openalex.org/works?filter=institutions.ror:https://ror.org/04rswrd78,is_paratext:False,type:journal-article,from_publication_date:2021-01-01,to_publication_date:2021-12-31&mailto=eschares@iastate.edu


---

With the URL specified, we can start downloading the list of publications. Since the OpenAlex API paginates its results, we need to **use [cursor paging](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging#cursor-paging)** to get the complete list:


In [4]:
def get_metadata_using_cursor_paging(openalex_url):
    session = requests.Session()

    # url with a placeholder for cursor
    openalex_url_with_cursor = openalex_url + '&per_page=200&cursor={}'

    # loop through pages
    cursor = '*'
    while cursor:
        # set cursor value and request page from OpenAlex
        url = openalex_url_with_cursor.format(cursor)
        print(url)
        page_with_results = session.get(url).json()

        # update cursor to meta.next_cursor
        cursor = page_with_results['meta']['next_cursor']

        # return page results to user to process
        results = page_with_results['results']
        yield results

---

From each page we extract the `results` section which holds a partial list of our requested publications in the form of `work` objects.

Each [work object](https://docs.openalex.org/api-entities/works/work-object) in the `results` section comes with a lot of metadata describing it.  
For our analysis we are only interested in a subset of attributes: [id](https://docs.openalex.org/api-entities/works/work-object#id), [doi](https://docs.openalex.org/api-entities/works/work-object#title), [publication_year](https://docs.openalex.org/api-entities/works/work-object#publication_year), [title](https://docs.openalex.org/api-entities/works/work-object#title-1), [host_venue](https://docs.openalex.org/api-entities/works/work-object#host_venue-deprecated) (display_name, publisher, issn_l) and number of [referenced_works](https://docs.openalex.org/api-entities/works/work-object#referenced_works).

We will use the following function to **extract these attributes from a `work` object**:

In [5]:
def extract_selected_fields(openalex_work):
    return (  openalex_work['id'],
              openalex_work['doi'],
              openalex_work['publication_year'],
              openalex_work['title'],
              openalex_work['host_venue']['display_name'],
              openalex_work['host_venue']['publisher'],
              openalex_work['host_venue']['issn_l'],
              len(openalex_work['referenced_works'])
            )

Additionally we want to **extract each publication's [`referenced_works`](https://docs.openalex.org/api-entities/works/work-object#referenced_works)**.  
The `referenced_works` attribute consists of a list of OpenAlex IDs which we will resolve in the next step. 
For now we will store a pair of the publication's OpenAlex ID and the referenced work's OpenAlex ID for each referenced work.

In [6]:
def extract_references(work):
    return [(work['id'], ref) for ref in work['referenced_works']]

Once we downloaded all the data and extracted all the relevant fields, we are going to **store the data in local files**, so we don't have to request them again from the OpenAlex API:
* we store the metadata about publications in a csv file called "publications.csv"
* we store the connections between a publication and its references in a separate file called "pub2ref.csv"

In [7]:
data_folder = '../files/ISU_2021_fullyear'

def store_in_file(data, column_names, filename):
    data_in_df = pd.DataFrame(data, columns=column_names)
    if filename.endswith("csv"):
        data_in_df.to_csv(filename, index=False)
    else:
        data_in_df.to_parquet(filename)

---
Let's see **how many publications in OpenAlex** match the given criteria and how many requests we need to make to the OpenAlex API to query them all using [paging](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging):  
*Note: we are setting the `per_page` parameter to its maximum value 200 to reduce the number of API calls we will have to make.*

In [8]:
api_response = requests.get(filtered_works_url)
parsed_response = api_response.json()

count = parsed_response['meta']['count']
print(f"number of publications: {count}")

per_page = 200
number_of_pages_needed = int(count / per_page) + (count % per_page > 0)
print(f"number of requests needed (with per_page set to {per_page}): {number_of_pages_needed}")

number of publications: 3350
number of requests needed (with per_page set to 200): 17


That's a reasonable amount of data and way below the [rate limit](https://docs.openalex.org/how-to-use-the-api/rate-limits-and-authentication) of 100,00 API calls per day, so we're good to go.

---

Let's put all the pieces together and time the process:
* request each page for the publication list using cursor paging
* extract selected attributes from each publication
* extract the connections between a publication and their referenced works
* store publication metadata in a file called "publications.csv" within the project's data folder
* store connections from publications to their references in a file called "pub2ref.csv" within the project's data folder

In [9]:
%%time

# get all publications
publications = []
pub2ref = []

results_per_page = get_metadata_using_cursor_paging(filtered_works_url)
for results in results_per_page:
    for work in results:
        publications.append(extract_selected_fields(work))
        pub2ref.extend(extract_references(work))

# store publications
store_in_file(publications, 
                  ['publication_id', 'publication_doi', 'publication_year', 'publication_title', 'publication_journal', 'publication_publisher', 'publication_journal_issn', 'num_cited_references'], 
                  f'{data_folder}/publications.csv')

# store connections from publications to their references
store_in_file(pub2ref, ['publication_id', 'reference_id'], f'{data_folder}/pub2ref.csv')

https://api.openalex.org/works?filter=institutions.ror:https://ror.org/04rswrd78,is_paratext:False,type:journal-article,from_publication_date:2021-01-01,to_publication_date:2021-12-31&mailto=eschares@iastate.edu&per_page=200&cursor=*
https://api.openalex.org/works?filter=institutions.ror:https://ror.org/04rswrd78,is_paratext:False,type:journal-article,from_publication_date:2021-01-01,to_publication_date:2021-12-31&mailto=eschares@iastate.edu&per_page=200&cursor=IlsyMiwgJ2h0dHBzOi8vb3BlbmFsZXgub3JnL1czMTM5MTU4OTgwJ10i
https://api.openalex.org/works?filter=institutions.ror:https://ror.org/04rswrd78,is_paratext:False,type:journal-article,from_publication_date:2021-01-01,to_publication_date:2021-12-31&mailto=eschares@iastate.edu&per_page=200&cursor=IlsxNCwgJ2h0dHBzOi8vb3BlbmFsZXgub3JnL1czMTE5NzQyNTc5J10i
https://api.openalex.org/works?filter=institutions.ror:https://ror.org/04rswrd78,is_paratext:False,type:journal-article,from_publication_date:2021-01-01,to_publication_date:2021-12-31&mail

---

### 2. All the cited references listed in their bibliographies
To retrieve the cited references for each publication, we can make use of the `pub2ref` table which stores the OpenAlex ID of each reference. 
However, it's important to note that multiple publications may cite the same reference, which can result in redundant API calls.
Therefore, we should first **remove any duplicate entries**:

In [10]:
ref_ids = [p2r[1] for p2r in pub2ref]
print(f'number of references in pub2ref: {len(ref_ids)}')
                                                   
unique_ref_ids = list(dict.fromkeys(ref_ids))
print(f'number of unique references in pub2ref: {len(unique_ref_ids)}')

number of references in pub2ref: 142947
number of unique references in pub2ref: 124032


--- 

To fetch the metadata belonging to multiple OpenAlex IDs efficiently, we can follow the [**approach outlined in the OurResearch blog**](https://blog.ourresearch.org/fetch-multiple-dois-in-one-openalex-api-request/).  
This involves constructing a URL querying the OpenAlex API for up to 50 OpenAlex IDs in a single API request. Note that we also set the `per_page` parameter to 50 to match the number of results (default of `per_page` is 25, which would take two API calls otherwise).

In [11]:
def build_url_for_references(openalex_ids, per_page, mailto):
    # specify endpoint
    endpoint = 'works'

    # build the 'filter' parameter
    openalex_only_ids = [openalex_id.replace("https://openalex.org/", "") for openalex_id in openalex_ids]
    filters = f'openalex:{"|".join(openalex_only_ids)}'
    
    # put the URL together
    return f'https://api.openalex.org/{endpoint}?filter={filters}&per_page={per_page}&mailto={mailto}'

To use this approach we also need to **slice the list of OpenAlex IDs into pieces of 50 IDs**, which we will put into the URL that we use to **query the OpenAlex API**. 

In [12]:
def get_references(reference_ids, mailto):
    chunk_size = 50
    session = requests.Session()
    
    for i in range(0, len(reference_ids), chunk_size):
        ref_ids_slice = reference_ids[i:i + chunk_size]
        references_url = build_url_for_references(ref_ids_slice, chunk_size, mailto)
        
        page_with_results = session.get(references_url)
        results = page_with_results.json()['results']
        yield results

---
Let's see **how many requests** we need to make to the OpenAlex API to retrieve all references:

In [13]:
count = len(unique_ref_ids)
per_page = 50
number_of_pages_needed = int(count / per_page) + (count % per_page > 0)
print(f"number of requests needed (with per_page set to {per_page}): {number_of_pages_needed}")

number of requests needed (with per_page set to 50): 2481


While the number is still way below the [rate limit](https://docs.openalex.org/how-to-use-the-api/rate-limits-and-authentication) of 100,00 API calls per day, it might take a while to make all of these requests.  
To get a sense of progress, we will put in a `print` statement that notifies us every time 100 requests were made.

---

We put the parts together again like this
* we take all of the references' OpenAlex IDs from `pub2ref` and remove duplicates
* we divide the remaining OpenAlex IDs into slices of size 50 and pipe them together using the OR operator as described in the OurResearch blog post
* with the URLs we call the OpenAlex API to get the references' metadata
* from the references we extract the same fields as we did before for publications (see function `extract_selected_fields(openalex_work)`)
* we store the metadata for references in a compressed file called "references.parquet" within the project's data folder

In [14]:
%%time

# call OpenAlex API 
references = []
results_per_page = get_references(unique_ref_ids, email)
for i, results in enumerate(results_per_page):
    if i % 100 == 0: print(f'{i} requests sent')
    for work in results:
        # extract fields
        references.append(extract_selected_fields(work))

# store references
store_in_file(references, 
                  ['reference_id','reference_doi','reference_year','reference_title','reference_journal','reference_publisher','reference_journal_issn','reference_citation_count'],
                  f'{data_folder}/references.parquet')

0 requests sent
100 requests sent
200 requests sent
300 requests sent
400 requests sent
500 requests sent
600 requests sent
700 requests sent
800 requests sent
900 requests sent
1000 requests sent
1100 requests sent
1200 requests sent
1300 requests sent
1400 requests sent
1500 requests sent
1600 requests sent
1700 requests sent
1800 requests sent
1900 requests sent
2000 requests sent
2100 requests sent
2200 requests sent
2300 requests sent
2400 requests sent

CPU times: user 55.3 s, sys: 1.46 s, total: 56.8 s
Wall time: 32min 7s


---

## Data download complete

Now we can use the stored files to analyze only the list of **publications** ...

In [15]:
# publication's only
pubs_df = pd.read_csv(f'{data_folder}/publications.csv')
pubs_df.head(10)

Unnamed: 0,publication_id,publication_doi,publication_year,publication_title,publication_journal,publication_publisher,publication_journal_issn,num_cited_references
0,https://openalex.org/W3108936441,https://doi.org/10.1080/15548627.2020.1797280,2021,Guidelines for the use and interpretation of a...,Autophagy,Landes Bioscience,1554-8627,4075
1,https://openalex.org/W3100777112,https://doi.org/10.1016/j.ymssp.2020.107398,2021,1D convolutional neural networks and applicati...,Mechanical Systems and Signal Processing,Elsevier BV,0888-3270,63
2,https://openalex.org/W3016123475,https://doi.org/10.1016/j.ymssp.2020.107077,2021,A review of vibration-based damage detection i...,Mechanical Systems and Signal Processing,Elsevier BV,0888-3270,155
3,https://openalex.org/W2084148186,https://doi.org/10.1063/1.124228,2021,Optical photonic crystals fabricated from coll...,Applied Physics Letters,American Institute of Physics,0003-6951,9
4,https://openalex.org/W3154987841,https://doi.org/10.1016/j.tplants.2021.03.010,2021,Designing Future Crops: Genomics-Assisted Bree...,Trends in Plant Science,Elsevier BV,1360-1385,167
5,https://openalex.org/W3110874572,https://doi.org/10.1109/tpwrs.2020.3041774,2021,Definition and Classification of Power System ...,IEEE Transactions on Power Systems,Institute of Electrical and Electronics Engineers,0885-8950,78
6,https://openalex.org/W3128448984,https://doi.org/10.1038/s41579-020-00502-7,2021,An evolving view on biogeochemical cycling of ...,Nature Reviews Microbiology,Nature Portfolio,1740-1526,204
7,https://openalex.org/W3189525949,https://doi.org/10.1126/science.abg5289,2021,"De novo assembly, annotation, and comparative ...",Science,American Association for the Advancement of Sc...,0036-8075,168
8,https://openalex.org/W3089919509,https://doi.org/10.5194/essd-13-889-2021,2021,An extended time series (2000–2018) of global ...,Earth System Science Data,Copernicus Publications,1866-3508,64
9,https://openalex.org/W3138381431,https://doi.org/10.1021/acsenergylett.1c00445,2021,Challenges for and Pathways toward Li-Metal-Ba...,ACS energy letters,American Chemical Society,2380-8195,21


<br>
<hr>

or only their **references** ...

In [16]:
# references only
refs_df = pd.read_parquet(f'{data_folder}/references.parquet')
refs_df.head(10)

Unnamed: 0.1,Unnamed: 0,reference_id,reference_doi,reference_year,reference_title,reference_journal,reference_publisher,reference_journal_issn,reference_citation_count
0,0,https://openalex.org/W2057189219,https://doi.org/10.1021/ac00204a006,1990,Electrochemical behavior of reversible redox s...,Analytical Chemistry,American Chemical Society,0003-2700,244
1,1,https://openalex.org/W2955113482,https://doi.org/10.1534/genetics.119.302134,2019,Heritability in Plant Breeding on a Genotype-D...,Genetics,Genetics Society of America,0016-6731,56
2,2,https://openalex.org/W2057122136,https://doi.org/10.1016/j.eneco.2007.09.004,2008,Metal volatility in presence of oil and intere...,Energy Economics,Elsevier BV,0140-9883,258
3,3,https://openalex.org/W2039927654,https://doi.org/10.1016/j.finel.2003.11.001,2004,Popular benchmark problems for geometric nonli...,Finite Elements in Analysis and Design,Elsevier BV,0168-874X,302
4,4,https://openalex.org/W2315171359,https://doi.org/10.15252/embr.201540500,2016,The antiobesity factor <scp>WDTC</scp> 1 suppr...,EMBO Reports,Nature Portfolio,1469-221X,25
5,5,https://openalex.org/W2030471735,https://doi.org/10.1002/(sici)1520-6297(199701...,1997,Vertical coordination in the US pork industry:...,Agribusiness,Wiley-Blackwell,0742-4477,61
6,6,https://openalex.org/W2890160214,https://doi.org/10.1017/wsc.2018.56,2018,Extractable and Germinable Seedbank Methods Pr...,Weed Science,Cambridge University Press,0043-1745,14
7,7,https://openalex.org/W2787021517,https://doi.org/10.1103/physrevlett.120.192001,2018,First Extraction of Transversity from a Global...,Physical Review Letters,American Physical Society,0031-9007,66
8,8,https://openalex.org/W2730115213,https://doi.org/10.1007/s11528-017-0215-z,2018,Online Course Design in Higher Education: A Re...,TechTrends,Springer Science+Business Media,1559-7075,40
9,9,https://openalex.org/W2061016601,https://doi.org/10.1109/epec.2011.6070241,2011,New microprocessor based relay to monitor and ...,Electrical Power and Energy Conference,,,51


<br>
<hr>

or if we want to get the original connected dataset we can **join publications with their references** using the pub2ref table:

In [17]:
# publications and their references
pub2ref_df = pd.read_csv(f'{data_folder}/pub2ref.csv')

df = pubs_df.join(pub2ref_df.set_index('publication_id'), on='publication_id')
df = df.join(refs_df.set_index('reference_id'), on='reference_id')
df.head(10)

Unnamed: 0.1,publication_id,publication_doi,publication_year,publication_title,publication_journal,publication_publisher,publication_journal_issn,num_cited_references,reference_id,Unnamed: 0,reference_doi,reference_year,reference_title,reference_journal,reference_publisher,reference_journal_issn,reference_citation_count
0,https://openalex.org/W3108936441,https://doi.org/10.1080/15548627.2020.1797280,2021,Guidelines for the use and interpretation of a...,Autophagy,Landes Bioscience,1554-8627,4075,https://openalex.org/W1919353,63604.0,https://doi.org/10.1093/jnen/64.2.113,2005.0,Extensive Involvement of Autophagy in Alzheime...,Journal of Neuropathology and Experimental Neu...,Oxford University Press,0022-3069,1218.0
0,https://openalex.org/W3108936441,https://doi.org/10.1080/15548627.2020.1797280,2021,Guidelines for the use and interpretation of a...,Autophagy,Landes Bioscience,1554-8627,4075,https://openalex.org/W21038192,101734.0,https://doi.org/10.4161/auto.1.1.1512,2005.0,Early Secretory Pathway Gene<i>TRS85</i>is Req...,Autophagy,Landes Bioscience,1554-8627,65.0
0,https://openalex.org/W3108936441,https://doi.org/10.1080/15548627.2020.1797280,2021,Guidelines for the use and interpretation of a...,Autophagy,Landes Bioscience,1554-8627,4075,https://openalex.org/W55875826,98332.0,,1990.0,Non-selective autophagy.,Seminars in Cell Biology,Saunders,1043-4682,44.0
0,https://openalex.org/W3108936441,https://doi.org/10.1080/15548627.2020.1797280,2021,Guidelines for the use and interpretation of a...,Autophagy,Landes Bioscience,1554-8627,4075,https://openalex.org/W68089729,60647.0,https://doi.org/10.1007/978-1-59745-157-4_2,2008.0,Fine Structure of the Autophagosome,Humana Press eBooks,Humana Press,,87.0
0,https://openalex.org/W3108936441,https://doi.org/10.1080/15548627.2020.1797280,2021,Guidelines for the use and interpretation of a...,Autophagy,Landes Bioscience,1554-8627,4075,https://openalex.org/W77505154,29744.0,,1996.0,Estimation of peroxidative damage. A critical ...,Pathologie Biologie,Elsevier BV,0369-8114,68.0
0,https://openalex.org/W3108936441,https://doi.org/10.1080/15548627.2020.1797280,2021,Guidelines for the use and interpretation of a...,Autophagy,Landes Bioscience,1554-8627,4075,https://openalex.org/W102256019,44033.0,https://doi.org/10.1007/978-1-61779-328-8_21,2011.0,Monitoring Mitophagy in Neuronal Cell Cultures,Humana Press eBooks,Humana Press,,44.0
0,https://openalex.org/W3108936441,https://doi.org/10.1080/15548627.2020.1797280,2021,Guidelines for the use and interpretation of a...,Autophagy,Landes Bioscience,1554-8627,4075,https://openalex.org/W115026104,101339.0,https://doi.org/10.1016/s0076-6879(08)03203-5,2008.0,Chapter 3 The Quantitative Pho8Δ60 Assay of No...,Methods in Enzymology,Academic Press,0076-6879,128.0
0,https://openalex.org/W3108936441,https://doi.org/10.1080/15548627.2020.1797280,2021,Guidelines for the use and interpretation of a...,Autophagy,Landes Bioscience,1554-8627,4075,https://openalex.org/W126188432,27665.0,https://doi.org/10.1007/978-1-59745-157-4_4,2008.0,LC3 and Autophagy,Humana Press eBooks,Humana Press,,1115.0
0,https://openalex.org/W3108936441,https://doi.org/10.1080/15548627.2020.1797280,2021,Guidelines for the use and interpretation of a...,Autophagy,Landes Bioscience,1554-8627,4075,https://openalex.org/W148258150,87209.0,https://doi.org/10.1007/978-0-387-74021-8_9,2007.0,Origin and Evolution of Self-Consumption: Auto...,Advances in Experimental Medicine and Biology,Springer Nature,0065-2598,33.0
0,https://openalex.org/W3108936441,https://doi.org/10.1080/15548627.2020.1797280,2021,Guidelines for the use and interpretation of a...,Autophagy,Landes Bioscience,1554-8627,4075,https://openalex.org/W149241939,122816.0,https://doi.org/10.1016/s0076-6879(08)01409-2,2008.0,Chapter Nine Lysosomes in Apoptosis,Elsevier eBooks,Elsevier,,69.0


<br>

This completes the data collection. Head over to **Part 2** where the data exploration and analysis takes place!