# Investigating OpenAlex data: cited references
#### Eric Schares, Iowa State University; [eschares.github.io](eschares.github.io) 
 
---

<div style='background:#e7edf7'>
    This notebook will query the OpenAlex API get a set of publications, pull the cited references in the bibliographies, and answer the questions:
    <blockquote>
        <b><i>How many articles to do our authors cite? When were those articles published? How recent are they?</i></b>
    </blockquote>
   </div>

 
**Context**

We would like to better understand how campus researchers use journal content. Analyzing which years our authors cite and how many papers they cite gives us a better feel for how content is being used. We can use this information as we make journal cancellation and renewal decisions.

- **Part 1**. Pull the Data from OpenAlex API
- **Part 2**. Plot the Data
 - **2.1**. Number of references
 - **2.2**. Years of references

---
# Part 1. Pull the Data
#### (Skip to Part 2 if you already have the data saved)

In [1]:
import pandas as pd
import requests
import plotly.express as px

## To modify for your own use, edit the input parameters:
- [ROR ID](https://ror.org/search?query=iowa+state) for your own institution 
- Date range: from_publication_date - to_publication_date
- Email address to get into OpenAlex's polite pool for faster response times

In [2]:
# input
ror_id = "https://ror.org/04rswrd78"
from_publication_date = "2021-01-01"
to_publication_date = "2021-01-01"
email = "eschares@iastate.edu"

In [3]:
def build_url(ror_id, from_pub_date, to_pub_date, email):
  # specify endpoint
  endpoint = 'works'

  # build the 'filter' parameter
  filters = (
      f'institutions.ror:{ror_id}',
      'is_paratext:false',
      'type:journal-article', 
      f'from_publication_date:{from_pub_date}',
      f'to_publication_date:{to_pub_date}'
  )

  # put the URL together
  return f'https://api.openalex.org/{endpoint}?filter={",".join(filters)}&mailto={email}'


filtered_works_url = build_url(ror_id, from_publication_date, to_publication_date, email)
print(f'complete URL:\n{filtered_works_url}')

complete URL:
https://api.openalex.org/works?filter=institutions.ror:https://ror.org/04rswrd78,is_paratext:false,type:journal-article,from_publication_date:2021-01-01,to_publication_date:2021-01-01&mailto=eschares@iastate.edu


Send the API call and get a response

In [4]:
api_response = requests.get(filtered_works_url)
parsed_response = api_response.json()

How many publication ("parent") results?
And how many OpenAlex pages will this take at the given `per_page`?

In [5]:
count = parsed_response['meta']['count']
print(f"result count: {count}")

per_page = 200
number_of_pages_needed = int(count / per_page) + (count % per_page > 0) # shorter way to calculate math.ceil
print(f"number of pages needed: {number_of_pages_needed}")

result count: 230
number of pages needed: 2


## Main loop - send a request, go through each page, on each page go through each result, and pull out the pieces we want
#### ---- Warning! ----

This can take quite a bit of time to run depending on the number of records and number of cited references you're asking for. Code takes about 1 minute per 300 total references, or 0.2s/reference. Assume ~45 references per paper to estimate, or 9s/paper.

In [6]:
## GET ONLY PUBLICATIONS AND STORE THEIR REFERENCED WORKS 
def get_publications(works_url):
    api_calls_total = 0
    session = requests.Session()

    # we will store publications and connection pub2ref separately
    publications = []
    pub2ref = []

    # url with a placeholder for cursor
    works_url_with_cursor = works_url + '&cursor={}&per_page=200'

    # loop through pages
    cursor = '*'
    while cursor:
        # set cursor value and request page from OpenAlex
        url = works_url_with_cursor.format(cursor)
        print(url)
        page_with_results = session.get(url).json()
        api_calls_total += 1

        # loop through partial list of results
        results = page_with_results['results']
        for work in results:
            publications.append((
              work['id'],  # keep the OpenAlex ID
              work['doi'],
              work['publication_year'],
              work['title'],
              work['host_venue']['display_name'],
              work['host_venue']['publisher'],
              work['host_venue']['issn_l'],
              len(work['referenced_works'])
              #kicked out concepts for now
            ))

            for ref in work['referenced_works']:
                pub2ref.append((
                    work['id'],
                    ref
                ))

        # update cursor to meta.next_cursor
        cursor = page_with_results['meta']['next_cursor']
      
    print(f"number of api calls for publications: {api_calls_total}")
    return publications, pub2ref

In [7]:
def get_references(pub2ref):
    api_calls_total = 0
    session = requests.Session()

    references = []

    # url with a placeholder for cursor
    references_url = "https://api.openalex.org/works?filter=cited_by:{list_of_ids}&mailto=eschares@iastate.edu"
    references_url_with_cursor = references_url + '&cursor={cursor}&per_page=200'

    # filter for publications that have at least one reference
    pubs_with_refs = list(set(p[0].replace("https://openalex.org/","") for p in pub2ref))

    # take chunk of 50 publications
    chunk_size = 50
    for i in range(0, len(pubs_with_refs), chunk_size):
        publications_slice = pubs_with_refs[i:i + chunk_size]
        list_of_ids = "|".join(publications_slice)

        # loop through pages
        cursor = '*'
        while cursor:
            # set cursor value and request page from OpenAlex
            url = references_url_with_cursor.format(list_of_ids=list_of_ids, cursor=cursor)
            print(url)
            page_with_results = session.get(url).json()
            api_calls_total += 1
      
            # loop through partial list of results
            results = page_with_results['results']
            for work in results:
                references.append((
                    work['id'],  # keep the OpenAlex ID
                    work['doi'],
                    work['publication_year'],
                    work['title'],
                    work['host_venue']['display_name'],
                    work['host_venue']['publisher'],
                    work['host_venue']['issn_l'],
                    work['cited_by_count']
                    #kicked out concepts for now
                 ))

            # update cursor to meta.next_cursor
            cursor = page_with_results['meta']['next_cursor']

    print(f"number of api calls for references: {api_calls_total}")
    return references

In [9]:
## MAIN 
%%time

# get all publications
filtered_works_url = build_url(ror_id, from_publication_date, to_publication_date, email)
print("URL for publications: " + filtered_works_url)
publications, pub2ref = get_publications(filtered_works_url)

# store publications
print(f"retrieved {len(publications)} publications")
pubs_only = pd.DataFrame(publications, columns =['publication_id',
                                                 'publication_doi', 
                                                 'publication_year',
                                                 'publication_title',
                                                 'publication_journal',
                                                 'publication_publisher',
                                                 'publication_journal_issn',
                                                 'num_cited_references'
                                                 ])

pubs_only.to_csv('publications.csv', index=False)

# store connection pub2ref
pub2ref_df = pd.DataFrame(pub2ref, columns=['publication_id', 'reference_id'])

# .csv format probably okay here, human readable
pub2ref_df.to_csv('pub2ref.csv', index=False)
pub2ref_df

# get references
references = get_references(pub2ref)

# store references
print(f"retrieved {len(references)} references")
refs_only = pd.DataFrame(references, columns =['reference_id',
                                                 'reference_doi', 
                                                 'reference_year',
                                                 'reference_title',
                                                 'reference_journal',
                                                 'reference_publisher',
                                                 'reference_journal_issn',
                                                 'reference_citation_count'
                                                 ])

# .csv format probably okay here, human readable
refs_only.to_csv('references.csv', index=False)

URL for publications: https://api.openalex.org/works?filter=institutions.ror:https://ror.org/04rswrd78,is_paratext:false,type:journal-article,from_publication_date:2021-01-01,to_publication_date:2021-01-01&mailto=eschares@iastate.edu
https://api.openalex.org/works?filter=institutions.ror:https://ror.org/04rswrd78,is_paratext:false,type:journal-article,from_publication_date:2021-01-01,to_publication_date:2021-01-01&mailto=eschares@iastate.edu&cursor=*&per_page=200
https://api.openalex.org/works?filter=institutions.ror:https://ror.org/04rswrd78,is_paratext:false,type:journal-article,from_publication_date:2021-01-01,to_publication_date:2021-01-01&mailto=eschares@iastate.edu&cursor=IlswLCAnaHR0cHM6Ly9vcGVuYWxleC5vcmcvVzMyMDIwNzIxODcnXSI=&per_page=200
https://api.openalex.org/works?filter=institutions.ror:https://ror.org/04rswrd78,is_paratext:false,type:journal-article,from_publication_date:2021-01-01,to_publication_date:2021-01-01&mailto=eschares@iastate.edu&cursor=IlswLCAnaHR0cHM6Ly9vcGVu

###❗ **It would be better to continuously write to a file (append) instead of doing it at the end - you may run out of memory!**

# Part 2: Plot the data
### Skip to here if you already have the df's saved.
Once you have the OpenAlex response run and parsed into pandas df's, you can start to plot. 

❗❗❗❗❗

*   all analysis that you did before on publications (or summary), you do on publications.csv.
* all analysis you did with pulications and references in one table, you do on the joined tables via pub2ref

In [35]:
pubs_df = pd.read_csv('publications.csv')
pub2ref_df = pd.read_csv('pub2ref.csv')
refs_df = pd.read_csv('references.csv')

# join tables on id fields - that's why it is important to keep unique openalex ids!
df = pub2ref_df.join(pubs_df.set_index('publication_id'), on='publication_id')
pub_id_col = df.pop('reference_id') # move reference_id column to end
df['reference_id'] = pub_id_col     # move reference_id column to end
df = df.join(refs_df.set_index('reference_id'), on='reference_id')
df

Unnamed: 0,publication_id,publication_doi,publication_year,publication_title,publication_journal,publication_publisher,publication_journal_issn,num_cited_references,reference_id,reference_doi,reference_year,reference_title,reference_journal,reference_publisher,reference_journal_issn,reference_citation_count
0,https://openalex.org/W3114025680,https://doi.org/10.1126/science.abb8518,2021,Nanoscale control of internal inhomogeneity en...,Science,American Association for the Advancement of Sc...,0036-8075,49,https://openalex.org/W107656301,https://doi.org/10.1016/b978-0-444-56334-7.000...,2012,Other membrane processes,Membrane Processes in Biotechnology and Pharma...,,,1
1,https://openalex.org/W3114025680,https://doi.org/10.1126/science.abb8518,2021,Nanoscale control of internal inhomogeneity en...,Science,American Association for the Advancement of Sc...,0036-8075,49,https://openalex.org/W1513798492,https://doi.org/10.1126/science.aaa5058,2015,Sub–10 nm polyamide nanofilms with ultrafast s...,Science,American Association for the Advancement of Sc...,0036-8075,1011
2,https://openalex.org/W3114025680,https://doi.org/10.1126/science.abb8518,2021,Nanoscale control of internal inhomogeneity en...,Science,American Association for the Advancement of Sc...,0036-8075,49,https://openalex.org/W1748568996,https://doi.org/10.1016/j.memsci.2015.09.059,2016,Identifying facile and accurate methods to mea...,Journal of Membrane Science,Elsevier,0376-7388,64
3,https://openalex.org/W3114025680,https://doi.org/10.1126/science.abb8518,2021,Nanoscale control of internal inhomogeneity en...,Science,American Association for the Advancement of Sc...,0036-8075,49,https://openalex.org/W1994098525,https://doi.org/10.1016/s1089-3156(99)00020-3,1999,Molecular dynamics simulation study of the mec...,Computational and Theoretical Polymer Science,Elsevier,1089-3156,69
4,https://openalex.org/W3114025680,https://doi.org/10.1126/science.abb8518,2021,Nanoscale control of internal inhomogeneity en...,Science,American Association for the Advancement of Sc...,0036-8075,49,https://openalex.org/W1994820621,https://doi.org/10.1016/j.desal.2013.09.024,2014,Molecular simulations of polyamide reverse osm...,Desalination,Elsevier,0011-9164,47
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7551,https://openalex.org/W4210364030,https://doi.org/10.2139/ssrn.3783526,2021,Fundamental Anomalies,Social Science Research Network,Social Science Electronic Publishing,1556-5068,40,https://openalex.org/W4244729525,https://doi.org/10.2307/j.ctv9b2wsq.9,2019,LETTER 63,"Letters, 61-90",,,1
7552,https://openalex.org/W4210364030,https://doi.org/10.2139/ssrn.3783526,2021,Fundamental Anomalies,Social Science Research Network,Social Science Electronic Publishing,1556-5068,40,https://openalex.org/W4247637984,https://doi.org/10.1021/cen-v041n033.p082,1963,INDUSTRY,Chemical & Engineering News,American Chemical Society,0009-2347,1
7553,https://openalex.org/W4210364030,https://doi.org/10.2139/ssrn.3783526,2021,Fundamental Anomalies,Social Science Research Network,Social Science Electronic Publishing,1556-5068,40,https://openalex.org/W4248717949,https://doi.org/10.3386/w28056,2020,Intangible Value,,,,6
7554,https://openalex.org/W4210364030,https://doi.org/10.2139/ssrn.3783526,2021,Fundamental Anomalies,Social Science Research Network,Social Science Electronic Publishing,1556-5068,40,https://openalex.org/W4249706454,https://doi.org/10.1080/10948009409389748,1994,Communications policy,Communication booknotes,Taylor & Francis,0748-657X,1


In [36]:
# table with multiindex - connection pub to ref visualized
df_grouped = df.set_index(['publication_id',
                            'publication_doi',
                            'publication_year',
                            'publication_title',
                            'publication_journal',
                            'publication_publisher',
                            'publication_journal_issn',
                            'num_cited_references',
                            'reference_id'])
df_grouped

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,reference_doi,reference_year,reference_title,reference_journal,reference_publisher,reference_journal_issn,reference_citation_count
publication_id,publication_doi,publication_year,publication_title,publication_journal,publication_publisher,publication_journal_issn,num_cited_references,reference_id,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
https://openalex.org/W3114025680,https://doi.org/10.1126/science.abb8518,2021,Nanoscale control of internal inhomogeneity enhances water transport in desalination membranes,Science,American Association for the Advancement of Science (AAAS),0036-8075,49,https://openalex.org/W107656301,https://doi.org/10.1016/b978-0-444-56334-7.000...,2012,Other membrane processes,Membrane Processes in Biotechnology and Pharma...,,,1
https://openalex.org/W3114025680,https://doi.org/10.1126/science.abb8518,2021,Nanoscale control of internal inhomogeneity enhances water transport in desalination membranes,Science,American Association for the Advancement of Science (AAAS),0036-8075,49,https://openalex.org/W1513798492,https://doi.org/10.1126/science.aaa5058,2015,Sub–10 nm polyamide nanofilms with ultrafast s...,Science,American Association for the Advancement of Sc...,0036-8075,1011
https://openalex.org/W3114025680,https://doi.org/10.1126/science.abb8518,2021,Nanoscale control of internal inhomogeneity enhances water transport in desalination membranes,Science,American Association for the Advancement of Science (AAAS),0036-8075,49,https://openalex.org/W1748568996,https://doi.org/10.1016/j.memsci.2015.09.059,2016,Identifying facile and accurate methods to mea...,Journal of Membrane Science,Elsevier,0376-7388,64
https://openalex.org/W3114025680,https://doi.org/10.1126/science.abb8518,2021,Nanoscale control of internal inhomogeneity enhances water transport in desalination membranes,Science,American Association for the Advancement of Science (AAAS),0036-8075,49,https://openalex.org/W1994098525,https://doi.org/10.1016/s1089-3156(99)00020-3,1999,Molecular dynamics simulation study of the mec...,Computational and Theoretical Polymer Science,Elsevier,1089-3156,69
https://openalex.org/W3114025680,https://doi.org/10.1126/science.abb8518,2021,Nanoscale control of internal inhomogeneity enhances water transport in desalination membranes,Science,American Association for the Advancement of Science (AAAS),0036-8075,49,https://openalex.org/W1994820621,https://doi.org/10.1016/j.desal.2013.09.024,2014,Molecular simulations of polyamide reverse osm...,Desalination,Elsevier,0011-9164,47
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
https://openalex.org/W4210364030,https://doi.org/10.2139/ssrn.3783526,2021,Fundamental Anomalies,Social Science Research Network,Social Science Electronic Publishing,1556-5068,40,https://openalex.org/W4244729525,https://doi.org/10.2307/j.ctv9b2wsq.9,2019,LETTER 63,"Letters, 61-90",,,1
https://openalex.org/W4210364030,https://doi.org/10.2139/ssrn.3783526,2021,Fundamental Anomalies,Social Science Research Network,Social Science Electronic Publishing,1556-5068,40,https://openalex.org/W4247637984,https://doi.org/10.1021/cen-v041n033.p082,1963,INDUSTRY,Chemical & Engineering News,American Chemical Society,0009-2347,1
https://openalex.org/W4210364030,https://doi.org/10.2139/ssrn.3783526,2021,Fundamental Anomalies,Social Science Research Network,Social Science Electronic Publishing,1556-5068,40,https://openalex.org/W4248717949,https://doi.org/10.3386/w28056,2020,Intangible Value,,,,6
https://openalex.org/W4210364030,https://doi.org/10.2139/ssrn.3783526,2021,Fundamental Anomalies,Social Science Research Network,Social Science Electronic Publishing,1556-5068,40,https://openalex.org/W4249706454,https://doi.org/10.1080/10948009409389748,1994,Communications policy,Communication booknotes,Taylor & Francis,0748-657X,1


In [37]:
# everything you did with summary_df before
summary_df = pubs_df

summary_df = summary_df.sort_values(by='num_cited_references')
summary_df = summary_df.reset_index(drop=True)
summary_df


Unnamed: 0,publication_id,publication_doi,publication_year,publication_title,publication_journal,publication_publisher,publication_journal_issn,num_cited_references
0,https://openalex.org/W4244705303,https://doi.org/10.2139/ssrn.3949336,2021,Effect of Pooling Family Oral Fluids on the Pr...,Social Science Research Network,Social Science Electronic Publishing,1556-5068,0
1,https://openalex.org/W3082738057,https://doi.org/10.1109/tsg.2020.3020790,2021,Real-Time Area Angle Monitoring Using Synchrop...,IEEE Transactions on Smart Grid,Institute of Electrical and Electronics Engineers,1949-3053,0
2,https://openalex.org/W3164221434,https://doi.org/10.13031/trans.14161,2021,Comparison of Dry Matter Loss Rates from Stati...,Transactions of the ASABE,American Society of Agricultural and Biologica...,2151-0032,0
3,https://openalex.org/W3180645738,https://doi.org/10.1159/000517937,2021,Preface to the Special Issue on Sexual Develop...,Sexual Development,S. Karger AG,1661-5425,0
4,https://openalex.org/W3182526752,https://doi.org/10.3389/fpls.2021.720709,2021,Corrigendum: Polysaccharide Biosynthesis: Glyc...,Frontiers in Plant Science,Frontiers Media SA,1664-462X,0
...,...,...,...,...,...,...,...,...
225,https://openalex.org/W3091715310,https://doi.org/10.1016/j.agwat.2020.106466,2021,Standard single and basal crop coefficients fo...,Agricultural Water Management,Elsevier,0378-3774,149
226,https://openalex.org/W3039069446,https://doi.org/10.1016/j.jobe.2020.101582,2021,Methodologies to mitigate wind-induced vibrati...,Journal of building engineering,Elsevier,2352-7102,154
227,https://openalex.org/W3205262116,https://doi.org/10.3389/fncel.2021.772868,2021,Differential Impact of Severity and Duration o...,Frontiers in Cellular Neuroscience,Frontiers Media SA,1662-5102,157
228,https://openalex.org/W3112388281,https://doi.org/10.1093/ajcn/nqaa302,2021,NIH Workshop Report: sensory nutrition and dis...,The American Journal of Clinical Nutrition,Oxford University Press,0002-9165,176


In [38]:
# everything you did with df before..
df['publication_doi'].value_counts()

https://doi.org/10.1002/jcpy.1201                  183
https://doi.org/10.1093/ajcn/nqaa302               176
https://doi.org/10.3389/fncel.2021.772868          158
https://doi.org/10.1016/j.jobe.2020.101582         154
https://doi.org/10.1016/j.agwat.2020.106466        149
                                                  ... 
https://doi.org/10.1016/j.nuclphysa.2020.121933      3
https://doi.org/10.1080/15434303.2020.1867555        3
https://doi.org/10.1080/15434303.2020.1862122        2
https://doi.org/10.1177/0098628320959946             2
https://doi.org/10.1386/ijia_00033_1                 1
Name: publication_doi, Length: 173, dtype: int64