# OpenAlex Cited References
### STI 2023
#### Eric Schares, Iowa State University; [eschares.github.io](eschares.github.io) 
#### Sandra Mierz; [https://github.com/smierz](https://github.com/smierz) 
---

<div style='background:#e7edf7'>
    This notebook will query the OpenAlex API get a set of publications, pull the cited references in the bibliographies, and answer the questions:
    <blockquote>
        <b><i>How many articles to do our authors cite? When were those articles published? How recent are they?</i></b>
    </blockquote>
   </div>

 
**Context**

We would like to better understand how campus researchers use journal content. Analyzing which years our authors cite and how many papers they cite gives us a better feel for how content is being used. We can use this information as we make journal cancellation and renewal decisions.

- **Part 1**. Pull the Data from OpenAlex API
- **Part 2**. Plot the Data
 - **2.1**. Number of references
 - **2.2**. Years of references

---
# Part 1. Pull the Data
#### (Skip to Part 2 if you already have the data saved)

In [6]:
import pandas as pd
import requests
import plotly.express as px
import datetime

## To modify for your own use, edit the input parameters:
- [ROR ID](https://ror.org/search?query=iowa+state) for your own institution 
- Date range: from_publication_date - to_publication_date
- Email address to get into OpenAlex's polite pool for faster response times

In [16]:
# input
ror_id = "https://ror.org/04rswrd78"
from_publication_date = "2021-01-06"
to_publication_date = "2021-01-06"
email = "eschares@iastate.edu"

In [17]:
def build_url(ror_id, from_pub_date, to_pub_date, email):
  # specify endpoint
  endpoint = 'works'

  # build the 'filter' parameter
  filters = (
      f'institutions.ror:{ror_id}',
      'is_paratext:false',
      'type:journal-article', 
      f'from_publication_date:{from_pub_date}',
      f'to_publication_date:{to_pub_date}'
  )

  # put the URL together
  return f'https://api.openalex.org/{endpoint}?filter={",".join(filters)}&mailto={email}'


filtered_works_url = build_url(ror_id, from_publication_date, to_publication_date, email)
print(f'complete URL:\n{filtered_works_url}')

complete URL:
https://api.openalex.org/works?filter=institutions.ror:https://ror.org/04rswrd78,is_paratext:false,type:journal-article,from_publication_date:2021-01-06,to_publication_date:2021-01-06&mailto=eschares@iastate.edu


Send the API call and get a response

In [18]:
api_response = requests.get(filtered_works_url)
parsed_response = api_response.json()

How many publication ("parent") results?  
And how many OpenAlex pages will this take at the given `per_page`?

In [19]:
count = parsed_response['meta']['count']
print(f"result count: {count}")

per_page = 200
number_of_pages_needed = int(count / per_page) + (count % per_page > 0) # shorter way to calculate math.ceil
print(f"number of pages needed: {number_of_pages_needed}")

result count: 6
number of pages needed: 1


## Main loop - send a request, go through each page, on each page go through each result, and pull out the pieces we want
#### ---- Warning! ----

This can take quite a bit of time to run depending on the number of records and number of cited references you're asking for.

If the estimated time is very long (~hours), shorten your time frame in the `build_url` function to run smaller chunks. Save the dfs separately, then reassemble into one combined dataframe using `new_df = pd.concat(df1,df2)`

In [23]:
## GET ONLY PUBLICATIONS AND STORE THEIR REFERENCED WORKS 
def get_publications(works_url):
    api_calls_total = 0
    session = requests.Session()

    # we will store publications and connection pub2ref separately
    publications = []
    pub2ref = []

    # url with a placeholder for cursor
    works_url_with_cursor = works_url + '&cursor={}&per_page=200'

    # loop through pages
    cursor = '*'
    while cursor:
        # set cursor value and request page from OpenAlex
        url = works_url_with_cursor.format(cursor)
        #print(url)
        page_with_results = session.get(url).json()
        api_calls_total += 1

        # loop through partial list of results
        results = page_with_results['results']
        for work in results:
            publications.append((
              work['id'],  # keep the OpenAlex ID
              work['doi'],
              work['publication_year'],
              work['title'],
              work['host_venue']['display_name'],
              work['host_venue']['publisher'],
              work['host_venue']['issn_l'],
              len(work['referenced_works'])
              #kicked out concepts for now
            ))

            for ref in work['referenced_works']:
                pub2ref.append((
                    work['id'],
                    ref
                ))

        # update cursor to meta.next_cursor
        cursor = page_with_results['meta']['next_cursor']
      
    print(f"number of api calls for publications: {api_calls_total}")
    return publications, pub2ref

In [24]:
def get_references(pub2ref):
    api_calls_total = 0
    session = requests.Session()

    references = []

    # url with a placeholder for cursor
    references_url = "https://api.openalex.org/works?filter=cited_by:{list_of_ids}&mailto=eschares@iastate.edu"
    references_url_with_cursor = references_url + '&cursor={cursor}&per_page=200'

    # filter for publications that have at least one reference
    pubs_with_refs = list(set(p[0].replace("https://openalex.org/","") for p in pub2ref))

    # take chunk of 50 publications
    chunk_size = 50
    for i in range(0, len(pubs_with_refs), chunk_size):
        publications_slice = pubs_with_refs[i:i + chunk_size]
        list_of_ids = "|".join(publications_slice)

        # loop through pages
        cursor = '*'
        while cursor:
            # set cursor value and request page from OpenAlex
            url = references_url_with_cursor.format(list_of_ids=list_of_ids, cursor=cursor)
            #print(url)
            page_with_results = session.get(url).json()
            api_calls_total += 1
      
            # loop through partial list of results
            results = page_with_results['results']
            for work in results:
                references.append((
                    work['id'],  # keep the OpenAlex ID
                    work['doi'],
                    work['publication_year'],
                    work['title'],
                    work['host_venue']['display_name'],
                    work['host_venue']['publisher'],
                    work['host_venue']['issn_l'],
                    work['cited_by_count']
                    #kicked out concepts for now
                 ))

            # update cursor to meta.next_cursor
            cursor = page_with_results['meta']['next_cursor']

    print(f"number of api calls for references: {api_calls_total}")
    return references

### Let's run the whole thing and time it

In [31]:
## MAIN 
#%%time

start = datetime.datetime.now()
print("Started at: ", start)

# get all publications
filtered_works_url = build_url(ror_id, from_publication_date, to_publication_date, email)
#print("URL for publications: " + filtered_works_url)
publications, pub2ref = get_publications(filtered_works_url)

# test case had num_cited_references.sum() of 8369, but of those only 8144 were unique. ~3% duplication
# took 125 seconds for 8144 references, or pulling 65/second
# turn off printing every reference URL and you get 84/second

# store publications
print(f"retrieved {len(publications)} publications")
pubs_only = pd.DataFrame(publications, columns =['publication_id',
                                                 'publication_doi', 
                                                 'publication_year',
                                                 'publication_title',
                                                 'publication_journal',
                                                 'publication_publisher',
                                                 'publication_journal_issn',
                                                 'num_cited_references'
                                                 ])

# store publications
print(f"\nRetrieved {len(publications)} publications and {pubs_only['num_cited_references'].sum()} references (some could be duplicates)")

print(f"If all were unique, I estimate this will take {round(pubs_only['num_cited_references'].sum() / 84, 2)} seconds, or {round((pubs_only['num_cited_references'].sum() / 84) / 60, 2)} minutes\n")

pubs_only.to_csv('publications.csv', index=False)


# store connection pub2ref
pub2ref_df = pd.DataFrame(pub2ref, columns=['publication_id', 'reference_id'])

# .csv format probably okay here, human readable
pub2ref_df.to_csv('pub2ref.csv', index=False)
pub2ref_df

# get references
references = get_references(pub2ref)

# store references
print(f"retrieved {len(references)} references")
refs_only = pd.DataFrame(set(references), columns =['reference_id',
                                                 'reference_doi', 
                                                 'reference_year',
                                                 'reference_title',
                                                 'reference_journal',
                                                 'reference_publisher',
                                                 'reference_journal_issn',
                                                 'reference_citation_count'
                                                 ])

# using parquet file format since it can be a big file, smaller size but not human readable
refs_only.to_parquet('references.parquet')

# add .csv to see what's going on
refs_only.to_csv('references.csv')

end = datetime.datetime.now()
print("\nEnded at: ", end)

print(f"Took {end-start}")

Started at:  2023-02-27 12:17:35.711010
number of api calls for publications: 2
retrieved 6 publications

Retrieved 6 publications and 309 references (some could be duplicates)
If all were unique, I estimate this will take 3.68 seconds, or 0.06 minutes

number of api calls for references: 3
retrieved 309 references

Ended at:  2023-02-27 12:17:36.875750
Took 0:00:01.164740
