<h1>OpenAlex Cited References - Sonification and Auditory Displays</h1>
<p>Notebook to query OpenAlex database of journal articles and citations, with the goal to produce a citation network analysis</p>
<p>Produced by Chris Harrison. Adapted from: https://github.com/eschares/OpenAlex-CitedReferences/tree/main with thanks to Eric Schares, Iowa State University; [eschares.github.io](eschares.github.io) and Sandra Mierz; [https://github.com/smierz](https://github.com/smierz)</p>

In [None]:
# needed dependencies
import requests
import pandas as pd
import pyarrow

In [None]:
email = "ui@openalex.org" #needed for OpenAlex API query
data_folder = '../data/openAlex/test'

<h2>Creating the API query</h2>
    <p>For now, we will simply produce a manual query in OpenAlex and then choose the option to "Show API query". I manually changed the query value of per_page to 200. This is the maximum items per page, and will limit the loops we need to do through pages by setting this to the maximum</p>

In [None]:
api_query = "https://api.openalex.org/works?page=1&filter=title_and_abstract.search:Sonification&sort=relevance_score:desc&per_page=200&mailto=ui@openalex.org"

### Set up the query
OpenAlex API puts results into pages. Therefore, we need to use [cursor paging](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging#cursor-paging) to access as many records as we want.

In [None]:
def get_metadata_using_cursor_paging(openalex_url):
    session = requests.Session()

    # take original API url and add a placeholder for cursor
    openalex_url_with_cursor = openalex_url + '&per_page=200&cursor={}'

    # loop through pages
    cursor = '*'
    while cursor:
        # set cursor value and request page from OpenAlex
        url = openalex_url_with_cursor.format(cursor)
        print(url)
        page_with_results = session.get(url).json()

        # update cursor to meta.next_cursor
        cursor = page_with_results['meta']['next_cursor']

        # return page results to user to process
        results = page_with_results['results']
        yield results

### Extract the metadata we want

Each [work object](https://docs.openalex.org/api-entities/works/work-object) (i.e., we have one of these work objects per article/output from the query). in the `results` section comes with a lot of metadata describing each work.  For our project we will look at a subset of these: <br>
1. [id](https://docs.openalex.org/api-entities/works/work-object#id), <br>
2. [doi](https://docs.openalex.org/api-entities/works/work-object#title), <br>
3. [publication_year](https://docs.openalex.org/api-entities/works/work-object#publication_year), (self explanitory) <br>
4. [title](https://docs.openalex.org/api-entities/works/work-object#title-1),<br>
5. [primary_location](https://docs.openalex.org/api-entities/works/work-object#primary_location) (display_name, publisher, issn_l, type of work in the primary location etc.) <br>
6. [referenced_works](https://docs.openalex.org/api-entities/works/work-object#referenced_works) (list of works that this work cites, using the OpenAlex IDs).
For now we will simply count the referenced words, but note it might be interesting to check this equals the explicit attribute: [referenced_works_count](https://docs.openalex.org/api-entities/works/work-object#type).

We will use the following function to extract the information we are interested in for each work:

In [None]:
def extract_selected_fields(openalex_work):
    return (  openalex_work['id'],
              openalex_work['doi'],
              openalex_work['publication_year'],
              openalex_work['title'],
             # openalex_work['primary_location']['source']['type'],# this is the type of work -> not calling this correctly at the moment
              len(openalex_work['referenced_works']) #this counts the number of referenced works
            )

### Get all of the cited works
Additionally we want a list of all of the [referenced_works](https://docs.openalex.org/api-entities/works/work-object#referenced_works) for each work. This comes in the convenient form of a list of OpenAlex IDs, so we can get a list of cited OpenAlex IDs for each of the IDs of the citing work. Let's set up a function to do this.

In [None]:
def extract_references(work):
    return [(work['id'], ref) for ref in work['referenced_works']]

### Storing the data locally
To avoid having to re-run the API query, let's make a little function to store the data

In [None]:
def store_in_file(data, column_names, filename):
    data_in_df = pd.DataFrame(data, columns=column_names)
    if filename.endswith("csv"):
        data_in_df.to_csv(filename, index=False)
    else:
        data_in_df.to_parquet(filename)

### How many works match the query?
Let's see **how many publications in OpenAlex** match the API query and how many requests we need to make to the OpenAlex API to query them all using [paging](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging):  
*Note: we are setting the `per_page` parameter to its maximum value 200 to reduce the number of API calls we will have to make.*

It would be a good idea to check this API query with the original query ran through the website for consistency.

In [None]:
api_response = requests.get(api_query)
parsed_response = api_response.json()

count = parsed_response['meta']['count']
print(f"number of publications: {count}")

per_page = 200
number_of_pages_needed = int(count / per_page) + (count % per_page > 0)
print(f"number of requests needed (with per_page set to {per_page}): {number_of_pages_needed}")

## Running the query

Let's put all the pieces together and time the process:
* request each page for the publication list using cursor paging
* extract selected attributes from each publication
* extract the connections between a publication and their referenced works
* store publication metadata in a file called "publications.csv" within the project's data folder
* store connections from publications to their references in a file called "pub2ref.csv" within the project's data folder

In [None]:
%%time

# get all publications
publications = [] #list of publications and their data
pub2ref = [] #connection between publication and its references

results_per_page = get_metadata_using_cursor_paging(api_query)
for results in results_per_page:
    for work in results:
        publications.append(extract_selected_fields(work))
        pub2ref.extend(extract_references(work))

# store publications
store_in_file(publications, 
                  ['publication_id', 'publication_doi', 'publication_year', 'publication_title', 'num_cited_references'], 
                  f'{data_folder}/publications.csv')

# store connections from publications to their references
store_in_file(pub2ref, ['publication_id', 'reference_id'], f'{data_folder}/pub2ref.csv')

**This might already be in the form I want...but note duplications have been removed!!***

In [None]:
pub2ref_df = pd.read_csv(f'{data_folder}/pub2ref.csv')
print("Number of connections:",pub2ref_df.shape)
pub2ref_df.head(10)

----
To retrieve the cited references for each publication, we can make use of the `pub2ref` table which stores the OpenAlex ID of each reference. 
However, it's important to note that multiple publications may cite the same reference, which can result in redundant API calls.
Therefore, we should first **remove any duplicate entries**:

In [None]:
ref_ids = [p2r[1] for p2r in pub2ref]
print(f'number of references in pub2ref: {len(ref_ids)}')
                                                   
unique_ref_ids = list(dict.fromkeys(ref_ids))
print(f'number of unique references in pub2ref: {len(unique_ref_ids)}')

--- 

To fetch the metadata belonging to multiple OpenAlex IDs efficiently, we can follow the [**approach outlined in the OurResearch blog**](https://blog.ourresearch.org/fetch-multiple-dois-in-one-openalex-api-request/).  
This involves constructing a URL querying the OpenAlex API for up to 50 OpenAlex IDs in a single API request. Note that we also set the `per_page` parameter to 50 to match the number of results (default of `per_page` is 25, which would take two API calls otherwise).

In [None]:
def build_url_for_references(openalex_ids, per_page, mailto):
    # specify endpoint
    endpoint = 'works'

    # build the 'filter' parameter
    openalex_only_ids = [openalex_id.replace("https://openalex.org/", "") for openalex_id in openalex_ids]
    filters = f'openalex:{"|".join(openalex_only_ids)}'
    
    # put the URL together
    return f'https://api.openalex.org/{endpoint}?filter={filters}&per_page={per_page}&mailto={mailto}'

To use this approach we also need to **slice the list of OpenAlex IDs into pieces of 50 IDs**, which we will put into the URL that we use to **query the OpenAlex API**. 

In [None]:
def get_references(reference_ids, mailto):
    chunk_size = 50
    session = requests.Session()
    
    for i in range(0, len(reference_ids), chunk_size):
        ref_ids_slice = reference_ids[i:i + chunk_size]
        references_url = build_url_for_references(ref_ids_slice, chunk_size, mailto)
        
        page_with_results = session.get(references_url)
        results = page_with_results.json()['results']
        yield results

---
Let's see **how many requests** we need to make to the OpenAlex API to retrieve all references:

In [None]:
count = len(unique_ref_ids)
per_page = 50
number_of_pages_needed = int(count / per_page) + (count % per_page > 0)
print(f"number of requests needed (with per_page set to {per_page}): {number_of_pages_needed}")

To get a sense of progress, we will put in a `print` statement that notifies us every time 30 requests were made.

---

We put the parts together again like this
* we take all of the references' OpenAlex IDs from `pub2ref` and remove duplicates
* we divide the remaining OpenAlex IDs into slices of size 50 and pipe them together using the OR operator as described in the OurResearch blog post
* with the URLs we call the OpenAlex API to get the references' metadata
* from the references we extract the same fields as we did before for publications (see function `extract_selected_fields(openalex_work)`)
* we store the metadata for references in a compressed file called "references.parquet" within the project's data folder

In [None]:
%%time

# call OpenAlex API 
references = []
results_per_page = get_references(unique_ref_ids, email)
for i, results in enumerate(results_per_page):
    if i % 30 == 0: print(f'{i} requests sent')
    for work in results:
        # extract fields
        references.append(extract_selected_fields(work))

# store references
store_in_file(references, 
                  ['reference_id','reference_doi','reference_year','reference_title','reference_citation_count'],
                  f'{data_folder}/references.parquet')

---

## Data download complete

Now we can use the stored files to analyze only the list of **publications** ...
or only their **references** ...

In [None]:
# references only
refs_df = pd.read_parquet(f'{data_folder}/references.parquet')
print("Size of references data frame:",refs_df.shape)
refs_df.head(10)

In [None]:
# publications and their references
pub2ref_df = pd.read_csv(f'{data_folder}/pub2ref.csv')

df = pubs_df.join(pub2ref_df.set_index('publication_id'), on='publication_id')
df = df.join(refs_df.set_index('reference_id'), on='reference_id')
df.head(10)

<br>
<hr>

or if we want to get the original connected dataset we can **join publications with their references** using the pub2ref table:

In [None]:
df = refs_df.join(pub2ref_df.set_index('publication_id'), on='reference_id',lsuffix='_left',rsuffix='_right')
df = df.join(refs_df.set_index('reference_id'), on='reference_id',lsuffix='_left',rsuffix='_right')
df.head(10)