#OpenAlex Cited References - Sonification and Auditory Displays
Notebook to query OpenAlex database of journal articles and citations, with the goal to produce a citation network analysis

Produced by Chris Harrison. Adapted from: https://github.com/eschares/OpenAlex-CitedReferences/tree/main with thanks to Eric Schares, Iowa State University; [eschares.github.io](eschares.github.io) and Sandra Mierz; [https://github.com/smierz](https://github.com/smierz)

In [38]:
# needed dependencies
import requests
import pandas as pd
import pyarrow

In [39]:
email = "ui@openalex.org" #needed for OpenAlex API query
data_folder = '../data/openAlex/test'

<h2>Creating the API query</h2>
    <p>For now, we will simply produce a manual query in OpenAlex and then choose the option to "Show API query". I manually changed the query value of per_page to 200. This is the maximum items per page, and will limit the loops we need to do through pages by setting this to the maximum</p>

In [40]:
api_query = "https://api.openalex.org/works?page=1&filter=title_and_abstract.search:Sonification&sort=relevance_score:desc&per_page=200&mailto=ui@openalex.org"

### Set up the query
OpenAlex API puts results into pages. Therefore, we need to use [cursor paging](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging#cursor-paging) to access as many records as we want.

In [4]:
def get_metadata_using_cursor_paging(openalex_url):
    session = requests.Session()

    # take original API url and add a placeholder for cursor
    openalex_url_with_cursor = openalex_url + '&per_page=200&cursor={}'

    # loop through pages
    cursor = '*'
    while cursor:
        # set cursor value and request page from OpenAlex
        url = openalex_url_with_cursor.format(cursor)
        print(url)
        page_with_results = session.get(url).json()

        # update cursor to meta.next_cursor
        cursor = page_with_results['meta']['next_cursor']

        # return page results to user to process
        results = page_with_results['results']
        yield results

### Extract the metadata we want

Each [work object](https://docs.openalex.org/api-entities/works/work-object) (i.e., we have one of these work objects per article/output from the query). in the `results` section comes with a lot of metadata describing each work.  For our project we will look at a subset of these: <br>
1. [id](https://docs.openalex.org/api-entities/works/work-object#id), <br>
2. [doi](https://docs.openalex.org/api-entities/works/work-object#title), <br>
3. [publication_year](https://docs.openalex.org/api-entities/works/work-object#publication_year), (self explanitory) <br>
4. [title](https://docs.openalex.org/api-entities/works/work-object#title-1),<br>
5. [primary_location](https://docs.openalex.org/api-entities/works/work-object#primary_location) (display_name, publisher, issn_l, type of work in the primary location etc.) <br>
6. [referenced_works](https://docs.openalex.org/api-entities/works/work-object#referenced_works) (list of works that this work cites, using the OpenAlex IDs).
For now we will simply count the referenced words, but note it might be interesting to check this equals the explicit attribute: [referenced_works_count](https://docs.openalex.org/api-entities/works/work-object#type).

We will use the following function to extract the information we are interested in for each work:

In [16]:
def extract_selected_fields(openalex_work):
    return (  openalex_work['id'],
              openalex_work['doi'],
              openalex_work['publication_year'],
              openalex_work['title'],
             # openalex_work['primary_location']['source']['title'],# this is the type of work -> not calling this correctly at the moment
             # openalex_work['primary_location']['source']['type'],# this is the type of work -> not calling this correctly at the moment
              len(openalex_work['referenced_works']) #this counts the number of referenced works
            )

### Get all of the cited works
Additionally we want a list of all of the [referenced_works](https://docs.openalex.org/api-entities/works/work-object#referenced_works) for each work. This comes in the convenient form of a list of OpenAlex IDs, so we can get a list of cited OpenAlex IDs for each of the IDs of the citing work. Let's set up a function to do this.

In [17]:
def extract_references(work):
    return [(work['id'], ref) for ref in work['referenced_works']]

### Storing the data locally
To avoid having to re-run the API query, let's make a little function to store the data

In [18]:
def store_in_file(data, column_names, filename):
    data_in_df = pd.DataFrame(data, columns=column_names)
    if filename.endswith("csv"):
        data_in_df.to_csv(filename, index=False)
    else:
        data_in_df.to_parquet(filename)

### How many works match the query?
Let's see **how many publications in OpenAlex** match the API query and how many requests we need to make to the OpenAlex API to query them all using [paging](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging):  
*Note: we are setting the `per_page` parameter to its maximum value 200 to reduce the number of API calls we will have to make.*

It would be a good idea to check this API query with the original query ran through the website for consistency.

In [19]:
api_response = requests.get(api_query)
parsed_response = api_response.json()

count = parsed_response['meta']['count']
print(f"number of publications: {count}")

per_page = 200
number_of_pages_needed = int(count / per_page) + (count % per_page > 0)
print(f"number of requests needed (with per_page set to {per_page}): {number_of_pages_needed}")

number of publications: 4856
number of requests needed (with per_page set to 200): 25


## Running the query

Let's put all the pieces together and time the process:
* request each page for the publication list using cursor paging
* extract selected attributes from each publication
* extract the connections between a publication and their referenced works
* store publication metadata in a file called "publications.csv" within the project's data folder
* store connections from publications to their references in a file called "pub2ref.csv" within the project's data folder

You will see a line output for each query (number needed calculated above)

In [14]:
%%time

# get all publications
publications = [] #list of publications and their data
pub2ref = [] #connection between publication and its references

results_per_page = get_metadata_using_cursor_paging(api_query)
counter = 0
for results in results_per_page:
    for work in results:
        publications.append(extract_selected_fields(work))
        pub2ref.extend(extract_references(work))
    counter += 1
    print('Query:',counter,'/',number_of_pages_needed)

# store publications
store_in_file(publications, 
                  ['publication_id', 'publication_doi', 'publication_year', 'publication_title', 'num_cited_references'], 
                  f'{data_folder}/publications.csv')

# store connections from publications to their references
store_in_file(pub2ref, ['publication_id', 'reference_id'], f'{data_folder}/pub2ref.csv')

https://api.openalex.org/works?page=1&filter=title_and_abstract.search:Sonification&sort=relevance_score:desc&per_page=200&mailto=ui@openalex.org&per_page=200&cursor=*
1 / 25
https://api.openalex.org/works?page=1&filter=title_and_abstract.search:Sonification&sort=relevance_score:desc&per_page=200&mailto=ui@openalex.org&per_page=200&cursor=IlsyMzMuMjU1MjMsIDk5LjAsIDMwLCAnaHR0cHM6Ly9vcGVuYWxleC5vcmcvVzI0MDMzMDA4NzQnXSI=
2 / 25
https://api.openalex.org/works?page=1&filter=title_and_abstract.search:Sonification&sort=relevance_score:desc&per_page=200&mailto=ui@openalex.org&per_page=200&cursor=IlsxNzkuNzA5NjQsIDk4LjAsIDE1LCAnaHR0cHM6Ly9vcGVuYWxleC5vcmcvVzIzMDQ5MDM4MDInXSI=
3 / 25
https://api.openalex.org/works?page=1&filter=title_and_abstract.search:Sonification&sort=relevance_score:desc&per_page=200&mailto=ui@openalex.org&per_page=200&cursor=IlsxNDkuMjc4MjMsIDk2LjAsIDcsICdodHRwczovL29wZW5hbGV4Lm9yZy9XMjI5Mjk5ODc1MCddIg==
4 / 25
https://api.openalex.org/works?page=1&filter=title_and_abstract

**This might already be in the form I want. However, might want to join up the publication_id with its original information so we can understand the communities in the next analysis done in sonification_citation_analysis**

In [37]:
pub2ref_df = pd.read_csv(f'{data_folder}/pub2ref.csv')
full_list = pd.concat([pub2ref_df['publication_id'],pub2ref_df['reference_id']])
n = len(pd.unique(full_list))
print("Number of connections (or 'edges') for our citation analysis:",pub2ref_df.shape)
print("Number of unique papers across all pairs of citations:",n)
pub2ref_df.head(10)

Number of connections (or 'edges') for our citation analysis: (74391, 2)
Number of unique papers across all pairs of citations: 48984


Unnamed: 0,publication_id,reference_id
0,https://openalex.org/W2184854514,https://openalex.org/W9119819
1,https://openalex.org/W2184854514,https://openalex.org/W35899286
2,https://openalex.org/W2184854514,https://openalex.org/W627963485
3,https://openalex.org/W2184854514,https://openalex.org/W768552817
4,https://openalex.org/W2184854514,https://openalex.org/W1499933681
5,https://openalex.org/W2184854514,https://openalex.org/W1549940068
6,https://openalex.org/W2184854514,https://openalex.org/W1565485882
7,https://openalex.org/W2184854514,https://openalex.org/W1574930552
8,https://openalex.org/W2184854514,https://openalex.org/W1577214156
9,https://openalex.org/W2184854514,https://openalex.org/W1577234506
