# Investigating OpenAlex data: cited references
#### Eric Schares, Iowa State University & Chantal Ripp, University of Ottawa

[eschares.github.io](eschares.github.io) 

Final Project for the [Science of Science Summer School 2022](https://s4.scienceofscience.org/)
 

<div style='background:#e7edf7'>
    This notebook will query the OpenAlex API get a set of publications, pull the cited references in the bibliographies, and answer the questions:
    <blockquote>
        <b><i>How many articles to do our authors cite? When were those articles published? How recent are they?</i></b>
    </blockquote>
   </div>

 
**Context**

We would like to better understand how campus researchers use journal content. Analyzing which years our authors cite and how many papers they cite gives us a better feel for how content is being used. We can use this information as we make journal cancellation and renewal decisions.

- **Part 1**. Pull the Data from OpenAlex API
- **Part 2**. Plot the Data
 - **2.1**. Number of references
 - **2.2**. Years of references

---
# Part 1. Pull the Data
#### (Skip to Part 2 if you already have the data saved)

In [1]:
import pandas as pd
import requests
import plotly.express as px

### Control the number of OpenAlex results per page

In [15]:
per_page = 25

## To modify for your own use, edit the `build_filter` function to change:
- [ROR ID](https://ror.org/search?query=iowa+state) for your own institution (line 3)
- Date range (line 6)
- Email address to get into OpenAlex's polite pool for faster response times (line 7)

In [93]:
def build_filter(page):
    # build the 'filter' parameter
    filter_by_institution_id = 'institutions.ror:https://ror.org/04rswrd78'   # ROR ID for Iowa State
    filter_by_paratext = 'is_paratext:false'   # not cover, ToC, issue information, etc
    filter_by_type = 'type:journal-article'
    filter_by_publication_date = 'from_publication_date:2021-01-06,to_publication_date:2021-01-06&per-page='+str(per_page)
    my_email = 'mailto=eschares@iastate.edu'
    page = 'page='+str(page)

    all_filters = (filter_by_institution_id, filter_by_paratext, filter_by_type, filter_by_publication_date)
    filter_param = f'filter={",".join(all_filters)}'
    filter_param = filter_param + '&' + my_email + '&' + page
    #print(f'filter query parameter:\n  {filter_param}')

    # put the URL together
    total_url = f'https://api.openalex.org/works?{filter_param}'
    #print(f'complete URL:\n  {total_url}')
    return total_url

In [94]:
filtered_works_url = build_filter(1)
filtered_works_url

'https://api.openalex.org/works?filter=institutions.ror:https://ror.org/04rswrd78,is_paratext:false,type:journal-article,from_publication_date:2021-01-06,to_publication_date:2021-01-06&per-page=25&mailto=eschares@iastate.edu&page=1'

Send the API call and get a response

In [95]:
api_response = requests.get(filtered_works_url)
parsed_response = api_response.json()

How many publication ("parent") results? (Note: If larger than 10,000 this approach with pagination will not work)

In [96]:
parsed_response['meta']['count']

6

So how many OpenAlex pages will this take at the given per_page?

In [97]:
parsed_response['meta']['count'] / per_page

0.24

Figure out how many pages we need to ask for

In [98]:
# If it takes 1.5 pages to give me all my records (for example), we'll need to ask it for 2
# To ask for 2, we'll need to set the range() to 3, since Python counts to stop-1

number_of_pages_needed = int(parsed_response['meta']['count'] / per_page) + 2
# need plus 2 to account for fractional page AND that python range stops 1 before end

# BUT if the total number of records is cleanly divisible by the per-page (say, 33.0 pages)
# we need to go back and remove one since the int() just drops the .0 and doesn't round
if (parsed_response['meta']['count'] % per_page) == 0:
    number_of_pages_needed -= 1
    
number_of_pages_needed

2

## Main loop - send a request, go through each page, on each page go through each result, and pull out the pieces we want
#### ---- Warning! ----

This can take quite a bit of time to run depending on the number of records and number of cited references you're asking for. Code takes about 1 minute per 300 total references, or 0.2s/reference. Assume ~40 references per paper to estimate, or 8s/paper.

In [174]:
print(f"Estimated running time: {(parsed_response['meta']['count'] * 40 * .2) / 60} minutes")

Estimated running time: 0.8 minutes


If the estimated time is very long (~hours), shorten your time frame in the `build_filter` function to run smaller chunks. Save the dfs separately, then reassemble into one combined dataframe using `new_df = pd.concat(df1,df2)`

In [100]:
# clear the lists 
publication_titles = []
publication_year = []
publication_journal = []
publication_publisher = []
publication_doi = []

reference_years = []
reference_titles = []
reference_journal = []
reference_publisher = []
reference_doi = []

summary_publication = []
summary_doi = []
summary_num_references = []

#go through all the pages of results
for page in range(1,number_of_pages_needed):   # offset already taken care of, don't need to +1 for range()
    print(f'Page {page}\n')                    # so if number_of_pages_needed is 3, variable 'page' will get 1 and 2
        
    # have to build the filter every time to ask for a new page
    filtered_works_url = build_filter(page)
    print(filtered_works_url)
    
    # Send it and get a response
    api_response = requests.get(filtered_works_url)
    parsed_response = api_response.json()

    # deal with fractional pages, need to keep track of how many leftover on the last page to avoid indexing error
    if page < (number_of_pages_needed-1):
        how_many_records = per_page
    else:  # figure out how many left on last page and only pass that many
        how_many_records = parsed_response['meta']['count'] - ((page-1) * per_page)
                                    # total records         -  pagesdone * records-per-page
        #print(f"page {page} num needed {number_of_pages_needed} how many records {how_many_records} total {parsed_response['meta']['count']} per page {per_page}")
        
    for j in range(how_many_records):  # controls number of papers to look at per page; 0 to per_page-1
        print(f"{parsed_response['results'][j]['doi']}, {len(parsed_response['results'][j]['referenced_works'])} references")
        summary_publication.append(parsed_response['results'][j]['title'])
        summary_doi.append(parsed_response['results'][j]['doi'])
        summary_num_references.append(len(parsed_response['results'][j]['referenced_works']))
        
        for i in parsed_response['results'][j]['referenced_works']:   #number of referenced works in the paper j
            splat = i.split('/')
            entity = splat[3]  #W1537479324

            single_work = requests.get('https://api.openalex.org/works/'+entity)
            parsed_single_work = single_work.json()

            #print(parsed_single_work['publication_year'])
            reference_years.append(parsed_single_work['publication_year'])
            reference_titles.append(parsed_single_work['title'])
            reference_journal.append(parsed_single_work['host_venue']['display_name'])
            reference_publisher.append(parsed_single_work['host_venue']['publisher'])
            reference_doi.append(parsed_single_work['doi'])

            publication_year.append(parsed_response['results'][j]['publication_year'])
            publication_titles.append(parsed_response['results'][j]['title'])
            publication_journal.append(parsed_response['results'][j]['host_venue']['display_name'])
            publication_publisher.append(parsed_response['results'][j]['host_venue']['publisher'])
            publication_doi.append(parsed_response['results'][j]['doi'])

Page 1

https://api.openalex.org/works?filter=institutions.ror:https://ror.org/04rswrd78,is_paratext:false,type:journal-article,from_publication_date:2021-01-06,to_publication_date:2021-01-06&per-page=25&mailto=eschares@iastate.edu&page=1
https://doi.org/10.1146/annurev-fluid-010719-060201, 102 references
https://doi.org/10.1039/d0sc05338d, 45 references
https://doi.org/10.2196/25535, 50 references
https://doi.org/10.1063/9.0000104, 19 references
https://doi.org/10.1063/5.0034936, 27 references
https://doi.org/10.1007/s00441-020-03333-3, 58 references


Check that the lengths are the same so we can put them into a dataframe later

In [102]:
print(f'Length of referenced years: {len(reference_years)}')
print(f'Length of referenced titles: {len(reference_titles)}')
print(f'Length of referenced journals: {len(reference_journal)}')

print(f'Length of publication year: {len(publication_year)}')
print(f'Length of publication titles: {len(publication_titles)}')
print(f'Length of publication journals: {len(publication_journal)}')

print(f'\nLength of summary titles: {len(summary_publication)}')
print(f'Length of summary dois: {len(summary_doi)}')
print(f'Length of summary num refs {len(summary_num_references)}')

Length of referenced years: 301
Length of referenced titles: 301
Length of referenced journals: 301
Length of publication year: 301
Length of publication titles: 301
Length of publication journals: 301

Length of summary titles: 6
Length of summary dois: 6
Length of summary num refs 6


### Put the lists together into two pandas DataFrames

In [103]:
d = {'publication':publication_titles,
     'publication_doi':publication_doi,
     'publication_year':publication_year,
     'publication_journal':publication_journal,
     'referenced_title':reference_titles,
     'reference_doi':reference_doi,
     'referenced_year':reference_years,
     'referenced_journal':reference_journal
    }
df = pd.DataFrame(data=d)
df.head(3)

Unnamed: 0,publication,publication_doi,publication_year,publication_journal,referenced_title,reference_doi,referenced_year,referenced_journal
0,X-Ray Flow Visualization in Multiphase Flows,https://doi.org/10.1146/annurev-fluid-010719-0...,2021,Annual Review of Fluid Mechanics,Industrial tomography using three different ga...,https://doi.org/10.1016/j.flowmeasinst.2015.10...,2016,Flow Measurement and Instrumentation
1,X-Ray Flow Visualization in Multiphase Flows,https://doi.org/10.1146/annurev-fluid-010719-0...,2021,Annual Review of Fluid Mechanics,Quantitative measurement of gas hold-up distri...,https://doi.org/10.1016/j.cej.2007.08.014,2008,Chemical Engineering Journal
2,X-Ray Flow Visualization in Multiphase Flows,https://doi.org/10.1146/annurev-fluid-010719-0...,2021,Annual Review of Fluid Mechanics,Gamma‐Ray Computed Tomography for Imaging of M...,https://doi.org/10.1002/cite.201200250,2013,Chemie Ingenieur Technik


In [105]:
d2 = {'publication_title':summary_publication,
      'publication_doi':summary_doi,
      'num_cited_references':summary_num_references    
}
summary_df = pd.DataFrame(data=d2)
summary_df

Unnamed: 0,publication_title,publication_doi,num_cited_references
0,X-Ray Flow Visualization in Multiphase Flows,https://doi.org/10.1146/annurev-fluid-010719-0...,102
1,Synthetic glycosidases for the precise hydroly...,https://doi.org/10.1039/d0sc05338d,45
2,Accurately Differentiating Between Patients Wi...,https://doi.org/10.2196/25535,50
3,Effect of coil positioning and orientation of ...,https://doi.org/10.1063/9.0000104,19
4,How do neuroglial cells respond to ultrasound ...,https://doi.org/10.1063/5.0034936,27
5,Pre-exposure to hydrogen sulfide modulates the...,https://doi.org/10.1007/s00441-020-03333-3,58


### Save them so you don't need to run the full loop again once you have it

In [106]:
# using parquet file format since it can be a big file, smaller size but not human readable
df.to_parquet('my_file.parquet')

In [175]:
# .csv format probably okay here, human readable
summary_df.to_csv('summary_file.csv', index=False)

---
---

# Part 2: Plot the data
### Skip to here if you already have the df's saved.
Once you have the OpenAlex response run and parsed into pandas df's, you can start to plot. 

## 2.1 Look at summary data first - just Title, DOI, and number of references

In [145]:
summary_df = pd.read_csv('summary_file_7697.csv', usecols=['publication_title','num_cited_references'], )
summary_df = summary_df.sort_values(by='num_cited_references')
summary_df = summary_df.reset_index(drop=True)
summary_df

Unnamed: 0,publication_title,num_cited_references
0,Optimization for L1-Norm Error Fitting via Dat...,0
1,Permittivity Extraction from Synthetic Apertur...,0
2,Effects of Splits Content on Dry Matter Loss R...,0
3,Corrigendum: Polysaccharide Biosynthesis: Glyc...,0
4,Preface to the Special Issue on Sexual Develop...,0
...,...,...
225,Differential Impact of Severity and Duration o...,157
226,Together We Rise: How Social Movements Succeed,183
227,NIH Workshop Report: sensory nutrition and dis...,183
228,Supernova 2018cuf: A Type IIP Supernova with a...,190


Average and median number of references per paper

In [147]:
summary_df.describe()

Unnamed: 0,num_cited_references
count,230.0
mean,33.465217
std,39.026316
min,0.0
25%,0.0
50%,23.0
75%,48.75
max,200.0


OpenAlex reports 0 references for some papers, even though manual investigation shows there are references there

In [148]:
# number of publications with 0 reported references
summary_df.loc[summary_df['num_cited_references']==0].shape[0]

61

OpenAlex is missing reference data for 26% of these records. 

In [150]:
# fraction of publications with 0 reported references
summary_df.loc[summary_df['num_cited_references']==0].shape[0] / summary_df.shape[0]

0.26521739130434785

### Aside: OpenCitation data 

From CrossRef April 9, 2020 [blog post](https://www.crossref.org/blog/free-public-data-file-of-112-million-crossref-records/): 

"References (i.e. authors’ cited sources) are also optional metadata. Nearly 50 million records include references and, of those, nearly 30 million have open references that are included in the data file. “Limited” and “Closed” references are not included in the data file. (EDIT 6th June 2022 - all references are now open by default with the March 2022 board vote to remove any restrictions on reference distribution)."

Initiative for Open Citations [(I4OC)](https://i4oc.org/) works to "promote the unrestricted availability of scholarly citation data."

![image.png](attachment:image.png)

---
### Make plots

In [151]:
# make all numbers same color except for 0 references
color_dict = {num:'blue' for num in summary_df['num_cited_references'] if num != 0}
color_dict[0]='lightgray'

In [152]:
fig = px.histogram(summary_df, x='num_cited_references', nbins=50,
             color='num_cited_references',
             color_discrete_map=color_dict,
             title=f'Histogram of the Number of Cited References in {summary_df.shape[0]} Publications<br>Num_references=0 shown in light gray'
)
fig.update_layout(showlegend=False)

In [153]:
# Remove publications with 0 reported references
summary_df_no_zeros = summary_df.loc[summary_df['num_cited_references']!=0]
summary_df_no_zeros

Unnamed: 0,publication_title,num_cited_references
61,Reorienting Perspectives: Why I Do Not Teach a...,1
62,"One Fish, Two Fish; Red Fish (or Green Fish?):...",2
63,Iowa State University’s English placement test...,2
64,Upholding Language Assessment Quality during t...,3
65,"Soybean Gall Midge (Diptera: Cecidomyiidae), a...",3
...,...,...
225,Differential Impact of Severity and Duration o...,157
226,Together We Rise: How Social Movements Succeed,183
227,NIH Workshop Report: sensory nutrition and dis...,183
228,Supernova 2018cuf: A Type IIP Supernova with a...,190


In [154]:
summary_df_no_zeros.describe()

Unnamed: 0,num_cited_references
count,169.0
mean,45.544379
std,39.021203
min,1.0
25%,19.0
50%,35.0
75%,57.0
max,200.0


In [155]:
px.histogram(summary_df_no_zeros, x='num_cited_references', nbins=50,
             text_auto=True,
             title=f'Histogram of the Number of Cited References in {summary_df_no_zeros.shape[0]} Publications<br>Num_references=0 *removed*')

In [156]:
px.ecdf(summary_df, x='num_cited_references', ecdfnorm='percent',
       title=f'Cumulative Distribution of the Number of Cited References in {summary_df.shape[0]} Publications')

In [157]:
px.ecdf(summary_df_no_zeros, x='num_cited_references', ecdfnorm='percent',
       title=f'Cumulative Distribution of the Number of Cited References in {summary_df_no_zeros.shape[0]} Publications<br>Num_references=0 *removed*')

---

## 2.2 Look further at the years those references were published

In [158]:
df = pd.read_parquet('my_file_7697.parquet')
df.head(3)

Unnamed: 0,publication,publication_doi,publication_year,publication_journal,referenced_title,reference_doi,referenced_year,referenced_journal
0,Crisis Management and Corporate Apology: The E...,https://doi.org/10.1177/2329488417735646,2021,International journal of business communication,The handbook of crisis communication,https://doi.org/10.1002/9781444314885,2010,Published in <b>2010</b> in Chichester UK Mald...
1,Crisis Management and Corporate Apology: The E...,https://doi.org/10.1177/2329488417735646,2021,International journal of business communication,The elaboration likelihood model of persuasion,https://doi.org/10.1016/s0065-2601(08)60214-2,1986,Advances in Experimental Social Psychology
2,Crisis Management and Corporate Apology: The E...,https://doi.org/10.1177/2329488417735646,2021,International journal of business communication,The negative communication dynamic,https://doi.org/10.1108/13632540710843913,2007,Journal of Communication Management


In [159]:
# Check distribution of number of references
df['publication_doi'].value_counts()

https://doi.org/10.1140/epjc/s10052-020-8227-9    200
https://doi.org/10.3847/1538-4357/abc417          190
https://doi.org/10.1093/ajcn/nqaa302              183
https://doi.org/10.1002/jcpy.1201                 183
https://doi.org/10.3389/fncel.2021.772868         157
                                                 ... 
https://doi.org/10.1111/1556-4029.14564             3
https://doi.org/10.1002/bes2.1812                   3
https://doi.org/10.1177/0098628320959946            2
https://doi.org/10.1080/15434303.2020.1862122       2
https://doi.org/10.1386/ijia_00033_1                1
Name: publication_doi, Length: 169, dtype: int64

### Oldest Reference is:

In [160]:
df.loc[df['referenced_year']==df['referenced_year'].min()]

Unnamed: 0,publication,publication_doi,publication_year,publication_journal,referenced_title,reference_doi,referenced_year,referenced_journal
4945,Sectional model of a prairie buffer strip in a...,https://doi.org/10.1002/agg2.20133,2021,"Agrosystems, geosciences & environment",The Relation Between the Rainfall and the Disc...,https://doi.org/10.1061/taceat.0000694,1889,Transactions of the American Society of Civil ...


### Make plots

In [161]:
px.histogram(df, x='referenced_year', nbins=200, 
             title=f'Histogram of Cited Year<br>{summary_df.shape[0]} Publications and {df.shape[0]} References')

In [162]:
px.histogram(df, x='referenced_year', nbins=200, histnorm='probability density',
            title=f'Probability Density of the Cited Year<br>{summary_df.shape[0]} Publications and {df.shape[0]} References')

In [164]:
fig5 = px.ecdf(df, x='referenced_year', ecdfnorm='percent',markers=True, lines=False,
        color_discrete_map={'red':'red', 'blue':'blue'},
       title=f'Cumulative Distribution of Year of Citation<br>{df.shape[0]} references'
)
fig5.update_layout(showlegend=False)

## Track one DOI of interest

### Add 'color' column to control the colors and change color of one DOI to track it on the plot

In [165]:
# Change DOI in this line
red_doi = 'https://doi.org/10.1177/2329488417735646'

#https://doi.org/10.1007/s11425-018-1550-5   # publication with oldest average reference at 36 years
#https://doi.org/10.1386/ijia_00033_1  - 0 average year, referenced 1 work, which is itself?
#https://doi.org/10.1016/j.trgeo.2020.100410 - 10.25 average year
#https://doi.org/10.1177/2329488417735646 - 16 average year

df['color'] = 'blue'
red_title = df.loc[df['publication_doi']==red_doi, 'publication'].iloc[0]
red_title

'Crisis Management and Corporate Apology: The Effects of Causal Attribution and Apology Type on Publics’ Cognitive and Affective Responses:'

In [166]:
# Change color for that DOI to red
filt = (df['publication_doi'] == red_doi)
df.loc[filt,'color'] = 'red'

In [167]:
# Double check the number that you changed to red, should match number of references in that DOI
df['color'].value_counts()

blue    7656
red       41
Name: color, dtype: int64

In [168]:
fig2 = px.histogram(df, x='referenced_year', color='color', nbins=200,
             title=f'Years when Cited References were published<br>Red: "{red_title}"',
             hover_data={'color':False,
                         'referenced_title':True},
             color_discrete_map={'red':'red', 'blue':'blue'},
             category_orders={"color":['blue','red']}
)
fig2.update_layout(showlegend=False)

In [169]:
fig3 = px.box(df, x='referenced_year', points='all', color='color', notched=True,
       title=f'Years when Cited References were published<br>Red: "{red_title}"',
       hover_data={'color':False,
                    'referenced_title':True,
                   'publication_year':True,      
                   'publication':True},
       color_discrete_map={'red':'red', 'blue':'blue'},
       category_orders={"color":['blue','red']}
)
fig3.update_layout(showlegend=False)

In [170]:
fig4 = px.ecdf(df, x='referenced_year', color='color', ecdfnorm='percent',markers=True, lines=False,
        color_discrete_map={'red':'red', 'blue':'blue'},
       title=f'Cumulative Distribution of Year of Citation in {df.shape[0]} Publications<br>Red: "{red_title}"'
)
fig4.update_layout(showlegend=False)

## Calculate the year delta, or how many years old a reference was when it got cited

In [171]:
df['year_delta'] = df['publication_year'] - df['referenced_year']

In [172]:
df.head(3)

Unnamed: 0,publication,publication_doi,publication_year,publication_journal,referenced_title,reference_doi,referenced_year,referenced_journal,color,year_delta
0,Crisis Management and Corporate Apology: The E...,https://doi.org/10.1177/2329488417735646,2021,International journal of business communication,The handbook of crisis communication,https://doi.org/10.1002/9781444314885,2010,Published in <b>2010</b> in Chichester UK Mald...,red,11
1,Crisis Management and Corporate Apology: The E...,https://doi.org/10.1177/2329488417735646,2021,International journal of business communication,The elaboration likelihood model of persuasion,https://doi.org/10.1016/s0065-2601(08)60214-2,1986,Advances in Experimental Social Psychology,red,35
2,Crisis Management and Corporate Apology: The E...,https://doi.org/10.1177/2329488417735646,2021,International journal of business communication,The negative communication dynamic,https://doi.org/10.1108/13632540710843913,2007,Journal of Communication Management,red,14


In [174]:
df['year_delta'].describe()

count    7697.000000
mean       11.203196
std        10.959907
min        -1.000000
25%         4.000000
50%         8.000000
75%        14.000000
max       132.000000
Name: year_delta, dtype: float64

### Group by publication, get one number per publication that shows the average age of its references

In [175]:
df2 = df.groupby('publication')['year_delta'].mean().to_frame(name='avg_year_delta')
df2.sample(5)  # 5 random results

Unnamed: 0_level_0,avg_year_delta
publication,Unnamed: 1_level_1
Assessing benefits of artificial drainage on soybean yield in the North Central US region,14.041667
Characterization of MOCVD regrown p-GaN and the interface properties for vertical GaN power devices,10.057143
Dietary nucleotide supplementation as an alternative to in-feed antibiotics in weaned piglets,11.857143
"Known Distribution of the Soybean Cyst Nematode, <i>Heterodera glycines</i>, in the United States and Canada in 2020",9.75
Nonlinear Multiple Models Adaptive Secondary Voltage Control of Microgrids,7.823529


In [176]:
df2 = df2.sort_values(by='avg_year_delta')
df2

Unnamed: 0_level_0,avg_year_delta
publication,Unnamed: 1_level_1
Reorienting Perspectives: Why I Do Not Teach a Course Titled ‘Islamic Architecture’,0.000000
From the Editors: Introduction to Managing Supply Chains Beyond Covid‐19 ‐ Preparing for the Next Global Mega‐Disruption,0.142857
Mechanical and fracture properties of steel fiber-reinforced geopolymer concrete,2.120000
Iowa State University’s English placement test of oral communication in times of COVID-19,2.500000
Upholding Language Assessment Quality during the COVID-19 Pandemic: Some Final Thoughts and Questions,2.666667
...,...
"Glass transition temperature studies of planetary ball milled glasses: Accessing the rapidly cooled glassy state in Na4P2S7-xOx, 0 ≤ x ≤ 7, Oxy-thio phosphate glasses",24.742857
Impacts of agricultural price support policy on price variability and welfare: Evidence from China's soybean market,25.783784
"Soybean Gall Midge (Diptera: Cecidomyiidae), a New Species Causing Injury to Soybean in the United States",27.666667
Identification and Biology of Common Caterpillars in U.S. Soybean,30.111111


In [177]:
df2['avg_year_delta'].describe()

count    169.000000
mean      11.779482
std        5.233166
min        0.000000
25%        8.342857
50%       11.102041
75%       14.363636
max       36.071429
Name: avg_year_delta, dtype: float64

In [178]:
px.ecdf(df2, x='avg_year_delta',
       title=f'Cumulative Distribution of the Average Age of Reference by Publication<br>{df2.shape[0]} Publications<br>'
)

---
# Part 3. Results

### Number of References
In this sample of 230 publications, 61 reported 0 references.

If we ignore those, the median number of references in the set of 169 publications was **35**, and the mean number was **45**. This shows how the 4 papers that have many references (> 180) pull up the mean.

### Individual References
Looking at all individual references, the median age is 8 years old (so, for 2021 publications, to a paper from 2013). The mean in this set is to a paper 11 years old, again showing the effect a very old paper can have in moving the mean (max age - 132 years, from 1889!).

By 2018, 20% of the references had been published. By 2004, 80% of references in this set had been published.

### Summarize by Publication

Grouping by publication shows an average age of the references for that publication. We see the mean (and median) of those average publication ages to be 11 years.

80% of papers have an average year_delta of 15 years old or less. The remaining 20% of references are scattered across 20+ years

___
# Part 4. Conclusion

OpenAlex is an open scholarly metadata index / Knowledge Graph that does not require payment to use. This is in contrast to various other large scientometric data sets that are proprietary.

However, OpenAlex can only report the data it has and depends greatly on the sources of its data. Some publishers choose not to submit reference data to CrossRef, and it therefore does not show up in OpenAlex data. In the past, publishers could submit reference data to CrossRef but opt to keep it closed; this changed in June 2022 so that *any* reference data submitted to CrossRef will now be made open.

These shortcomings limit the data and the findings here. Any conclusions should be taken as a minimum, as there are undoubtedly missing references and publication data.

#### Future Work
I thought that restricting the scope to look at only one year of publications from one university would limit the data enough to not overwhelm me. However, I found that even pulling 1 day of publications (Jan 1, 2021) gave me 230 results and 7,600 cited references, taking 35 minutes to run in OpenAlex.

Scaling this up to pull a full year of 3,500 publications and their projected 115,600 references would take an estimated 7.5 hours of active API time.

Building further to pull multiple years and look across time would take another ~8 hours of API time per year of publications. 

This is more involved than I had thought. It remains possible, but will require more careful consideration of how and when to run the data pulls in chunks. However, once I have the data saved, the plots seen above will still work.