# Investigating OpenAlex data: cited references
#### Eric Schares, Iowa State University; [eschares.github.io](eschares.github.io) 
 
---

<div style='background:#e7edf7'>
    This notebook will query the OpenAlex API get a set of publications, pull the cited references in the bibliographies, and answer the questions:
    <blockquote>
        <b><i>How many articles to do our authors cite? When were those articles published? How recent are they?</i></b>
    </blockquote>
   </div>

 
**Context**

We would like to better understand how campus researchers use journal content. Analyzing which years our authors cite and how many papers they cite gives us a better feel for how content is being used. We can use this information as we make journal cancellation and renewal decisions.

- **Part 1**. Pull the Data from OpenAlex API
- **Part 2**. Plot the Data
 - **2.1**. Number of references
 - **2.2**. Years of references

---
# Part 1. Pull the Data
#### (Skip to Part 2 if you already have the data saved)

In [1]:
import pandas as pd
import requests
import plotly.express as px

## To modify for your own use, edit the input parameters:
- [ROR ID](https://ror.org/search?query=iowa+state) for your own institution 
- Date range: from_publication_date - to_publication_date
- Email address to get into OpenAlex's polite pool for faster response times

In [2]:
# input
ror_id = "https://ror.org/04rswrd78"
from_publication_date = "2021-01-01"
to_publication_date = "2021-01-01"
email = "eschares@iastate.edu"

In [3]:
def build_url(ror_id, from_pub_date, to_pub_date, email):
  # specify endpoint
  endpoint = 'works'

  # build the 'filter' parameter
  filters = (
      f'institutions.ror:{ror_id}',
      'is_paratext:false',
      'type:journal-article', 
      f'from_publication_date:{from_pub_date}',
      f'to_publication_date:{to_pub_date}'
  )

  # put the URL together
  return f'https://api.openalex.org/{endpoint}?filter={",".join(filters)}&mailto={email}'


filtered_works_url = build_url(ror_id, from_publication_date, to_publication_date, email)
print(f'complete URL:\n{filtered_works_url}')

complete URL:
https://api.openalex.org/works?filter=institutions.ror:https://ror.org/04rswrd78,is_paratext:false,type:journal-article,from_publication_date:2021-01-01,to_publication_date:2021-01-01&mailto=eschares@iastate.edu


Send the API call and get a response

In [4]:
api_response = requests.get(filtered_works_url)
parsed_response = api_response.json()

How many publication ("parent") results?  
And how many OpenAlex pages will this take at the given `per_page`?

In [5]:
count = parsed_response['meta']['count']
print(f"result count: {count}")

per_page = 200
number_of_pages_needed = int(count / per_page) + (count % per_page > 0) # shorter way to calculate math.ceil
print(f"number of pages needed: {number_of_pages_needed}")

result count: 230
number of pages needed: 2


In [6]:
# this will find the first concept labeled as Level 0 (most general, no ancestors)
# there may be multiple Level 0's, but this works b/c it will take the first one when ordered by score
# https://docs.openalex.org/about-the-data/concept

# Should we find and return the first Level 1 concept too?
# What about when no Level 0? Just the lowest Level, with ties broken by highest score?
# But that makes it hard to compare - if Level 0 Engineering on most, but some papers get Level 1 Mechanical, won't group
# But that Mechanical wouldn't have had ANY Level 0, so it would have said None and not grouped anyway...

def find_level_zero_concept(concept_list):
    for concept in concept_list:
        if concept['level']==0:
            #print(concept['display_name'])
            return concept['display_name']

## Main loop - send a request, go through each page, on each page go through each result, and pull out the pieces we want
#### ---- Warning! ----

This can take quite a bit of time to run depending on the number of records and number of cited references you're asking for.

If the estimated time is very long (~hours), shorten your time frame in the `build_url` function to run smaller chunks. Save the dfs separately, then reassemble into one combined dataframe using `new_df = pd.concat(df1,df2)`

In [7]:
## GET ONLY PUBLICATIONS AND STORE THEIR REFERENCED WORKS 
def get_publications(works_url):
    api_calls_total = 0
    session = requests.Session()

    # we will store publications and connection pub2ref separately
    publications = []
    pub2ref = []

    # url with a placeholder for cursor
    works_url_with_cursor = works_url + '&cursor={}&per_page=200'

    # loop through pages
    cursor = '*'
    while cursor:
        # set cursor value and request page from OpenAlex
        url = works_url_with_cursor.format(cursor)
        print(url)
        page_with_results = session.get(url).json()
        api_calls_total += 1

        # loop through partial list of results
        results = page_with_results['results']
        for work in results:
            publications.append((
              work['id'],  # keep the OpenAlex ID
              work['doi'],
              work['publication_year'],
              work['title'],
              work['host_venue']['display_name'],
              work['host_venue']['publisher'],
              work['host_venue']['issn_l'],
              len(work['referenced_works'])
              #kicked out concepts for now
            ))

            for ref in work['referenced_works']:
                pub2ref.append((
                    work['id'],
                    ref
                ))

        # update cursor to meta.next_cursor
        cursor = page_with_results['meta']['next_cursor']
      
    print(f"number of api calls for publications: {api_calls_total}")
    return publications, pub2ref

In [8]:
def get_references(pub2ref):
    api_calls_total = 0
    session = requests.Session()

    references = []

    # url with a placeholder for cursor
    references_url = "https://api.openalex.org/works?filter=cited_by:{list_of_ids}&mailto=eschares@iastate.edu"
    references_url_with_cursor = references_url + '&cursor={cursor}&per_page=200'

    # filter for publications that have at least one reference
    pubs_with_refs = list(set(p[0].replace("https://openalex.org/","") for p in pub2ref))

    # take chunk of 50 publications
    chunk_size = 50
    for i in range(0, len(pubs_with_refs), chunk_size):
        publications_slice = pubs_with_refs[i:i + chunk_size]
        list_of_ids = "|".join(publications_slice)

        # loop through pages
        cursor = '*'
        while cursor:
            # set cursor value and request page from OpenAlex
            url = references_url_with_cursor.format(list_of_ids=list_of_ids, cursor=cursor)
            print(url)
            page_with_results = session.get(url).json()
            api_calls_total += 1
      
            # loop through partial list of results
            results = page_with_results['results']
            for work in results:
                references.append((
                    work['id'],  # keep the OpenAlex ID
                    work['doi'],
                    work['publication_year'],
                    work['title'],
                    work['host_venue']['display_name'],
                    work['host_venue']['publisher'],
                    work['host_venue']['issn_l'],
                    work['cited_by_count']
                    #kicked out concepts for now
                 ))

            # update cursor to meta.next_cursor
            cursor = page_with_results['meta']['next_cursor']

    print(f"number of api calls for references: {api_calls_total}")
    return references

### Let's run the whole thing and time it

In [9]:
## MAIN 
%%time

# get all publications
filtered_works_url = build_url(ror_id, from_publication_date, to_publication_date, email)
print("URL for publications: " + filtered_works_url)
publications, pub2ref = get_publications(filtered_works_url)

# store publications
print(f"retrieved {len(publications)} publications")
pubs_only = pd.DataFrame(publications, columns =['publication_id',
                                                 'publication_doi', 
                                                 'publication_year',
                                                 'publication_title',
                                                 'publication_journal',
                                                 'publication_publisher',
                                                 'publication_journal_issn',
                                                 'num_cited_references'
                                                 ])

pubs_only.to_csv('publications.csv', index=False)

# store connection pub2ref
pub2ref_df = pd.DataFrame(pub2ref, columns=['publication_id', 'reference_id'])

# .csv format probably okay here, human readable
pub2ref_df.to_csv('pub2ref.csv', index=False)
pub2ref_df

# get references
references = get_references(pub2ref)

# store references
print(f"retrieved {len(references)} references")
refs_only = pd.DataFrame(references, columns =['reference_id',
                                                 'reference_doi', 
                                                 'reference_year',
                                                 'reference_title',
                                                 'reference_journal',
                                                 'reference_publisher',
                                                 'reference_journal_issn',
                                                 'reference_citation_count'
                                                 ])

# using parquet file format since it can be a big file, smaller size but not human readable
refs_only.to_parquet('references.parquet')

URL for publications: https://api.openalex.org/works?filter=institutions.ror:https://ror.org/04rswrd78,is_paratext:false,type:journal-article,from_publication_date:2021-01-01,to_publication_date:2021-01-01&mailto=eschares@iastate.edu
https://api.openalex.org/works?filter=institutions.ror:https://ror.org/04rswrd78,is_paratext:false,type:journal-article,from_publication_date:2021-01-01,to_publication_date:2021-01-01&mailto=eschares@iastate.edu&cursor=*&per_page=200
https://api.openalex.org/works?filter=institutions.ror:https://ror.org/04rswrd78,is_paratext:false,type:journal-article,from_publication_date:2021-01-01,to_publication_date:2021-01-01&mailto=eschares@iastate.edu&cursor=IlswLCAnaHR0cHM6Ly9vcGVuYWxleC5vcmcvVzMyMDIwNzIxODcnXSI=&per_page=200
https://api.openalex.org/works?filter=institutions.ror:https://ror.org/04rswrd78,is_paratext:false,type:journal-article,from_publication_date:2021-01-01,to_publication_date:2021-01-01&mailto=eschares@iastate.edu&cursor=IlswLCAnaHR0cHM6Ly9vcGVu

###❗ **It would be better to continuously write to a file (append if csv, or use multiple files in a dir if parquet) instead of doing it at the end - you may run out of memory!**

---
---

# Part 2: Plot the data
### Skip to here if you already have the df's saved.
Once you have the OpenAlex response run and parsed into pandas df's, you can start to plot. 

❗❗❗❗❗

*   all analysis that you did before on summary, you do on publications.csv.
* all analysis you did with publications and references in one table, you do on the joined tables (via pub2ref)

In [10]:
pubs_df = pd.read_csv('publications.csv')
pub2ref_df = pd.read_csv('pub2ref.csv')
refs_df = pd.read_parquet('references.parquet')

# summary before is pubs_df now
summary_df = pubs_df

# join tables on id fields - that's why it is important to keep unique openalex ids!
df = pub2ref_df.join(pubs_df.set_index('publication_id'), on='publication_id')
pub_id_col = df.pop('reference_id') # move reference_id column to end
df['reference_id'] = pub_id_col     # move reference_id column to end
df = df.join(refs_df.set_index('reference_id'), on='reference_id')
df

Unnamed: 0,publication_id,publication_doi,publication_year,publication_title,publication_journal,publication_publisher,publication_journal_issn,num_cited_references,reference_id,reference_doi,reference_year,reference_title,reference_journal,reference_publisher,reference_journal_issn,reference_citation_count
0,https://openalex.org/W3114025680,https://doi.org/10.1126/science.abb8518,2021,Nanoscale control of internal inhomogeneity en...,Science,American Association for the Advancement of Sc...,0036-8075,49,https://openalex.org/W107656301,https://doi.org/10.1016/b978-0-444-56334-7.000...,2012,Other membrane processes,Membrane Processes in Biotechnology and Pharma...,,,1
1,https://openalex.org/W3114025680,https://doi.org/10.1126/science.abb8518,2021,Nanoscale control of internal inhomogeneity en...,Science,American Association for the Advancement of Sc...,0036-8075,49,https://openalex.org/W1513798492,https://doi.org/10.1126/science.aaa5058,2015,Sub–10 nm polyamide nanofilms with ultrafast s...,Science,American Association for the Advancement of Sc...,0036-8075,1011
2,https://openalex.org/W3114025680,https://doi.org/10.1126/science.abb8518,2021,Nanoscale control of internal inhomogeneity en...,Science,American Association for the Advancement of Sc...,0036-8075,49,https://openalex.org/W1748568996,https://doi.org/10.1016/j.memsci.2015.09.059,2016,Identifying facile and accurate methods to mea...,Journal of Membrane Science,Elsevier,0376-7388,64
3,https://openalex.org/W3114025680,https://doi.org/10.1126/science.abb8518,2021,Nanoscale control of internal inhomogeneity en...,Science,American Association for the Advancement of Sc...,0036-8075,49,https://openalex.org/W1994098525,https://doi.org/10.1016/s1089-3156(99)00020-3,1999,Molecular dynamics simulation study of the mec...,Computational and Theoretical Polymer Science,Elsevier,1089-3156,69
4,https://openalex.org/W3114025680,https://doi.org/10.1126/science.abb8518,2021,Nanoscale control of internal inhomogeneity en...,Science,American Association for the Advancement of Sc...,0036-8075,49,https://openalex.org/W1994820621,https://doi.org/10.1016/j.desal.2013.09.024,2014,Molecular simulations of polyamide reverse osm...,Desalination,Elsevier,0011-9164,47
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7551,https://openalex.org/W4210364030,https://doi.org/10.2139/ssrn.3783526,2021,Fundamental Anomalies,Social Science Research Network,Social Science Electronic Publishing,1556-5068,40,https://openalex.org/W4244729525,https://doi.org/10.2307/j.ctv9b2wsq.9,2019,LETTER 63,"Letters, 61-90",,,1
7552,https://openalex.org/W4210364030,https://doi.org/10.2139/ssrn.3783526,2021,Fundamental Anomalies,Social Science Research Network,Social Science Electronic Publishing,1556-5068,40,https://openalex.org/W4247637984,https://doi.org/10.1021/cen-v041n033.p082,1963,INDUSTRY,Chemical & Engineering News,American Chemical Society,0009-2347,1
7553,https://openalex.org/W4210364030,https://doi.org/10.2139/ssrn.3783526,2021,Fundamental Anomalies,Social Science Research Network,Social Science Electronic Publishing,1556-5068,40,https://openalex.org/W4248717949,https://doi.org/10.3386/w28056,2020,Intangible Value,,,,6
7554,https://openalex.org/W4210364030,https://doi.org/10.2139/ssrn.3783526,2021,Fundamental Anomalies,Social Science Research Network,Social Science Electronic Publishing,1556-5068,40,https://openalex.org/W4249706454,https://doi.org/10.1080/10948009409389748,1994,Communications policy,Communication booknotes,Taylor & Francis,0748-657X,1


In [29]:
# table with multiindex - connection pub to ref visualized
df_grouped = df.set_index(['publication_id',
                            'publication_doi',
                            'publication_year',
                            'publication_title',
                            'publication_journal',
                            'publication_publisher',
                            'publication_journal_issn',
                            'num_cited_references',
                            'reference_id'])
df_grouped

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,reference_doi,reference_year,reference_title,reference_journal,reference_publisher,reference_journal_issn,reference_citation_count,year_delta
publication_id,publication_doi,publication_year,publication_title,publication_journal,publication_publisher,publication_journal_issn,num_cited_references,reference_id,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
https://openalex.org/W3114025680,https://doi.org/10.1126/science.abb8518,2021,Nanoscale control of internal inhomogeneity enhances water transport in desalination membranes,Science,American Association for the Advancement of Science (AAAS),0036-8075,49,https://openalex.org/W107656301,https://doi.org/10.1016/b978-0-444-56334-7.000...,2012,Other membrane processes,Membrane Processes in Biotechnology and Pharma...,,,1,9
https://openalex.org/W3114025680,https://doi.org/10.1126/science.abb8518,2021,Nanoscale control of internal inhomogeneity enhances water transport in desalination membranes,Science,American Association for the Advancement of Science (AAAS),0036-8075,49,https://openalex.org/W1513798492,https://doi.org/10.1126/science.aaa5058,2015,Sub–10 nm polyamide nanofilms with ultrafast s...,Science,American Association for the Advancement of Sc...,0036-8075,1011,6
https://openalex.org/W3114025680,https://doi.org/10.1126/science.abb8518,2021,Nanoscale control of internal inhomogeneity enhances water transport in desalination membranes,Science,American Association for the Advancement of Science (AAAS),0036-8075,49,https://openalex.org/W1748568996,https://doi.org/10.1016/j.memsci.2015.09.059,2016,Identifying facile and accurate methods to mea...,Journal of Membrane Science,Elsevier,0376-7388,64,5
https://openalex.org/W3114025680,https://doi.org/10.1126/science.abb8518,2021,Nanoscale control of internal inhomogeneity enhances water transport in desalination membranes,Science,American Association for the Advancement of Science (AAAS),0036-8075,49,https://openalex.org/W1994098525,https://doi.org/10.1016/s1089-3156(99)00020-3,1999,Molecular dynamics simulation study of the mec...,Computational and Theoretical Polymer Science,Elsevier,1089-3156,69,22
https://openalex.org/W3114025680,https://doi.org/10.1126/science.abb8518,2021,Nanoscale control of internal inhomogeneity enhances water transport in desalination membranes,Science,American Association for the Advancement of Science (AAAS),0036-8075,49,https://openalex.org/W1994820621,https://doi.org/10.1016/j.desal.2013.09.024,2014,Molecular simulations of polyamide reverse osm...,Desalination,Elsevier,0011-9164,47,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
https://openalex.org/W4210364030,https://doi.org/10.2139/ssrn.3783526,2021,Fundamental Anomalies,Social Science Research Network,Social Science Electronic Publishing,1556-5068,40,https://openalex.org/W4244729525,https://doi.org/10.2307/j.ctv9b2wsq.9,2019,LETTER 63,"Letters, 61-90",,,1,2
https://openalex.org/W4210364030,https://doi.org/10.2139/ssrn.3783526,2021,Fundamental Anomalies,Social Science Research Network,Social Science Electronic Publishing,1556-5068,40,https://openalex.org/W4247637984,https://doi.org/10.1021/cen-v041n033.p082,1963,INDUSTRY,Chemical & Engineering News,American Chemical Society,0009-2347,1,58
https://openalex.org/W4210364030,https://doi.org/10.2139/ssrn.3783526,2021,Fundamental Anomalies,Social Science Research Network,Social Science Electronic Publishing,1556-5068,40,https://openalex.org/W4248717949,https://doi.org/10.3386/w28056,2020,Intangible Value,,,,6,1
https://openalex.org/W4210364030,https://doi.org/10.2139/ssrn.3783526,2021,Fundamental Anomalies,Social Science Research Network,Social Science Electronic Publishing,1556-5068,40,https://openalex.org/W4249706454,https://doi.org/10.1080/10948009409389748,1994,Communications policy,Communication booknotes,Taylor & Francis,0748-657X,1,27


## 2.1 Look at summary data first - just Title, DOI, and number of references

In [11]:
#summary_df = pd.read_csv('summary_file.csv', usecols=['publication_title','publication_doi','num_cited_references'])
summary_df = summary_df.sort_values(by='num_cited_references')
summary_df = summary_df.reset_index(drop=True)
summary_df

Unnamed: 0,publication_id,publication_doi,publication_year,publication_title,publication_journal,publication_publisher,publication_journal_issn,num_cited_references
0,https://openalex.org/W4244705303,https://doi.org/10.2139/ssrn.3949336,2021,Effect of Pooling Family Oral Fluids on the Pr...,Social Science Research Network,Social Science Electronic Publishing,1556-5068,0
1,https://openalex.org/W3082738057,https://doi.org/10.1109/tsg.2020.3020790,2021,Real-Time Area Angle Monitoring Using Synchrop...,IEEE Transactions on Smart Grid,Institute of Electrical and Electronics Engineers,1949-3053,0
2,https://openalex.org/W3164221434,https://doi.org/10.13031/trans.14161,2021,Comparison of Dry Matter Loss Rates from Stati...,Transactions of the ASABE,American Society of Agricultural and Biologica...,2151-0032,0
3,https://openalex.org/W3180645738,https://doi.org/10.1159/000517937,2021,Preface to the Special Issue on Sexual Develop...,Sexual Development,S. Karger AG,1661-5425,0
4,https://openalex.org/W3182526752,https://doi.org/10.3389/fpls.2021.720709,2021,Corrigendum: Polysaccharide Biosynthesis: Glyc...,Frontiers in Plant Science,Frontiers Media SA,1664-462X,0
...,...,...,...,...,...,...,...,...
225,https://openalex.org/W3091715310,https://doi.org/10.1016/j.agwat.2020.106466,2021,Standard single and basal crop coefficients fo...,Agricultural Water Management,Elsevier,0378-3774,149
226,https://openalex.org/W3039069446,https://doi.org/10.1016/j.jobe.2020.101582,2021,Methodologies to mitigate wind-induced vibrati...,Journal of building engineering,Elsevier,2352-7102,154
227,https://openalex.org/W3205262116,https://doi.org/10.3389/fncel.2021.772868,2021,Differential Impact of Severity and Duration o...,Frontiers in Cellular Neuroscience,Frontiers Media SA,1662-5102,157
228,https://openalex.org/W3112388281,https://doi.org/10.1093/ajcn/nqaa302,2021,NIH Workshop Report: sensory nutrition and dis...,The American Journal of Clinical Nutrition,Oxford University Press,0002-9165,176


Average and median number of references per paper

In [12]:
summary_df.describe()

Unnamed: 0,publication_year,num_cited_references
count,230.0,230.0
mean,2021.0,32.852174
std,0.0,35.929879
min,2021.0,0.0
25%,2021.0,1.25
50%,2021.0,24.0
75%,2021.0,47.75
max,2021.0,182.0


OpenAlex reports 0 references for some papers, even though manual investigation shows there are references there

In [13]:
# number of publications with 0 reported references
summary_df.loc[summary_df['num_cited_references']==0].shape[0]

57

OpenAlex is missing reference data for 26% of these records. 

In [14]:
# fraction of publications with 0 reported references
summary_df.loc[summary_df['num_cited_references']==0].shape[0] / summary_df.shape[0]

0.24782608695652175

### Aside: OpenCitation data 

From CrossRef April 9, 2020 [blog post](https://www.crossref.org/blog/free-public-data-file-of-112-million-crossref-records/): 

"References (i.e. authors’ cited sources) are also optional metadata. Nearly 50 million records include references and, of those, nearly 30 million have open references that are included in the data file. “Limited” and “Closed” references are not included in the data file. (EDIT 6th June 2022 - all references are now open by default with the March 2022 board vote to remove any restrictions on reference distribution)."

Initiative for Open Citations [(I4OC)](https://i4oc.org/) works to "promote the unrestricted availability of scholarly citation data."

![image.png](attachment:image.png)

---
### Make plots

In [15]:
# make all numbers same color except for 0 references
color_dict = {num:'blue' for num in summary_df['num_cited_references'] if num != 0}
color_dict[0]='lightgray'

In [16]:
fig = px.histogram(summary_df, x='num_cited_references', nbins=50,
             color='num_cited_references',
             color_discrete_map=color_dict,
             title=f'Histogram of the Number of Cited References in {summary_df.shape[0]} Publications<br>Num_references=0 shown in light gray'
)
fig.update_layout(showlegend=False)

In [17]:
# Remove publications with 0 reported references
summary_df_no_zeros = summary_df.loc[summary_df['num_cited_references']!=0]
summary_df_no_zeros

Unnamed: 0,publication_id,publication_doi,publication_year,publication_title,publication_journal,publication_publisher,publication_journal_issn,num_cited_references
57,https://openalex.org/W3107958010,https://doi.org/10.1386/ijia_00033_1,2021,Reorienting Perspectives: Why I Do Not Teach a...,International journal of Islamic architecture,Intellect,2045-5895,1
58,https://openalex.org/W3115756258,https://doi.org/10.1080/15434303.2020.1862122,2021,Iowa State University’s English placement test...,Language Assessment Quarterly,Taylor & Francis,1543-4303,2
59,https://openalex.org/W3088002767,https://doi.org/10.1177/0098628320959946,2021,"One Fish, Two Fish; Red Fish (or Green Fish?):...",Teaching of Psychology,SAGE,0098-6283,2
60,https://openalex.org/W3131335974,https://doi.org/10.1093/jipm/pmab001,2021,"Soybean Gall Midge (Diptera: Cecidomyiidae), a...",Journal of Integrated Pest Management,Oxford University Press,2155-7470,3
61,https://openalex.org/W3112660193,https://doi.org/10.1016/j.nuclphysa.2020.121933,2021,Probing Jet Modification in Small Systems via ...,Nuclear Physics,Elsevier,0375-9474,3
...,...,...,...,...,...,...,...,...
225,https://openalex.org/W3091715310,https://doi.org/10.1016/j.agwat.2020.106466,2021,Standard single and basal crop coefficients fo...,Agricultural Water Management,Elsevier,0378-3774,149
226,https://openalex.org/W3039069446,https://doi.org/10.1016/j.jobe.2020.101582,2021,Methodologies to mitigate wind-induced vibrati...,Journal of building engineering,Elsevier,2352-7102,154
227,https://openalex.org/W3205262116,https://doi.org/10.3389/fncel.2021.772868,2021,Differential Impact of Severity and Duration o...,Frontiers in Cellular Neuroscience,Frontiers Media SA,1662-5102,157
228,https://openalex.org/W3112388281,https://doi.org/10.1093/ajcn/nqaa302,2021,NIH Workshop Report: sensory nutrition and dis...,The American Journal of Clinical Nutrition,Oxford University Press,0002-9165,176


In [18]:
summary_df_no_zeros.describe()

Unnamed: 0,publication_year,num_cited_references
count,173.0,173.0
mean,2021.0,43.676301
std,0.0,35.259987
min,2021.0,1.0
25%,2021.0,20.0
50%,2021.0,35.0
75%,2021.0,56.0
max,2021.0,182.0


In [19]:
px.histogram(summary_df_no_zeros, x='num_cited_references', nbins=50,
             text_auto=True,
             title=f'Histogram of the Number of Cited References in {summary_df_no_zeros.shape[0]} Publications<br>Num_references=0 *removed*')

In [20]:
px.ecdf(summary_df, x='num_cited_references', ecdfnorm='percent',
       title=f'Cumulative Distribution of the Number of Cited References in {summary_df.shape[0]} Publications')

In [21]:
px.ecdf(summary_df_no_zeros, x='num_cited_references', ecdfnorm='percent',
       title=f'Cumulative Distribution of the Number of Cited References in {summary_df_no_zeros.shape[0]} Publications<br>Num_references=0 *removed*')

---

## 2.2 Look further at the years those references were published

In [22]:
#df = pd.read_parquet('my_file_7697.parquet')
df.head(3)

Unnamed: 0,publication_id,publication_doi,publication_year,publication_title,publication_journal,publication_publisher,publication_journal_issn,num_cited_references,reference_id,reference_doi,reference_year,reference_title,reference_journal,reference_publisher,reference_journal_issn,reference_citation_count
0,https://openalex.org/W3114025680,https://doi.org/10.1126/science.abb8518,2021,Nanoscale control of internal inhomogeneity en...,Science,American Association for the Advancement of Sc...,0036-8075,49,https://openalex.org/W107656301,https://doi.org/10.1016/b978-0-444-56334-7.000...,2012,Other membrane processes,Membrane Processes in Biotechnology and Pharma...,,,1
1,https://openalex.org/W3114025680,https://doi.org/10.1126/science.abb8518,2021,Nanoscale control of internal inhomogeneity en...,Science,American Association for the Advancement of Sc...,0036-8075,49,https://openalex.org/W1513798492,https://doi.org/10.1126/science.aaa5058,2015,Sub–10 nm polyamide nanofilms with ultrafast s...,Science,American Association for the Advancement of Sc...,0036-8075,1011
2,https://openalex.org/W3114025680,https://doi.org/10.1126/science.abb8518,2021,Nanoscale control of internal inhomogeneity en...,Science,American Association for the Advancement of Sc...,0036-8075,49,https://openalex.org/W1748568996,https://doi.org/10.1016/j.memsci.2015.09.059,2016,Identifying facile and accurate methods to mea...,Journal of Membrane Science,Elsevier,0376-7388,64


In [23]:
# Check distribution of number of references
df['publication_doi'].value_counts()

https://doi.org/10.1002/jcpy.1201                183
https://doi.org/10.1016/j.agwat.2020.106466      183
https://doi.org/10.1093/ajcn/nqaa302             176
https://doi.org/10.3389/fncel.2021.772868        158
https://doi.org/10.1016/j.agwat.2020.106196      157
                                                ... 
https://doi.org/10.1111/1556-4029.14564            3
https://doi.org/10.1002/bes2.1812                  3
https://doi.org/10.1177/0098628320959946           2
https://doi.org/10.1080/15434303.2020.1862122      2
https://doi.org/10.1386/ijia_00033_1               1
Name: publication_doi, Length: 173, dtype: int64

### Oldest Reference is:

In [25]:
df.loc[df['reference_year']==df['reference_year'].min()]

Unnamed: 0,publication_id,publication_doi,publication_year,publication_title,publication_journal,publication_publisher,publication_journal_issn,num_cited_references,reference_id,reference_doi,reference_year,reference_title,reference_journal,reference_publisher,reference_journal_issn,reference_citation_count
6897,https://openalex.org/W3126952487,https://doi.org/10.1002/agg2.20133,2021,Sectional model of a prairie buffer strip in a...,"Agrosystems, geosciences & environment",Wiley,2639-6696,21,https://openalex.org/W2409768010,https://doi.org/10.1061/taceat.0000694,1889,The Relation Between the Rainfall and the Disc...,Transactions of the American Society of Civil ...,American Society of Civil Engineers,0066-0604,200


### Make plots

In [None]:
px.histogram(df, x='referenced_year', nbins=200, 
             title=f'Histogram of Cited Year<br>{summary_df.shape[0]} Publications and {df.shape[0]} References')

In [None]:
px.histogram(df, x='referenced_year', nbins=200, histnorm='probability density',
            title=f'Probability Density of the Cited Year<br>{summary_df.shape[0]} Publications and {df.shape[0]} References')

In [None]:
fig5 = px.ecdf(df, x='referenced_year', ecdfnorm='percent',markers=True, lines=False,
        color_discrete_map={'red':'red', 'blue':'blue'},
       title=f'Cumulative Distribution of Year of Citation<br>{df.shape[0]} references'
)
fig5.update_layout(showlegend=False)

## Track one DOI of interest

### Add 'color' column to control the colors and change color of one DOI to track it on the plot

In [None]:
# Change DOI in this line
red_doi = 'https://doi.org/10.1177/2329488417735646'

#https://doi.org/10.1007/s11425-018-1550-5   # publication with oldest average reference at 36 years
#https://doi.org/10.1386/ijia_00033_1  - 0 average year, referenced 1 work, which is itself?
#https://doi.org/10.1016/j.trgeo.2020.100410 - 10.25 average year
#https://doi.org/10.1177/2329488417735646 - 16 average year

df['color'] = 'blue'
red_title = df.loc[df['publication_doi']==red_doi, 'publication_title'].iloc[0]
red_title

In [None]:
# Change color for that DOI to red
filt = (df['publication_doi'] == red_doi)
df.loc[filt,'color'] = 'red'

In [None]:
# Double check the number that you changed to red, should match number of references in that DOI
df['color'].value_counts()

In [None]:
fig2 = px.histogram(df, x='referenced_year', color='color', nbins=200,
             title=f'Years when Cited References were published<br>Red: "{red_title}"',
             hover_data={'color':False,
                         'referenced_title':True},
             color_discrete_map={'red':'red', 'blue':'blue'},
             category_orders={"color":['blue','red']}
)
fig2.update_layout(showlegend=False)

In [None]:
fig3 = px.box(df, x='referenced_year', points='all', color='color', notched=True,
       title=f'Years when Cited References were published<br>Red: "{red_title}"',
       hover_data={'color':False,
                    'referenced_title':True,
                   'publication_year':True,      
                   'publication':True},
       color_discrete_map={'red':'red', 'blue':'blue'},
       category_orders={"color":['blue','red']}
)
fig3.update_layout(showlegend=False)

In [None]:
fig4 = px.ecdf(df, x='referenced_year', color='color', ecdfnorm='percent',markers=True, lines=False,
        color_discrete_map={'red':'red', 'blue':'blue'},
       title=f'Cumulative Distribution of Year of Citation in {df.shape[0]} Publications<br>Red: "{red_title}"'
)
fig4.update_layout(showlegend=False)

## Calculate the year delta, or how many years old a reference was when it got cited

In [26]:
df['year_delta'] = df['publication_year'] - df['reference_year']

In [27]:
df.head(3)

Unnamed: 0,publication_id,publication_doi,publication_year,publication_title,publication_journal,publication_publisher,publication_journal_issn,num_cited_references,reference_id,reference_doi,reference_year,reference_title,reference_journal,reference_publisher,reference_journal_issn,reference_citation_count,year_delta
0,https://openalex.org/W3114025680,https://doi.org/10.1126/science.abb8518,2021,Nanoscale control of internal inhomogeneity en...,Science,American Association for the Advancement of Sc...,0036-8075,49,https://openalex.org/W107656301,https://doi.org/10.1016/b978-0-444-56334-7.000...,2012,Other membrane processes,Membrane Processes in Biotechnology and Pharma...,,,1,9
1,https://openalex.org/W3114025680,https://doi.org/10.1126/science.abb8518,2021,Nanoscale control of internal inhomogeneity en...,Science,American Association for the Advancement of Sc...,0036-8075,49,https://openalex.org/W1513798492,https://doi.org/10.1126/science.aaa5058,2015,Sub–10 nm polyamide nanofilms with ultrafast s...,Science,American Association for the Advancement of Sc...,0036-8075,1011,6
2,https://openalex.org/W3114025680,https://doi.org/10.1126/science.abb8518,2021,Nanoscale control of internal inhomogeneity en...,Science,American Association for the Advancement of Sc...,0036-8075,49,https://openalex.org/W1748568996,https://doi.org/10.1016/j.memsci.2015.09.059,2016,Identifying facile and accurate methods to mea...,Journal of Membrane Science,Elsevier,0376-7388,64,5


In [28]:
df['year_delta'].describe()

count    7856.000000
mean       11.257256
std        11.065128
min        -1.000000
25%         4.000000
50%         8.000000
75%        14.000000
max       132.000000
Name: year_delta, dtype: float64

### Group by publication, get one number per publication that shows the average age of its references

In [None]:
df2 = df.groupby('publication_title')['year_delta'].mean().to_frame(name='avg_year_delta')
df2.sample(5)  # 5 random results

In [None]:
df2 = df2.sort_values(by='avg_year_delta')
df2

In [None]:
df2['avg_year_delta'].describe()

In [None]:
px.ecdf(df2, x='avg_year_delta',
       title=f'Cumulative Distribution of the Average Age of Reference by Publication<br>{df2.shape[0]} Publications<br>'
)

---
# Part 3. Results

### Number of References
In this sample of 230 publications, 61 reported 0 references.

If we ignore those, the median number of references in the set of 169 publications was **35**, and the mean number was **45**. This shows how papers that have lots of references (4 with more than 180) pull up the mean.

### Individual References
Looking at all individual references, the median age is 8 years old (so, for 2021 publications, to a paper from 2013). The mean in this set is to a paper 11 years old, again showing the effect a very old paper can have in moving the mean (max age - 132 years, from 1889!).

By 2018, 20% of the references had been published. By 2004, 80% of references in this set had been published.

### Summarize by Publication

Grouping by publication shows an average age of all references for each publication. We see the mean (and median) of those average publication ages to be about 11 years.

80% of papers have an average year_delta of 15 years old or less. The remaining 20% of references are scattered across 20+ years

___
# Part 4. Conclusion

OpenAlex is an open scholarly metadata index / Knowledge Graph. This is in contrast to various other large scientometric data sets that are proprietary.

However, OpenAlex can only report the data it has, and it depends greatly on the sources of its data. Some publishers choose not to submit reference data to CrossRef, and it therefore does not show up in OpenAlex data. In the past, publishers could submit reference data to CrossRef but opt to keep it closed; this changed in June 2022 so now *any* reference data submitted to CrossRef will now be made open.

These shortcomings limit the data and the findings here. Any conclusions should be taken as a minimum, as there are undoubtedly missing references and publication data.

#### Future Work
I thought that restricting the scope to look at only one year of publications from one university would limit the data enough to not overwhelm me. However, I found that even pulling 1 day of publications (Jan 1, 2021) gave me 230 results and 7,600 cited references, taking 35 minutes to run in OpenAlex.

Scaling this up to pull a full year of 3,500 publications and their projected 115,600 references would take an estimated 7.5 hours of active API time.

Building further to pull multiple years and look across time would take another ~8 hours of API time per year of publications. 

This is more involved than I had thought. It remains possible, but will require more careful consideration of how and when to run the data pulls in chunks. However, once I have the data saved, the plots seen above will still work.