# Script to get all results from OpenAlex's API

We'll create a script to download all works from a given institution (UFRJ) in OpenAlex using its 'works' API.

The 'cursor paging' feature will be used to do so. More on that [here](https://github.com/ourresearch/openalex-api-tutorials/blob/main/notebooks/getting-started/paging.ipynb).

Note that OpenAlex has quite a few API's, each dealing with different types of data: 

![OpenAlex's entities](https://docs.openalex.org/~gitbook/image?url=https%3A%2F%2F334408415-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FpHVuV3Ib5KXeBKft4Kcl%252Fuploads%252Fgit-blob-f2467ba820f38bcd9dc58a791415c8bd1fbcafec%252Fentities.png%3Falt%3Dmedia&width=768&dpr=1&quality=100&sign=60a05a56&sv=2)

In [6]:
import requests #We'll need the requests library to make API requests

In [73]:
#Let's see some information about the institution of interest (UFRJ) using the API
#We'll use its ROR identifier to do so (https://ror.org/03490as77)

ufrj_info = requests.get('https://api.openalex.org/institutions?filter=ror:03490as77').json()
ufrj_info['results']

[{'id': 'https://openalex.org/I122140584',
  'ror': 'https://ror.org/03490as77',
  'display_name': 'Universidade Federal do Rio de Janeiro',
  'country_code': 'BR',
  'type': 'education',
  'type_id': 'https://openalex.org/institution-types/education',
  'lineage': ['https://openalex.org/I122140584'],
  'homepage_url': 'https://ufrj.br',
  'image_url': 'https://commons.wikimedia.org/w/index.php?title=Special:Redirect/file/Minerva%20UFRJ.jpg',
  'image_thumbnail_url': 'https://commons.wikimedia.org/w/index.php?title=Special:Redirect/file/Minerva%20UFRJ.jpg&width=300',
  'display_name_acronyms': ['UFRJ'],
  'display_name_alternatives': ['Federal University of Rio de Janeiro'],
  'repositories': [],
  'works_count': 144605,
  'cited_by_count': 2243183,
  'summary_stats': {'2yr_mean_citedness': 1.8833812260536398,
   'h_index': 356,
   'i10_index': 45261},
  'ids': {'openalex': 'https://openalex.org/I122140584',
   'ror': 'https://ror.org/03490as77',
   'mag': '122140584',
   'grid': 'grid

In [40]:
#That's a lot of information! Let's get only the most important fields for now

def get_relevant_institution_fields(ror):
    data = requests.get(f'https://api.openalex.org/institutions?filter=ror:{ror}').json()
    results = data['results'][0]
    print(f"Institution name: {results['display_name']}")
    print(f"Total works: {results['works_count']}")
    print(f"Works api url: {results['works_api_url']}")

In [41]:
get_relevant_institution_fields('03490as77')

Institution name: Universidade Federal do Rio de Janeiro
Total works: 144605
Works api: https://api.openalex.org/works?filter=institutions.id:I122140584


In [47]:
#Let's use the works' api url to retrieve first page of works from the institution

data = requests.get(f'https://api.openalex.org/works?filter=institutions.id:I122140584&per-page=1').json()
metadata = data['meta']
metadata

{'count': 142411,
 'db_response_time_ms': 140,
 'page': 1,
 'per_page': 1,
 'groups_count': None}

In [64]:
#There are less docs in the 'works' request than in the 'institution' request
#We'll see if searching through ror will get more results
#Searching through raw-affiliation name is not recommended (https://docs.openalex.org/api-entities/works/search-works#why-cant-i-search-by-name-of-related-entity-author-name-institution-name-etc.), but we'll do it anyway

def get_number_of_results(work_api_url):
    return( requests.get(work_api_url).json()['meta']['count'] )

#Using ror
ror_url = 'https://api.openalex.org/works?filter=institutions.ror:03490as77'
print(f'ror: { get_number_of_results(ror_url) }')

raw_affiliation_strings_url = 'https://api.openalex.org/works?filter=raw_affiliation_strings.search:"Universidade Federal do Rio de Janeiro"'
print(f'affiliation_name: { get_number_of_results(ror_url) }')

ror: 142408
affiliation_name: 142408


In [77]:
requests.get('https://api.openalex.org/institutions/I122140584').json()['works_count']

144605

In [83]:
#Since there is no difference between the methods used to recover data, we'll just check that this happens for other institutions

#https://api.openalex.org/institutions/I122140584

def get_institution_counts(id):
    inst_results = requests.get(f'https://api.openalex.org/institutions/{id}').json()
    work_results = requests.get(f'https://api.openalex.org/works?filter=institutions.id:{id}').json()
    final_results = {
        'Institution name': inst_results['display_name'],
        'Total itens': inst_results['works_count'],
        'Total itens (works api)': work_results['meta']['count']
    }
    print(final_results)
                        
institutions_ids = ['i122140584','i1294671590', 'i27837315', 'i185261750', 'i136199984',
                'i19820366', 'i17974374']
                
for i in institutions_ids:
    get_institution_counts(i)

{'Institution name': 'Universidade Federal do Rio de Janeiro', 'Total itens': 144605, 'Total itens (works api)': 142411}
{'Institution name': 'Centre National de la Recherche Scientifique', 'Total itens': 1064412, 'Total itens (works api)': 1057182}
{'Institution name': 'University of Michigan–Ann Arbor', 'Total itens': 925883, 'Total itens (works api)': 921971}
{'Institution name': 'University of Michigan–Ann Arbor', 'Total itens': 925883, 'Total itens (works api)': 921971}
{'Institution name': 'University of Toronto', 'Total itens': 489575, 'Total itens (works api)': 487790}
{'Institution name': 'Harvard University', 'Total itens': 667798, 'Total itens (works api)': 666674}
{'Institution name': 'Chinese Academy of Sciences', 'Total itens': 836331, 'Total itens (works api)': 835871}
{'Institution name': 'Universidade de São Paulo', 'Total itens': 403434, 'Total itens (works api)': 400904}


In [None]:
#So, in general, the number of works returned by the 'works' api is lower than the number listed by the 'institutions' api
#Then, let's use the 'works' api to get all of the UFRJ's publications

##CONTINUAR DAQUI!!!!


url = f'https://api.openalex.org/works?filter=author.ror:I122140584&cursor={}'


# url with a placeholder for cursor
example_url_with_cursor = 'https://api.openalex.org/works?filter=author.id:A5048491430&cursor={}'

cursor = '*'

# loop through pages
while cursor:
    
    # set cursor value and request page from OpenAlex
    url = example_url_with_cursor.format(cursor)
    print("\n" + url)
    page_with_results = requests.get(url).json()
    
    # loop through partial list of results
    results = page_with_results['results']
    for i,work in enumerate(results):
        openalex_id = work['id'].replace("https://openalex.org/", "")
        print(openalex_id, end='\t' if (i+1)%5!=0 else '\n')

    # update cursor to meta.next_cursor
    cursor = page_with_results['meta']['next_cursor']