# Dimensions data retrieval

## Web App

In order to obtain Dimensions data, we have used their [Web App](https://app.dimensions.ai/discover/publication) with a Dimensions Analytics subscription.

[Our query](https://app.dimensions.ai/discover/publication?or_facet_research_org=grid.8536.8&not_facet_year=2023) to obtain all UFRJ publications up to 2022 can be summarized as:

```
Research Organization: "Federal University of Rio de Janeiro" AND
Publication Year: NOT (2023)
```

This yielded a large number of records (89649 records - access date: 11/02/2022), which goes above the max number of records per download of the platform (50k/download - Dimensions Analytics subscription). 

Thus, we split this data into two subsets containing documents published in the following periods: 
- [Up to 2014](https://app.dimensions.ai/discover/publication?or_facet_research_org=grid.8536.8&not_facet_year=2023&not_facet_year=2022&not_facet_year=2020&not_facet_year=2021&not_facet_year=2019&not_facet_year=2018&not_facet_year=2017&not_facet_year=2016&not_facet_year=2015)
- [2015-2022](https://app.dimensions.ai/discover/publication?or_facet_research_org=grid.8536.8&or_facet_year=2022&or_facet_year=2021&or_facet_year=2020&or_facet_year=2019&or_facet_year=2018&or_facet_year=2017&or_facet_year=2016&or_facet_year=2015)

Those subsets were downloaded separately in `.csv` in 11/02/2023.

## API

The [Bibliometrix R package](https://cran.r-project.org/web/packages/bibliometrix/index.html) parses Dimensions results that come directly from the Web App. Thus, this was selected as the main approach for data retrieval.

However, such approach also lacks some fields that can be retrieved through the [Dimensions Analytics API](https://www.dimensions.ai/dimensions-apis/). Here, we will use the API to obtain an extra dataset containing all data from Dimensions, as it could prove useful during downstream analyses. 

Resources:
    
- [Documentation](https://docs.dimensions.ai/dsl/)
- [API Lab](https://api-lab.dimensions.ai/)
- [API Lab (github repository)](https://github.com/digital-science/dimensions-api-lab)

This notebook will use the [dimcli library](https://github.com/digital-science/dimcli) (v.0.9.9.1) to access the Dimensions Analytics API.

Queries can be performed using:
- [Dimcli Magic Commands](https://api-lab.dimensions.ai/cookbooks/1-getting-started/4-Dimcli-magic-commands.html#Dimcli-%E2%80%98magic%E2%80%99-commands): Useful for jupyter notebooks

- Dimcli library methods, such as `.query()` and `.query_iterative()`. In this case, its possible to create variables and pass them to the query through f-strings or any other string substitution method. It's also possible to [pass lists](https://api-lab.dimensions.ai/cookbooks/1-getting-started/6-Working-with-lists.html#3.-Making-a-list-from-the-results-of-a-query) to the queries using the `json.dumps()` method from the `json` module. We'll focus on this approach for querying.

### Initial setup

In [1]:
#Importing relevant libraries
import dimcli
import json
import pandas as pd
import seaborn as sns

In [6]:
#Checking dimcli version
dimcli.__version__

'0.9.9.1'

In [7]:
#Logging in via dsl.ini file - Reference: https://api-lab.dimensions.ai/cookbooks/1-getting-started/1-Using-the-Dimcli-library-to-query-the-API.html#More-secure-method:-storing-a-private-credentials-file
dimcli.login()

[2mSearching config file credentials for default 'live' instance..[0m


[2mDimcli - Dimensions API Client (v0.9.9.1)[0m
[2mConnected to: <https://app.dimensions.ai/api/dsl> - DSL v2.5[0m
[2mMethod: dsl.ini file[0m


In [8]:
#Alternatively, to login by inputing your api key, uncomment and run the following lines:
#import getpass
#api_key = getpass.getpass("Provide your dimensions API key here: ")
#dimcli.login(key=api_key, endpoint="https://app.dimensions.ai/api/dsl")

In [9]:
#Creating the dimcli.core object (necessary for querying)
dsl = dimcli.Dsl()
type(dsl)

dimcli.core.api.Dsl

### Querying Dimensions API

#### How to retrieve records?

Our objective is to retrieve all publications in Dimensions from Federal University of Rio de Janeiro (UFRJ) - grid.8536.8.

In order to recover only UFRJ publications through API, we can filter queries by [GRID identifier](https://www.grid.ac/) using the field `research_orgs.id`

In [10]:
#Saving UFRJ's GRID id into a variable
ufrj_grid = 'grid.8536.8'

In [11]:
#Performing a simplified query (returning only 1 record and default fields) to count the number of records recovered

dsl.query(f"""
  search publications
  where research_orgs.id = {json.dumps(ufrj_grid)} and
  year < 2023
  return publications
  limit 1
""", verbose=False).count_total

89649

Everything seems to be okay with the query, since the number of results matches the one obtained from the Dimensions web app.

We want to retrieve as much information from Dimensions as possible. For that, we'll include all available Publication Fields to the return statement.

First, we'll get the return term of a query containing all fields (except for "abstract") from the [Dimensions documentation](https://docs.dimensions.ai/dsl/datasource-publications.html) (accessed 23-Jan-2022).

In [12]:
def get_all_pub_fields(html, table_index):
    ''' Given the: 
    (i) location of a html (e.g. https://docs.dimensions.ai/dsl/datasource-publications.html) with a table of all Dimensions fields for the 'Publication' datasource; and
    (ii) index/position of said table (starting from 0);
    Get the return term of a query with all publication fields (i.e. "publications [field_1+field_2+...+field_n]")
    '''
    pubs_source = pd.read_html(html) #Parsing tables in html file
    field_df = pubs_source[table_index] #Selecting the table by index
    field_list = field_df["Field"].to_list() #Selecting only the 'Field' column and converting it to a list
    #field_list.remove('abstract') #The 'abstract' field isn't going to be particularly useful and will greatly increase filesize, so we're going to remove it
    #print(field_list)
    return f'publications [{"+".join(field_list)}]' #Getting the 'return term' for our queries with all publication fields

all_pub_fields = get_all_pub_fields('data/publications_datasource.html', 0)
print(all_pub_fields)

publications [abstract+acknowledgements+altmetric+altmetric_id+arxiv_id+authors+authors_count+book_doi+book_series_title+book_title+category_bra+category_for+category_for_2008+category_for_2020+category_hra+category_hrcs_hc+category_hrcs_rac+category_icrp_cso+category_icrp_ct+category_rcdc+category_sdg+category_uoa+clinical_trial_ids+concepts+concepts_scores+date+date_inserted+date_online+date_print+dimensions_url+doi+editors+field_citation_ratio+funder_countries+funders+funding_section+id+isbn+issn+issue+journal+journal_lists+journal_title_raw+linkout+mesh_terms+open_access+pages+pmcid+pmid+proceedings_title+publisher+recent_citations+reference_ids+referenced_pubs+relative_citation_ratio+research_org_cities+research_org_countries+research_org_country_names+research_org_names+research_org_state_codes+research_org_state_names+research_orgs+researchers+resulting_publication_doi+score+source_title+subtitles+supporting_grant_ids+times_cited+title+type+volume+year]


Finally, we want to download the data. However, there is a limit of 50000 records/download for Dimensions Analytics API requests. Then, we will download records by year, append them into a temporary dataframe and return it after looping over the whole dataset, overcoming this problem.

Since we want to retrieve data by year, we'll use a query to recover a list of the years where UFRJ had at least one publication avaliable in the Dimensions database.

In [13]:
def get_pubyears(org_id, min_year=0, max_year=2022):
    '''This function receives the org_id (GRID id) of a given organization, as well as
    a minimum and maximum value for delimiting the years of interest
    and returns a sorted list with the years for which the organization has publications in Dimensions
    '''
    pubyears =  dsl.query(f"""
      search publications
      where research_orgs.id = {json.dumps(ufrj_grid)} and
      year >= {min_year} and year <= {max_year}
      return year
      limit 1000
      """, verbose=False).as_dataframe().id.to_list()
    return sorted(pubyears)

In [14]:
pubyears = get_pubyears(org_id=ufrj_grid, min_year=0, max_year=2022)
print(pubyears)

[1926, 1941, 1948, 1949, 1951, 1952, 1954, 1955, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022]


After getting the years of interest, we'll download all the data in loops, obtaining the final dataframe containing the whole dataset.

In [15]:
def get_dimensions_data(org_id, pubyears, return_term):
    '''This function splits your query in several sub-queries (data is split by year)
    to obtain a final dataframe containing all records.
    Parameter types:
    pubyears is a list lists
    org_id and return_term are strings
    OBS: This only recovers 50k records/query, so it is not recommended when at least one year has more than 50k records
    '''
    final_df = pd.DataFrame() #Initializing variable with empty dataframe
    for year in pubyears:
        print(f"Getting records for year {year}...")
        year_df =  dsl.query_iterative(f"""
          search publications
          where research_orgs.id = {json.dumps(org_id)} and
          year = {year}
          return {return_term}
        """, limit=500, verbose=False).as_dataframe() #Returns ALL the data
        if final_df.empty: #If final_df is empty (first loop), it receives the first year_df
            final_df = year_df
        else: #Else, year_df is added to final_df
            final_df = pd.concat([final_df, year_df])
    return final_df

In [16]:
#Getting the data
dimensions_data = get_dimensions_data(org_id=ufrj_grid,
                                    pubyears=pubyears,
                                    return_term=all_pub_fields)

Getting records for year 1926...
Getting records for year 1941...
Getting records for year 1948...
Getting records for year 1949...
Getting records for year 1951...
Getting records for year 1952...
Getting records for year 1954...
Getting records for year 1955...
Getting records for year 1957...
Getting records for year 1958...
Getting records for year 1959...
Getting records for year 1960...
Getting records for year 1961...
Getting records for year 1962...
Getting records for year 1963...
Getting records for year 1964...
Getting records for year 1966...
Getting records for year 1967...
Getting records for year 1968...
Getting records for year 1969...
Getting records for year 1970...
Getting records for year 1971...
Getting records for year 1972...
Getting records for year 1973...
Getting records for year 1974...
Getting records for year 1975...
Getting records for year 1976...
Getting records for year 1977...
Getting records for year 1978...
Getting records for year 1979...
Getting re

In [17]:
#Viewing final dataframe
dimensions_data

Unnamed: 0,id,title,altmetric_id,authors,authors_count,concepts,concepts_scores,date,date_inserted,date_print,...,funder_countries,funders,funding_section,relative_citation_ratio,acknowledgements,supporting_grant_ids,arxiv_id,resulting_publication_doi,field_citation_ratio,clinical_trial_ids
0,pub.1107133025,Conferencias de therapeutica clinica,0,"[{'affiliations': [{'city': 'Rio de Janeiro', ...",1,"[Clinicas, Conferencia]","[{'concept': 'Clinicas', 'relevance': 0.11}, {...",1926-04-01,2018-09-24,1926-04,...,,,,,,,,,,
0,pub.1007820230,Observações sobre o conteúdo gastrico das aves...,0,"[{'affiliations': [{'city': 'Rio de Janeiro', ...",3,,,1941-01-01,2017-08-31,1941,...,,,,,,,,,,
0,pub.1069914961,ONTOGENETIC EVOLUTION IN FROGS,51964645,"[{'affiliations': [{'city': 'Rio de Janeiro', ...",1,"[ontogenetic evolution, frogs, evolution]","[{'concept': 'ontogenetic evolution', 'relevan...",1948-03-01,2017-08-31,1948-03,...,,,,,,,,,,
0,pub.1013447457,Anfíbios anuros da coleção Adolpho Lutz do Ins...,0,"[{'affiliations': [{'city': 'Rio de Janeiro', ...",1,,,1949-12-01,2017-08-31,1949-12,...,,,,,,,,,,
0,pub.1027023497,CIV.—A new genus of Hallodapini and two new sp...,0,"[{'affiliations': [{'city': None, 'city_id': N...",2,"[species, Distant, genus, new genus, new species]","[{'concept': 'species', 'relevance': 0.053}, {...",1951-11-01,2017-08-31,1951-11,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5760,pub.1137759711,Effect of replacement of milk by block freeze ...,0,"[{'affiliations': [{'city': 'Florianópolis', '...",12,"[air bubble diameter, Newtonian fluid behavior...","[{'concept': 'air bubble diameter', 'relevance...",2022-01-01,2021-05-06,,...,"[{'id': 'BR', 'name': 'Brazil'}]","[{'acronym': 'CAPES', 'city_name': 'Brasília',...",,,The authors are grateful to CNPq (National Cou...,,,,,
5761,pub.1137759710,Application of skimmed milk freeze concentrate...,0,"[{'affiliations': [{'city': 'Florianópolis', '...",12,"[rheological properties, physicochemical prope...","[{'concept': 'rheological properties', 'releva...",2022-01-01,2021-05-06,,...,"[{'id': 'BR', 'name': 'Brazil'}]","[{'acronym': 'CNPq', 'city_name': 'Brasília', ...",,,The authors are grateful to CAPES (Coordinatio...,,,,,
5762,pub.1137275960,Early trauma and schizophrenia onset: prelimin...,108377409,"[{'affiliations': [{'city': 'São Luís', 'city_...",10,[Early Trauma Inventory Self Report-Short Form...,[{'concept': 'Early Trauma Inventory Self Repo...,2022-01-01,2021-04-17,2022,...,"[{'id': 'BR', 'name': 'Brazil'}]","[{'acronym': 'CAPES', 'city_name': 'Brasília',...",,,,,,,,
5763,pub.1134919629,Fusarium and Fusariosis,0,"[{'affiliations': [{'city': 'São Paulo', 'city...",2,"[fatal outcome, disseminated skin lesions, pos...","[{'concept': 'fatal outcome', 'relevance': 0.5...",2022-01-01,2021-01-29,2022,...,,,,,The authors would like to thank Dr. Marcio Nas...,,,,,


In [18]:
#Saving the entire dataframe to a file
dimensions_data.to_csv('../../data/dimensions/dimensions_from_api.csv', index=None)

Obs: Download date of both API and Web App data: 23-01-2023. 

23-01-2023: 89592 records
11-02-2023: 89649

Since the number of records had almost no increase between 23-01-2023 and 11-02-2023 (time frame for data retrieval), we chose to keep this dataset instead of downloading it again.