# Dataverse API Experiments - Part 1

An Interactive Notebook by Tiffany Chan (University of Victoria Libraries)
tjychan@uvic.ca

## Searching Scholars Portal (Search API)

We'll use the [search API](https://guides.dataverse.org/en/latest/api/search.html) to explore Scholars Portal with a simple search. Note that we're passing two parameters for our search:

1. **q**: this is the keyword we're searching. In our example, this is "trees."
2. **per_page**: the maximum number of results we want to return. If this is blank, Dataverse defaults to 10. In our example, this is 5. (Note: this is an integer and not a string).

The parameters are declared in a dictionary: "a list of key-value pairs within braces" (from the [Python documentation])(https://docs.python.org/3/library/stdtypes.html#dict). For example, the first pair consists of a key ('q') and a value of 'trees'.

In [None]:
search_params = {
    'q': 'trees',
    'per_page': 5 
}

Now we execute the search using Python's requests library.

In [111]:
import requests
import json

search_url = 'https://dataverse.scholarsportal.info/api/search'

# Get the first 5 results for a keyword search for "trees"
resp = requests.get(search_url, params=search_params)

# Print the search results with indents so it looks neater
results = resp.json()['data']['items']
print(json.dumps(results, indent=4))

See these results in your browser at https://dataverse.scholarsportal.info/api/search?q=trees&per_page=5
[
    {
        "name": "Data from: Time to get moving: assisted gene flow of forest trees",
        "type": "dataset",
        "url": "https://doi.org/10.5683/SP2/BQBI7S",
        "global_id": "doi:10.5683/SP2/BQBI7S",
        "description": "Abstract Geographic variation in trees has been investigated since the mid-18th century. Similar patterns of clinal variation have been observed along latitudinal and elevational gradients in common garden experiments for many temperate and boreal species. These studies convinced forest managers that a \u2018local is best\u2019 seed source policy was usually safest for reforestation. In recent decades, experimental design, phenotyping methods, climatic data and statistical analyses have improved greatly and refined but not radically changed knowledge of clines. The maintenance of local adaptation despite high gene flow suggests selection for l


You can see the same results in your browser at https://dataverse.scholarsportal.info/api/search?q=trees&per_page=5. Python simply gives us a way to chain together ("concatenate") different parameters together into a search string that Dataverse understands, then store the response so we can do things with it later.

If you're using Chrome, you can download the [JSON Formatter extension](https://chrome.google.com/webstore/detail/json-formatter/bcjindcccaagfpapjjmafapmmgkkhgoa/related?hl=en) to see a prettier version of the json response.

### Try it Yourself

There are many other search parameters we can use. The full list is available on the [Search API](https://guides.dataverse.org/en/latest/api/search.html) page. To test your understanding, here is a list of searches that you can execute with Python.

Using the search API as a reference, try editing the parameters in the code block below and then rerunning it to see the results. The first is done for you as an example.

1. Datasets with the keyword "climate" and a subject of Arts and Humanities, sorted alphabetically by title.
2. Results with a keyword term of Cheese Factories by Army Survey Establishment (author). (**Hint**: Adding `'show_facets': 'true'` to a request will help you see what filters (`fq`) are possible.)
3. All tabular data files (\*.tab) uploaded between 2020 and 2022 to the [UVic Research Data Collection](https://dataverse.scholarsportal.info/dataverse/uvic-research), sorted reverse chronologically (i.e. most recent first). (**Hint**: You'll need the `subtree` parameter.)

In [463]:
search_params = {
    'q': 'climate',
    'type': 'dataset',
    'sort': 'name',
    'order': 'asc',
    'fq': 'subject_ss:"Arts and Humanities"' # Note the double quotes around "Arts and Humanities"
}

resp = requests.get(search_url, params=search_params)
print(resp.url)
results = resp.json()['data']['items']
print(json.dumps(results, indent=2))

https://dataverse.scholarsportal.info/api/search?q=climate&type=dataset&sort=name&order=asc&fq=subject_ss%3A%22Arts+and+Humanities%22
[
  {
    "name": "#climatemarch tweets April 19-May 3, 2017",
    "type": "dataset",
    "url": "https://doi.org/10.5683/SP/KZZVZW",
    "global_id": "doi:10.5683/SP/KZZVZW",
    "description": "681,668 tweet ids for #climate collected with Documenting the Now's twarc from January 22-26, 2017. Tweets can be \u201crehydrated\u201d with Documenting the Now\u2019s twarc (https://github.com/DocNow/twarc). twarc.py hydrate climatemarch_tweet_ids.txt > climatemarch.json.",
    "published_at": "2017-05-04T02:18:51Z",
    "publisher": "Web Archives for Historical Research Group Dataverse",
    "citationHtml": "Ruest, Nick, 2017, \"#climatemarch tweets April 19-May 3, 2017\", <a href=\"https://doi.org/10.5683/SP/KZZVZW\" target=\"_blank\">https://doi.org/10.5683/SP/KZZVZW</a>, Scholars Portal Dataverse, V1",
    "identifier_of_dataverse": "wahr",
    "name_of_da

## Working with Download Statistics (Metrics API)

Let's generate a graph of the top 5 most downloaded datasets in the University of Victoria (UVic) Dataverse, using Dataverse's documentation on [the Metrics API](https://guides.dataverse.org/en/latest/api/metrics.html) for reference.

According to the docs, we can use the url [/api/info/metrics/uniquedownloads](https://dataverse.scholarsportal.info/api/info/metrics/uniquedownloads) to get the number of unique downloads:

>The use case for this metric (uniquedownloads) is to more fairly assess which datasets are getting downloaded/used by only counting each users who downloads any file from a dataset as one count (versus downloads of multiple files or repeat downloads counting as multiple counts which adds a bias for large datasets and/or use patterns where a file is accessed repeatedly for new analyses)

Scholars Portal seems to work a bit differently than described in the Dataverse documentation. For example, the unique downloads are only available as a CSV attachment/file (i.e. not JSON). The [/api/info/metrics/filedownloads](https://dataverse.scholarsportal.info/api/info/metrics/filedownloads) URL returns an empty string so we can't use that either.

Here is the pseudocode (a plain-language recipe or list of steps) for the code that will generate the graph:
1. **Download the CSV with download statistics.** The CSV two columns: 1 with the dataset's persistent identifier (PID), 1 with the number of downloads (expressed as an integer or whole number).
2. **Use the PID to get the title of the dataset (by making a request to the API).**
3. **Generate a simple bar graph with this information.**

In [461]:
# Get and print the CSV from the Metrics API
dataverse_alias = 'uvic'
text_response = requests.get(f'https://dataverse.scholarsportal.info/api/info/metrics/uniquedownloads?parentAlias={dataverse_alias}').text
print(text_response)

pid,count
"doi:10.5683/SP2/GCTYCU",34
"doi:10.5683/SP2/7UOOVR",33
"doi:10.5683/SP2/GDIZRV",32
"doi:10.5683/SP2/AAGZDG",17
"doi:10.5683/SP3/VEIBVL",17
"doi:10.5683/SP2/3BI57S",13
"doi:10.5683/SP2/MRHP4Y",12
"doi:10.5683/SP2/J1H3U4",11
"doi:10.5683/SP2/5AMBHV",10
"doi:10.5683/SP2/OOVOQR",10
"doi:10.5683/SP2/RQR8LN",8
"doi:10.5683/SP2/NHAYFN",5
"doi:10.5683/SP2/8KUABH",5
"doi:10.5683/SP2/SVALAK",4
"doi:10.5683/SP2/YXUAIC",4
"doi:10.5683/SP2/GKJPIQ",4
"doi:10.5683/SP2/URURKC",4
"doi:10.5683/SP2/1L8NKY",4
"doi:10.5683/SP2/D7I7CC",3
"doi:10.5683/SP2/NTRNB7",3
"doi:10.5683/SP2/KFIH8X",3
"doi:10.5683/SP2/RFWOJD",3
"doi:10.5683/SP2/SPOAWK",3
"doi:10.5683/SP2/DLGXYO",3
"doi:10.5683/SP2/5LJYXO",3
"doi:10.5683/SP2/P58E1E",3
"doi:10.5683/SP2/3RUIQK",3
"doi:10.5683/SP2/6BATWK",2
"doi:10.5683/SP2/TTJNIU",2
"doi:10.5683/SP2/KS3BLD",2
"doi:10.5683/SP2/ZXZQNN",2
"doi:10.5683/SP2/KD94VW",2
"doi:10.5683/SP2/NBG11U",2
"doi:10.5683/SP2/FPFKUN",2
"doi:10.5683/SP2/TZCHKE",2
"doi:10.5683/SP2/2EGZVX",2
"doi:10.

The above script will work with any dataverse. To try it with a different dataverse, go to that dataverse's page and edit the `dataverse_alias` to be whatever follows dataverse/ in the URL. For example, to use the Queen's University Dataverse (https://dataverse.scholarsportal.info/dataverse/queens), edit the `dataverse_alias = 'uvic'` to be `dataverse_alias = 'queens'` instead.

This also works for nested dataverses, e.g. [Ocean Networks Canada](https://dataverse.scholarsportal.info/dataverse/oceannetworkscanada) (`oceannetworkscanada`), which is within the UVic Dataverse.

### Advanced: Chaining API calls together

It would be nice to know the titles of the most downloaded datasets as well as the DOIs. Looking at [the documentation](https://guides.dataverse.org/en/latest/api/native-api.html) ("Get JSON Representation of a Dataset"), we can make a second API call to Dataverse to get the title of a dataset, given that we know its DOI or PID.

In [462]:
# Parse the CSV from above into a dictionary
rows = list(csv.DictReader(text_response.splitlines(), delimiter=','))
headers = ['DOI', '# of downloads','title']
print('{:<22}  {:<5}  {:<20}'.format(*headers))
# Print the first 10 results with their titles
for row in rows[0:10]:
    pid = row['pid']
    # Make a request to find the title
    # See "Get JSON Representation of a Dataset" below)
    # https://guides.dataverse.org/en/latest/api/native-api.html
    params = {"persistentId": f'{pid}' }
    resp = requests.get('https://dataverse.scholarsportal.info/api/datasets/:persistentId/', params).json()
    title = resp['data']['latestVersion']['metadataBlocks']['citation']['fields'][0]['value']
    # Shorten titles if they are 10 words or longer
    if len(title.split(' ')) > 9:
        row['title'] = ' '.join(title.split(' ')[0:9] + ['...'])
    else:
        row['title'] = title
    print('{:<28}  {:<8}  {:<15}'.format(*row.values()))

DOI                     # of downloads  title               
doi:10.5683/SP2/GCTYCU        34        2018 Canadian Physician Survey
doi:10.5683/SP2/7UOOVR        33        Ten years (2006-2016) of oceanographic temperature, salinity, pressure, density ...
doi:10.5683/SP2/GDIZRV        32        2017 National Survey of Canadian Nurses: Use of Digital ...
doi:10.5683/SP2/AAGZDG        17        Identifying deep sea fish from videos of a benthic ...
doi:10.5683/SP3/VEIBVL        17        University Student Mental Health [Student_Mental_Health_2021-10-10]
doi:10.5683/SP2/3BI57S        13        2016 National Survey of Community-Based Pharmacists: Use of Digital ...
doi:10.5683/SP2/MRHP4Y        12        2014 National Survey of Canadian Nurses
doi:10.5683/SP2/J1H3U4        11        2014 National Survey of Canadian Community Pharmacists: Use of ...
doi:10.5683/SP2/5AMBHV        10        Six years (2009-2015) of oceanographic temperature, salinity, pressure, density ...
doi:10.5683/SP2/OO

Note that these results are current as of the time you call the API. So you might get a different result if you run the same script tomorrow!