# Consortium Organisation DOI usage report

With the introduction of Consortium Members, it stopped being possible to easily get a DOI usage report that only includes our partners, as https://stats.datacite.org/ does not allow narrowing by that level of the hierarchy. Thankfully, all the information on the stats website is also available via the API, so we can build our own report.

## Contents

- [Setup](#Setup)
- [Fetch statistics](#Fetch-statistics)
- [Selecting Consortium Organisations](#Selecting-Consortium-Organisations)
- [Final report](#Final-report)
- [Save the results](#Save-the-results)

## Setup

First we import the python modules we'll need:

In [1]:
import httpx
import pandas as pd
import asyncio

We increase the number of rows that Pandas displays, so that we can see all the Consortium Organisations at once when we show the final table (we'll use `DataFrame.head` when we just want to check an intermediate result).

In [2]:
pd.set_option('display.max_rows', 120)

Configure the main parameters for the rest of the notebook. `API_BASE` can be changed to use the test system instead of production; `CONSORTIUM_ID` can be changed to another consortium (lowercase).

In [3]:
API_BASE = "https://api.datacite.org"
# API_BASE = "https://api.test.datacite.org"
CONSORTIUM_ID = "blco"

Configure the HTTP client we'll use: reusing this means we only have to set our options once, and it will also enable some optimisations that we don't need to care about ourselves. Using `AsyncClient` here (instead of plain `Client`) will be useful later when we need to make a large number of similar API calls in parallel, but it means we will need to use `await` each time we make a HTTP request. We need to increase the timeout because DataCite API calls take about 10s to complete, and we reduce the number of simultaneous connections to avoid making too many at once and overloading the service (if we do this we will get blocked to preserve their service!).

In [4]:
client = httpx.AsyncClient(
    timeout=httpx.Timeout(5.0, read=60.0), 
    limits=httpx.Limits(max_keepalive_connections=5, max_connections=10),
)

## Fetch statistics

We can fetch *all* the statistics with a single API call, and use `pandas.json_normalize` to convert the JSON response into a `DataFrame`. The endpoint to use is `/providers/totals`, because the API still refers to Direct Member and Consortium Organisation accounts as "providers". We use `DataFrame.set_index` to use the provider ID as the index column instead of the sequential numbers generated by default.

In [5]:
r = await client.get(f"{API_BASE}/providers/totals")
r.raise_for_status()
totals_data = r.json()
totals = pd.json_normalize(totals_data) \
    .set_index('id')
totals.head()

Unnamed: 0_level_0,title,count,states,temporal.this_month,temporal.this_year,temporal.last_year,temporal.two_years_ago
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
cern,CERN - European Organization for Nuclear Research,2876193,"[{'id': 'findable', 'title': 'Findable', 'coun...","[{'id': '2021', 'title': '2021', 'count': 2635}]","[{'id': '2021', 'title': '2021', 'count': 5677...","[{'id': '2020', 'title': '2020', 'count': 6554...","[{'id': '2019', 'title': '2019', 'count': 8333..."
figshare,figshare,2334292,"[{'id': 'findable', 'title': 'Findable', 'coun...","[{'id': '2021', 'title': '2021', 'count': 2101}]","[{'id': '2021', 'title': '2021', 'count': 3438...","[{'id': '2020', 'title': '2020', 'count': 4497...","[{'id': '2019', 'title': '2019', 'count': 3862..."
tawj,University of Tartu,1930980,"[{'id': 'findable', 'title': 'Findable', 'coun...","[{'id': '2021', 'title': '2021', 'count': 25605}]","[{'id': '2021', 'title': '2021', 'count': 1218...","[{'id': '2020', 'title': '2020', 'count': 21431}]","[{'id': '2019', 'title': '2019', 'count': 11704}]"
stdp,ETH Zurich,1906730,"[{'id': 'findable', 'title': 'Findable', 'coun...","[{'id': '2021', 'title': '2021', 'count': 485}]","[{'id': '2021', 'title': '2021', 'count': 1195...","[{'id': '2020', 'title': '2020', 'count': 1182...","[{'id': '2019', 'title': '2019', 'count': 1814..."
sage,SAGE Publishing,1793436,"[{'id': 'registered', 'title': 'Registered', '...","[{'id': '2021', 'title': '2021', 'count': 20}]","[{'id': '2021', 'title': '2021', 'count': 81923}]","[{'id': '2020', 'title': '2020', 'count': 12728}]","[{'id': '2019', 'title': '2019', 'count': 13019}]"


Because of the structure of the returned JSON, we need to do a bit of extra work to extract the counts into columns on their own. We do this with `DataFrame.apply`, which runs a function (that we've defined) on each row of the dataset:

In [6]:
def extract_counts(row):
    result = {}
    for st in row.states:
        result[st['id']] = st['count']
    for period in ['this_year', 'last_year', 'two_years_ago']:
        data = row['temporal.' + period][0]
        result['count.' + data['id']] = data['count']
       
    return result

totals = totals[['title', 'count']].join(totals.apply(extract_counts, axis=1, result_type='expand'))
totals.head()

Unnamed: 0_level_0,title,count,findable,registered,count.2021,count.2020,count.2019
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
cern,CERN - European Organization for Nuclear Research,2876193,2438006.0,438187.0,567784.0,655451.0,833365.0
figshare,figshare,2334292,2197806.0,136486.0,343853.0,449740.0,386275.0
tawj,University of Tartu,1930980,1930964.0,16.0,1218803.0,21431.0,11704.0
stdp,ETH Zurich,1906730,1906664.0,66.0,119565.0,118294.0,181454.0
sage,SAGE Publishing,1793436,333716.0,1459720.0,81923.0,12728.0,13019.0


That's much better! We still have *all* the providers though, so now we need to select only those in our consortium.

## Selecting Consortium Organisations

The COs under a given Consortium Lead are available as part of the Lead's provider record, so we fetch that and extract the JSON response:

In [7]:
r = await client.get(f"{API_BASE}/providers/{CONSORTIUM_ID}")
r.raise_for_status()
provider_data = r.json()

Now we make a Python set containing only the IDs of the Consortium Organisations, and use that to select only the totals we're interested in.

In [8]:
co_ids = {x['id'] for x in provider_data['data']['relationships']['consortiumOrganizations']['data']}
co_totals = totals[totals.index.isin(co_ids)]
co_totals.head()

Unnamed: 0_level_0,title,count,findable,registered,count.2021,count.2020,count.2019
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
urks,Imperial College London,214857,209547.0,5310.0,1465.0,2813.0,8254.0
mqkk,Archaeology Data Service,87799,87778.0,21.0,4546.0,25822.0,7430.0
lpsw,University of Cambridge,71693,71569.0,124.0,11851.0,15272.0,12456.0
jhwt,Digital Repository of Ireland,54062,54061.0,1.0,2681.0,17495.0,5676.0
duqf,Science and Technology Facilities Council,18750,18746.0,4.0,914.0,2110.0,5796.0


The `/providers/totals` endpoint only returns a row for providers that have at least one DOI, but we also want to see which of our COs have 0 DOIs, so we will need to add these manually; reindexing the dataset with the full list of CO identifiers will add blank rows for these.

In [9]:
co_totals = co_totals.reindex(co_ids)

We only have the IDs for these, so we have to make a separate API call for each to fetch its name. We will use Python's `async` feature to allow us to do many of these in parallel: we build a list of the tasks, and then use `asyncio.gather` to handle actually running them and collecting the results.

In [10]:
async def get_title(org_id, client):
    title = '<deleted>'
    r = await client.get(f"{API_BASE}/providers/{org_id}")
    if r.status_code == httpx.codes.OK:
        data = r.json()
        title = data['data']['attributes']['name']
    return {'id': org_id, 'title': title}

get_title_tasks = [get_title(org_id, client)
                   for org_id
                   in co_totals[co_totals.title.isna()].index]
extra_names = await asyncio.gather(*get_title_tasks)

extra_names = pd.DataFrame(extra_names).set_index('id').title

Now we can update our dataset with these extra names, and finally fill the remaining blank cells with 0 to give our final report.

In [11]:
co_totals['title'].update(extra_names)
co_totals = co_totals.fillna(0)

## Final report

In [12]:
co_totals

Unnamed: 0_level_0,title,count,findable,registered,count.2021,count.2020,count.2019
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
zzsa,The Open University,7001.0,6916.0,85.0,318.0,6438.0,112.0
cdcx,SOAS University of London,2864.0,2864.0,0.0,161.0,2663.0,0.0
qsdg,University of Edinburgh,4273.0,4269.0,4.0,522.0,1017.0,282.0
urks,Imperial College London,214857.0,209547.0,5310.0,1465.0,2813.0,8254.0
hbvj,University of Aberystwyth,48.0,43.0,5.0,3.0,9.0,15.0
hugz,University of Aberdeen,292.0,291.0,1.0,33.0,104.0,129.0
huwv,Marine Biological Association,2178.0,2170.0,8.0,28.0,70.0,1398.0
tnae,Oxford Brookes University,1318.0,1317.0,1.0,197.0,226.0,209.0
nmku,Lincoln repository,17.0,17.0,0.0,5.0,3.0,9.0
mbye,Sheffield Hallam University,40.0,39.0,1.0,4.0,2.0,9.0


## Save the results

As a final step, we'll export the results as an Excel spreadsheet for use elsewhere:

In [13]:
import datetime as dt

today = dt.datetime.now()
co_totals.to_excel(f'{CONSORTIUM_ID}-totals-{today:%Y%m%d}.xlsx')