<a href="https://colab.research.google.com/github/digital-science/dimensions-api-lab/blob/master/3-workshops/2019-09-Rome-University-ISSI-conference/7-Indicators.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open Dimensions API Lab In Google Colab"/></a>

# 7. Indicators  

This Notebook shows how it's possible to generate indicators programmatically, using the raw data from the Dimensions API with Jupyter Notebooks. 


In [1]:
import dimcli
from dimcli.shortcuts import dslquery, dslqueryall
import pandas as pd
from pandas.io.json import json_normalize
import plotly_express as px
from plotly.offline import init_notebook_mode # needed for exports 
init_notebook_mode(connected=True)

Dimensions data source are mined for organizations identifiers using GRID, the [Global Research Identifier Database](https://grid.ac/). So we can use GRID IDs to perform searches across all source in Dimensions. 

The GRID ID for Rome La Sapienza is [grid.7841.a](https://grid.ac/institutes/grid.7841.a). 


In [2]:
GRIDID = "grid.7841.a"

## Example: the H-index

##### Background

> The [h-index](https://en.wikipedia.org/wiki/H-index) is an author-level metric that attempts to measure both the productivity and citation impact of the publications of a scientist or scholar. The index is based on the set of the scientist's most cited papers and the number of citations that they have received in other publications.

A more precise definition:

> The h-index is defined as the maximum value of h such that the given author/journal has published h papers that have each been cited at least h times.

How to calculate it:

> Formally, if f is the function that corresponds to the number of citations for each publication, we compute the h-index as follows. First we order the values of f from the largest to the lowest value. Then, we look for the last position in which f is greater than or equal to the position (we call h this position). For example, if we have a researcher with 5 publications A, B, C, D, and E with 10, 8, 5, 4, and 3 citations, respectively, the h-index is equal to 4 because the 4th publication has 4 citations and the 5th has only 3. In contrast, if the same publications have 25, 8, 5, 3, and 3 citations, then the index is 3 because the fourth paper has only 3 citations ([wikipedia](https://en.wikipedia.org/wiki/H-index))

#### Selecting a researcher

Let's take a researcher ID from the previous tutorial eg [Shahram Rahatlou ur.013334067161.01](https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.013334067161.01) and save its ID into a variable that can be referenced later.

> Try modifying the researcher ID below to get different results! 

In [3]:
RESEARCHER = "ur.013334067161.01"

#### Generic function for calculating the H-Index 

The h-Index function takes a list of citations and outputs the h-index value as explained above: 

In [4]:
def the_H_function(sorted_citations_list, n=1):
    """from a list of integers [n1, n2 ..] representing publications citations, 
    return the max list-position which is >= integer
    
    eg 
    >>> the_H_function([10, 8, 5, 4, 3]) => 4
    >>> the_H_function([25, 8, 5, 3, 3]) => 3
    >>> the_H_function([1000, 20]) => 2
    """
    if sorted_citations_list and sorted_citations_list[0] >= n:
        return the_H_function(sorted_citations_list[1:], n+1)
    else:
        return n-1

The H-index function is generic and can take any list of numbers representing publication citations. 

### Input: researchers citations data from Dimensions

In order to pass some real-world data to the H-Index function, we can easily use the Dimensions API to extract all publication citations for a researcher, like this: 

In [5]:
def get_pubs_citations(researcher_id):
    q = """search publications where researchers.id = "{}" return publications[times_cited] sort by times_cited limit 1000"""
    pubs = dslquery(q.format(researcher_id))
    return list(pubs.as_dataframe().fillna(0)['times_cited'])

Finally, we combine the two functions to calculate the H-Index for a specific researcher:

In [6]:
print("H_index is:", the_H_function(get_pubs_citations(RESEARCHER)))

Returned Publications: 1000 (total = 1102)
H_index is: 94


---
# Activities

* Try extracting the H-Index for a group of researchers you are interested in (eg `search researchers where ...`) 
* Try comparing the citations count value of researchers with the H-Index. How can this be visualized? What data do we need? 
* Think of other indicators what one could build using Dimensions data

---
# Want to learn more?

Check out the [Dimensions API Lab](https://digital-science.github.io/dimensions-api-lab/) website, which contains many tutorials and reusable Jupyter notebooks for scholarly data analytics. 