<a href="https://colab.research.google.com/github/digital-science/dimensions-api-lab/blob/master/2-sample-applications/Calculating-Indicators/Calculating-the-H-Index-of-a-researcher.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open Dimensions API Lab In Google Colab"/></a>

# Calculating the H-index of a researcher

This notebook shows how to use Python and the [Dimensions Analytics API](https://www.dimensions.ai/dimensions-apis/) to calculate the H-index of a researcher. 

#### Background

> The [h-index](https://en.wikipedia.org/wiki/H-index) is an author-level metric that attempts to measure both the productivity and citation impact of the publications of a scientist or scholar. The index is based on the set of the scientist's most cited papers and the number of citations that they have received in other publications.

A more precise definition:

> The h-index is defined as the maximum value of h such that the given author/journal has published h papers that have each been cited at least h times.

How to calculate it:

> Formally, if f is the function that corresponds to the number of citations for each publication, we compute the h-index as follows. First we order the values of f from the largest to the lowest value. Then, we look for the last position in which f is greater than or equal to the position (we call h this position). For example, if we have a researcher with 5 publications A, B, C, D, and E with 10, 8, 5, 4, and 3 citations, respectively, the h-index is equal to 4 because the 4th publication has 4 citations and the 5th has only 3. In contrast, if the same publications have 25, 8, 5, 3, and 3 citations, then the index is 3 because the fourth paper has only 3 citations ([wikipedia](https://en.wikipedia.org/wiki/H-index))

#### Prerequisites

In [0]:
username = ""  #@param {type: "string"}
password = ""  #@param {type: "string"}
endpoint = "https://app.dimensions.ai"  #@param {type: "string"}

#
!pip install dimcli -U --quiet 
import dimcli
dimcli.login(username, password, endpoint)
dsl = dimcli.Dsl()

#
# load common libraries
import pandas as pd

[?25l[K     |██▉                             | 10kB 24.9MB/s eta 0:00:01[K     |█████▊                          | 20kB 7.1MB/s eta 0:00:01[K     |████████▌                       | 30kB 9.9MB/s eta 0:00:01[K     |███████████▍                    | 40kB 6.2MB/s eta 0:00:01[K     |██████████████▎                 | 51kB 7.5MB/s eta 0:00:01[K     |█████████████████               | 61kB 8.8MB/s eta 0:00:01[K     |████████████████████            | 71kB 10.0MB/s eta 0:00:01[K     |██████████████████████▉         | 81kB 11.1MB/s eta 0:00:01[K     |█████████████████████████▋      | 92kB 12.3MB/s eta 0:00:01[K     |████████████████████████████▌   | 102kB 9.9MB/s eta 0:00:01[K     |███████████████████████████████▍| 112kB 9.9MB/s eta 0:00:01[K     |████████████████████████████████| 122kB 9.9MB/s 
[?25hDimCli v0.6.1 - Succesfully connected to <https://app.dimensions.ai> (method: manual login)


#### Selecting a researcher

Let's take a researcher ID eg [George Church ID=ur.01115626315.03](https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01115626315.03) and save its ID into a variable that can be referenced later.

> Try modifying the researcher ID below to get different results! 

In [0]:
RESEARCHER = "ur.01115626315.03"

### Calculating the H-Index 

The h-Index function takes a list of citations and outputs the h-index value as explained above: 

In [0]:
def the_H_function(sorted_citations_list, n=1):
    """from a list of integers [n1, n2 ..] representing publications citations, 
    return the max list-position which is >= integer
    
    eg 
    >>> the_H_function([10, 8, 5, 4, 3]) => 4
    >>> the_H_function([25, 8, 5, 3, 3]) => 3
    >>> the_H_function([1000, 20]) => 2
    """
    if sorted_citations_list and sorted_citations_list[0] >= n:
        return the_H_function(sorted_citations_list[1:], n+1)
    else:
        return n-1

The H-index function is generic and can take any list of numbers representing publication citations. 

### Getting citations data from Dimensions

In order to pass some real-world data to the H-Index function, we can easily use the Dimensions API to extract all publication citations for a researcher. 

Lastly, we combine the two functions to calculate the H-Index value. 

In [0]:
def get_pubs_citations(researcher_id):
    print("Checking researcher: ", researcher_id)
    q = """search publications where researchers.id = "{}" return publications[times_cited] sort by times_cited limit 1000"""
    pubs = dsl.query(q.format(researcher_id))
    return list(pubs.as_dataframe().fillna(0)['times_cited'])

#
#
print("===\nH_index is:", the_H_function(get_pubs_citations(RESEARCHER)))

Checking researcher:  ur.01115626315.03
Returned Publications: 532 (total = 532)
===
H_index is: 131


---
## Want to learn more?

Check out the [Dimensions API Lab](https://digital-science.github.io/dimensions-api-lab/) website, which contains many tutorials and reusable Jupyter notebooks for scholarly data analytics. 