# Welcome to Datasets on DataCite - An initial bibliometric investigation

This notebook serves as supplemental material to the ISSI presentation "Datasets on DataCite - An initial bibliometric investigation" by Anton Ninkov, Kathleen Gregory, Isabella Peters, and Stefanie Haustien. This notebook allows the user to rerun the data presented in this paper with the most current ouputs from DataCite's GraphQL API. 

The first three sections (installs, preparing querries, managing data output) all prepare the data for display. Section 4 shows the results of the data pulled from DataCite.Here, users can see the number of datasets with specific licenses (top 50), languages (top 50), subjects (all fields of science), as well as publication year (for the last 10 years). The "perc" column indicates what percentage of the total number of datasets in DataCite have the attribute listed in the row.

At the end of the notebook, in section five,  a treemap is created to visualize the dispersion of datasets that have a fields of science listed. This is a recreation of the treemap from the presentation with the most current data from DataCite.

Special thanks to Kristian Garza from DataCite, who supported this work along the way. This notebook is based, in part, on his notebook with similar functions (https://github.com/datacite/pidgraph-notebooks-python/tree/master/mdc-dataset-discipline).

As well, special thanks to Amir Haghighati from Western University's Insight Lab, who provided technical assistance in the developmeent of this notebook and specifically the treemap.

## 1. Installs
This first section is for installing and importing all the required libraries to run this notebook

In [154]:
%%capture
# Install required Python packages
!pip install dfply altair altair_saver vega altair_viewer gql dash==1.16.3 

In [155]:
import json
import numpy as np
from dfply import *
import altair.vega.v5 as alt
from altair_saver import save
import altair.vegalite.v4 as lite
#import plotly.graph_objects as go
import pandas as pd
import plotly.graph_objects as go


## 2. Prepare for the DataCite GraphQL API Querries
Next, we prepare the client and queries to make the GraphQL calls. We will also create functions to request the data from these calls.

In [156]:
import requests
from IPython.display import display, Markdown
from gql import gql, Client
from gql.transport.requests import RequestsHTTPTransport

_transport = RequestsHTTPTransport(
    url='https://api.datacite.org/graphql',
    use_json=True,
)

client = Client(
    transport=_transport,
    fetch_schema_from_transport=True,
)

In [157]:
query_params = {
    "query" : "subjects.subjectScheme:\"Fields of Science and Technology (FOS)\"",
}


fOSQuery = gql("""query getOutputs($query: String)
{
  datasets(query: $query) {
    totalCount
  }
}
""")

citationsQuery = gql("""query getOutputs($query: String)
{
  datasets(query: $query,  hasCitations:1) {
    totalCount
  }
}
""")

generalQuery = gql("""query
{
  datasets(facetCount:50) {
    totalCount
    fieldsOfScience{
      title
      count
    }
    published{
      title
      count
    }
    licenses{
        title
        count
    }
    languages{
        title
        count
    }
  }
}
""")

In [158]:
def get_data(type):
    """Gets the data from the graphql api into an object

    Parameters:
    type (string): Controlled vocabulary for type of data

    Returns:
    object:Returning object reponse

   """
    if type == "citations":
        return client.execute(citationsQuery, variable_values=json.dumps(query_params))["datasets"]
    elif type == "fos":
        return client.execute(fOSQuery, variable_values=json.dumps(query_params))["datasets"]
    elif type == "general":
        return client.execute(generalQuery, variable_values=json.dumps(query_params))["datasets"]
    else:
        return client.execute(datasetsQuery, variable_values=json.dumps(query_params))["datasets"]

## 3. Manage the data output
Next, we prepare the functions to process the data, transform it for proper display, and functions to run the calls.

In [159]:
def transform_distributions(dataframe, total):
    if (dataframe) is None:
        return pd.DataFrame() 
    else: 
        return (dataframe >>
        mutate(
            perc = (X['count']/total)*100
        ))
  

In [160]:
def processTable(data, type):
    if len(data[type]) == 0:
        return None
    else:
        table = pd.DataFrame(data[type],columns=data[type][0].keys())
    return transform_distributions(table, data['totalCount']) 

In [161]:
citations = get_data("citations")
fos = get_data("fos")
general = get_data("general")

## 4. Display the results
First, the total number of datasets in DataCite is displayed.
Next, the number and percentage of datatsets with a field of science listed and then at least one citation are listed.

Next, four tables are listed, one for each of: field of science, published year, license, and language.. These tables display the title, count, and percentage of the total number of datasets on DataCite.

In [162]:
fig = go.Figure(go.Indicator(
    mode = "number+delta",
    value = general["totalCount"],
    title= {'text': f"Total Number of Datasets on DataCite"},
    domain = {'x': [0, 1], 'y': [0, 1]}))
fig.update_layout(paper_bgcolor = "lightgray")
fig.show()

In [163]:

perc1 = 100*(fos["totalCount"]/general["totalCount"])

fig = go.Figure(go.Indicator(
    mode = "number+delta",
    value = fos["totalCount"],
    title= {'text': f"Datasets with a Field of Science listed ({perc1:.2f}%)"},
    domain = {'x': [0, 1], 'y': [0, 1]}))
fig.update_layout(paper_bgcolor = "lightgray")
fig.show()

In [164]:
perc2 = 100*(citations["totalCount"]/general["totalCount"])


fig = go.Figure(go.Indicator(
    mode = "number+delta",
    value = citations["totalCount"],
    title= {'text': f"Datasets with a citation listed ({perc2:.2f}%)"},
    domain = {'x': [0, 1], 'y': [0, 1]}))
fig.update_layout(paper_bgcolor = "lightgray")
fig.show()

In [165]:
processTable(general, "fieldsOfScience")

Unnamed: 0,title,count,perc
0,Biological sciences,304128,3.066576
1,Earth and related environmental sciences,93753,0.945328
2,Health sciences,80860,0.815325
3,Chemical sciences,65202,0.657443
4,Computer and information sciences,63899,0.644305
5,Clinical medicine,62105,0.626216
6,Sociology,40336,0.406715
7,Mathematics,33849,0.341305
8,Physical sciences,18410,0.185631
9,Psychology,16363,0.164991


In [166]:
processTable(general, "published")

Unnamed: 0,title,count,perc
0,2021,1591542,16.047795
1,2020,1502280,15.147751
2,2019,1039047,10.476892
3,2018,1049955,10.586879
4,2017,865636,8.728358
5,2016,478005,4.819808
6,2015,1059811,10.686259
7,2014,494211,4.983216
8,2013,194158,1.957729
9,2012,331789,3.345486


In [167]:
processTable(general, "licenses")

Unnamed: 0,title,count,perc
0,CC-BY-4.0,679910,6.855651
1,CC-BY-3.0,286479,2.888618
2,CC-BY-NC-4.0,261621,2.63797
3,CC0-1.0,154277,1.555602
4,CC-BY-NC-SA-4.0,20630,0.208016
5,CC-BY-SA-4.0,5791,0.058392
6,CC-BY-NC-3.0,5721,0.057686
7,MIT,4499,0.045364
8,CC-BY-NC-ND-4.0,3790,0.038215
9,CC-BY-NC-ND-3.0,577,0.005818


In [168]:
processTable(general, "languages")

Unnamed: 0,title,count,perc
0,English,5488577,55.342277
1,German,203565,2.052581
2,French,160380,1.617139
3,Italian,159084,1.604072
4,Irish,32635,0.329064
5,Dutch,16967,0.171081
6,Spanish,7327,0.073879
7,Danish,3232,0.032589
8,Portuguese,2043,0.0206
9,Thai,1997,0.020136


## 5. Treemap
Finally, in this section we generate and display a treemap showing the breakdown of datasets with a field of science. This treemap is based on the field of science table listed in section 4. Fields of science are lumped together by their higher level (parent) OECD field of science (i.e., natural sciences, engineering and technology, medical and health sciences, social sciences, agricultural sciences, humanities). If the user clicks on one of these six higher level cells of the treemap, the lower level fields for that field are displayed. Hovering over a box lists the number of datasets contained within that specific field.

Note - the size of each cell represent the number of datasets that have that field of science listed. However, at the higher level, if one hovers over a cell, the number of datasets with that higher level field listed is displayed, not all datasets that fall within that higher level field. To calculate that specific number, one would need to add the lower level fields under that higher level field together.

In [152]:
def readjson():
    df = pd.read_json('fosData.json')
    return df
    
processTableData = processTable(general,"fieldsOfScience")
fosData = readjson()
items = [{'id': '0', 'label': 'all', 'parent': '', 'val': 0}]
def makeParents():
    for row in fosData.itertuples():
        currentPar = 0
        count = 0
        if(row.fosId % 1 != 0.0):
            currentPar = row.fosId // 1.0
        for tableData in processTableData.itertuples():
            if (str(tableData.title).lower().strip() == str(row.fosLabel).lower().strip()):
                count = tableData.count
        item = {
            'id': str(row.fosId),
            'label': row.fosLabel,
            'parent': str(currentPar),
            'val': count
        }
        items.append(item)
        
makeParents()

In [153]:
df = pd.DataFrame(items)
fig = go.Figure()

fig.add_trace(go.Treemap(
    ids = df.id,
    labels = df.label,
    values = df.val,
    parents = df.parent,
    maxdepth=2,
    root_color="lightgrey"
))

fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))

fig.show()