<a href="https://colab.research.google.com/github/digital-science/dimensions-api-lab/blob/master/3-workshops/2019-04-Technical-University-of-Denmark/12-Joining-Dimensions-data-to-Wikidata.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open Dimensions API Lab In Google Colab"/></a>

# Joining Dimensions data to Wikidata

Grid identifiers are available in Wikidata. By using Sparql to query Wikidata alongside the Dimensions API, it is possible to join information in Dimensions to other attributes about institutions in Wikidata. In this example Wikidata is used to help us understand why some universities have such high numbers of papers with no external authors.

In [5]:
import pandas as pd
from dimcli.shortcuts import dslquery_json as dslquery

In example 8, we used Dimensions to find institutions with very high numbers of internal publications (publications with no external authors). In this example we use Wikidata to see if these numbers corellate with high numbers of students.

## 1) Query Wikidata, and put the results in a dataframe

In [6]:
#pip (or pip3) install sparqlwrapper
#https://rdflib.github.io/sparqlwrapper/

from SPARQLWrapper import SPARQLWrapper, JSON
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
sparql.setQuery("""Select ?grid ?inception ?students
where {
    ?inst wdt:P2427 ?grid;
          wdt:P2196 ?students;
          wdt:P571 ?inception .
}""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

cols = results['head']['vars']

out = []
for row in results['results']['bindings']:
    item = []
    for c in cols:
        item.append(row.get(c, {}).get('value'))
    out.append(item)
    
wddf = pd.DataFrame(out, columns=cols). \
           set_index('grid')

wddf.head()





Unnamed: 0_level_0,inception,students
grid,Unnamed: 1_level_1,Unnamed: 2_level_1
grid.497287.7,1999-01-01T00:00:00Z,100
grid.461653.3,1979-01-01T00:00:00Z,110
grid.448855.0,2010-01-01T00:00:00Z,109
grid.466243.1,1925-01-01T00:00:00Z,104
grid.465925.9,1805-01-01T00:00:00Z,43


## 2) Get internal collaboration information on institutions from Dimensions

In [8]:
dsldf = pd.DataFrame(
        dslquery("""
            search publications
                where year > "2012"
                and count(research_orgs) = 1
            return research_orgs limit 200
        """)['research_orgs']
    ). \
    set_index('id')
dsldf.index.name = 'grid'

Execution time: 1.1190330982208252


In [9]:
dsldf.head()

Unnamed: 0_level_0,acronym,count,country_name,name
grid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
grid.11899.38,USP,29893,Brazil,University of Sao Paulo
grid.12527.33,THU,25896,China,Tsinghua University
grid.13402.34,ZJU,25513,China,Zhejiang University
grid.17063.33,,25437,Canada,University of Toronto
grid.16821.3c,SJTU,25139,China,Shanghai Jiao Tong University


## Joining the data together
Although there is only a partial match in information, there appears to be a relationship between a high number of students, and a high number of internal publications

In [10]:
pd.merge(dsldf, wddf, on='grid', how='left')

Unnamed: 0_level_0,acronym,count,country_name,name,inception,students
grid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
grid.11899.38,USP,29893,Brazil,University of Sao Paulo,1934-01-01T00:00:00Z,96364
grid.12527.33,THU,25896,China,Tsinghua University,,
grid.13402.34,ZJU,25513,China,Zhejiang University,1897-05-21T00:00:00Z,39000
grid.17063.33,,25437,Canada,University of Toronto,,
grid.16821.3c,SJTU,25139,China,Shanghai Jiao Tong University,,
grid.214458.e,UM,23737,United States,University of Michigan,,
grid.26999.3d,UT,21234,Japan,University of Tokyo,1877-04-12T00:00:00Z,28253
grid.21107.35,JHU,20518,United States,Johns Hopkins University,,
grid.4991.5,,20507,United Kingdom,University of Oxford,1096-01-01T00:00:00Z,19791
grid.168010.e,SU,20287,United States,Stanford University,1891-01-01T00:00:00Z,16336


---
# Want to learn more?

Check out the [Dimensions API Lab](https://digital-science.github.io/dimensions-api-lab/) website, which contains many tutorials and reusable Jupyter notebooks for scholarly data analytics. 