# Building an Organizations Collaboration Network using the Dimensions API. 

This notebook shows how to analyse organizations collaboration data. Starting from a research organization, we will extract information about other organizations that collaborated with it, based on shared publications data. 

In order to make the analysis more focused, we are going to select also a topic and a time-frame. By appying these extra constraints we will reduce the number of shared publications data and also make the overall extraction faster. 

At the end of the tutorial we will generate a 'collaborations network diagram', in which the nodes represent the organizations working together, and the edges represent the number of publications they have in common. An example of the resulting network diagram [can be seen here](http://api-sample-data.dimensions.ai/dataviz-exports/3-Organizations-Collaboration-Network/network_2_levels_grid.412125.1.html).

## 1. Prerequisites: load libraries and log in

In [2]:
# @markdown # Get the API library and login 
# @markdown Click the 'play' button on the left (or shift+enter) after entering your API credentials

username = "" #@param {type: "string"}
password = "" #@param {type: "string"}
endpoint = "https://app.dimensions.ai" #@param {type: "string"}


!pip install dimcli plotly tqdm pyvis -U --quiet 
import dimcli
from dimcli.shortcuts import *
dimcli.login(username, password, endpoint)
dsl = dimcli.Dsl()

#
# load common libraries
import time
import sys
import json
import pandas as pd
from pandas.io.json import json_normalize
from tqdm.notebook import tqdm as progress
import networkx as nx

#
# charts libs
# import plotly_express as px
import plotly.express as px
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports 
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)

DimCli v0.6.3 - Succesfully connected to <https://app.dimensions.ai> (method: dsl.ini file)


## 2. Choose an Organization and a keyword (topic)

For the purpose of this exercise, we will use [grid.412125.1](https://grid.ac/institutes/grid.412125.1) (King Abdulaziz University, Saudi Arabia). 

> You can try using a different GRID ID to see how results change, e.g. by [browsing for another GRID organization](https://grid.ac/institutes).


In [3]:
GRIDID = "grid.412125.1" #@param {type:"string"}
    
#@markdown The start/end year of publications used to extract patents
YEAR_START = 2000 #@param {type: "slider", min: 1950, max: 2020}
YEAR_END = 2016 #@param {type: "slider", min: 1950, max: 2020}

#@markdown ---
#@markdown A keyword used to filter publications search
TOPIC = "nanotechnology" #@param {type:"string"}

if YEAR_END < YEAR_START:
  YEAR_END = YEAR_START

#
# gen link to Dimensions
#
try:
  gridname = dsl.query(f"""search organizations where id="{GRIDID}" return organizations[name]""", verbose=False).organizations[0]['name']
except:
  gridname = ""
from IPython.core.display import display, HTML
display(HTML('GRID: <a href="{}" title="View selected organization in Dimensions">{} - {} &#x29c9;</a>'.format(dimensions_url(GRIDID), GRIDID, gridname)))
display(HTML('Time period: {} to {}'.format(YEAR_START, YEAR_END)))
display(HTML('Topic: "{}" <br /><br />'.format(TOPIC)))


## 3. Building a one-degree network of collaborating institutions

We can use the [publications API](https://docs.dimensions.ai/dsl/data-sources.html#publications) to find the top 10 collaborating institutions based on the parameters above, via a single query. 

The `get_collaborators` function below fills out a templated query with the relevant bits and runs it. Then it transforms the results into a pandas dataframe, which will make it easier to process the data later on. 

A couple of things to note: 

* The resulting dataframe contains two extra columns: a) `id_from`, which is the 'seed' institution we start from; b) `level`, an optional parameter representing the network depth of the query (we'll see later how it is used with recursive querying).
* The query returns 11 records - that's because the first one is normally the seed GRID (due to internal collaborations) which we will omit from the results.
* Lastly, it's important to note that one could easily more constraints to the query e.g. research areas via FOR codes, or setting a threshold based on citation counts. The possibilities are endless!  

In [4]:
query = """search publications {}
               where year in [{}:{}] 
               and research_orgs.id="{}"
            return research_orgs limit 11"""

def get_collaborators(orgid, level=1, printquery=False):
    if TOPIC:
        TOPIC_CLAUSE = f"""for "{TOPIC}" """
    else:
        TOPIC_CLAUSE = ""
    searchstring = query.format(TOPIC_CLAUSE, YEAR_START, YEAR_END, orgid)
    if printquery: print(searchstring)
    df = dsl.query(searchstring, verbose=False).as_dataframe()
    df['id_from'] = [orgid] * len(df)
    df['level'] = [level] * len(df)
    return df

For example, let's try it out with our GRID ID:

In [5]:
get_collaborators(GRIDID, printquery=True)

search publications for "nanotechnology" 
               where year in [2000:2016] 
               and research_orgs.id="grid.412125.1"
            return research_orgs limit 11


Unnamed: 0,acronym,city_name,count,country_name,id,latitude,linkout,longitude,name,state_name,types,id_from,level
0,KAU,Jeddah,1069,Saudi Arabia,grid.412125.1,21.493889,[http://www.kau.edu.sa/home_english.aspx],39.25028,King Abdulaziz University,,[Education],grid.412125.1,1
1,MIT,Cambridge,53,United States,grid.116068.8,42.35982,[http://web.mit.edu/],-71.09211,Massachusetts Institute of Technology,Massachusetts,[Education],grid.412125.1,1
2,,Cambridge,49,United States,grid.38142.3c,42.377052,[http://www.harvard.edu/],-71.11665,Harvard University,Massachusetts,[Education],grid.412125.1,1
3,NU,Boston,46,United States,grid.261112.7,42.33983,[http://www.northeastern.edu/],-71.08918,Northeastern University,Massachusetts,[Education],grid.412125.1,1
4,AMU,Aligarh,39,India,grid.411340.3,27.91737,[http://www.amu.ac.in/],78.07785,Aligarh Muslim University,Uttar Pradesh,[Education],grid.412125.1,1
5,QAU,Islamabad,35,Pakistan,grid.412621.2,33.747223,[http://www.qau.edu.pk/],73.138885,Quaid-i-Azam University,,[Education],grid.412125.1,1
6,KSU,Riyadh,35,Saudi Arabia,grid.56302.32,24.723982,[http://ksu.edu.sa/en/],46.64584,King Saud University,,[Education],grid.412125.1,1
7,,Ismailia,34,Egypt,grid.33003.33,30.622778,[http://scuegypt.edu.eg/ar/],32.275,Suez Canal University,,[Education],grid.412125.1,1
8,,Elâzığ,33,Turkey,grid.411320.5,38.6799,[https://yeni.firat.edu.tr/],39.202843,Fırat University,,[Education],grid.412125.1,1
9,,New Delhi,33,India,grid.411818.5,28.561607,[http://jmi.ac.in/],77.28015,National Islamic University,,[Education],grid.412125.1,1


## 4. Building a network of any size 

What if we want to retrieve the collaborators of the collaborators? 

In other words, what if we want to generate a larger network, which includes the institutions linked to the collaborating institutions of King Abdulaziz University? If we think of our collaboration data as a [graph structure](https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)) with nodes and edges, we can see that the `get_collaborators` function defined above is limited. That's because it allows to obtain only the objects directly linked to the 'seed' GRID. Instead, we want to run the same analysis for any GRID ID in our results, **iteratively**, so to generate an N-degrees network where N is chosen by us. 

To this purpose, we can set up a [recursive](https://en.wikipedia.org/wiki/Recursion_(computer_science)) function. This function essentially repeats the `get_collaborators` function as many times as needed. A few key points to note: 
* The `maxlevel` parameter determines how big our network should be (1 =  neighbours only, 2 = collaborators of neighbours,e tc..) 
* We pause 1 second after each iteration to avoid hitting the normal Analytics API quota (~30 requests per minute)
* The function can generate lots of data! E.g. calling this function with `maxlevel=5` will lead to 10k queries! (note: you can get a rough estimate of the queries via the formula *10 to the power of maxlevel-1*. That's because 10 is the number of orgs we extract per iteration, and maxlevel is the number or iterations, minus the first one which generates no extra queries).   


In [6]:
def looper(seed, maxlevel=1, thislevel=1):
    "Recursive function for building an organization collaboration network"
    collaborators = get_collaborators(seed, thislevel)
    time.sleep(1)
    print("--" * thislevel, seed, " :: level =", thislevel)
    if thislevel < maxlevel:
        gridslist = list(collaborators[collaborators['id'] != GRIDID]['id'])
        extra = [looper(x, maxlevel, thislevel+1) for x in gridslist]
        return collaborators.append(extra)
    else:
        # finally
        return collaborators

Let's try this out. 

We can construct a 2-degrees collaboration network starting from King Abdulaziz University. We are extracting 10 organizations per node so our network will have ~100 nodes at the end!   

In [7]:
collaborators = looper(GRIDID, maxlevel=2)
# change column order for readability purposes
collaborators.rename(columns={"id": "id_to"}, inplace=True)
collaborators = collaborators[['id_from', 'id_to', 'level', 'count', 'name', 'acronym', 'city_name', 'state_name', 'country_name', 'latitude', 'longitude', 'linkout',  'types' ]]
collaborators.head()

-- grid.412125.1  :: level = 1
---- grid.116068.8  :: level = 2
---- grid.38142.3c  :: level = 2
---- grid.261112.7  :: level = 2
---- grid.411340.3  :: level = 2
---- grid.412621.2  :: level = 2
---- grid.56302.32  :: level = 2
---- grid.33003.33  :: level = 2
---- grid.411320.5  :: level = 2
---- grid.411818.5  :: level = 2
---- grid.412144.6  :: level = 2



Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.





Unnamed: 0,id_from,id_to,level,count,name,acronym,city_name,state_name,country_name,latitude,longitude,linkout,types
0,grid.412125.1,grid.412125.1,1,1069,King Abdulaziz University,KAU,Jeddah,,Saudi Arabia,21.493889,39.25028,[http://www.kau.edu.sa/home_english.aspx],[Education]
1,grid.412125.1,grid.116068.8,1,53,Massachusetts Institute of Technology,MIT,Cambridge,Massachusetts,United States,42.35982,-71.09211,[http://web.mit.edu/],[Education]
2,grid.412125.1,grid.38142.3c,1,49,Harvard University,,Cambridge,Massachusetts,United States,42.377052,-71.11665,[http://www.harvard.edu/],[Education]
3,grid.412125.1,grid.261112.7,1,46,Northeastern University,NU,Boston,Massachusetts,United States,42.33983,-71.08918,[http://www.northeastern.edu/],[Education]
4,grid.412125.1,grid.411340.3,1,39,Aligarh Muslim University,AMU,Aligarh,Uttar Pradesh,India,27.91737,78.07785,[http://www.amu.ac.in/],[Education]


## 5. Visualizing the network 

In order to get an overview of the network data we can build a visualization using the [pyvis](https://pyvis.readthedocs.io/en/latest/tutorial.html) library. In particular, in order to quickly identify the key players in the network, we can build a visualization where the size of the nodes is proportional to the proximity to our 'seed' organization, and the strenght of the collaboration is proportional to the size of the edges (= how many publications two orgs have in common). 

A custom version of pyvis is already included in [dimcli.core.extras](https://github.com/digital-science/dimcli/blob/master/dimcli/core/extras.py) and is called `NetworkViz` (note: this custom version only fixes a bug that prevents pyvis graphs to be displayed online with Google Colab). 

This is what the code below does:

* After creating a `NetworkViz` object, we fill it in with the `add_node` and `add_edge` method. The full list of attributes for nodes and edges are described in [pyvis](https://pyvis.readthedocs.io/en/latest/tutorial.html).
* We generate colors for the chart, using the built-in [plotly color scales](https://plot.ly/python/builtin-colorscales/). Try changing them!
* The `repulsion` parameter is set to 300, but for bigger charts you may want to increase that..
* Tip: by experimenting with the way node sizes/colors are generated to the underlying data, it is possible to highlight different dimensions eg *countries* or *types* of the organizations. 


In [8]:

# load custom version of pyvis 
from dimcli.core.extras import NetworkViz

# set up dataviz
g = NetworkViz(notebook=True, width="100%", height="800px")
g.toggle_hide_edges_on_drag(False)
g.barnes_hut()
g.repulsion(300)
# g.show_buttons() # in html-standalone mode, this command show viz controls


#
# create nodes and edges
#

# remove duplicates from nodes 
nodes = collaborators.drop_duplicates(subset ="id_to", keep = 'first')
# remove internal collaborations stats 
edges = collaborators[(collaborators['id_to'] != collaborators['id_from'])]

# reuse plotly color palette
palette = px.colors.diverging.Temps


#
# add nodes
#

for index, row in nodes.iterrows():
    
    # calc size based on level
    maxsize = int(nodes['level'].max()) + 1
    if row['id_to'] == GRIDID:
        size = maxsize
    else:
        size = maxsize - row['level']

    # calc color based on level
    if row['id_to'] == GRIDID:
        color = palette[0]
    else:
        color = palette[row['level'] * 2]

    g.add_node(
        n_id = row['id_to'],
        label = row['name'],
        title = f"<h4>{row['name']}<br>{row['city_name']}, {row['country_name']}<br> - {row['id_to']}</h4>",
        value = size,
        color = color,
        borderWidthSelected = 5,
        shape = "dot",
    )


# store the max value for normalization operations later
edges_maxcount = edges['count'].max()

#
# add edges
#

for index, row in edges.iterrows():
  g.add_edge(row['id_from'], row['id_to'], 
             value = float(row['count']) / edges_maxcount,
             label=int(row['count']), 
             arrows="none"
            )


# add tooltips with adjancent links info
neighbor_map = g.get_adj_list() 
for node in g.nodes:
    neigh = neighbor_map[node["id"]]
    labels = [nodes[nodes['id_to'] == x].iloc[0]['name'] for x in neigh]
    node["title"] += "Links:<li>" + "</li><li>".join(labels)
 
    
g.show(f"network_{GRIDID}.html")

## 6. Conclusions

In this tutorial we have demonstrated how to generate an organization 'collaborations network diagram' using the Dimensions API. Starting from a research organization, we extracted information about other collaborating organizations, based on shared publications data, a topic and a time-frame. 

An example of the resulting network diagram [can be seen here](http://api-sample-data.dimensions.ai/dataviz-exports/3-Organizations-Collaboration-Network/network_2_levels_grid.412125.1.html).

Here's some ideas for further experimentation:

* try changing the initial `publications` query so to include other parameters. The [publications API](https://docs.dimensions.ai/dsl/data-sources.html#publications) is rich so there're many ways to fine-tune your analysis
* try increasing the number of iterations using the `maxlevel` parameter 
* try customizing the resulting network diagram, e.g. to highlight nodes and edges based on different criteria like countries or years. 