<a href="https://colab.research.google.com/github/digital-science/dimensions-api-lab" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open Dimensions API Lab In Google Colab"/></a>

# Part 4: Institutions

## Install Dimensions Library and login

In [None]:
try:
  from google.colab import files
  %load_ext google.colab.data_table
  COLAB_ENV = True
  !pip install dimcli plotly_express  -U
  !mkdir data # to save temp data 
except:
  COLAB_ENV = False


# common libraries
import pandas as pd
from pandas.io.json import json_normalize
import time
from tqdm import tqdm_notebook as tqdm
import plotly_express as px
from getpass import getpass
# FINALLY..
import dimcli
from dimcli.shortcuts import *

# set up for exports
if not COLAB_ENV:
  from plotly.offline import init_notebook_mode # needed for exports 
  init_notebook_mode(connected=True)


##
# LOG IN 
##

USERNAME = "m.pasin@digital-science.com"  #@param {type: "string"}

if not USERNAME:
  print("====\nERROR: Please enter a valid Dimensions API username")
else:
  password = getpass('====\nEnter password here')
  print('=> username is', USERNAME)
  print('=> password is', "*" * len(password))
  dimcli.login(USERNAME, password)
  dsl = dimcli.Dsl()


# Institutions Contributing to a Journal

From our original publications dataset, we now want to look at institutions i.e. 

* getting the full list of institutions (also ones without a GRID, for subsequent analysis) linked to the journal
* publications count 
* authors count 

Let's reload the affiliations data from Part-1 of the tutorial.



In [3]:
affiliations = pd.read_csv("data/1.publications_authors_affiliations.csv")
affiliations

Unnamed: 0,aff_city,aff_city_id,aff_country,aff_country_code,aff_id,aff_name,aff_state,aff_state_code,pub_id,researcher_id,first_name,last_name
0,Besançon,3033123.0,France,FR,grid.493090.7,Université Bourgogne Franche-Comté,,,pub.1121383028,,Pierre,Vabres
1,Dijon,3021372.0,France,FR,grid.5613.1,University of Burgundy,,,pub.1121383028,,Pierre,Vabres
2,Besançon,3033123.0,France,FR,grid.493090.7,Université Bourgogne Franche-Comté,,,pub.1121383028,,Arthur,Sorlin
3,Dijon,3021372.0,France,FR,grid.5613.1,University of Burgundy,,,pub.1121383028,,Arthur,Sorlin
4,Ithaca,5122432.0,United States,US,grid.5386.8,Cornell University,New York,US-NY,pub.1121383028,,Stanislav S.,Kholmanskikh
5,Amiens,3037854.0,France,FR,grid.134996.0,Centre Hospitalier Universitaire D' Amiens,,,pub.1121383028,,Bénédicte,Demeer
6,Dijon,3021372.0,France,FR,grid.5613.1,University of Burgundy,,,pub.1121383028,,Judith,St-Onge
7,Besançon,3033123.0,France,FR,grid.493090.7,Université Bourgogne Franche-Comté,,,pub.1121383028,,Judith,St-Onge
8,Montreal,6077243.0,Canada,CA,grid.63984.30,McGill University Health Centre,Quebec,CA-QC,pub.1121383028,,Judith,St-Onge
9,Dijon,3021372.0,France,FR,grid.5613.1,University of Burgundy,,,pub.1121383028,,Yannis,Duffourd


## Some stats about affiliations

* count how many affiliations statements in total
* count how many affiliations have a GRID ID
* count how many unique GRID IDs we have in total

In [None]:
#
# segment the affiliations dataset
affiliations = affiliations.fillna('') 
gridaffiliations = affiliations.query("aff_id != ''")
non_gridaffiliations = affiliations.query("aff_id == ''")
#
# save
gridaffiliations.to_csv("data/4.Affiliations-with-grid.csv", index=False)
non_gridaffiliations.to_csv("data/4.Affiliations-without-grid.csv", index=False)

In [6]:
# build a summary barchart

df = pd.DataFrame({
    'measure' : ['Affiliations in total (non unique)', 'Affiliations with a GRID ID', 'Affiliations with a GRID ID (unique)'],
    'count' : [len(affiliations), len(gridaffiliations), gridaffiliations['aff_id'].nunique()],
})
px.bar(df, x="measure", y="count", title=f"Affiliations stats")

## Enriching the unique affiliations (GRIDs list) with pubs count and authors count

We want a table with the following columns 

* grid ID
* city
* country
* country code
* name
* tot_pubs
* tot_affiliations

NOTE: tot_affiliations is a list of non unique authors

In [7]:
gridaffiliations.head(5)

Unnamed: 0,aff_city,aff_city_id,aff_country,aff_country_code,aff_id,aff_name,aff_state,aff_state_code,pub_id,researcher_id,first_name,last_name
0,Besançon,3033120.0,France,FR,grid.493090.7,Université Bourgogne Franche-Comté,,,pub.1121383028,,Pierre,Vabres
1,Dijon,3021370.0,France,FR,grid.5613.1,University of Burgundy,,,pub.1121383028,,Pierre,Vabres
2,Besançon,3033120.0,France,FR,grid.493090.7,Université Bourgogne Franche-Comté,,,pub.1121383028,,Arthur,Sorlin
3,Dijon,3021370.0,France,FR,grid.5613.1,University of Burgundy,,,pub.1121383028,,Arthur,Sorlin
4,Ithaca,5122430.0,United States,US,grid.5386.8,Cornell University,New York,US-NY,pub.1121383028,,Stanislav S.,Kholmanskikh


In [8]:
#
# group by GRIDID and add new column with affiliations count
gridaffiliations["tot_affiliations"] = gridaffiliations.groupby('aff_id')['aff_id'].transform('count')
#
# add new column with publications count, for each GRID
gridaffiliations["tot_pubs"] = gridaffiliations.groupby(['aff_id'])['pub_id'].transform('nunique')
# 
# remove unnecessary columns
gridaffiliations = gridaffiliations.drop(['aff_city_id', 'pub_id', 'researcher_id', 'first_name', 'last_name'], axis=1).reset_index(drop=True)
#
# remove duplicate rows
gridaffiliations.drop_duplicates(inplace=True)
#
# update columns order
gridaffiliations = gridaffiliations[[ 'aff_id', 'aff_name','aff_city', 
                                     'aff_country', 'aff_country_code',  'aff_state',
                                     'aff_state_code', 'tot_affiliations',  'tot_pubs']]
#
# sort
gridaffiliations = gridaffiliations.sort_values(['tot_affiliations', 'tot_pubs'], ascending=False)
#
#
# That's it! Let's see the result
gridaffiliations.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



Unnamed: 0,aff_id,aff_name,aff_city,aff_country,aff_country_code,aff_state,aff_state_code,tot_affiliations,tot_pubs
438,grid.66859.34,Broad Institute,Cambridge,United States,US,Massachusetts,US-MA,1192,210
454,grid.38142.3c,Harvard University,Cambridge,United States,US,Massachusetts,US-MA,978,245
118,grid.5335.0,University of Cambridge,Cambridge,United Kingdom,GB,,,780,145
184,grid.10306.34,Wellcome Sanger Institute,Cambridge,United Kingdom,GB,,,683,132
463,grid.32224.35,Massachusetts General Hospital,Boston,United States,US,Massachusetts,US-MA,511,132


In [None]:
# save the data
gridaffiliations.to_csv("data/4.aggregated-affiliations-with-grid.csv", index=False)

In [None]:
# download the data 
if COLAB_ENV:
  files.download("data/4.aggregated-affiliations-with-grid.csv")

## Couple of Dataviz

In [11]:
treshold = 100

px.scatter(gridaffiliations[:treshold], x="tot_pubs", y="tot_affiliations", color="aff_country",
           hover_name="aff_name", hover_data=['aff_id', 'aff_name', 'aff_city', 'aff_country', 'tot_affiliations', 'tot_pubs'], 
           title=f"Top {treshold} Institutions: authors & publications")

In [12]:
treshold = 500

px.scatter(gridaffiliations[:treshold], x="aff_city", y="tot_pubs", color="aff_country", size='tot_affiliations',
           hover_name="aff_name", hover_data=['aff_id', 'aff_name', 'aff_city', 'aff_country', 'tot_affiliations', 'tot_pubs'], 
           title=f"Top {treshold} Institutions: cities and countries")