# Extracting Authors order from Publications data

This Python notebook shows how to use the [Dimensions Analytics API](https://www.dimensions.ai/dimensions-apis/), in particular the [publications source](https://docs.dimensions.ai/dsl/datasource-publications.html), in order to analyse the publications' authors' order.

These are the steps:

* First we extract a dataset of interest from Dimensions' publications database
* Second, we process authors structured data so to turn the implicit authorship order into a number
* Third, we mark first and last authors via a new 'author category' column


In [1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))

==
CHANGELOG
This notebook was last run on Apr 20, 2023
==


## Prerequisites

This notebook assumes you have installed the [Dimcli](https://pypi.org/project/dimcli/) library and have followed the steps in the ['Getting Started' tutorial](https://api-lab.dimensions.ai/cookbooks/1-getting-started/1-Using-the-Dimcli-library-to-query-the-API.html).

In [2]:
!pip install dimcli plotly tqdm -U --quiet 

import dimcli
from dimcli.utils import *

import os, sys, time, json
from tqdm.notebook import tqdm as progressbar

import pandas as pd
import numpy as np

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')  
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()

[2mSearching config file credentials for 'https://app.dimensions.ai' endpoint..[0m


==
Logging in..
[2mDimcli - Dimensions API Client (v1.0.2)[0m
[2mConnected to: <https://app.dimensions.ai/api/dsl> - DSL v2.6[0m
[2mMethod: dsl.ini file[0m


## 1. Extracting a dataset from Dimensions 

We use three different queries in order to extract 

* authors information
* publications metadata
* research organizations information 


**NOTE** other approaches are also possible e.g. extracting all data via a single query and then using Python to select only the fields of interests. For the purpose of this tutorial, using separate queries is the most straighforward way to achieve our goal.


In [3]:
#
# the main query string selects publications based on a) pub year, b) specific organization IDs and c) concept
# you can update this query based on your preferences
#

main_query = """
search publications 
    where year in [2022:2022] 
    and research_orgs in ["grid.21925.3d","grid.147455.6","grid.25879.31","grid.29857.31"]
    and concepts = "oncology"
 return publications
"""

In [4]:
# use the main query but extract only authors infos
Authors = dsl.query_iterative(main_query + "[id+authors]").as_dataframe_authors()  ##researcher_id, pub_id, current_organization_ID
Authors.head()

Starting iteration with limit=1000 skip=0 ...[0m
0-120 / 120 (4.36s)[0m
===
Records extracted: 120[0m


Unnamed: 0,affiliations,corresponding,current_organization_id,first_name,last_name,orcid,raw_affiliation,researcher_id,pub_id
0,"[{'city': 'Philadelphia', 'city_id': 4560349, ...",,grid.25879.31,Andrew,Schlafly,,"[Perelman School of Medicine, University of Pe...",ur.012676303143.43,pub.1154094821
1,"[{'city': 'Jacksonville', 'city_id': 4160021, ...",True,grid.25879.31,Ronnie,Sebro,,"[Center for Augmented Intelligence, Mayo Clini...",ur.0660765735.77,pub.1154094821
2,"[{'city': 'Madison', 'city_id': 5261457, 'coun...",,grid.14003.36,Jessica R.,Schumacher,[0000-0002-6740-9498],"[Department of Surgery, University of Wisconsi...",ur.0661627033.29,pub.1153677611
3,"[{'city': 'Madison', 'city_id': 5261457, 'coun...",,grid.14003.36,Alyssa A.,Wiener,,"[Department of Surgery, University of Wisconsi...",ur.015612367333.32,pub.1153677611
4,"[{'city': 'Madison', 'city_id': 5261457, 'coun...",,grid.410427.4,Caprice C.,Greenberg,,"[Department of Surgery, University of Wisconsi...",ur.012326542557.13,pub.1153677611


In [5]:
# use the main query but extract only pubs metadata
Pubs = dsl.query_iterative(main_query + "[id+title+year+times_cited]").as_dataframe() 
Pubs.head()

Starting iteration with limit=1000 skip=0 ...[0m
0-120 / 120 (1.88s)[0m
===
Records extracted: 120[0m


Unnamed: 0,id,title,times_cited,year
0,pub.1154094821,Does NIH funding differ between medical specia...,0,2022
1,pub.1153677611,Local/Regional Recurrence Rates After Breast-C...,0,2022
2,pub.1153575321,Quality and Safety Considerations in Intensity...,0,2022
3,pub.1153525111,"Data standards in pediatric oncology: Past, pr...",0,2022
4,pub.1153522196,Assessments of Somatic Variant Classification ...,0,2022


In [6]:
# use the main query but extract only research orgs infos
RORGS = dsl.query_iterative(main_query + "[unnest(research_orgs)]").as_dataframe()
RORGS.head()

Starting iteration with limit=1000 skip=0 ...[0m
0-120 / 120 (1.03s)[0m
120-120 / 120 (4.07s)[0m
===
Records extracted: 599[0m


Unnamed: 0,research_orgs.city_name,research_orgs.country_name,research_orgs.id,research_orgs.latitude,research_orgs.linkout,research_orgs.longitude,research_orgs.name,research_orgs.state_name,research_orgs.types,research_orgs.acronym
0,Philadelphia,United States,grid.25879.31,39.952457,[http://www.upenn.edu/],-75.19322,University of Pennsylvania,Pennsylvania,[Education],
1,Jacksonville,United States,grid.417467.7,30.289337,[https://www.mayoclinic.org/patient-visitor-gu...,-81.437775,Mayo Clinic,Florida,[Healthcare],
2,Madison,United States,grid.14003.36,43.076694,[http://www.wisc.edu/],-89.41244,University of Wisconsin–Madison,Wisconsin,[Education],UW
3,Rochester,United States,grid.66875.3a,44.02407,[http://www.mayoclinic.org/patient-visitor-gui...,-92.46631,Mayo Clinic,Minnesota,[Healthcare],
4,Madison,United States,grid.412639.b,43.076946,[https://cancer.wisc.edu/],-89.43147,UW Carbone Cancer Center,Wisconsin,[Healthcare],UWCCC


## 2. Combining the results 

We merge the results from the queries above into a single table containing only the columns we want. 

Additionally, we calculate for each author which is the order of authorship and add a category for 'first' and 'last' authors. 


In [7]:
#
# Authors becomes the "main table" because it has both the PubID and the ResearcherID
# Then use Authors->Pubs to lookup title, year, times cited      on authors.pub_id = Pubs.id
# Then use Authors->RORGS to lookup rorg name, type and country  on authors.current_organization_ID = RORGS.id 
#


##prep RORGS for merge

RORGS = RORGS.dropna(subset = ['research_orgs.id'])
RORGS = RORGS.rename(columns = {'research_orgs.id':'rorg_id'})
RORGS = RORGS.drop_duplicates(subset=['rorg_id', 'research_orgs.name'], keep='last')


##Combine all three dataframes into one

AutPub = pd.merge(
    left=Authors,
    right=Pubs,
    left_on='pub_id',
    right_on='id',
    how='left'
)

final = pd.merge(
    left=AutPub,
    right=RORGS,
    left_on='current_organization_id',
    right_on='rorg_id',
    how='left'
)

final["author_name"] = final["last_name"] + [", "] + final["first_name"]
final['author_number'] = final.groupby(['pub_id']).cumcount()+1;  #this will only work if you haven't sorted the dataframe
final = final.drop(columns=['affiliations', 'corresponding', 'raw_affiliation', 'id', 'first_name', 'last_name','research_orgs.latitude','research_orgs.longitude','research_orgs.acronym'])

#Get AuthorCounts,etc by pub ID and join back to AutPubRORG table

AuthorCount = final.groupby(['pub_id'])['author_number'].max()

final = pd.merge(
    left=final,
    right=AuthorCount,
    left_on='pub_id',
    right_on='pub_id',
    how='left'
)

final = final.rename(columns = {'author_number_x':'author_number', 'author_number_y':'authors_tot', })


# Assing a category to first authors and last authors 

final['AuthorCategory'] = np.where(
     final['author_number']==1, 'FirstAuthor', 
         np.where(
            final['author_number']==final['authors_tot'],"LastAuthor",
             np.where(
                (final['authors_tot']-final['author_number'])==1,"Penultimate",""
             )
         )
)

final.head(20)


Unnamed: 0,current_organization_id,orcid,researcher_id,pub_id,title,times_cited,year,research_orgs.city_name,research_orgs.country_name,rorg_id,research_orgs.linkout,research_orgs.name,research_orgs.state_name,research_orgs.types,author_name,author_number,authors_tot,AuthorCategory
0,grid.25879.31,,ur.012676303143.43,pub.1154094821,Does NIH funding differ between medical specia...,0,2022,Philadelphia,United States,grid.25879.31,[http://www.upenn.edu/],University of Pennsylvania,Pennsylvania,[Education],"Schlafly, Andrew",1,2,FirstAuthor
1,grid.25879.31,,ur.0660765735.77,pub.1154094821,Does NIH funding differ between medical specia...,0,2022,Philadelphia,United States,grid.25879.31,[http://www.upenn.edu/],University of Pennsylvania,Pennsylvania,[Education],"Sebro, Ronnie",2,2,LastAuthor
2,grid.14003.36,[0000-0002-6740-9498],ur.0661627033.29,pub.1153677611,Local/Regional Recurrence Rates After Breast-C...,0,2022,Madison,United States,grid.14003.36,[http://www.wisc.edu/],University of Wisconsin–Madison,Wisconsin,[Education],"Schumacher, Jessica R.",1,14,FirstAuthor
3,grid.14003.36,,ur.015612367333.32,pub.1153677611,Local/Regional Recurrence Rates After Breast-C...,0,2022,Madison,United States,grid.14003.36,[http://www.wisc.edu/],University of Wisconsin–Madison,Wisconsin,[Education],"Wiener, Alyssa A.",2,14,
4,grid.410427.4,,ur.012326542557.13,pub.1153677611,Local/Regional Recurrence Rates After Breast-C...,0,2022,Augusta,United States,grid.410427.4,[http://www.augusta.edu/],Augusta University,Georgia,[Education],"Greenberg, Caprice C.",3,14,
5,grid.14003.36,[0000-0002-4517-1204],ur.0632670166.10,pub.1153677611,Local/Regional Recurrence Rates After Breast-C...,0,2022,Madison,United States,grid.14003.36,[http://www.wisc.edu/],University of Wisconsin–Madison,Wisconsin,[Education],"Hanlon, Bret",4,14,
6,grid.240614.5,,ur.0671641425.86,pub.1153677611,Local/Regional Recurrence Rates After Breast-C...,0,2022,Buffalo,United States,grid.240614.5,[https://www.roswellpark.org/],Roswell Park Comprehensive Cancer Center,New York,[Healthcare],"Edge, Stephen B.",5,14,
7,grid.66875.3a,,ur.01264057027.05,pub.1153677611,Local/Regional Recurrence Rates After Breast-C...,0,2022,Rochester,United States,grid.66875.3a,[http://www.mayoclinic.org/patient-visitor-gui...,Mayo Clinic,Minnesota,[Healthcare],"Ruddy, Kathryn J.",6,14,
8,grid.65499.37,[0000-0002-4722-4824],ur.012333143317.98,pub.1153677611,Local/Regional Recurrence Rates After Breast-C...,0,2022,Boston,United States,grid.65499.37,[http://www.dana-farber.org/],Dana-Farber Cancer Institute,Massachusetts,[Facility],"Partridge, Ann H.",7,14,
9,grid.66875.3a,[0000-0002-2234-7430],ur.0654547635.88,pub.1153677611,Local/Regional Recurrence Rates After Breast-C...,0,2022,Rochester,United States,grid.66875.3a,[http://www.mayoclinic.org/patient-visitor-gui...,Mayo Clinic,Minnesota,[Healthcare],"Le-Rademacher, Jennifer G.",8,14,


---
## Where to go from here

In this [Dimensions Analytics API](https://www.dimensions.ai/dimensions-apis/) tutorial we have seen how, using the [publications source](https://docs.dimensions.ai/dsl/datasource-publications.html), it is possible to extract and analyse information about authors and their order to authorhip.

This only scratches the surface of the possible applications of publications data, but hopefully it'll give you a few basic tools to get started building your own application. 

For more tutorials, see the [API LAB homepage](https://api-lab.dimensions.ai/).