# Journal Profiling Part 1: Getting the Data

This Python notebook shows how to use the [Dimensions Analytics API](https://www.dimensions.ai/dimensions-apis/) to extract [publications data ](https://docs.dimensions.ai/dsl/datasource-publications.html) for a specific journal, as well its authors and affiliations.

This tutorial is the first of a series that uses the data extracted in order to generate a 'journal profile' report. See the [API Lab homepage](https://api-lab.dimensions.ai/) for the other tutorials in this series.


In this notebook we are going to:

* extract all publications data for a given journal
* have a quick look at the publications' authors and affiliations 
* review how many authors have been disambiguated with a Dimensions Researcher ID
* produce a dataset of non-disambiguated authors that can be used for manual disambiguation 

## Prerequisites

This notebook assumes you have installed the [Dimcli](https://pypi.org/project/dimcli/) library and are familiar with the *Getting Started* tutorial.


In [1]:
!pip install dimcli plotly tqdm -U --quiet 

import dimcli
from dimcli.shortcuts import *
import os, sys, time, json
from tqdm.notebook import tqdm as progress
import pandas as pd
import plotly.express as px
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)
#

print("==\nLogging in..")
# https://github.com/digital-science/dimcli#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  USERNAME = getpass.getpass(prompt='Username: ')
  PASSWORD = getpass.getpass(prompt='Password: ')    
  dimcli.login(USERNAME, PASSWORD, ENDPOINT)
else:
  USERNAME, PASSWORD  = "", ""
  dimcli.login(USERNAME, PASSWORD, ENDPOINT)
dsl = dimcli.Dsl()

==
Logging in..
[2mDimcli - Dimensions API Client (v0.7.4.2)[0m
[2mConnected to: https://app.dimensions.ai - DSL v1.27[0m
[2mMethod: dsl.ini file[0m


Some helper functions to store the data we are going to extract

In [3]:
# create output data folder
FOLDER_NAME = "journal-profile-data"
if not(os.path.exists(FOLDER_NAME)):
    os.mkdir(FOLDER_NAME)
    
def save(df,filename_dot_csv):
    df.to_csv(FOLDER_NAME+"/"+filename_dot_csv, index=False)

## Selecting a Journal and Extracting All Publications Metadata

In [4]:
#@title Select a journal from the dropdown
#@markdown If the journal isn't there, you can try type in the exact name instead.

journal_title = "Nature Genetics" #@param ['Nature', 'Nature Communications', 'Nature Biotechnology', 'Nature Medicine', 'Nature Genetics', 'Nature Neuroscience', 'Nature Structural & Molecular Biology', 'Nature Methods', 'Nature Cell Biology', 'Nature Immunology', 'Nature Reviews Drug Discovery', 'Nature Materials', 'Nature Physics', 'Nature Reviews Neuroscience', 'Nature Nanotechnology', 'Nature Reviews Genetics', 'Nature Reviews Urology', 'Nature Reviews Molecular Cell Biology', 'Nature Precedings', 'Nature Reviews Cancer', 'Nature Photonics', 'Nature Reviews Immunology', 'Nature Reviews Cardiology', 'Nature Reviews Gastroenterology & Hepatology', 'Nature Reviews Clinical Oncology', 'Nature Reviews Endocrinology', 'Nature Reviews Neurology', 'Nature Chemical Biology', 'Nature Reviews Microbiology', 'Nature Geoscience', 'Nature Reviews Rheumatology', 'Nature Climate Change', 'Nature Reviews Nephrology', 'Nature Chemistry', 'Nature Digest', 'Nature Protocols', 'Nature Middle East', 'Nature India', 'Nature China', 'Nature Plants', 'Nature Microbiology', 'Nature Ecology & Evolution', 'Nature Astronomy', 'Nature Energy', 'Nature Human Behaviour', 'AfCS-Nature Molecule Pages', 'Human Nature', 'Nature Reviews Disease Primers', 'Nature Biomedical Engineering', 'Nature Reports Stem Cells', 'Nature Reviews Materials', 'Nature Sustainability', 'Nature Catalysis', 'Nature Electronics', 'Nature Reviews Chemistry', 'Nature Metabolism', 'Nature Reviews Physics', 'Nature Machine Intelligence', 'NCI Nature Pathway Interaction Database', 'Nature Reports: Climate Change'] {allow-input: true}
start_year = 2015  #@param {type: "number"}
#@markdown ---

# PS 
# To get titles from the API one can do this:
# > %dsldf search publications where journal.title~"Nature" and publisher="Springer Nature" return journal limit 100
# > ", ".join([f"'{x}'" for x in list(dsl_last_results.title)]) 
#

q_template = """search publications where 
    journal.title="{}" and 
    year>={} 
    return publications[basics+altmetric+times_cited]"""
q = q_template.format(journal_title, start_year)
print("DSL Query:\n----\n", q, "\n----")
pubs = dsl.query_iterative(q.format(journal_title, start_year), limit=500)


DSL Query:
----
 search publications where 
    journal.title="Nature Genetics" and 
    year>=2015 
    return publications[basics+altmetric+times_cited] 
----
Starting iteration with limit=500 skip=0 ...
0-500 / 1541 (5.75s)
500-1000 / 1541 (5.24s)
1000-1500 / 1541 (3.31s)
1500-1541 / 1541 (0.89s)
===
Records extracted: 1541


Save the data as a CSV file in case we want to reuse it later

In [5]:
dfpubs = pubs.as_dataframe()
save(dfpubs,"1_publications.csv")
# preview the publications 
dfpubs.head(10)

Unnamed: 0,id,author_affiliations,type,pages,times_cited,title,year,altmetric,journal.id,journal.title,issue,volume
0,pub.1130541833,"[[{'first_name': 'Andrea', 'last_name': 'Lunar...",article,1-1,0,Author Correction: A co-clinical approach iden...,2020,1.0,jour.1103138,Nature Genetics,,
1,pub.1129832914,"[[{'first_name': 'Robert', 'last_name': 'Hänse...",article,878-883,1,Landscape of G-quadruplex DNA structural regio...,2020,167.0,jour.1103138,Nature Genetics,9.0,52.0
2,pub.1130496379,,article,865-865,0,Crop genomes and beyond,2020,12.0,jour.1103138,Nature Genetics,9.0,52.0
3,pub.1130496620,"[[{'first_name': 'Dalen', 'last_name': 'Chan',...",article,868-869,0,RNA post-transcriptional modification speaks t...,2020,16.0,jour.1103138,Nature Genetics,9.0,52.0
4,pub.1130497175,"[[{'first_name': 'Ivano', 'last_name': 'Mocavi...",article,866-867,0,RNA closing the Polycomb circle,2020,12.0,jour.1103138,Nature Genetics,9.0,52.0
5,pub.1129017794,"[[{'first_name': 'Yicheng', 'last_name': 'Long...",article,931-938,3,RNA is essential for PRC2 chromatin occupancy ...,2020,224.0,jour.1103138,Nature Genetics,9.0,52.0
6,pub.1130146552,"[[{'first_name': 'Wangxin', 'last_name': 'Guo'...",article,908-918,0,Single-cell transcriptomics identifies a disti...,2020,42.0,jour.1103138,Nature Genetics,9.0,52.0
7,pub.1130293541,"[[{'first_name': 'Xihao', 'last_name': 'Li', '...",article,969-983,0,Dynamic incorporation of multiple in silico fu...,2020,48.0,jour.1103138,Nature Genetics,9.0,52.0
8,pub.1130003230,"[[{'first_name': 'Yuan', 'last_name': 'Li', 'c...",article,870-877,1,N6-Methyladenosine co-transcriptionally direct...,2020,35.0,jour.1103138,Nature Genetics,9.0,52.0
9,pub.1130497144,"[[{'first_name': 'Giulio', 'last_name': 'Carav...",article,898-907,0,Subclonal reconstruction of tumors by using ma...,2020,73.0,jour.1103138,Nature Genetics,9.0,52.0


Extract the authors data 

In [6]:
# preview the authors data 
authors = pubs.as_dataframe_authors()
save(authors,"1_publications_authors.csv")
authors.head(10)

Unnamed: 0,first_name,last_name,corresponding,orcid,affiliations,pub_id
0,Andrea,Lunardi,,,"[{'name': 'Cancer Genetics Program, Beth Israe...",pub.1130541833
1,Ugo,Ala,,,"[{'name': 'Cancer Genetics Program, Beth Israe...",pub.1130541833
2,Mirjam T.,Epping,,,"[{'name': 'Cancer Genetics Program, Beth Israe...",pub.1130541833
3,Leonardo,Salmena,,,"[{'name': 'Cancer Genetics Program, Beth Israe...",pub.1130541833
4,John G.,Clohessy,,,"[{'name': 'Cancer Genetics Program, Beth Israe...",pub.1130541833
5,Kaitlyn A.,Webster,,,"[{'name': 'Cancer Genetics Program, Beth Israe...",pub.1130541833
6,Guocan,Wang,,,"[{'name': 'Cancer Genetics Program, Beth Israe...",pub.1130541833
7,Roberta,Mazzucchelli,,,"[{'id': 'grid.7010.6', 'name': 'Marche Polytec...",pub.1130541833
8,Maristella,Bianconi,,,"[{'id': 'grid.7010.6', 'name': 'Marche Polytec...",pub.1130541833
9,Edward C.,Stack,,,"[{'id': 'grid.65499.37', 'name': 'Dana-Farber ...",pub.1130541833


Extract the affiliations data 

In [7]:
affiliations = pubs.as_dataframe_authors_affiliations()
save(affiliations,"1_publications_affiliations.csv")
affiliations.head(10)

Unnamed: 0,aff_name,aff_id,aff_city,aff_city_id,aff_country,aff_country_code,aff_state,aff_state_code,pub_id,researcher_id,first_name,last_name
0,"Cancer Genetics Program, Beth Israel Deaconess...",,,,,,,,pub.1130541833,,Andrea,Lunardi
1,"Cancer Genetics Program, Beth Israel Deaconess...",,,,,,,,pub.1130541833,,Ugo,Ala
2,University of Turin,grid.7605.4,Turin,3165520.0,Italy,IT,,,pub.1130541833,,Ugo,Ala
3,"Cancer Genetics Program, Beth Israel Deaconess...",,,,,,,,pub.1130541833,,Mirjam T.,Epping
4,"Cancer Genetics Program, Beth Israel Deaconess...",,,,,,,,pub.1130541833,,Leonardo,Salmena
5,Memorial Sloan Kettering Cancer Center,grid.51462.34,New York,5128580.0,United States,US,New York,US-NY,pub.1130541833,,Leonardo,Salmena
6,Memorial Sloan Kettering Cancer Center,grid.51462.34,New York,5128580.0,United States,US,New York,US-NY,pub.1130541833,,Leonardo,Salmena
7,"Cancer Genetics Program, Beth Israel Deaconess...",,,,,,,,pub.1130541833,,John G.,Clohessy
8,Memorial Sloan Kettering Cancer Center,grid.51462.34,New York,5128580.0,United States,US,New York,US-NY,pub.1130541833,,John G.,Clohessy
9,Memorial Sloan Kettering Cancer Center,grid.51462.34,New York,5128580.0,United States,US,New York,US-NY,pub.1130541833,,John G.,Clohessy


## Some stats about authors

* count how many authors in total 
* count how many authors have a researcher ID
* count how many unique researchers IDs we have in total

In [None]:
researchers = authors.query("researcher_id!=''")
#
df = pd.DataFrame({
    'measure' : ['Authors in total (non unique)', 'Authors with a researcher ID', 'Authors with a researcher ID (unique)'],
    'count' : [len(authors), len(researchers), researchers['researcher_id'].nunique()],
})
px.bar(df, x="measure", y="count", title=f"Author stats for {journal_title} (from {start_year})")

In [None]:
# save the researchers data to a file
save(researchers, "1_authors_with_researchers_id.csv")

## A quick look at authors *without* a Dimensions Researcher ID

We're not going to try to disambiguate them here, but still it's good to have a quick look at them... 

Looks like the most common surname is `Wang`, while the most common first name is an empty value

In [None]:
authors_without_id = authors.query("researcher_id==''")
authors_without_id[['first_name', 'last_name']].describe()

Top ten 'ambiguous' surnames seem to be all Asian.. it's a rather known problem! 

In [None]:
authors_without_id['last_name'].value_counts()[:10]

### Any common patterns? 

If we try to group the data by name+surname we can see some interesting patterns 

* some entries are things which are not persons (presumably the results of bad source data in Dimensions, eg from the publisher) 
* there are some apparently meaningful name+surname combinations with a lot of hits
* not many Asian names in the top ones 



In [None]:
authors_without_id = authors_without_id.groupby(["first_name", "last_name"]).size().reset_index().rename(columns={0: "frequency"})
authors_without_id.sort_values("frequency", ascending=False, inplace=True)
authors_without_id.head(20)

### Creating an export for manual curation

For the next tasks, we will focus on the disambiguated authors as the Researcher ID links will let us carry out useful analyses.

Still, we can **save the authors with missing IDs** results and try to do some manual disambiguation later. To this end, adding a simple google-search URL can help in making sense of these data quickly.

In [None]:
from dimcli.shortcuts import google_url

authors_without_id['search_url'] = authors_without_id.apply(lambda x: google_url(x['first_name'] + " " +x['last_name'] ), axis=1)

authors_without_id.head(20)

In [None]:
# save the data
save(authors_without_id, "1_authors_without_researchers_id.csv")

That's it! 

Now let's go and open this in [Google Sheets](https://docs.google.com/spreadsheets/)...

In [None]:
# for colab users: download everything
if COLAB_ENV:
    from google.colab import auth
    auth.authenticate_user()

    import gspread
    from gspread_dataframe import get_as_dataframe, set_with_dataframe
    from oauth2client.client import GoogleCredentials

    gc = gspread.authorize(GoogleCredentials.get_application_default())

    title = 'Authors_without_IDs'
    sh = gc.create(title)
    worksheet = gc.open(title).sheet1
    set_with_dataframe(worksheet, authors_without_id)
    spreadsheet_url = "https://docs.google.com/spreadsheets/d/%s" % sh.id
    print(spreadsheet_url)