<a href="https://colab.research.google.com/github/digital-science/dimensions-api-lab" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open Dimensions API Lab In Google Colab"/></a>

# Part 5: competing journals

## Install Dimensions Library and login

In [0]:
try:
  from google.colab import files
  %load_ext google.colab.data_table
  COLAB_ENV = True
  !pip install dimcli plotly_express  -U
  !mkdir data # to save temp data 
except:
  COLAB_ENV = False


# common libraries
import pandas as pd
from pandas.io.json import json_normalize
import time
import json
from tqdm import tqdm_notebook as tqdm
import plotly_express as px
from getpass import getpass
# FINALLY..
import dimcli
from dimcli.shortcuts import *


##
# LOG IN 
##

USERNAME = "m.pasin@digital-science.com"  #@param {type: "string"}

if not USERNAME:
  print("====\nERROR: Please enter a valid Dimensions API username")
else:
  password = getpass('====\nEnter password here')
  print('=> username is', USERNAME)
  print('=> password is', "*" * len(password))
  dimcli.login(USERNAME, password)
  dsl = dimcli.Dsl()


# Competing Journals

From our researchers master list, we now want to extract the following:

* full list of publications for a 5 year period
* full list of journals with counts of how many publications per journal 

This new dataset will let us draw up some conclusions re. which are the competing journals of the one we selected at the beginning.



#### First let's reload the data obtained in previous steps

In [0]:
#
journal_id = "jour.1103138" # Nature genetics
start_year = 2015 
#
researchers = pd.read_csv("data/2.researchers_impact_metrics.csv")
#

In [4]:
print("Total researchers:", len(researchers))
researchers.head(5)

Total researchers: 17440


Unnamed: 0,researcher_id,pubs,full_name,citations_mean,altmetric_mean,last_pub_year,url
0,ur.0723426172.10,49,Kari Stefansson,84.571429,261.346939,2019,https://app.dimensions.ai/discover/publication...
1,ur.01277776417.51,35,Unnur Thorsteinsdottir,70.342857,168.485714,2019,https://app.dimensions.ai/discover/publication...
2,ur.0641525362.39,30,Gonçalo R Abecasis,106.9,146.6,2019,https://app.dimensions.ai/discover/publication...
3,ur.0637651205.48,27,Daniel F Gudbjartsson,55.37037,153.518519,2019,https://app.dimensions.ai/discover/publication...
4,ur.01313145634.66,27,Andres Metspalu,159.592593,356.518519,2019,https://app.dimensions.ai/discover/publication...


### Two possible approaches in this case

The easy way is to use the 'journals' facet on publications data.

This approach is quick and good enough to get a feeling for the top competitors. However 
* this approach will end up including duplicates 
* there might be a long tail of journals with very few publications that are lsot

In [10]:
%%dsldf 
search publications where researchers.id in ["ur.01277776417.51", "ur.0637651205.48"] 
and year >= 2015 and journal.id != "jour.1103138" return journal limit 1000

Returned Journal: 38


Unnamed: 0,count,id,title
0,32,jour.1043282,Nature Communications
1,14,jour.1293558,bioRxiv
2,6,jour.1018957,Nature
3,6,jour.1034974,PLoS Genetics
4,4,jour.1102504,Human Molecular Genetics
5,3,jour.1091325,European Heart Journal
6,3,jour.1300829,Communications Biology
7,2,jour.1017423,Diabetes
8,2,jour.1024947,BMC Medical Genetics
9,2,jour.1045271,Translational Psychiatry


**Second approach**: pulling all publications data and doing the counting programmatically using the data. 

This takes longer, and requires more processing, but it is much more precise. 

In [11]:
%%dsldf 
search publications where researchers.id in ["ur.01277776417.51", "ur.0637651205.48"] 
and year >= 2015 and journal is not empty 
and journal.id != "jour.1103138" return publications[id+journal] limit 1000

Returned Publications: 109 (total = 109)


Unnamed: 0,id,journal.id,journal.title
0,pub.1113477375,jour.1043282,Nature Communications
1,pub.1112876837,jour.1043282,Nature Communications
2,pub.1111326269,jour.1043282,Nature Communications
3,pub.1113901526,jour.1043282,Nature Communications
4,pub.1115929591,jour.1009646,Molecular Genetics and Metabolism
5,pub.1113239657,jour.1293558,bioRxiv
6,pub.1110156056,jour.1102504,Human Molecular Genetics
7,pub.1111648957,jour.1346339,Science
8,pub.1112983752,jour.1101548,European Neuropsychopharmacology
9,pub.1110758233,jour.1034974,PLoS Genetics


## Using the second approach to extract all publications/journals information

This approach will take more time (eg about ~30 mins or more) but the results we get are guaranteed to be precise and complete. 



In [0]:
# our list of researchers
llist = list(researchers['researcher_id'])
#
# the query
q2 = """search publications where researchers.id in {} 
and year >= {} and journal is not empty and journal.id != "{}" 
return publications[id+journal]"""

In [0]:
#
# helper to extract and simplify nested journal data
def get_journal_data(x):
  # simplify nested dict and store results
  if type(x['journal']) == dict:
      x['journal_id'] = x['journal']['id']
      x['title'] = x['journal']['title']
      del x["journal"]  
      return x 
  return None
#
#
# iterate 200 researchers per loop
out = []
for chunk in tqdm(list(chunks_of(llist, 200))):
    res = dslqueryall(q2.format(json.dumps(chunk), start_year, journal_id))
    for x in res.publications:
        if get_journal_data(x):
            out.append(x)
    time.sleep(1)
#
# save to a df
journals = pd.DataFrame().from_dict(out)
#
# remove duplicate journals if they have the same PUB_ID
journals = journals.drop_duplicates()
journals.to_csv("data/5.journals-via-publications-RAW.csv", index=False)
#
# now drop pub_id column
journals = journals.drop(['id'], axis=1)
#
# add total column 
journals['total'] = journals.groupby('journal_id')['journal_id'].transform('count')
#
# remove multiple counts for same journal, after countin
journals = journals.drop_duplicates() 
#
# sort by total count
journals = journals.sort_values('total', ascending=False)
#
# save
journals.to_csv("data/5.journals-via-publications.csv", index=False)
print("======\nDone")

In [20]:
#preview the data 
journals.head(5)

Unnamed: 0,journal_id,title,total
177,jour.1293558,bioRxiv,9691
822,jour.1312191,Journal of Clinical Oncology,7239
263,jour.1037553,PLoS ONE,6905
2,jour.1045337,Scientific Reports,5626
1177,jour.1034467,Alzheimer's & Dementia,3336


In [0]:
# download the data
if COLAB_ENV:
  files.download("data/5.journals-via-publications.csv")

# Visualization

In [30]:

threshold = 500

px.bar(df2[:threshold], x="title", y="total", hover_name="title", 
           hover_data=['journal_id', 'title', 'total' ], 
           title=f"Top competitors for {journal_id} (based on data from {start_year})")