# What Journals Have Been Citing My Organization? 

This notebook shows how to use the [Dimensions Analytics API](https://www.dimensions.ai/dimensions-apis/) to discover what academic journals are most frequenlty citing publications from a selected research organization. 

We start from a [GRID](https://grid.ac/) identifier (representing a research organization in Dimensions) and then select all publications citing research where at least one author is/as affiliated to that GRID ID. We then group this publications by source (journal) and sort them by frequency.  

For the purpose of this exercise, we will look at **University of Bologna** in Italy: [grid.6292.f](https://grid.ac/institutes/grid.6292.f) and publications from 2018. 

In [67]:
GRID_ORG = "grid.6292.f"
YEAR = 2018

> **Customize this notebook:** simply change the GRID ID to try out this example for your institution. You can also add more contraints to the `publications` query below e.g. filtering by subject area in order to obtain a different segmentation of these results. 

In [65]:
# data analysis libraries
import json
import pandas as pd
from pandas.io.json import json_normalize
from tqdm import tqdm
# Dimensions API query helper
import dimcli
from dimcli.shortcuts import chunks_of
dsl = dimcli.Dsl()
# 

## 1. Getting all publications from the GRID org

First we extract all publications where at least one of the authors is affiliated to GRID_ORG. 

By saying `return publications[reference_ids]` we can only select the field we are interested in.

In [68]:
publications = dsl.query_iterative(f"""
search publications 
    where research_orgs.id = "{GRID_ORG}" and year in [{YEAR}] and reference_ids is not empty
    return publications[reference_ids] sort by id 
""")

1000 / 4765
2000 / 4765
3000 / 4765
4000 / 4765
4765 / 4765


Let's remove duplicates from the references data and save it into a list.

In [69]:
references = set()

for publication in publications.publications:
    for ref_id in publication['reference_ids']:
        references.add(ref_id)

In [81]:
len(references)

162291

## 2. Getting all referencing publications

In the next step we extract all publications corresponding to these references. This query will return JSON data which can be further analyzed e.g. to count the unique number of journals they were published in.

E.g.:

```
'publications': [
  {'journal': {'id': 'jour.1295784',
    'title': 'IEEE Transactions on Cognitive and Developmental Systems'},
   'publisher': 'Institute of Electrical and Electronics Engineers (IEEE)',
   'year': 2018,
   'id': 'pub.1061542201',
   'issn': ['2379-8920', '2379-8939']},
  {'journal': {'id': 'jour.1043581', 'title': 'International Geology Review'},
   'publisher': 'Taylor & Francis',
   'year': 2018,
   'id': 'pub.1087302818',
   'issn': ['0020-6814', '1938-2839']}, etc..
```

This is query template we use. 

In [71]:
query = """search publications where journal is not empty and id in {} 
return publications[journal+issn+id+year+publisher] limit 1000"""

Note the `{}` part which is where we will put lists of publication IDs during each iteration. This is to ensure our query is never too long (<400 IDs is a good way to ensure we never get an API error).

In [None]:
pub_journals = []
for chunk in tqdm(list(chunks_of(list(references), 400))):
    pub_journals += (dsl.query(query.format(json.dumps(chunk))).publications)

## 3. Grouping References by Journals
We are going to analyze the referencing publications, in particular by grouping them by source journal, after loading the JSON data into a pandas Dataframe.  

> [pandas](https://pandas.pydata.org/pandas-docs/stable/) is a popular software library written for the Python programming language for data manipulation and analysis

In [83]:
df = json_normalize(pub_journals)
df.head(5)

Unnamed: 0,id,issn,journal.id,journal.title,publisher,year
0,pub.1104882405,"[0962-8436, 1471-2970]",jour.1032123,Philosophical Transactions of the Royal Societ...,The Royal Society,2018
1,pub.1106384345,"[0885-3185, 1531-8257]",jour.1096585,Movement Disorders,Wiley,2018
2,pub.1103832990,"[0100-879X, 1414-431X]",jour.1009158,Brazilian Journal of Medical and Biological Re...,FapUNIFESP (SciELO),2018
3,pub.1092079480,"[1551-3203, 1941-0050]",jour.1045705,IEEE Transactions on Industrial Informatics,Institute of Electrical and Electronics Engine...,2018
4,pub.1092550561,"[0001-5148, 1398-9995]",jour.1358083,Allergy,Wiley,2018


Dataframes provide many ways to analyze the data further.

### How many unique journals do we have?

In [84]:
df['journal.id'].describe()

count           153549
unique           11985
top       jour.1037553
freq              1622
Name: journal.id, dtype: object

### What are the most frequent journals?

We set a threshold of >100 citations to identify journals that are most relevant, and sort them by tot number of publications citing research from our chosen GRID ID. 

In [86]:
counts = df['journal.title'].value_counts()
counts[counts>100]

PLoS ONE                                                             1622
The Astrophysical Journal                                            1490
Physical Review D                                                    1454
Monthly Notices of the Royal Astronomical Society                    1327
Proceedings of the National Academy of Sciences                      1151
Nature                                                               1097
Science                                                               998
Physical Review Letters                                               943
Journal of High Energy Physics                                        935
Journal of the American Chemical Society                              661
Blood                                                                 651
Astronomy and Astrophysics                                            633
Physics Letters B                                                     622
New England Journal of Medicine       

In [88]:
# finally we can save the data to a CSV file
counts[counts>100].to_csv("top_journals_citing_" + GRID_ORG + "id.csv")





---
# Want to learn more?


Make sure you check out the [Dimensions API Lab](https://digital-science.github.io/dimensions-api-lab/) website, which many tutorials and reusable Jupyter notebooks for scholarly data analytics. 