## Citation Overrepresentation Tool
This tool extracts last authors & journals of papers you have cited and creates frequency tables so you can see who you cite the most often. It then attempts to map authors to instutions via the ORCID database. You can use it to find which journals, labs, and universities receive most of your attention.

1. Extract your citations as a .bib file. Either extract all refs from a single folder in your citation manager or extract them straight from a Word doc if you use Mendeley or Zotero using [this tool](https://rintze.zelle.me/ref-extractor/).
2. Upload your .bib file to the binder.
3. Run each cell (Shift+Enter or Play button).
4. Optional: extract institutions from Google Scholar
5. Optional: get an ORCID API key (instructions below) to extract institutions.

In [None]:
#imports
!pip install pybtex
!pip install scholarly

from pybtex.database.input import bibtex
import glob
import pandas as pd
import requests
import re
from scholarly import scholarly

In [None]:
#read .bib file
ID = glob.glob('*bib')
parser = bibtex.Parser()
try:
    bib_data = parser.parse_file(ID[0])
except:
    raise ValueError("Your .bib file has non-UTF8 characters it in (like smart quotes). Please remove them & try again.")

In [None]:
#extract author & journal names from each citation
authors = list()
for key in bib_data.entries:
    author = bib_data.entries[key].persons['author']
    first_name = author[-1].rich_first_names
    last_name = author[-1].rich_last_names
    first_name = str(first_name)[7:-3]
    last_name = str(last_name)[7:-3]

    try:
        journal = bib_data.entries[key].fields['journal']
    except:
        journal = 'Book'
    authors.append([first_name, last_name, journal])

In [None]:
#build data frame & print
auth_df = pd.DataFrame(authors, columns=['First Name','Last Name', 'Journal'])
print('Overcited Authors')
names_grouped = auth_df.groupby(['First Name','Last Name']).size()
print(names_grouped.sort_values(ascending=False).head(10))
print('\nOvercited Journals')
print(auth_df.groupby(['Journal']).size().sort_values(ascending=False).head(10))

### Optional: Get institutions from Google Scholar
Using the [Scholarly package](https://pypi.org/project/scholarly/), query Google Scholar for institutions. Limitations: assumes first hit is correct, may need to use method get_proxy if too many requests.

In [None]:
#match each author to an institutions
#NB this section calls Google Scholar once for each author, so it takes a while
inst_list = list()
for name in names_grouped.axes[0]:
    search_query = scholarly.search_author(name[0]+' '+name[1])
    try:
        author = next(search_query).fill()
        affil = re.sub('Professor .+, ','',author.affiliation)
        inst_list.append(affil)
    except:
        inst_list.append('Undetermined')

In [None]:
#add to author counts series and print
inst_series = pd.Series(names_grouped.values, index = inst_list)
print('Overcited Institutions')
print(inst_series.sort_values(ascending=False).head(10))

### Optional: Get institutions from ORCID
Using the ORCID API, query ORCID for institutions. Limitations: assumes first hit is correct, many authors have empty ORCIDs.

Get an [ORCID API client ID & secret](https://support.orcid.org/hc/en-us/articles/360006897174)

You can learn more about how to [search for an ORCID](https://members.orcid.org/api/tutorial/search-orcid-registry) and [find info about an author given their ORCID](https://members.orcid.org/api/tutorial/read-orcid-records) in the API documentation.

In [None]:
#input your client ID & key
ORCIDAPI_ID = 'YOUR ACCOUNT ID HERE'
ORCIDAPI_key = 'YOUR ACCOUNT KEY HERE'

In [None]:
# build a request for a token
payload = {'client_id': ORCIDAPI_ID,
                   'client_secret': ORCIDAPI_key,
                   'scope': '/read-public',
                   'grant_type': 'client_credentials'
                   }
url = 'https://orcid.org/oauth/token'
headers = {'Accept': 'application/json'}
response = requests.post(url, data=payload, headers=headers, timeout=None)
response.raise_for_status()
token = response.json()['access_token']

# set up headers for searches
headers = {'Accept': 'application/vnd.orcid+json',
           'Authorization type': 'Bearer',
           'Access token': token}

In [None]:
#find ORCIDs
#NB this section calls the API once for each author, so it takes a while
orcid_list = list()
for name in names_grouped.axes[0]:
    given = re.sub(' ','%20',name[0])
    family = re.sub(' ','%20',name[1])

    #build search
    url = "https://pub.orcid.org/v3.0/search/?q=" \
        + "family-name:" + family + "+AND+given-names:" + given \
        + "&rows=1"
    auth_id = requests.get(url, headers=headers, timeout=None)

    #get first returned ORCID
    if auth_id.json()['result'] is not None:
        orcid_list.append(auth_id.json()['result'][0]['orcid-identifier']['path'])
    else:
        orcid_list.append('')

In [None]:
# get institution from ORCID
#NB this section calls the API once for each author, so it takes a while
inst_list = list()
for orcid in orcid_list:
    if len(orcid)>0:
        url = "https://pub.orcid.org/v2.1/" + orcid + "/record"
        orcid_request = requests.get(url, headers=headers, timeout=None)
        affil = orcid_request.json()['activities-summary']['employments']['employment-summary']
        if len(affil)>0:
#         print(json.dumps(var, indent=2, separators=(',', ':')))
            inst_list.append(affil[0]['organization']['name'])
        else:
            inst_list.append('Undetermined')
    else:
        inst_list.append('Undetermined')

In [None]:
#add to author counts series and print
inst_series = pd.Series(names_grouped.values, index = inst_list)
print('Overcited Institutions')
print(inst_series.sort_values(ascending=False).head(10))