# Scopus Retrieval
After obtaining a csv from Scopus of articles specified by ISSN of the journals. Loop through these by DOI and retrieve each of their titles and references using the pybliometrics library. Of use from the original csv is DOI, title, document type. Note, this only works if you are connected via your institutional VPN or internet connection and the institution has a full service API.

https://pybliometrics.readthedocs.io/en/stable/classes/AbstractRetrieval.html#pybliometrics.scopus.AbstractRetrieval
https://dev.elsevier.com/sc_abstract_retrieval_views.html 

Observations:
1. Affiliations are misattributed to incorrect institutions even if the institution is shown in the text. Likely from mis-using tesseract or not specifying how to collect affiliations properly.
   * If an author has multiple affiliations they are not recorded separately even if there are commas that separate them. A single institution is affiliated with the author and is often the wrong one/ not the first affiliation.
   * If there are multiple authors and the affiliations were listed in order without a direct link between the author to each set of affiliations, all affiliations can be mis assigned to each of the authors. Eg: if jane doe and john doe belong to standford and harvard respectively, both jane and joe are said to belong to both stanford and harvard.
   * Conclusion: there is value in running this ourselves through Mturk to separate out each affiliation
2. References are in a good format, must run a check against tesseract collected references to see what scopus is missing.
   * References that are not part of scopus corpus are also present which means scopus doesn't just include references to articles found in it's corpus
   * The DOI of articles are the actual DOIs and the links to the sources direct to a page on Scopus. Which is not useful for consolidation as it is not a direct link to the article's Jstor-pseudo DOI. Resolving SCOPUS link to a Jstor link will require additional authomation. consolidate against jstor with title and author match.
3. Abstracts are fine

In [187]:
from pybliometrics.scopus import AbstractRetrieval
import pandas as pd
import json
pd.set_option('display.max_columns', None)

In [188]:
import pybliometrics.scopus as pyblio

Set up some utility functions

In [None]:
def open_json(filepath):
    with open(filepath) as json_data:
        d = json.load(json_data)
        json_data.close()
        return d   
        
def save_json(fn, data):
    with open(fn, 'w', encoding='utf-8') as f:
        print("dumping ...")
        json.dump(data, f, ensure_ascii=True, indent=2)

Set base path and read in downloaded scopus file

In [17]:
base_path="/Users/sijiawu/Downloads/"
j_name="aer" #change this to ecta, jpe, res, qje to cycle through the journals

Read in the data

In [212]:
journal=pd.read_csv(base_path+j_name+"_scopus.csv")

In [213]:
journal.head(3)

Unnamed: 0,Authors,Author full names,Author(s) ID,Title,Year,Source title,Volume,Issue,Art. No.,Page start,Page end,Page count,Cited by,DOI,Link,Affiliations,Authors with affiliations,Abstract,References,Publisher,Document Type,Publication Stage,Open Access,Source,EID
0,Lachowska M.; Mas A.; Woodbury S.A.,"Lachowska, Marta (56104480700); Mas, Alexandre...",56104480700; 36803875800; 6701836505,Sources of displaced workers' long- term earni...,2020,American Economic Review,110,10,,3231,3236,5.0,29,10.1257/aer.20180652,https://www.scopus.com/inward/record.uri?eid=2...,W. E. Upjohn Institute for Employment Research...,"Lachowska M., W. E. Upjohn Institute for Emplo...",We estimate the magnitudes of reduced earnings...,"Abowd John, Creecy Robert H., Kramarz Francis,...",American Economic Association,Article,Final,All Open Access; Green Open Access,Scopus,2-s2.0-85092547682
1,Piazzesi M.; Schneider M.; Stroebel J.,"Piazzesi, Monika (6506641986); Schneider, Mart...",6506641986; 57202768608; 56070941400,Segmented housing search†,2020,American Economic Review,110,3,,720,759,39.0,38,10.1257/aer.20141772,https://www.scopus.com/inward/record.uri?eid=2...,"Stanford, NBER, United States; New York Univer...","Piazzesi M., Stanford, NBER, United States; Sc...",We study housing markets with multiple segment...,"Albrecht J., Anderson A., Smith E., Vroman S.,...",American Economic Association,Article,Final,All Open Access; Green Open Access,Scopus,2-s2.0-85085354518
2,Oprea R.,"Oprea, Ryan (24479977800)",24479977800,What Makes a Rule Complex?,2020,American Economic Review,110,12,,3913,3951,38.0,19,10.1257/AER.20191717,https://www.scopus.com/inward/record.uri?eid=2...,"Economics Department, University of California...","Oprea R., Economics Department, University of ...",We study the complexity of rules by paying exp...,"Abeler Johannes, Jager Simon, Complex Tax Ince...",American Economic Association,Article,Final,,Scopus,2-s2.0-85098227762


In [214]:
journal.columns

Index(['Authors', 'Author full names', 'Author(s) ID', 'Title', 'Year',
       'Source title', 'Volume', 'Issue', 'Art. No.', 'Page start', 'Page end',
       'Page count', 'Cited by', 'DOI', 'Link', 'Affiliations',
       'Authors with affiliations', 'Abstract', 'References', 'Publisher',
       'Document Type', 'Publication Stage', 'Open Access', 'Source', 'EID'],
      dtype='object')

## Define the function to retrieve and process data


In [148]:
def retrieveStructuredRefsAndProcess(doi):
    ab = AbstractRetrieval(doi, view="FULL")
    auth=ab.authors
    agroup=ab.authorgroup
    refs=ab.references
    affil=ab.affiliation
    return {"authors": None if auth == None else [a._asdict() for a in auth], 
            "references": None if refs == None else [ref._asdict() for ref in refs], 
            "affiliations": None if affil == None else [aff._asdict() for aff in affil], 
            "authorgroup": None if agroup == None else [ag._asdict() for ag in agroup]}

This container will hold all the retrieved data

In [215]:
ref_holder={}

Let's run a loop over the DOIs of the scopus file and retrieve each paper's references if there is a DOI.

In [None]:
for i in journal.index:
    doi=journal.loc[i, "DOI"]
    # if i<2131:
    #     continue
    if pd.isna(doi)==False:
        print(doi+" "+str(i))
        ref_holder[doi]=retrieveStructuredRefsAndProcess(doi)

I only want some of the columns from the file from the scopus dashboard

In [216]:
journal_r=journal[['Author full names', 'Title', 'Year',
       'Source title', 'Volume', 'Issue', 'Art. No.', 'Page start', 'Page end',
       'Page count', 'Cited by', 'DOI', 'Abstract', 'Publisher',
       'Document Type', 'Publication Stage', 'Open Access', 'Source', 'EID']].to_dict('records')

Combine the journal data retrieved from the scopus 

In [None]:
for i in journal_r:
    doi=i['DOI']
    if pd.isna(doi)==False:
        print(doi)
        ref_holder[doi]=ref_holder[doi]|i

Save the data

In [210]:
save_json("scopus_"+j_name+".json", ref_holder) # change the journal name accordingly

dumping ...


Run each file from scopus through this provided the scopus files has doi's and you have api credits.