### This file runs a simple Scopus query and outputs scopusIDs to a CSV file.

Getting information at an article level is a two stage process.  The first is to run a search, and get a list of article IDs that match it.  The second is to run each article ID as a query agains Scopus (or another index if there are DOIs) to get all the metadata.

The file [scopussearch.html](scopussearch.html) has documentation for all the ways to search scopus

**Libraries** Import all the libraries we are going to need.
* **apikey** is a esperate file that has a variable apikey = _your api key_
* **requests** makes submitting a query over HTTP straightforward
 * [requests documentation](https://requests.readthedocs.io/en/latest/user/quickstart/)
* **pprint** formats output nicely while debugging
* **json** helps to convert json data from scopus into formats we can output and put into other programs

In [1]:
from apikey import apikey # this imports your elsevier api key
import requests
import pprint
import json
import time


**Constants** here are some things that will remain constant throughout the program

In [2]:
stemurl ="http://api.elsevier.com/content/search/scopus" # the standard entrypoint for the Scopus API

**Query** This is the query we are going to run against Scopus.  See [scopus search fields documentation](scopussearchfields.html) for more things you can search for.  For DOIs, `doi('10.26021/10274')` would do the trick

In [3]:
query = "all('Oamaru')" #searching for the phrase 'Oamaru'

Query scopus using the requests library.  This will get the first result as a test.

In [4]:
fullquery = query #add anything to the query at the last minute
print(stemurl+fullquery) #this is all the information being sent to Scopus

http://api.elsevier.com/content/search/scopusall('Oamaru')


The next block uses the **Requests** library to make the call to the scopus api.  Firstly we create a dictionary 'payload' of all our options, and send that to the main URL.  It handles all the URL serialisation and escaping.

At this point we are only going to get _one_ result. (see `count: 1` in the payload)  This lets us run and test our query before letting it loose on Scopus, and using up all our query quota

In [5]:
payload = {'query': fullquery, 'apiKey':apikey, 'httpAccept':'application/json', 'cursor':'*', 'count': 1} #create a python dictionary that holds all the parameters that will go into the query
r = requests.get(stemurl, params=payload) # this uses requests to get data
text = json.loads(r.text) # get the text of the response in JSON format, and put it in a python dictponary object so we can have a squizz

**Debug** Lets have a look at what we have got from the API to see if it looks like what we want.  We are putting the Python JSON object through prettyprint.  Uncomment this (remove the `#` hash marks from the pp.pprint line and one of the lines that begins ```results```)

In [6]:
pp = pprint.PrettyPrinter(indent=4) # create a prettyprint method
#results = text #everything including headers
#results = text['search-results']['entry']# just the entries
results = text['search-results']['entry'][0]['prism:doi']#just the keys
pp.pprint(results) #lets see the result.

# print(r.headers['X-RateLimit-Remaining']) # If you're worried about running out of queries with Scopus (2000 a day?) uncomment this and it will tell you how many queries you can make.

'10.1007/s10347-020-00619-4'


Let's see how many total results there were from our query.

In [7]:
resultsTotal = int(text['search-results']['opensearch:totalResults'])
print(resultsTotal)

679


Lets grab the ScopusIDs and DOIs of each entry.  They are in ```text['search-results']['entry'][?]['dc:identifier']:``` (check the debug output above for thefull results to have a look at the data structure returned) and put them in a result file, named with the date and time, in a directory called 'results'.

In [8]:
resultsfilename = "results/queryresults-" + time.strftime("%Y%m%d-%H%M%S") +'.tsv'# a string made of the date and time.
resultsfile = open(resultsfilename, "a")

headers = 'scopusID \t DOI \n'
resultsfile.write(headers)
firstentryid = text['search-results']['entry'][0]['dc:identifier'] 
firstentryDOI = text['search-results']['entry'][0]['prism:doi'] 
resultsfile.write(firstentryid + '\t'+ firstentryDOI + '\n')
resultsfile.close()

We already have our first result, so the remaining results are one less than that, and we can start our main query (if everything is OK).  We'll get 200 results at a time (that as many as scopus let you get).  

In [9]:
del(payload['count'])

In [10]:
resultsRemaining = resultsTotal -1 # we already have the first one
while resultsRemaining > 0 :

    payload['cursor'] = text['search-results']['cursor']['@next'] #scopus embeds the next url in the results, we can dig it out and use it here.
    payload['count'] = 200
    
    r = requests.get(stemurl, params=payload) # this uses requests to get data
    remainingrequests = r.headers['X-RateLimit-Remaining'] #how many more requests can we make today?
    
    text = json.loads(r.text) # get the text of the response in JSON format, and put it in a python dictionary object so we can have a squizz
    
    for entry in text['search-results']['entry']:
        # print(entry['dc:identifier']) #print the scopus identifier, a tab, then the doi.
        resultsfile = open(resultsfilename, "a")
        entryid = entry['dc:identifier']
        if 'prism:doi' in entry: 
            entrydoi = entry['prism:doi']
        else: 
            entrydoi = 'doi error'
        resultsfile.write(entryid+'\t'+entrydoi+'\n') #throw everything in the results file
        resultsfile.close()  
    resultsRemaining = resultsRemaining - 200
    print(str(resultsRemaining) , str(remainingrequests)) # how many resukts left to do, how many queries you have left for this key


478 19601
278 19600
78 19599
-122 19598
