# Scraping Elsevier Metadata

This notebook is used for pulling metadata from articles via Scopus' literature search. It can technically be used to scrape abstracts from anywhere within Scopus' database, but we've specifically limited it to Elsevier journals as that is the only journal that we have access to the fulltext options from. Specifically, this sets up a way to pull PII identification numbers automatically.

To manually test queries, go to https://www.scopus.com/search/form.uri?display=advanced

Elsevier maintains a list of all journals in a single excel spreadsheet. The link to that elsevier active journals link: https://www.elsevier.com/__data/promis_misc/sd-content/journals/jnlactivesubject.xls

The whole of this scraping tool centers around `pybliometrics`, a prebuilt scraping package that interacts with the quirks of Scopus. 

In [2]:
import pybliometrics
from pybliometrics.scopus import ScopusSearch
from pybliometrics.scopus.exception import Scopus429Error
import pandas as pd
import numpy as np
from sklearn.utils import shuffle
import os
import multiprocessing
from os import system, name
import json
import time
from IPython.display import clear_output
from pybliometrics.scopus import config

In order to use `pybliometrics`, you have to set up a config file on your computer. The best way to do that is to just use the built in command, `pybliometrics.scopus.utils.create_config()`. It will prompt you to enter an API key, so make sure you have one at the ready before you run this command. You can get one easily from https://dev.elsevier.com/documentation/SCOPUSSearchAPI.wadl with a quick registration. 

In [31]:
#In addition to imports, the first time we ever run pybliometrics we need to config pybliometrics
#My API key: 646199a6755da12c28f3fdfe59bbfe55
#pybliometrics.scopus.utils.create_config()

Please enter your API Key, obtained from http://dev.elsevier.com/myapikey.html: 
 646199a6755da12c28f3fdfe59bbfe55
API Keys are sufficient for most users.  If you have to use Authtoken authentication, please enter the token, otherwise press Enter: 
 


Configuration file successfully created at C:\Users\Jonathan/.scopus/config.ini


You should note the above filepath location, as it's important to have this filepath for later function calls. 

In [3]:
#Note your config path for pybliometrics: C:\Users\Jonathan/.scopus/config.ini

### Let's get into it - Time to walk through the algorithm!

List of things by which the algorithm will parse searches:

1. Year
2. Journal
3. Keyword search

So, we'll have to select a set of these parameters to fine-tune our search to get articles that'll be useful to us. 

One of the first quick parameters that will help is to filter down the number of journals that we'll be searching through, and then organize them into a dataframe so we can continue to work through the data in later methods.

We'll first go through all the methods that we have, then we'll show you exactly how to use the methods with some examples.

The following method, `make_jlist`, creates a dataframe that only contains journals mentioning certain keywords in their 'Full_Category' column. Those keywords are passed directly to the method, though some default keywords can be used.

In [15]:
def make_jlist(jlist_url = 'https://www.elsevier.com/__data/promis_misc/sd-content/journals/jnlactivesubject.xls', 
               journal_strings = ['chemistry','energy','molecular','atomic','chemical','biochem'
                                  ,'organic','polymer','chemical engineering','biotech','colloid']):
    """
    This method creates a dataframe of relevant journals to query. The dataframe contains two columns:
    (1) The names of the Journals
    (2) The issns of the Journals
    
    As inputs, the URL for a journal list and a list of keyword strings to subselect the journals by is required.
    These values currently default to Elsevier's journals and some chemical keywords.
    """
    
    # This creates a dataframe of the active journals and their subjects from elsevier
    active_journals = pd.read_excel(jlist_url)
    # This makes the dataframe column names a smidge more intuitive.
    active_journals.rename(columns = {'Display Category Full Name':'Full_Category','Full Title':'Journal_Title'}, inplace = True)
    
    active_journals.Full_Category = active_journals.Full_Category.str.lower() # lowercase topics for searching
    active_journals = active_journals.drop_duplicates(subset = 'Journal_Title') # drop any duplicate journals
    active_journals = shuffle(active_journals, random_state = 42) 

    # new dataframe full of only journals who's topic description contained the desired keywords
    active_journals = active_journals[active_journals['Full_Category'].str.contains('|'.join(journal_strings))]

    #Select down to only the title and the individual identification number called ISSN
    journal_frame = active_journals[['Journal_Title','ISSN']]
    #Remove things that have were present in multiple name searches.
    journal_frame = journal_frame.drop_duplicates(subset = 'Journal_Title')
    
    return journal_frame

### The following method builds the keyword search portion of a query. There is an example below that can be copy-pasted into the Scopus advanced Search.
This method is a helper function, and you really shouldn't need to interact with it. It helps to combine several terms in a way that would be unnatural for us to type, but is necessary for online searching. 

In [7]:
def build_search_terms(kwds):
    """
    This builds the keyword search portion of the query string. 
    """
    combined_keywords = ""
    for i in range(len(kwds)):
        if i != len(kwds)-1:
            combined_keywords += kwds[i] + ' OR '
        else:
            combined_keywords += kwds[i] + ' '
    
    return combined_keywords

In [None]:
# Here is a model test query 
# test = search(verbose = True, query = 'polymer OR organic OR molecular AND PUBYEAR IS 2019 AND ISSN(00404020)')

### The following method builds the entire query to be put into pybliometrics
The query requires a pretty specific format, so we are using a helper function to make it less obnoxious to deal with.

In [8]:
def build_query_dict(term_list, issn_list, year_list):
    """
    This method takes the list of journals and creates a nested dictionary
    containing all accessible queries, in each year, for each journal,
    for a given keyword search on sciencedirect.
    
    Parameters
    ----------
    term_list(list, required): the list of search terms looked for in papers by the api.
    
    issn_list(list, required): the list of journal issn's to be queried. Can be created by getting the '.values'
    of a 'journal_list' dataframe that has been created from the 'make_jlist' method.
    
    year_list(list, required): the list of years which will be searched through
    
    """
    search_terms = build_search_terms(term_list)
    dict1 = {}
    #This loop goes through and sets up a dictionary key with an ISSN number
    for issn in issn_list:
        
        issn_terms = ' AND ISSN(' + issn + ')'
        dict2 = {}
        #This loop goes and attaches all the years to the outer loop's key.
        for year in year_list:
            
            year_terms = "AND PUBYEAR IS " + str(year)
            querystring = search_terms + year_terms + issn_terms

            dict2[year] = querystring

        dict1[issn] = dict2

    return dict1

Ok, we can either run this with a single process, or we can multiprocess our way to victory. Either way, the first thing we need to do is define a set of functions that will follow our list of journals, as well as a set of outlined years, and search for a list of terms within those journals and years.

### Here is a method to clear the cache. Doesn't matter too much because 1.1 million pubs stored in cache only took 2 GB of memory 
BE CAREFUL WITH THIS. IT CAN DELETE EVERYTHING ON YOUR COMPUTER IF YOU MESS IT UP. But it can be useful if your cache starts to get too full and take up too much memory. 

In [10]:
def clear_cache(cache_path):
    """
    Be very careful with this method. It can delete your entire computer if you let it. 
    """
    
    # if the cache path contains the proper substring, and if the files we are deleting are of the propper length, delete the files
    
    if '.scopus/scopus_search/' in cache_path:
        for file in os.listdir(cache_path):
            
            # Making sure the deleted files match the standard length of pybliometrics cache output
            if len(file) == len('8805245317ccb15059e3cfa219be2dd4'):
                os.remove(cache_path + file)

### The method below loops through the entire journal list and collects article metadata (ie, not full-text), including PII
Unfotunately, collecting fulltext is not possible with this API. We have another method, `Elsevier_fulltext_api.py`, which takes in this metadata information and is able to pull out a full length article. 

Things we probably want to just grab because we have them:
1. Author names
2. Author keywords
3. Cited by count
4. title
5. PII
6. DOI
7. Description

In [11]:
def get_piis(term_list, journal_frame, year_list, cache_path, output_path, keymaster=False, fresh_keys=None, config_path='/Users/DavidCJ/.scopus/config.ini'):
    """
    This should be a standalone method that recieves a list of journals (issns), a keyword search,
    an output path and a path to clear the cache. It should be mappable to multiple parallel processes. 
    """
    if output_path[-1] is not '/':
        raise Exception('Output file path must end with /')
    
    if '.scopus/scopus_search' not in cache_path:
        raise Exception('Cache path is not a sub-directory of the scopus_search. Make sure cache path is correct.')
    
    # Two lists who's values correspond to each other    
    issn_list = journal_frame['ISSN'].values
    journal_list = journal_frame['Journal_Title'].values
    # Find and replaces slashes and spaces in names for file storage purposes
    for j in range(len(journal_list)):
        if ':' in journal_list[j]:
            journal_list[j] = journal_list[j].replace(':','')
        elif '/' in journal_list[j]:
            journal_list[j] = journal_list[j].replace('/','_') 
        elif ' ' in journal_list[j]:
            journal_list[j] = journal_list[j].replace(' ','_')
    
            
    
    # Build the dictionary that can be used to sequentially query elsevier for different journals and years
    query_dict = build_query_dict(term_list,issn_list,year_list)
    
    # Must write to memory, clear cache, and clear a dictionary upon starting every new journal
    for i in range(len(issn_list)):
        # At the start of every year, clear the standard output screen
        os.system('cls' if os.name == 'nt' else 'clear')
        paper_counter = 0

        issn_dict = {}
        for j in range(len(year_list)):
            # for every year in every journal, query the keywords
            print(f'{journal_list[i]} in {year_list[j]}.')
            
            # Want the sole 'keymaster' process to handle 429 responses by swapping the key. 
            if keymaster:
                try:
                    query_results = ScopusSearch(verbose = True,query = query_dict[issn_list[i]][year_list[j]])
                except Scopus429Error:
                    print('entered scopus 429 error loop... replacing key')
                    newkey = fresh_keys.pop(0)
                    config["Authentication"]["APIKey"] = newkey
                    time.sleep(5)
                    query_results = ScopusSearch(verbose = True,query = query_dict[issn_list[i]][year_list[j]])
                    print('key swap worked!!')
            # If this process isn't the keymaster, try a query. 
            # If it excepts, wait a few seconds for keymaster to replace key and try again.
            else:
                try:
                    query_results = ScopusSearch(verbose = True,query = query_dict[issn_list[i]][year_list[j]])
                except Scopus429Error:
                    print('Non key master is sleeping for 15... ')
                    time.sleep(15)
                    query_results = ScopusSearch(verbose = True,query = query_dict[issn_list[i]][year_list[j]]) # at this point, the scopus 429 error should be fixed... 
                    print('Non key master slept, query has now worked.')
            
            # store relevant information from the results into a dictionary pertaining to that query
            year_dict = {}
            if query_results.results is not None:
                # some of the query results might be of type None 
                
                
                for k in range(len(query_results.results)):
                    paper_counter += 1
                    
                    result_dict = {}
                    result = query_results.results[k]

                    result_dict['pii'] = result.pii
                    result_dict['doi'] = result.doi
                    result_dict['title'] = result.title
                    result_dict['num_authors'] = result.author_count
                    result_dict['authors'] = result.author_names
                    result_dict['description'] = result.description
                    result_dict['citation_count'] = result.citedby_count
                    result_dict['keywords'] = result.authkeywords
                    
                    year_dict[k] = result_dict

                # Store all of the results for this year in the dictionary containing to a certain journal
                issn_dict[year_list[j]] = year_dict
            else:
                # if it was a None type, we will just store the empty dictionary as json
                issn_dict[year_list[j]] = year_dict
        
        
        # Store all of the results for this journal in a folder as json file
        os.mkdir(f'{output_path}{journal_list[i]}')
        with open(f'{output_path}{journal_list[i]}/{journal_list[i]}.json','w') as file:
            json.dump(issn_dict, file)
        
        with open(f'{output_path}{journal_list[i]}/{journal_list[i]}.txt','w') as file2:
            file2.write(f'This file contains {paper_counter} publications.')

****

## Example Time!

Ok, now that we've shown all the methods, let's investigate their usage. We'll walk through linearly, so feel free to use these cells to figure out and run your own scraping efforts.

First thing's first, we need to call the `make_jlist` method and pass it anything we want to search by, and receive a dataframe of our downselected set of journals. You will get a warning from this method call, but it's not a big deal. It's an underlying weirdness of the pandas.read_excel function.

In [16]:
journal_list = make_jlist(jlist_url = 'https://www.elsevier.com/__data/promis_misc/sd-content/journals/jnlactivesubject.xls', 
               journal_strings = ['chemistry','synthesis','molecular','chemical','organic','polymer','materials'])



In [17]:
#Print out to show you the structure of the journal dataframe
journal_list.head()

Unnamed: 0,Journal_Title,ISSN
1574,Current Research in Physiology,26659441
6609,Tetrahedron Letters,00404039
2071,"Environmental Nanotechnology, Monitoring & Man...",22151532
2781,Immunobiology,01712985
4695,Matrix Biology,0945053X


In [27]:
# example of how to use the dictionary builder
issn_list = journal_list['ISSN'].values
dictionary = build_query_dict(term_list, issn_list, range(1995,2021))
#This shows a specific journal ISSN, and specific year selected. 
dictionary['00404020'][2015]

'polymer OR organic OR molecular OR molecule OR chemistry OR synthesis AND PUBYEAR IS 2015 AND ISSN(00404020)'

In [19]:
#List of Jon's API Keys - Feel free (please do) to grab your own quickly from the Scopus website. 
apikeylist = ['646199a6755da12c28f3fdfe59bbfe55','f23e69765c41a3a6e042eb9baf73bd77','f6dafc105b5adfe25105eb658aa80b7c', 
              '	e9f7c3a33c7bf1b790372d25a8fbb5a1', '2e57cbb3c25fa9e446a8fd0e58be91e9', '1bed2480701164024b1a644843c76099']

Ok, now we'll go ahead and set up the other search terms and cache paths, so we can run our full `get_piis` method. 

In [18]:
cache_path = '/Users/Jonathan/.scopus/scopus_search/COMPLETE/'
term_list = ['polymer','organic','molecular', 'chemistry', 'synthesis']

Ok, with search terms in hand, and a downselected journal list, we're ready to go and scrape papers!

In [None]:
get_piis(term_list,journal_frame,range(1995,2021),cache_path=cache_path,output_path = '/Users/Jonathan/Desktop/pyblio_test/', keymaster = True, fresh_keys = apikeylist)

### Further development work that isn't quite yet fully done/working?


In [13]:
def multiprocess(term_list, journal_frame, year_list, cache_path, output_path, keymaster = False, fresh_keys = None, config_path = '/Users/Jonathan/.scopus/config.ini', split_ratio = 2):
    """asdfoinasdfoin"""
    split_list = np.array_split(journal_frame, split_ratio)
    processes = []
    for k in range(split_ratio):
        print("Before multiprocessing")
        p = multiprocessing.Process(target = get_piis, args = [term_list, split_list[k], year_list, cache_path, output_path, keymaster, fresh_keys, config_path])
        print("after multiprocessing")
        p.start()
        processes.append(p)
    for process in processes:
        process.join()

In [14]:
multiprocess(term_list, journal_frame, range(1995,2021), cache_path=cache_path, output_path = '/Users/Jonathan/Desktop/pyblio_test2/', keymaster = True, fresh_keys = fresh_keys, config_path =  '/Users/Jonathan/.scopus/config.ini', split_ratio = 2)
# split_ratio = 3
# split_list = np.array_split(journal_frame, split_ratio)
# for k in range(split_ratio):
#     p = multiprocessing.Process(target = get_piis, args = [term_list,split_list[k],range(1995,2021),cache_path,'/Volumes/My Passport/Davids Stuff/pyblio_test3/', True, fresh_keys])
#     p.start()

Before multiprocessing
after multiprocessing


BrokenPipeError: [Errno 32] Broken pipe

In [35]:
#First, we need to split our list of journals in half
df1, df2 = np.array_split(journal_frame,2)

In [36]:
p1 = multiprocessing.Process(target = get_piis, args = [term_list,df1,range(1995,2021),cache_path,'/Volumes/My Passport/Davids Stuff/pyblio_test3/'])
p2 = multiprocessing.Process(target = get_piis, args = [term_list,df2,range(1995,2021),cache_path,'/Volumes/My Passport/Davids Stuff/pyblio_test3/',True,fresh_keys])
#p3 = multiprocessing.Process(target = get_piis, args = [term_list,df3,range(1995,2021),cache_path,'/Volumes/My Passport/Davids Stuff/pyblio_test2/'])
#p4 = multiprocessing.Process(target = get_piis, args = [term_list,df4,range(1995,2021),cache_path,'/Volumes/My Passport/Davids Stuff/pyblio_test2/'])

p1.start()
p2.start()
#p3.start()
#p4.start()

# starttime=time.time()
# while True:
#     clear_cache(cache_path)
#     clear_output()
#     time.sleep(20.0 - ((time.time() - starttime) % 20.0)) 

p1.join()
p2.join()
#p3.join()
#p4.join()

BrokenPipeError: [Errno 32] Broken pipe

### Stuff below is for counting how many publications are located in an output directory

In [14]:
def absoluteFilePaths(directory):
    for dirpath,_,filenames in os.walk(directory):
        for f in filenames:
            yield os.path.abspath(os.path.join(dirpath, f))

In [None]:
file2 = open('/Volumes/My Passport/Davids Stuff/pyblio_test/Gene: X/Gene: X.txt','r')

In [None]:
file2.readline()

In [15]:
def count_pubs(output_path):
    count = 0
    for path in absoluteFilePaths(output_path):
        if 'txt' in path and '._' not in path:
            file = open(path,'r')
            #print(path)
            a = sum([int(s) for s in string.split() if s.isdigit()])
            count+=a

    return count

In [16]:
count_pubs('/Users/Jonathan/Desktop/pyblio_test/')

NameError: name 'string' is not defined