# Scraping MetaData and getting Fulltexts

This tutorial notebook uses the functions from `pybliometrics` and `Elsevier_fulltext_api` and shows how to get the fulltexts from the articles you pulled from Scopus.

The first part of the notebook is used for pulling metadata from articles via Scopus' literature search. It can technically be used to scrape abstracts from anywhere within Scopus' database, but we've specifically limited it to Elsevier journals as that is the only journal that we have access to the fulltext options from. Specifically, this sets up a way to pull PII identification numbers automatically.

To manually test queries, go to https://www.scopus.com/search/form.uri?display=advanced

Elsevier maintains a list of all journals in a single excel spreadsheet. The link to that elsevier active journals link: https://www.elsevier.com/__data/promis_misc/sd-content/journals/jnlactivesubject.xls

The second part of the notebook uses the metadata generated from the first part and gets the fulltexts out of that.

In [1]:
import pybliometrics
from pybliometrics.scopus import ScopusSearch
from pybliometrics.scopus.exception import Scopus429Error
import pandas as pd
import numpy as np
from sklearn.utils import shuffle
import os
import multiprocessing
from os import system, name
import json
import time
from IPython.display import clear_output
from pybliometrics.scopus import config
from elsapy.elsclient import ElsClient

In order to get the articles, the first step requires you to get an API key from Scopus and adding it to your local config file. You can easily get an API key from https://dev.elsevier.com/documentation/SCOPUSSearchAPI.wadl with a quick registration. 

Once you have your API key, you need to add it to your computer using the following command:

`import pybliometrics`

`pybliometrics.scopus.utils.create_config()`

This will prompt you to enter an API key which you obtained from the Scopus website. Once you're done with that you are good to download the articles using the following functions.

**Note**: While downloading the articles from the Scopus, make sure you are connected to UW VPN (All Internet Traffic) using the BIG-IP Edge Client. Without that you might end up getting the Scopus authorization error.

The config path for `pybliometrics` is: `/Users/nisarg/.scopus/config.ini` (Would vary as per your local path)

### Let's get into it - Time to walk through the algorithm!

List of things by which the algorithm will parse searches:

1. Year
2. Journal
3. Keyword search

So, we'll have to select a set of these parameters to fine-tune our search to get articles that'll be useful to us. 

One of the first quick parameters that will help is to filter down the number of journals that we'll be searching through, and then organize them into a dataframe so we can continue to work through the data in later methods.

We'll first go through all the methods that we have, then we'll show you exactly how to use the methods with some examples.

The following method, `make_jlist`, creates a dataframe that only contains journals mentioning certain keywords in their 'Full_Category' column. Those keywords are passed directly to the method, though some default keywords can be used.

In [2]:
def make_jlist(jlist_url = 'https://www.elsevier.com/__data/promis_misc/sd-content/journals/jnlactivesubject.xls', 
               journal_strings = ['chemistry','energy','molecular','atomic','chemical','biochem'
                                  ,'organic','polymer','chemical engineering','biotech','colloid']):
    """
    This method creates a dataframe of relevant journals to query. The dataframe contains two columns:
    (1) The names of the Journals
    (2) The issns of the Journals
    
    As inputs, the URL for a journal list and a list of keyword strings to subselect the journals by is required.
    These values currently default to Elsevier's journals and some chemical keywords.
    """
    
    # This creates a dataframe of the active journals and their subjects from elsevier
    active_journals = pd.read_excel(jlist_url)
    # This makes the dataframe column names a smidge more intuitive.
    active_journals.rename(columns = {'Display Category Full Name':'Full_Category','Full Title':'Journal_Title'}, inplace = True)
    
    active_journals.Full_Category = active_journals.Full_Category.str.lower() # lowercase topics for searching
    active_journals = active_journals.drop_duplicates(subset = 'Journal_Title') # drop any duplicate journals
    active_journals = shuffle(active_journals, random_state = 42) 

    # new dataframe full of only journals who's topic description contained the desired keywords
    active_journals = active_journals[active_journals['Full_Category'].str.contains('|'.join(journal_strings))]

    #Select down to only the title and the individual identification number called ISSN
    journal_frame = active_journals[['Journal_Title','ISSN']]
    #Remove things that have were present in multiple name searches.
    journal_frame = journal_frame.drop_duplicates(subset = 'Journal_Title')
    
    return journal_frame

### The following method builds the keyword search portion of a query. There is an example below that can be copy-pasted into the Scopus advanced Search.
This method is a helper function, and you really shouldn't need to interact with it. It helps to combine several terms in a way that would be unnatural for us to type, but is necessary for online searching. 

In [3]:
def build_search_terms(kwds):
    """
    This builds the keyword search portion of the query string. 
    """
    combined_keywords = ""
    for i in range(len(kwds)):
        if i != len(kwds)-1:
            combined_keywords += kwds[i] + ' OR '
        else:
            combined_keywords += kwds[i] + ' '
    
    return combined_keywords

### The following method builds the entire query to be put into pybliometrics
The query requires a pretty specific format, so we are using a helper function to make it less obnoxious to deal with.

In [4]:
def build_query_dict(term_list, issn_list, year_list):
    """
    This method takes the list of journals and creates a nested dictionary
    containing all accessible queries, in each year, for each journal,
    for a given keyword search on sciencedirect.
    
    Parameters
    ----------
    term_list(list, required): the list of search terms looked for in papers by the api.
    
    issn_list(list, required): the list of journal issn's to be queried. Can be created by getting the '.values'
    of a 'journal_list' dataframe that has been created from the 'make_jlist' method.
    
    year_list(list, required): the list of years which will be searched through
    
    """
    search_terms = build_search_terms(term_list)
    dict1 = {}
    #This loop goes through and sets up a dictionary key with an ISSN number
    for issn in issn_list:
        
        issn_terms = ' AND ISSN(' + issn + ')'
        dict2 = {}
        #This loop goes and attaches all the years to the outer loop's key.
        for year in year_list:
            
            year_terms = "AND PUBYEAR IS " + str(year)
            querystring = search_terms + year_terms + issn_terms

            dict2[year] = querystring

        dict1[issn] = dict2

    return dict1

The following method shows how to collect the article metadata including the PII by looping through the journal of articles available for our `term_list`

In [5]:
def get_piis(term_list, journal_frame, year_list, cache_path, output_path, keymaster=False, fresh_keys=None, config_path='/Users/nisarg/.scopus/config.ini'):
    """
    This should be a standalone method that recieves a list of journals (issns), a keyword search,
    an output path and a path to clear the cache. It should be mappable to multiple parallel processes. 
    """
    if output_path[-1] is not '/':
        raise Exception('Output file path must end with /')
    
    if '.scopus/scopus_search' not in cache_path:
        raise Exception('Cache path is not a sub-directory of the scopus_search. Make sure cache path is correct.')
    
    # Two lists who's values correspond to each other    
    issn_list = journal_frame['ISSN'].values
    journal_list = journal_frame['Journal_Title'].values
    # Find and replaces slashes and spaces in names for file storage purposes
    for j in range(len(journal_list)):
        if ':' in journal_list[j]:
            journal_list[j] = journal_list[j].replace(':','')
        elif '/' in journal_list[j]:
            journal_list[j] = journal_list[j].replace('/','_') 
        elif ' ' in journal_list[j]:
            journal_list[j] = journal_list[j].replace(' ','_')
    
            
    
    # Build the dictionary that can be used to sequentially query elsevier for different journals and years
    query_dict = build_query_dict(term_list,issn_list,year_list)
    
    # Must write to memory, clear cache, and clear a dictionary upon starting every new journal
    for i in range(len(issn_list)):
        # At the start of every year, clear the standard output screen
        os.system('cls' if os.name == 'nt' else 'clear')
        paper_counter = 0

        issn_dict = {}
        for j in range(len(year_list)):
            # for every year in every journal, query the keywords
            print(f'{journal_list[i]} in {year_list[j]}.')
            
            # Want the sole 'keymaster' process to handle 429 responses by swapping the key. 
            if keymaster:
                try:
                    query_results = ScopusSearch(verbose = True,query = query_dict[issn_list[i]][year_list[j]])
                except Scopus429Error:
                    print('entered scopus 429 error loop... replacing key')
                    newkey = fresh_keys.pop(0)
                    config["Authentication"]["APIKey"] = newkey
                    time.sleep(5)
                    query_results = ScopusSearch(verbose = True,query = query_dict[issn_list[i]][year_list[j]])
                    print('key swap worked!!')
            # If this process isn't the keymaster, try a query. 
            # If it excepts, wait a few seconds for keymaster to replace key and try again.
            else:
                try:
                    query_results = ScopusSearch(verbose = True,query = query_dict[issn_list[i]][year_list[j]])
                except Scopus429Error:
                    print('Non key master is sleeping for 15... ')
                    time.sleep(15)
                    query_results = ScopusSearch(verbose = True,query = query_dict[issn_list[i]][year_list[j]]) # at this point, the scopus 429 error should be fixed... 
                    print('Non key master slept, query has now worked.')
            
            # store relevant information from the results into a dictionary pertaining to that query
            year_dict = {}
            if query_results.results is not None:
                # some of the query results might be of type None 
                
                
                for k in range(len(query_results.results)):
                    paper_counter += 1
                    
                    result_dict = {}
                    result = query_results.results[k]

                    result_dict['pii'] = result.pii
                    result_dict['doi'] = result.doi
                    result_dict['title'] = result.title
                    result_dict['num_authors'] = result.author_count
                    result_dict['authors'] = result.author_names
                    result_dict['description'] = result.description
                    result_dict['citation_count'] = result.citedby_count
                    result_dict['keywords'] = result.authkeywords
                    
                    year_dict[k] = result_dict

                # Store all of the results for this year in the dictionary containing to a certain journal
                issn_dict[year_list[j]] = year_dict
            else:
                # if it was a None type, we will just store the empty dictionary as json
                issn_dict[year_list[j]] = year_dict
        
        
        # Store all of the results for this journal in a folder as json file
        os.mkdir(f'{output_path}{journal_list[i]}')
        with open(f'{output_path}{journal_list[i]}/{journal_list[i]}.json','w') as file:
            json.dump(issn_dict, file)
        
        with open(f'{output_path}{journal_list[i]}/{journal_list[i]}.txt','w') as file2:
            file2.write(f'This file contains {paper_counter} publications.')

### Example for getting the PII and the metadata from the journals


First thing's first, we need to call the `make_jlist` method and pass it anything we want to search by, and receive a dataframe of our downselected set of journals. You will get a warning from this method call, but it's not a big deal. It's an underlying weirdness of the pandas.read_excel function.

In [6]:
journal_frame = make_jlist(jlist_url = 'https://www.elsevier.com/__data/promis_misc/sd-content/journals/jnlactivesubject.xls', 
               journal_strings = ['chemistry','synthesis','molecular','chemical','organic','polymer','materials'])



In [7]:
journal_frame.head() #This shows the journal titles and their ISSN from where we will get the metadata for the articles

Unnamed: 0,Journal_Title,ISSN
6506,Solid State Communications,381098
2451,Forensic Chemistry,24681709
4086,Journal of Molecular and Cellular Cardiology,222828
4098,Journal of Molecular Spectroscopy,222852
707,Biophysical Chemistry,3014622


Now we will define the `cache_path` which will store the cache and `term_list` which takes the list of the terms we need for searching the articles. 

To clear the cache, you can find the `clear_cache` function in the `pybliometrics` notebook.

In [8]:
cache_path = '/Users/nisarg/.scopus/scopus_search/COMPLETE/'
term_list = ['deposition', 'corrosion', 'inhibit', 'corrosive', 'resistance', 'protect', 'acid', 'base', 'coke', 'coking', 'anti', \
             'layer', 'steel', 'mild steel', 'coating', 'degradation', 'oxidation', \
             'film', 'photo-corrosion', 'hydrolysis', 'Schiff']

In [9]:
# example of how to use the dictionary builder
issn_list = journal_frame['ISSN'].values
dictionary = build_query_dict(term_list, issn_list, range(1995,2021))
#This shows a specific journal ISSN, and specific year selected. 
dictionary['00404020'][2015]

'deposition OR corrosion OR inhibit OR corrosive OR resistance OR protect OR acid OR base OR coke OR coking OR anti OR layer OR steel OR mild steel OR coating OR degradation OR oxidation OR film OR photo-corrosion OR hydrolysis OR Schiff AND PUBYEAR IS 2015 AND ISSN(00404020)'

In [10]:
#Here you'll need to add the API keys which you generated following the steps mentioned at the beginning of the notebook
apikeylist = ['6bcdddd0c63296684f85245fe26ef03d','060a5b0160e1ecc6b361060633700981','28fff643126ba570a7a6315537bb9dde', 
              '095d720842e4a6103e699e2913da406f']

In [None]:
#Now we'll run the function and get the piis.
get_piis(term_list,journal_frame,range(1995,2021),cache_path=cache_path, \
         output_path = '/Users/nisarg/Desktop/summer research/test/', keymaster = True, fresh_keys = apikeylist)

Now we have the fresh `PII` generated using our function. The next part of the notebook shows how to obtain FullTexts using those piis.

The functions below are obtained from `Elsevier_fulltext_api` notebook. The functions show how to obtain the data from the metacorpus path and obtain the Fulltexts for the articles.

In [13]:
def load_journal_json(absolute_path):
    """
    This method loads data collected on a single journal by the pybliometrics metadata collection module into a dictionary.
    
    Parameters
    ----------
    absolute_path(str, required) - The path to the .json file containing metadata procured by the pybliometrics module. 
    """
    with open(absolute_path) as json_file:
        data = json.load(json_file)
    
    return data

In [11]:
def get_doc(dtype,identity):
    """
    This method retrieves a 'Doc' object from the Elsevier API. The doc object contains metadata and full-text information
    about a publication associated with a given PII. 
    
    Parameters:
    -----------
    dtype(str,required): The type of identification string being used to access the document. (Almost always PII in our case.)
    
    identity: The actual identification string/ PII that will be used to query. 
    """
    if dtype == 'pii':
        doc = FullDoc(sd_pii = identity)
    elif dtype == 'doi':
        doc= FullDoc(doi = identity)
       
    if doc.read(client):
            #print ("doc.title: ", doc.title)
            doc.write()   
    else:
        print ("Read document failed.")
    
    return doc

In [12]:
def get_docdata(doc):
    """
    This method attempts to get certain pieces of metadata from an elsapy doc object. 
    
    Parameters:
    -----------
    
    doc(elsapy object, required): elsapy doc object being searched for
    
    Returns:
    --------
    text(str): The full text from the original publciation.
    
    auths(list): The list of authors from the publication.
    """
    try:
        text = doc.data['originalText']                          # grab original full text                                                        
    except:
        text = 'no text in doc'
    
    try:
        auths = authorize(doc) # a list of authors
    except:
        auths = []
    
    return text, auths
    

In [14]:
def authorize(doc):
    #this method takes a doc object and returns a list of authors for the doc
    auths = []
    for auth in doc.data['coredata']['dc:creator']:
        auths.append(auth['$'])
    
    return auths

In [16]:
def get_fulltexts(directory_list, directory_path, output_path, pnum):
    """
    This method takes a list of directories containing 'meta' corpus information from the pybliometrics module and adds full-text information to those files. 
    
    Parameters:
    ___________
    directory_list(list, required): A list of directories which this method will enter and add full-text information to.
    
    output_path(str, required): The folder in which the new full text corpus will be placed. 
    
    api_keys(list, required): A list of valid API keys from Elsevier developer. One key needed per process being started.
    """
    #client = client
    
    for directory in directory_list:
        os.mkdir(f'{output_path}/{directory}')
        marker = open(f'{output_path}/{directory}/marker.txt','w') # put a file in the directory that lets us know we've been in that directory
        marker.close()
        
        info = open(f'{output_path}/{directory}/info.csv','w') # a file to keep track of errors
        info.write('type,file,year,pub') # header
        
        #print(f'made marker and errors in {directory}')
        
        
        json_file = f'{directory_path}/{directory}/{directory}.json'
        j_dict = load_journal_json(json_file) # now we have a dictionary of information in our hands. Access it via journal_dict['year']['pub_number']
        rem_list = ['num_authors', 'description', 'citation_count', 'keywords']
        for year in j_dict:

            if j_dict[year] is not {}:
                for pub in j_dict[year]:
                    
                    pii = j_dict[year][pub]['pii'] # the pii identification number used to get the full text

                    try:
                        
                        doc = get_doc('pii',pii) # don't know if doc retrieval will fail
                        print(f'Process {pnum} got doc for {directory}, {year}')
                    except Exception as e:
                        print(f'EXCEPTION: DOC RETRIEVAL. Process {pnum}')
                        print(f'Exception was {e}')
                        doc = None 
                        info.write(f'doc retrieval,{json_file},{year},{pub}')

                    text, auths = get_docdata(doc) # doesn't crash even if doc = None
                    

                    if text is 'no text in doc':
                        info.write(f'no text in doc,{json_file},{year},{pub}')
                    elif auths is []:
                        info.write(f'no auths in doc,{json_file},{year},{pub}')

                    j_dict[year][pub]['authors'] = auths
                    j_dict[year][pub]['fulltext'] = text # the real magic
                    
                    for key in rem_list:
                        j_dict[year][pub].pop(key)
                                        
            else:
                # the year was empty
                info.write(f'year empty,{json_file},{year},{np.nan}')
                
        info.close()
        j_file = f'{output_path}/{directory}/{directory}.json'
        
        with open(j_file,'w') as file:
            json.dump(j_dict,file)

### Example for obtain Fulltexts using the pii of the journals

In [17]:
#Directory path is basically your output path for the get_pii function. 
#It stores the metadata files for the journals. Output path is where you want your Fulltexts to be downloaded
directory_path = '/Users/nisarg/Desktop/summer research/CI_pii'
output_path = '/Users/nisarg/Desktop/summer research/CI_fulltexts'

In [18]:
client = ElsClient('6bcdddd0c63296684f85245fe26ef03d')

In [19]:
#files is the list of all the journals which exist in your directory_path
files = os.listdir('/Users/nisarg/Desktop/summer research/CI_pii')

Now you have everything you need to obtain the Fulltexts, so we can just use the function and get the Fulltexts we need.

In [None]:
get_fulltexts(files, directory_path, output_path, apikeylist)