# Metadata Scraping for Springer Nature Journals

URL that points to an excel sheet with all of the Springer Nature journals: https://media.springernature.com/full/springer-cms/rest/v1/content/17737828/data/v2

In [1]:
import pybliometrics
from pybliometrics.scopus import ScopusSearch
from pybliometrics.scopus.exception import Scopus429Error
import pandas as pd
import numpy as np
from sklearn.utils import shuffle
import os
import multiprocessing
from os import system, name
import json
import time
from IPython.display import clear_output
from pybliometrics.scopus import config

In [5]:
#This is the full list of journals from Springer, but it doesnt have any topic contents.
journal_frame = pd.read_excel("https://media.springernature.com/full/springer-cms/rest/v1/content/17737828/data/v2", header = 5)
#Remove excess columns
journal_frame.drop(columns = ['product_title_sort', 'Format', 'product_id', 'Primary Language', 'Vols Qty', 'Scheduled Vol Nos', 'Single Issues per volume', 'Comments'], inplace = True)
#There doesn't seem to be any journal duplicates, so we don't have to drop any duplicates.
#Display the current dataframe
journal_frame.head(40)

HTTPError: HTTP Error 308: Permanent Redirect

OK, so some small problems. If the Imprint isn't Springer, then the format of the URLs isn't right for me to hard-code requests from a Selenium extracter, though if I get access to a fulltext API scraper, this won't matter anymore. Also, if it's a Nature imprint, the format of the DOIs change. We'll have to filter out things that don't have either Nature or Springer in them I think. It's notable that anything that is co-published with Springer also has the same format with any regular Springer journals. 

---This is actually kinda funny. The only category this doesn't cover is biomedcentral, so we have covered at least 2/3 of the journals this way. I think we could hardcode format all three of them. 

In [15]:
#Ok, figuring out the new excel spreadsheet format:
journal_dataframe = pd.read_excel('Springer_Nature_Titles.xlsx')
journal_dataframe.drop(columns = ["Publication Type", "Editor", "Title ID", "Language", "MARC21 Record Date", "MARC21 Change Date", "Title URL", "Coverage Notes", "Publisher"], inplace = True)
journal_dataframe.head()

Unnamed: 0,Journal Title,Electronic ISSN,Print ISSN,DOI,Imprint,Subject Classification,Subject Collections
0,International Journal on Software Tools for Te...,1433-2787,1433-2779,http://doi.org/10.1007/10009.1433-2787,Springer,Computer Science; Software Engineering; Softwa...,Computer Science
1,Artificial Life and Robotics,1614-7456,1433-5298,http://doi.org/10.1007/10015.1614-7456,Springer,Computer Science; Computation by Abstract Devi...,Computer Science
2,Ecosystems,1435-0629,1432-9840,http://doi.org/10.1007/10021.1435-0629,Springer,Life Sciences; Ecology; Plant Sciences; Zoolog...,Biomedical and Life Sciences
3,Hernia,1248-9204,1265-4906,http://doi.org/10.1007/10029.1248-9204,Springer,Medicine & Public Health; Abdominal Surgery,Medicine
4,International Journal on Document Analysis and...,1433-2825,1433-2833,http://doi.org/10.1007/10032.1433-2825,Springer,Computer Science; Image Processing and Compute...,Computer Science


In [3]:
#We'll build a method to drop everything in the frames that isn't a Springer of Nature imprint


Starting list of filtered SpringerNature Journals: https://adminportal.springernature.com/metadata/journals?hash=1e6e01cb8630a46fe4ebea343207187f0c462d56
This isn't easily directly downloaded, but if you download it and then put it in your directory you should be in a good place. 

Alright, once we've removed unacceptable journal entries, we'll need to start building functions to make calls to Scopus via Pybliometrics, and save the results of our search.

In [32]:
#First method is to build a smart journal list according to what's available and our search terms. 
def build_list(search_terms = ['molecule','chemistry','materials','synth']):
    """This method puts together a formatted dataframe from a specific list of search terms and the excel spreadsheet
    provided by Springer-Nature. It's really only designed for Springer-Nature published articles."""
    #Read in the excel spreadsheet directly from this directory
    journal_dataframe = pd.read_excel('Springer_Nature_Titles.xlsx')
    #delete any extraneous columns from our dataframe
    journal_dataframe.drop(columns = ["Publication Type", "Editor", "Title ID", "Language", "MARC21 Record Date", "MARC21 Change Date", "Title URL", "Coverage Notes", "Publisher"], inplace = True)
    #Reorder our journal list so we don't introduce bias later
    journal_dataframe = shuffle(journal_dataframe, random_state = 12)
    #Check to see if the journals in some way contain the intended search terms
    journal_dataframe = journal_dataframe[journal_dataframe['Subject Classification'].str.contains('|'.join(search_terms))]
    #After searching, drop all columns that aren't the title and ISSN
    journal_dataframe = journal_dataframe[['Journal Title', 'Electronic ISSN']]
    #Update the name of that one column dataframe from Electronic ISSN to just ISSN
    journal_dataframe.rename(columns = {"Electronic ISSN": "ISSN"}, inplace = True)
    
    return journal_dataframe
    

In [33]:
df = build_list()
df.head(20)

Unnamed: 0,Journal Title,ISSN
1476,Cancer Nanotechnology,1868-6966
3060,"Journal of Ocular Biology, Diseases, and Infor...",1936-8445
2600,Nature Geoscience,1752-0908
1168,Journal of Structural and Functional Genomics,1570-0267
2052,3D Printing in Medicine,2365-6271
2485,Moscow University Biological Sciences Bulletin,1934-791X
1722,European Biophysics Journal,1432-1017
1698,Archives of Microbiology,1432-072X
2945,Somatic Cell and Molecular Genetics,1572-9931
356,Molecular Biotechnology,1559-0305


Alright, we have a journal list, but now we need to actually start searching for stuff. We'll need a couple helper functions to get through that though, so those will be what we work through next.

In [34]:
def build_search_terms(kwds):
    """
    This builds the keyword search portion of the query string. 
    """
    combined_keywords = ""
    for i in range(len(kwds)):
        if i != len(kwds)-1:
            combined_keywords += kwds[i] + ' OR '
        else:
            combined_keywords += kwds[i] + ' '
    
    return combined_keywords

In [35]:
def build_query_dict(term_list, issn_list, year_list):
    """
    This method takes the list of journals and creates a nested dictionary
    containing all accessible queries, in each year, for each journal,
    for a given keyword search on sciencedirect.
    
    Parameters
    ----------
    term_list(list, required): the list of search terms looked for in papers by the api.
    
    issn_list(list, required): the list of journal issn's to be queried. Can be created by getting the '.values'
    of a 'journal_list' dataframe that has been created from the 'make_jlist' method.
    
    year_list(list, required): the list of years which will be searched through
    
    """
    search_terms = build_search_terms(term_list)
    dict1 = {}
    #This loop goes through and sets up a dictionary key with an ISSN number
    for issn in issn_list:
        
        issn_terms = ' AND ISSN(' + issn + ')'
        dict2 = {}
        #This loop goes and attaches all the years to the outer loop's key.
        for year in year_list:
            
            year_terms = "AND PUBYEAR IS " + str(year)
            querystring = search_terms + year_terms + issn_terms

            dict2[year] = querystring

        dict1[issn] = dict2

    return dict1

In [36]:
def clear_cache(cache_path):
    """
    Be very careful with this method. It can delete your entire computer if you let it. 
    """
    
    # if the cache path contains the proper substring, and if the files we are deleting are of the propper length, delete the files
    
    if '.scopus/scopus_search/' in cache_path:
        for file in os.listdir(cache_path):
            
            # Making sure the deleted files match the standard length of pybliometrics cache output
            if len(file) == len('8805245317ccb15059e3cfa219be2dd4'):
                os.remove(cache_path + file)

Ok, with those helper functions in place, we're ready to actually build a wrapper and calling class. 


# One thing that this still needs is a way to handle duplicate entries without crashing
Ideally, it would check to see if something is present in the database before querying the Scopus dataframe, so we don't accidentally waste API Key queries. 

In [72]:
def get_metadata(term_list, journal_search, year_list, cache_path, output_path, keymaster=False, fresh_keys=None, config_path='/Users/DavidCJ/.scopus/config.ini'):
    """
    This should be a standalone method that recieves a list of journals (issns), a keyword search,
    an output path and a path to clear the cache. It should be mappable to multiple parallel processes. 
    """
    if output_path[-1] is not '/':
        raise Exception('Output file path must end with /')
    
    if '.scopus/scopus_search' not in cache_path:
        raise Exception('Cache path is not a sub-directory of the scopus_search. Make sure cache path is correct.')
    
    #Generate a dataframe from the excel spreadsheet
    journal_frame = build_list(journal_search)
    
    # Two lists who's values correspond to each other    
    issn_list = journal_frame['ISSN'].values
    journal_list = journal_frame['Journal Title'].values
    # Find and replaces slashes and spaces in names for file storage purposes
    for j in range(len(journal_list)):
        if ':' in journal_list[j]:
            journal_list[j] = journal_list[j].replace(':','')
        elif '/' in journal_list[j]:
            journal_list[j] = journal_list[j].replace('/','_') 
        elif ' ' in journal_list[j]:
            journal_list[j] = journal_list[j].replace(' ','_') 
    
    # Build the dictionary that can be used to sequentially query elsevier for different journals and years
    query_dict = build_query_dict(term_list,issn_list,year_list)
    
    # Must write to memory, clear cache, and clear a dictionary upon starting every new journal
    for i in range(len(issn_list)):
        # At the start of every year, clear the standard output screen
        os.system('cls' if os.name == 'nt' else 'clear')
        paper_counter = 0

        issn_dict = {}
        for j in range(len(year_list)):
            # for every year in every journal, query the keywords
            print(f'{journal_list[i]} in {year_list[j]}.')
            
            # Want the sole 'keymaster' process to handle 429 responses by swapping the key. 
            #If you have more keys, then you want to be able to switch keys if Scopus starts
            #passing back 429 errors, which are over-quota errors.
            if keymaster:
                try:
                    query_results = ScopusSearch(verbose = True,query = query_dict[issn_list[i]][year_list[j]])
                except Scopus429Error:
                    print('entered scopus 429 error loop... replacing key')
                    newkey = fresh_keys.pop(0)
                    config["Authentication"]["APIKey"] = newkey
                    time.sleep(5)
                    query_results = ScopusSearch(verbose = True,query = query_dict[issn_list[i]][year_list[j]])
                    print('key swap worked!!')
            # If this process isn't the keymaster, try a query. 
            # If it excepts, wait a few seconds for keymaster to replace key and try again.
            #If we don't have several API keys, we don't have the option to switch keys
            #Because of that, we will use the below code block. 
            else:
                try:
                    query_results = ScopusSearch(verbose = True,query = query_dict[issn_list[i]][year_list[j]])
                except Scopus429Error:
                    print('Non key master is sleeping for 15... ')
                    time.sleep(15)
                    query_results = ScopusSearch(verbose = True,query = query_dict[issn_list[i]][year_list[j]]) # at this point, the scopus 429 error should be fixed... 
                    print('Non key master slept, query has now worked.')
            
            # store relevant information from the results into a dictionary pertaining to that query
            year_dict = {}
            if query_results.results is not None:
                # some of the query results might be of type None 
                
                
                for k in range(len(query_results.results)):
                    paper_counter += 1
                    
                    result_dict = {}
                    result = query_results.results[k]

                    result_dict['pii'] = result.pii
                    result_dict['doi'] = result.doi
                    result_dict['title'] = result.title
                    result_dict['num_authors'] = result.author_count
                    result_dict['authors'] = result.author_names
                    result_dict['description'] = result.description
                    result_dict['citation_count'] = result.citedby_count
                    result_dict['keywords'] = result.authkeywords
                    
                    year_dict[k] = result_dict

                # Store all of the results for this year in the dictionary containing to a certain journal
                issn_dict[year_list[j]] = year_dict
            else:
                # if it was a None type, we will just store the empty dictionary as json
                issn_dict[year_list[j]] = year_dict        
        
        # Store all of the results for this journal in a folder as json file
        os.mkdir(f'{output_path}{journal_list[i]}')
        with open(f'{output_path}{journal_list[i]}/{journal_list[i]}.json','w') as file:
            json.dump(issn_dict, file)
        
        with open(f'{output_path}{journal_list[i]}/{journal_list[i]}.txt','w') as file2:
            file2.write(f'This file contains {paper_counter} publications.')

In [73]:
apikeylist = ['646199a6755da12c28f3fdfe59bbfe55','f23e69765c41a3a6e042eb9baf73bd77','f6dafc105b5adfe25105eb658aa80b7c', 
              '	e9f7c3a33c7bf1b790372d25a8fbb5a1', '2e57cbb3c25fa9e446a8fd0e58be91e9', '1bed2480701164024b1a644843c76099']

In [74]:
get_metadata(['Chemistry', 'Synthesis'], journal_search = ['Chemistry', 'Synthesis'], year_list = [2018,2019,2020], cache_path = '/Users/Jonathan/.scopus/scopus_search/COMPLETE/',
             output_path = '/Users/Jonathan/Desktop/pyblio_test2/', keymaster = True, fresh_keys = apikeylist, config_path = "Users\Jonathan/.scopus/config.ini")

Journal_of_Nanobiotechnology in 2018.
Journal_of_Nanobiotechnology in 2019.
Journal_of_Nanobiotechnology in 2020.


FileExistsError: [WinError 183] Cannot create a file when that file already exists: '/Users/Jonathan/Desktop/pyblio_test2/Journal_of_Nanobiotechnology'