# 01 - Scrape Scopus Pages

This notebook includes the code to scrape the Scopus database to generate a corpus of academic journal abstracts.  It starts with a tab-delimited file listing the ~35000 journals (!) included in scopus, and uses this list to guide an extensive series of queries for abstracts.  This notebook performs the following steps:

NOTE:  Scopus is a subscription service.  As of 12/2016 I have Scopus access through my graduate program at Upenn.  This code assumes that you have access and have recently logged in using your credentials.  

I ran this code locally on my Macbook Pro.

### Parse the Scopus-provided journal list.
* Limits list to English-language journals
* Removes medical / dental journals.
* Removes multi-topic / interdisciplinary journals
* Sorts by impact factor, so only relatively high-impact abstrats will be scraped.
    
### Use the cleaned-up list of Scopus journals to automate web queries. 

* Loops over a given range of years and loops over top N impact journals in each year. 
* Generates a Scopus URL for each journal / year.
* Uses Selenium to click control to show 200 articles (max) and to display their abstracts.
* Save webpage as HTML file, and update a log-file used to later index the saved HTML.
* If > 200 articles, use Selenium to click a button to advance to next page.  Continue as needed until all abstracts for journal have been captured.
    
    
NOTE:  There appears to be a memory leak in Selenium.  After running queries for several hundred journals, performance decreases notably.  All the 8 GB on my Macbook is consumed at this point.  However, killing and re-instatantiating the webdriver object appears to address this problem by freeing the leaked meemory.  This procedure is used when looping over the journals / years.
    


### Imports

In [1]:
import pandas as pd
import string
import numpy as np
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import os
import datetime

### Helper functions for filtering the journal list

In [2]:

# Function to clean up column names in the Scopus journal file, 
# as well as names of topics.
def clean_name (s):
    s = s.upper()  #Make upper case
    s = s.replace (':',' ')
    s = s.replace ('-',' ')
    s = s.replace ("'", '')
    s = s.replace ('(', '_')
    s = s.replace (')', '')
    s = s.replace (',','')
    s = '_'.join (string.split(s))
    return s

######################################################

# Function to return whether journal is published in English-speaking country
# German and Dutch journals are generally in English.  Journals from France
# are also typically pusblished in English, but there seem to be enough exceptions
# to justify excluding them.
def is_english_journal (s):
    try: return s.upper() in ['UNITED STATES','UNITED KINGDOM','AUSTRALIA','CANADA','NETHERLANDS','GERMANY']
    except: return False
        
######################################################

# This converts coverts the numeric topic codes to text-readable (separated by semicolons)
def build_text_classification_codes (s, topic_dict):
    code_list = [ i for i in s.strip().split(';')]  # List of string numerical values
    topic_list = [topic_dict[i] for i in code_list]
#    print topic_list    
    return ';'.join (topic_list)

######################################################

# Exclude medical, dental, etc. journals.  These have structured abstracts and are challenging to parse.  
# They 'section header' words ('Results', 'Methods', 'Conclusions', etc..) seem to bias results, but 
# these headers are vary enough between medical journals that it is not trivial to simply remove them.
# Future work may include medical journals, but for now they are excluded from the project.
def remove_medicine (s):
    code_list = [ i.replace (';',' ').strip() for i in s.split() ]  # List of string numerical values
    code_list = filter (lambda x: int (x) < 2700 or int (x) > 2799, code_list) # 2700 --> 2799 are medical topics
    code_list = filter (lambda x: int (x) < 2900 or int (x) > 2999, code_list) # 2900 -> 2999 are nursing
    code_list = filter (lambda x: int (x) < 3000 or int (x) > 3099, code_list) # 3000 -> 3099 are pharmocology
    code_list = filter (lambda x: int (x) < 3400 or int (x) > 3699, code_list) # 3400 -> 3699 are veterinary / dental / health services
    if len (code_list) < 1: return '_NONE_'  # Will be removed from the dataframe
    else: return ';'.join (code_list)

    ######################################################

# Returns True if 1000 one of the topic codes, otherwise returns False.  Examples include
# Nature, Science, PNAS.  
# (1000 --> an 'interdisciplinary' journal)
def is_interdisciplinary (s):
    code_list = [ i for i in s.strip().split(';')]  # List of string numerical values
    return '1000' in code_list

######################################################

# A multitopic journal is defined as having more than one topic code (this is distinct from 
# interdisciplinary, which has its own dedicated topic code.)
def is_multitopic (s):
    code_list = [ i for i in s.strip().split(';')]  # List of string numerical values
    return len (code_list) > 1

######################################################

# Given list of topic codes, this function rounds them all down to next-lowest 100-value.
# This makes the topics more general.
def def_build_simple_topics (s):
    code_list = [ int(i) for i in s.strip().split(';')]  # List of string numerical values
    code_list = [i - i%100 for i in code_list]
    code_list = list(set(code_list)) # Uniquify
    code_list = [str(i) for i in code_list]
    return ';'.join (code_list)

######################################################

# Function to generate 'electronic ISSN or E_ISSN'
# The scopus query works most reliably using an electronic ISSN rather
# than a text journal name.  Some journals have an E_ISSN given in the 
# scopus journal list.  For those that do not, we use this function to 
# generate the E_ISSN from a given print ISSN.
def generate_e_issn (print_issn, e_issn):
    if e_issn == 'NONE':
        try:
            e_issn = '0' * (8-len(print_issn)) + print_issn
            e_issn = e_issn[0:4] + '-' + e_issn[4:]
        except: e_issn = 'NONE'
    return e_issn

#print generate_e_issn ('7434618','14773848')
#print generate_e_issn ('255858','NONE')
#print generate_e_issn (np.nan, 'NONE')



### Function to generate Scopus query URL

In [3]:
def build_url (e_issn, yr = 2015, language = 'english'):
    pass
    # The sample URL was generated from the Scopus advanced feature.  ISSN is the electronic
    # journal code, in this case for JACS.  Language is english, and year is 2015, and
    # document type is articles (i.e. not conference proceedings, etc.)
    
    r=    'https://www.scopus.com/results/results.uri?sort=plf-f&src=s'
    r=r + '&sid=DD71260EDC8DA0916BED214D5039CFAC.wsnAw8kcdt7IPYLO0V48gA%3a130&sot=a&sdt=a&sl=71'
    r=r + '&s=ISSN%28'
    r=r +  str (e_issn)
    r=r + '%29+AND+PUBYEAR+%3d+'
    r=r +  str (yr)
    r=r +  '+AND+DOCTYPE%28ar%29+AND+LANGUAGE%28'
    r=r +  language
    r=r +  '%29&origin=searchadvanced&editSaveSearch='
    r=r +  '&txGid=DD71260EDC8DA0916BED214D5039CFAC.wsnAw8kcdt7IPYLO0V48gA%3a13'
    
    return r

#build_url (15205126)

'https://www.scopus.com/results/results.uri?sort=plf-f&src=s&sid=DD71260EDC8DA0916BED214D5039CFAC.wsnAw8kcdt7IPYLO0V48gA%3a130&sot=a&sdt=a&sl=71&s=ISSN%2815205126%29+AND+PUBYEAR+%3d+2015+AND+DOCTYPE%28ar%29+AND+LANGUAGE%28english%29&origin=searchadvanced&editSaveSearch=&txGid=DD71260EDC8DA0916BED214D5039CFAC.wsnAw8kcdt7IPYLO0V48gA%3a13'

### Miscellaneous helper functions for Scopus queries

In [20]:
# Replace commas in dataframe with underscores

def replace_commas(s):
    try: return s.replace(',','_')
    except: return s
    
######################################################

# Compute number of queries needed for given journal / year combination.
# Scopus returns only 200 articles per query, so esssentially this 
# is a rounded-up mod div by 200.

def calc_n_queries (n):   # n = number of articles in journal for year
    if n < 1: return 0
    else: return min (1 + ((n-1) // 200), 10)

#####################################################

# Build filename for an HTML page.

# e_issn is journal tag, q is query number (1..10)
def build_html_filename (j, e_issn, q, yr):
    return str (yr) + '_' + str (j).zfill (5) + '_' + e_issn + '_' + str (q).zfill(2)+'.html'
    

#####################################################

# Set the Scopus website to show 200 articles (max #), and to display abstracts.

def config_abstract_view (driver):
    
    #First set to view 200 docs:
    n_res_dropdown = driver.find_element_by_id('resultsPerPage-button')
    n_res_dropdown.send_keys('200\n')
    
    # Now make the abstracts visible
    try:
        abstract_toggle = driver.find_element_by_link_text ('Show all abstracts')
        abstract_toggle.click()
    except:    
        pass #abstracts were already visible, no problem.

### Functions to intialize and write to the scopus query log file.  

This log file contains one line for each page returned by the scopus site.  It is used to track success / failures in the query, and has information to guide the (later) process of extracting HTML from the saved pages and saving the results to Mongo.

In [5]:
# This log is a tab-delimited file, should be readable as a dataframe.
# f_log      = file handle to the output log
# fn_html    = file name for html file of just-written page
# e_issn     = e_issn of current journal
# n_article  = total # of articles for the journal
# b_err      = Boolean, True is the scopus query failed
# df_jist    = the whole dataframe for journals.  The row corresponding to the E_ISSN will be
#              copied to the log

def write_to_log_file (f_log, yr, fn_html, e_issn, n_articles, b_err, df_jlist):
    t = datetime.datetime.now().strftime("%Y-%m-%d_%H:%M:%S") # current time
    df_row = df_jlist[df_jlist.E_ISSN == e_issn]   # extract row from dataframe
    df_row = df_row.to_csv(header=False, index=False)
    s = ','.join ([t, str (yr), fn_html, str(n_articles), str(b_err)])
    s =  s + ',' + df_row 
    f_log.write (s)
    f_log.flush()
    return
    
######################################################
    
# Function to open a new log file and write its header line  
def init_log_file (fp_log_out, df_jlist):
    if os.path.exists (fp_log_out):
        f_log = open (fp_log_out, 'a')
    else:
        f_log = open (fp_log_out, 'w')
        s = 'TIMESTAMP,YEAR,HTML_FILE,N_ARTICLES,ERR,'
        s = s + ','.join (df_jlist.columns) + '\n'
        f_log.write (s)
    return f_log

######################################################

# Function to log a chromedriver restart
def write_to_log_file_chromedriver_restart (f_log):
    t = datetime.datetime.now().strftime("%Y-%m-%d_%H:%M:%S") # current time
    s = t + ',' + 'CHROMEDRIVER KILL / RESTART\n'
    f_log.write (s)
    f_log.flush()
    return

### Function to make a scopus query (up to 200 docs), save the HTML, and update the log file

The text following is a sample Scopus query, for the Journal of the American Chemical Society.  It should return a query result if you are properly configured for Scopus access through your institution.

**https://www.scopus.com/results/results.uri?sort=plff&src=s&sid=DD71260EDC8DA0916BED214D5039CFAC.wsnAw8kcdt7IPYLO0V48gA%3a130&sot=a&sdt=a&sl=71&s=ISSN%2815205126%29+AND+PUBYEAR+%3d+2015+AND+DOCTYPE%28ar%29+AND+LANGUAGE%28english%29&origin=searchadvanced&editSaveSearch=&txGid=DD71260EDC8DA0916BED214D5039CFAC.wsnAw8kcdt7IPYLO0V48gA%3a13**

In [21]:
# f_log = file handle to the log file
# j = index for current journal 
# e_issn = e_issn for current journal
# n_docs = total number of docs for currnet journal
# q = current query number for current journal (1..10)
# n_query = total number of queries needed for current journal
# df_jlist = big dataframe of journal info
# driver = the selenium broweser driver object
# err:  Will be TRUE if an error occurred prior to calling the function.  If
#       err == True, no query / html save will be doen here.  However, the log is still 
#       written to make record of the error.

def save_scopus_query (dir_out_html, f_log, j, e_issn, n_docs, q, yr, n_query, df_jlist, driver, err):
        
        if err == False:      
            try:
                print 'Running query ' + str(q) + ' of ' + str (n_query)
                if q == 1:
                    config_abstract_view (driver)   # Make abstracts visible, set to 200 / page
                src = driver.page_source 
                fn_html = build_html_filename (j, e_issn, q, yr)
                time.sleep(sleep_time_query)
                fp_html = os.path.join (dir_out_html, fn_html)
                print fp_html
                with open (fp_html, 'w') as f:
                    f.write (src.encode ('UTF-8'))    
            except: err = True
                
        if err == True:  fn_html = 'ERR_NO_VALID_HTML_DUMP'
                
        write_to_log_file (f_log, yr, fn_html, e_issn, n_docs, b_err = err, df_jlist = df_jlist)    
        
        return err

### Function to update restart counter, and kill / restart chromedriver if needed

In [22]:
def process_chromedriver_reset (driver, q_since_reset, browser_restart_interval, f_log, delay = 3):
    q_since_reset = (q_since_reset + 1) % browser_restart_interval
    
    if q_since_reset == 0:
        print 'KILLING CHROMEDRIVER'
        driver.close()
        time.sleep (delay)
        print 'RESTARTING CHROMEDRIVER'
        driver = webdriver.Chrome (chromedriver)
        time.sleep(delay)
        write_to_log_file_chromedriver_restart (f_log)
    
    return driver, q_since_reset

# MAIN -- Filter Journal List

In [11]:
# Define filepaths
fp_jlist_in = '/Users/bryanfry/projects/proj_asksci/files_in/scopus_journal_list_tab_delimited.txt'
fp_topics_in = '/Users/bryanfry/projects/proj_asksci/files_in/scopus_topic_codes.csv'
fp_jlist_out = '/Users/bryanfry/projects/proj_asksci/files_out/filtered_journal_list.txt'

# Read the csv files
df_j = pd.read_csv (fp_jlist_in, sep = '\t')
df_t = pd.read_csv (fp_topics_in)

# Clean up the column names in the journal and topics list dataframes
df_j.columns = [clean_name(i) for i in df_j.columns]
df_t.columns = [clean_name(i) for i in df_t.columns]

# Build missing E_ISSN values (electronic journal ID codes)
df_j.E_ISSN =   df_j.E_ISSN.fillna('NONE') # Fill missing E_ISSN with 'NONE'

# String algebra to generate correct E_ISSN for journals where it is absent
df_j.E_ISSN = [generate_e_issn (i,j) for i,j in zip (df_j.PRINT_ISSN, df_j.E_ISSN)]

# Clean up the topic names
df_t.DESCRIPTION = df_t.DESCRIPTION.apply (clean_name)

# Basic filtering of the journal list
df_j = df_j [df_j.SOURCE_TYPE.apply(string.upper) == 'JOURNAL'] # Limit to journals (not books, etc)
df_j = df_j [df_j.ACTIVE_OR_INACTIVE.apply(string.upper) == 'ACTIVE'] #Limit to journals active in Scotus
df_j = df_j [df_j.PUBLISHERS_COUNTRY.apply (is_english_journal) == True] # Limit to likely english language journals

# Sort on 2015_SNIP, descending (this is a topic-normalized impact factor)
df_j = df_j.sort_values (['2015_SNIP'], ascending=False)

# Remove medicine / vet / nursing / veterinary topics, then remove journals for which no topics remain.
len_org = len (df_j)
df_j.ALL_CLASSIFICATION_CODES = df_j.ALL_CLASSIFICATION_CODES.apply (remove_medicine)
df_j= df_j[df_j.ALL_CLASSIFICATION_CODES != '_NONE_']
print 'NUMBER OF JOURNALS REMOVED FOR MEDICAL / VET / NURSING TOPICS = ' + str (len_org - len (df_j))

# Build column for interdisciplinary journal - True or False
df_j['IS_INTERDISCIPLINARY'] = df_j.ALL_CLASSIFICATION_CODES.apply (is_interdisciplinary)

# Build columnd for 'simple classification codes' -- all topic codes rounded to next-lowest mult of 100.
df_j['ALL_SIMPLE_CODES'] = df_j.ALL_CLASSIFICATION_CODES.apply (def_build_simple_topics)

# Make a dictionary with the topic codes as keys and subjects as values
topic_dict = {str(df_t.CODE.iloc[i]):df_t.DESCRIPTION.iloc[i] for i in range (len(df_t))}

# Add a field indicating multi-topic journal (more than one 100-level topic)
df_j['IS_MULTITOPIC'] = df_j.ALL_SIMPLE_CODES.apply (is_multitopic)

# Finally, add new columns that includes text topics separated by semicolons.
df_j['ALL_CLASSIFICATION_TOPICS'] = [build_text_classification_codes(i, topic_dict) for i in df_j.ALL_CLASSIFICATION_CODES]
df_j['ALL_SIMPLE_TOPICS'] = [build_text_classification_codes(i, topic_dict) for i in df_j.ALL_SIMPLE_CODES]


df_j.to_csv (fp_jlist_out, sep='\t', index = False)


print 'TOTAL # Journal = ' + str (len (df_j))
print 'TOTAL # of Multidisciplinary Journals is = ' + str (len (df_j[df_j.IS_INTERDISCIPLINARY == True]))
print 'TOTAL # of NOT multi-topic Journals is = ' + str (len (df_j[df_j.IS_MULTITOPIC == False]))
print 'DONE'

NUMBER OF JOURNALS REMOVED FOR MEDICAL / VET / NURSING TOPICS = 2982
TOTAL # Journal = 11413
TOTAL # of Multidisciplinary Journals is = 34
TOTAL # of NOT multi-topic Journals is = 6286
DONE


# Main -- Query Scopus and save webpages

In [None]:
# Point to the Selenium Chromedriver
chromedriver = '/Users/bryanfry/chromedriver'
os.environ ['webdriver.chrome.driver'] = chromedriver

# Input file with cleaned journal list
fp_jlist_in = '/Users/bryanfry/projects/proj_asksci/files_out/filtered_journal_list.txt' 

# All saved webpages, and the query log file, will be saved to the following directory.
# Directory will be created if it does not already exist.
dir_out = '/Users/bryanfry/projects/proj_asksci/files_out/scopus_query_out'

#Log file, updated with each saved scopus HTML page
fp_log_out = os.path.join (dir_out, '_SCOPUS_QRY_LOG.csv')

sleep_time_init = 30  #Intial time to sleep before first query (sec)
sleep_time_query = 1  #Time to sleep after each query (sec)
#end_journal = 2500
end_journal = 10  # Number of journals to run.  The journals will be run in order of decreasing 20515_SNIP
start_journal = 0  #Initial journal in the list, used to restart mid-run. 0 --> first journal.
yr_list = [2015, 2013, 2011, 2009, 2007]  # Updated to support mulitple years
browser_restart_interval = 100 # Every N queries, the chromedriver will be killed and restarted.... this
                              # aims to address the memory leak issue seen in long runs

# Create Scopus query output directory, if it does not exist:
if not os.path.exists (dir_out): os.makedirs(dir_out)
    
n_cum_docs = 0 # Cumulative count of documents
q = 0  # Total query count
q_since_reset = 0  # Queries since last chromedriver kill/restart

df_jlist = pd.read_csv (fp_jlist_in, sep = '\t')  # Read *.TXT with journal info (tab-delimited)
df_jlist = df_jlist [df_jlist.IS_MULTITOPIC == False]  # Eliminate Multi-topic Journals

f_log = init_log_file (fp_log_out, df_jlist)  # Initialize log file (one line per scopus query)

df_jlist.E_ISSN =   df_jlist.E_ISSN.fillna('NONE') # Fill missing E_ISSN with 'NONE'

# String algebra to generate correct E_ISSN for journals where it is absent
df_jlist.E_ISSN = [generate_e_issn (i,j) for i,j in zip (df_jlist.PRINT_ISSN, df_jlist.E_ISSN)]

# If both electronic and print ISSN values were not given in the original file, the E_ISSN field will now contain
# 'NONE'.  Remove journals where this is the case.
df_jlist = df_jlist [df_jlist.E_ISSN != 'NONE']

# replace any commas in the journal dataframe with underscores.  There are some
# commas in various fields (publisher names, etc.) and they may confuse the comma-delimited
# file format
for c in df_jlist.columns:
    df_jlist[c] = df_jlist[c].apply (replace_commas)

    
df_jlist = df_jlist.sort_values (by=['2015_SNIP'], ascending=False) #Sort by impact factor

driver = webdriver.Chrome (chromedriver)
time.sleep (sleep_time_init)
end_journal = min (end_journal, len (df_jlist)-1) # Do not allow end_journal to exceed # of valid journals in dataset.


# LOOP ON YEARS
for yr in yr_list:
    # Now we will loop on the top N journals, making a scopus query for each one.
    for idx, e_issn in enumerate (df_jlist.E_ISSN[start_journal: (end_journal+1)]):
        j = idx + start_journal  # journal index, offset by the starting point
        err = False   # New Journal, reset the error flag
        print 'E_ISSN = ' + e_issn + '                  J# = ' + str (j)
        print 'JOUNRAL = ' + df_jlist.JOURNAL_TITLE[df_jlist.E_ISSN == e_issn].tolist()[0]
        print 'YEAR = ' + str(yr)
        url = build_url (e_issn, yr = yr, language = 'english')  # construct URL for the page
        try:
            driver.get (url)  # Load the page
            # Read in the total number of articles for the journal / year
            count_display = driver.find_element_by_class_name ('resultsCount')
            n_docs = int (count_display.text.replace (',',''))
            n_cum_docs = n_cum_docs + n_docs
            print 'CUMULATIVE DOC COUNT (ACROSS ALL YEARS) = ' + str (n_cum_docs)
        except: err = True

        # Compute the number of scopus pages needed to get the abstracts (max of 2000 abstracts)
        # Loop to execute the queries
        n_query = calc_n_queries (n_docs)
        if n_query > 0:
            q = 1
            save_scopus_query (dir_out, f_log, j, e_issn, n_docs, q, yr, n_query, df_jlist, driver, err) #save html, update log
            q = q + 1
            driver, q_since_reset = process_chromedriver_reset (driver, q_since_reset, \
                                                                browser_restart_interval, f_log, delay = 3)
            # If necessary, click the "next page" button and make next query
            while q <= n_query and err == False:

                # Find and click the "Next page" button
                try:
                    next_page = driver.find_element_by_class_name('nextPage')
                    next_page.click()
                except: err = True

                save_scopus_query (dir_out, f_log, j, e_issn, n_docs, q, yr, n_query, df_jlist, driver, err) #save html, update log
                q = q + 1
                driver, q_since_reset = process_chromedriver_reset (driver, q_since_reset, \
                                                                    browser_restart_interval, f_log, delay = 3)
                
f_log.close()    
driver.close()
print '\n#### DONE #####'