# Populate SURFACE from BibTeX

**Duncan A. Brown<sup>1</sup>**

**<sup>1</sup>Department of Physics, Syracuse University, Syracuse, NY 13244, USA**

This script will populate SURFACE with publications that are downloaded in a BibTeX file from ORCID, INSPIRES, or ADS.

## License

This work is licensed under a [Creative Commons Attribution-ShareAlike 3.0 United States License](http://creativecommons.org/licenses/by-sa/3.0/us/).

## Instructions

1. Create a Python virtual environment with
```sh
virtualenv ~/surface
```
and then activate it with
```sh
source ~/surface/bin/activate
```
2. Install the required Python packages
```sh
pip install --upgrade pip
pip install --upgrade setuptools
pip install jupyter
pip install selenium
pip install bibtexparser
pip install bs4
```
3. Install the [headless Chrome](https://developers.google.com/web/updates/2017/04/headless-chrome) driver using [Homebrew](https://brew.sh/)
```sh
brew cask install chromedriver
```
3. Follow the [export works instructions](https://support.orcid.org/hc/en-us/articles/360006971453-Exporting-works-into-a-BibTeX-file) to download your works from ORCID as a bibtex file. Save it in the same directory as this notebook.
4. Execute this notebook. The first two cells ask for your NetID and password, so enter them when prompted.

## Enter NetID and Password

In [1]:
print "Enter NetID"
netid = raw_input()

Enter NetID
dabrown


In [2]:
import getpass
print "Enter NetID Password"
password = getpass.getpass()

Enter NetID Password
········


In [3]:
print "Enter your ORCID with the URL (i.e. in the form https://orcid.org/0000-0002-9180-5765)"
orcid = raw_input()

Enter your ORCID with the URL (i.e. in the form https://orcid.org/0000-0002-9180-5765)
https://orcid.org/0000-0002-9180-5765


## Import modules and launch the Chrome headless driver

In [4]:
import os
import re
import sys
import time
import pickle
import urllib
import urllib2
import bibtexparser
from bibtexparser.bwriter import BibTexWriter
from bibtexparser.bibdatabase import BibDatabase
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup as Soup
from collections import defaultdict
import AddressBook as ab
from __future__ import unicode_literals

In [5]:
def save_obj(obj, name ):
    with open(name + '.pkl', 'wb') as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)

def load_obj(name ):
    with open(name + '.pkl', 'rb') as f:
        return pickle.load(f)

In [6]:
driver = webdriver.Chrome('/usr/local/bin/chromedriver')
time.sleep(5)

In [7]:
try:
    author_emails = load_obj('author_emails')
except:
    author_emails = {}

## Load in a BibTeX file created by exporting works from ORCID

In [8]:
try:
    del bib_database
except:
    pass
with open('works.bib') as bibtex_file:
    bib_database = bibtexparser.load(bibtex_file)

## Open a connection to SURFACE

This will bounce us through the SAML2 authentication to Syracuse University's IdP and back to SURFACE.

In [9]:
driver.get('https://shibidp.syr.edu/idp/profile/cas/login?service=https%3A%2F%2Fsurface.syr.edu%2Fcgi%2Flogin.cgi%3Fauth_server%3D21%26return_to%3Dhttps%253A%252F%252Fsurface.syr.edu%252Fcgi%252Fir_submit.cgi%253Fcontext%253Dphy')

In [10]:
driver.find_element_by_id('username').send_keys(netid)
driver.find_element_by_id('password').send_keys(password)
driver.find_element_by_name('_eventId_proceed').click()
time.sleep(5)

## Functions to query the OS X address book to find emails

In [11]:
def _make_list(cdw):
    """Make a list from CoreDataWrapper"""
    if not cdw:
        return []
    values = []
    for i in range(cdw.count()):
        values.append(unicode(cdw.valueAtIndex_(i)))
    return values


# map dict keys to AB properties and conversion funcs
_text_property_map = {
    # output key :  (property, conversion func)
    'first_name': (ab.kABFirstNameProperty, None),
    'last_name': (ab.kABLastNameProperty, None),
    'company': (ab.kABOrganizationProperty, None),
    'emails': (ab.kABEmailProperty, _make_list),
}


def ab_person_to_dict(person):
    """Convert ABPerson to Python dict"""
    d = {}
    d['emails'] = []
    for key in _text_property_map:
        prop, func = _text_property_map[key]
        value = person.valueForProperty_(prop)
        if func:
            value = func(value)
        if not value:
            value = ''
        d[key] = value
    return d


def search(query):
    """Search Mac Address Book and return pythonified results"""

    address_book = ab.ABAddressBook.sharedAddressBook()
    # build search criteria
    criteria = []
    for key in _text_property_map:
        prop, func = _text_property_map[key]
        criteria.append(
            ab.ABPerson.searchElementForProperty_label_key_value_comparison_(
            prop,
            None, None,
            query,
            ab.kABContainsSubStringCaseInsensitive)
        )

    search = ab.ABSearchElement.searchElementForConjunction_children_(
            ab.kABSearchOr, criteria)

    # search Address Book
    people = address_book.recordsMatchingSearchElement_(search)

    results = defaultdict(set)
    for person in [ab_person_to_dict(person) for person in people]:
        key = '{} {}'.format(person.get('first_name', ''),
                             person.get('last_name', '')).strip()
        for addr in person['emails']:
            results[key].add(addr)

    return results

## Function to get the URL of a PDF from ArXiv

In [12]:
def get_arxiv_pdf_from_doi(doi):
        
    response = urllib2.urlopen('https://arxiv.org/search/advanced?advanced=1'
                               '&terms-0-operator=AND&terms-0-term={}'
                               '&terms-0-field=doi&classification-physics_archives=all'
                               '&date-filter_by=all_dates&date-year=&date-from_date='
                               '&date-to_date=&date-date_type=submitted_date'
                               '&abstracts=show&size=50'
                               '&order=-announced_date_first'.format(urllib.quote_plus(doi)))
    
    html = response.read()
    page = Soup(html,'html.parser')
    
    pdf_link = None
    for a in page.find_all('a'):
        try:
            href = str(a['href'])
            if 'https://arxiv.org/pdf/' in href:
                pdf_link = href
        except:
            pass
        
    return pdf_link

In [13]:
def populate_authors(driver, bibkey):
    
    driver.find_element_by_css_selector('a.remove').click()
    
    author_index = 0
    
    for a in authors:
        author_index +=1
        a = re.sub("{|}", "", a).strip()
                
        if ',' in a:
            l, others = a.split(',')
            others = others.strip()
            names = others.split(' ')
            if len(names) == 2:
                f, m = names
            elif len(names) == 1:
                f = names[0]
                m = ''
            else:
                f = names[0]
                m = None
        else:
            names = a.split(' ')
            if len(names) == 3:
                f, m, l = names
            elif len(names) == 2:
                f, l = names
                m = ''
            else:
                f = names[0]
                m = None
                l = names[-1]

        
        f.strip()
        m.strip()
        l.strip()

        if a in author_emails:
            e = author_emails[a][0]
            i = author_emails[a][1]
        else:
            e = None
            people = search(l)
            if len(people):
                try:
                    e = [value for key, value in people.items() if f.lower() in key.lower()][0].pop()
                except:
                    e = None
                
            if e is None:
                print "I need to know the email address for {}".format(a)
                e = raw_input()
            
            try:
                if e.split('@')[1] == 'syr.edu':
                    i = "Syracuse University"
                else:
                    i = None
            except:
                i = None
                
            if i is None:
                print "I need to know the institution for {}".format(a)
                i = raw_input()
            
            author_emails[a] = (e, i)
            
        if author_index == 1:
            driver.find_element_by_xpath('//*[@title="Show/hide details"]').click()
        else:
            ap = driver.find_element_by_id('ap_picker')
            ap.send_keys(a)
            
            time.sleep(5)
            c = ap.get_attribute('aria-owns')
            driver.find_element_by_id(c).click()
            
        time.sleep(2)        
        driver.find_element_by_name('email_{}'.format(author_index)).send_keys(e)
        driver.find_element_by_name('fname_{}'.format(author_index)).send_keys(f)
        driver.find_element_by_name('mname_{}'.format(author_index)).send_keys(m)
        driver.find_element_by_name('lname_{}'.format(author_index)).send_keys(l)
        
        ip = driver.find_element_by_name('institution_{}'.format(author_index))
        ip.send_keys(i)
        
        try:
            time.sleep(2)
            c = ip.get_attribute('aria-owns')
            driver.find_element_by_id(c).click()
        except:
            pass
    
    driver.find_element_by_id('min_{}'.format(author_index)).click()        
    
    print "OK to continue?"
    x = raw_input()

    return True

In [14]:
def populate_entry(driver, bibkey):
    
    if 'doi'in bibkey:
        doi = urllib2.unquote(bibkey['doi'])
    else:
        print "No DOI found for entry, skipping {}".format(bibkey['title'])
        return False, bibkey
    
    pdf_link = get_arxiv_pdf_from_doi(doi)
    
    print "Uploading {}".format(doi)
    
    driver.get('https://shibidp.syr.edu/idp/profile/cas/login?service=https%3A%2F%2Fsurface.syr.edu%2Fcgi%2Flogin.cgi%3Fauth_server%3D21%26return_to%3Dhttps%253A%252F%252Fsurface.syr.edu%252Fcgi%252Fir_submit.cgi%253Fcontext%253Dphy')
    
    try:
        driver.find_element_by_css_selector('a.cc-btn.cc-dismiss').click()
    except:
        pass
    
    time.sleep(5)
    driver.find_element_by_name('accept_agreement').click()
    driver.find_element_by_name('agreement_button').click()
    time.sleep(5)

    driver.find_element_by_id('title').send_keys(re.sub("{|}", "", bibkey['title']))
    
    if not populate_authors(driver, bibkey):
        print "Skipping {} due to incomplete author information".format(bibkey['title'])
        return False, bibkey
    
    ifr = driver.find_element_by_id('orcid_ifr')
    driver.switch_to.frame(ifr)
    driver.find_element_by_id('tinymce').send_keys(orcid)
    driver.switch_to.default_content()
    
    driver.find_element_by_id('publication_date_year').send_keys(bibkey['year'])
    
    time.sleep(2)
    driver.find_element_by_id('keywords').send_keys('gravitational wave astronomy astrophysics')
    
    select_box = driver.find_element_by_id('disciplines')
    options = [x for x in select_box.find_elements_by_tag_name("option")]
    for o in options:
        o.click()
    driver.find_element_by_xpath(u'//*[@value="« Remove"]').click()
    
    driver.find_element_by_id('ygtvlabelel9').click()
    time.sleep(2)
    driver.find_element_by_id('ygtvlabelel12').click()
    time.sleep(2)
    driver.find_element_by_id('ygtvlabelel22').click()
    time.sleep(2)
    driver.find_element_by_xpath(u'//*[@value="Select »"]').click()
    
    if pdf_link:
        ifr = driver.find_element_by_id('comments_ifr')
        driver.switch_to.frame(ifr)
        driver.find_element_by_id('tinymce').send_keys(pdf_link)
        driver.switch_to.default_content()

    driver.find_element_by_id('link_full_text').click()
    driver.find_element_by_xpath(u'//*[@name="source_fulltext_url"]').send_keys(
        'https://doi.org/{}'.format(doi))
    
    ifr = driver.find_element_by_id('doi_ifr')
    driver.switch_to.frame(ifr)
    driver.find_element_by_id('tinymce').send_keys(doi)
    driver.switch_to.default_content()

    driver.find_element_by_id('source').send_keys('submission')
    
    x = raw_input('OK to submit? Hit return to continue or type anything else to cancel')
    if x.strip() == '':
        driver.find_element_by_xpath(u'//*[@name="submit_paper"]').click()
        return True, bibkey
    else:
        return False, bibkey

In [15]:
rejected_publications = []
accepted_publications = []

for b in bib_database.entries:
    authors = b['author'].split(' and ')
    
    if authors[-1].strip().lower() == 'others':
        print "Skipping {} due to author information".format(b['ID'])
        rejected_publications.append(b)
        
    else:
        try:
            e, r = populate_entry(driver, b)
            if e:
                accepted_publications.append(r)
            else:
                rejected_publications.append(r)
        except:
            rejected_publications.append(b) 

Skipping Aartsen:2014mfp due to author information
Skipping Aasi:2012fw due to author information
Skipping Aasi:2012rja due to author information
Skipping Aasi:2012wd due to author information
Skipping Aasi:2013jjl due to author information
Skipping Aasi:2013jya due to author information
Skipping Aasi:2013lva due to author information
Skipping Aasi:2013sia due to author information
Skipping Aasi:2013sna due to author information
Skipping Aasi:2013vna due to author information
Skipping Aasi:2014bqj due to author information
Skipping Aasi:2014ent due to author information
Skipping Aasi:2014erp due to author information
Skipping Aasi:2014iia due to author information
Skipping Aasi:2014iwa due to author information
Skipping Aasi:2014jkh due to author information
Skipping Aasi:2014jln due to author information
Skipping Aasi:2014ksa due to author information
Skipping Aasi:2014mqd due to author information
Skipping Aasi:2014mtf due to author information
Skipping Aasi:2014qak due to author inf

Uploading 10.1103/PhysRevD.86.024024
OK to continue?

OK to submit?
Skipping Kumar:2013gwa due to author information
Skipping LIGO:2012aa due to author information
Uploading 10.1103/PhysRevD.78.124020
I need to know the institution for Lindblom, Lee
Caltech
I need to know the email address for Owen, Benjamin J.
Penn State
I need to know the institution for Owen, Benjamin J.
Penn State
OK to continue?

OK to submit?
Uploading 10.1086/588246
I need to know the email address for Miller, M.Coleman
miller@astro.umd.edu
I need to know the institution for Miller, M.Coleman
University of Maryland
OK to continue?

OK to submit?
Skipping Mandel:2013ara due to author information
Skipping Margutti:2017cjl due to author information
Skipping Martynov:2016fzi due to author information
Skipping Monitor:2017mdv due to author information
Skipping Nicholl:2017ahq due to author information
Skipping others:2016ifn due to author information
Uploading 10.1088/0264-9381/24/12/S06
OK to continue?

OK to submit

In [16]:
writer = BibTexWriter()

rejected_bib = bibtexparser.bibdatabase.BibDatabase()
rejected_bib.entries = rejected_publications

accepted_bib = bibtexparser.bibdatabase.BibDatabase()
accepted_bib.entries = accepted_publications

with open('rejected_publications.bib', 'w') as bibfile:
    bibfile.write(writer.write(rejected_bib).encode('utf-8'))
    
with open('accepted_publications.bib', 'w') as bibfile:
    bibfile.write(writer.write(accepted_bib).encode('utf-8'))

In [17]:
save_obj(author_emails,'author_emails')