# Web Scraping of NeurIPS 2019 Conference Using Python and Selenium
### David Lowe
### January 17, 2020

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping Python code leverages the Selenium module.

INTRODUCTION: INTRODUCTION: The Conference on Neural Information Processing Systems (NeurIPS) covers a wide range of topics in neural information processing systems and research for the biological, technological, mathematical, and theoretical applications. Neural information processing is a field that benefits from a combined view of biological, physical, mathematical, and computational sciences. This web scraping script will automatically traverse through the entire web page and collect all links to the PDF and PPTX documents. The script will also download the documents as part of the scraping process.

Starting URLs: https://papers.nips.cc/book/advances-in-neural-information-processing-systems-32-2019

## Section 1. Prepare Environment

In [1]:
import os
import shutil
import smtplib
import sys
from email.message import EmailMessage
from datetime import datetime
from random import randint
from time import sleep
# from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

In [2]:
# Begin the timer for the script processing
startTimeScript = datetime.now()

# Set up the verbose and debug flags to print detailed messages for debugging (setting True will activate!)
verbose = True
debug = False

# Set up the flag to send status emails (setting to True will send the status emails!)
notifyStatus = False

# Set up the mountStorage flag to mount G Drive for storing files (setting True will mount the drive!)
mountStorage = False

# Set up the executeDownload flag to download files (setting True will download!)
executeDownload = True

In [3]:
# Colab-Specific Setup - Mount Google Drive for storing downloaded files
if (mountStorage):
    from google.colab import drive
    drive.mount('/content/gdrive')

In [4]:
# Set up the email notification function
def email_notify(msg_text):
    sender = os.environ.get('MAIL_SENDER')
    receiver = os.environ.get('MAIL_RECEIVER')
    gateway = os.environ.get('SMTP_GATEWAY')
    smtpuser = os.environ.get('SMTP_USERNAME')
    password = os.environ.get('SMTP_PASSWORD')
    if sender==None or receiver==None or gateway==None or smtpuser==None or password==None:
        sys.exit("Incomplete email setup info. Script Processing Aborted!!!")
    msg = EmailMessage()
    msg.set_content(msg_text)
    msg['Subject'] = 'Notification from Python Web Scraping Script'
    msg['From'] = sender
    msg['To'] = receiver
    server = smtplib.SMTP(gateway, 587)
    server.starttls()
    server.login(smtpuser, password)
    server.send_message(msg)
    server.quit()

In [5]:
def download_to_local(doc_path):
    # Adding random wait time so we do not hammer the website needlessly
    waitTime = randint(2,5)
    print("Waiting " + str(waitTime) + " seconds to retrieve " + doc_path)
    sleep(waitTime)
    local_file = doc_path.split('/')[-1]
    if (os.path.isfile(local_file) == False):
        with requests.get(doc_path, stream=True) as r:
            with open(local_file, 'wb') as f:
                shutil.copyfileobj(r.raw, f)
        print('Downladed file: ' + local_file)
    else:
        print('Skipped existing file: ' + local_file)

In [6]:
def download_to_gdrive(doc_path):
    # Adding random wait time so we do not hammer the website needlessly
    waitTime = randint(2,5)
    print("Waiting " + str(waitTime) + " seconds to retrieve " + doc_path)
    sleep(waitTime)
    local_file = doc_path.split('/')[-1]
    gdrivePrefix = '/content/gdrive/My Drive/Colab_Downloads/'
    dest_file = gdrivePrefix + local_file
    with requests.get(doc_path, stream=True) as r:
        with open(dest_file, 'wb') as f:
            shutil.copyfileobj(r.raw, f)
    print('Downladed file: ' + dest_file)

In [7]:
if (notifyStatus): email_notify("Phase 1 Prepare Environment completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

## Section 1. Perform the Scraping and Processing

In [8]:
if (notifyStatus): email_notify("Phase 2 Perform the Scraping and Processing has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [9]:
# Specifying the URL of desired web page to be scrapped
website_url = "https://papers.nips.cc"
starting_url = website_url + "/book/advances-in-neural-information-processing-systems-32-2019"

In [10]:
# Access and test the starting URL
firefox_options = Options()
firefox_options.headless = False
home_page_browser = webdriver.Firefox(options=firefox_options)
print('Attempting to access the web page: ' + starting_url)
home_page_browser.get(starting_url)

# Gather all document links from the starting URL
main_page_container = home_page_browser.find_element_by_class_name('main-container')
collection = main_page_container.find_elements_by_tag_name('li')
num_presentations = 0
num_documents = 0

Attempting to access the web page: https://papers.nips.cc/book/advances-in-neural-information-processing-systems-32-2019


In [11]:
doc_page_browser = webdriver.Firefox(options=firefox_options)
for item in collection:
    if (debug): print(item)
    presentation_element = item.find_element_by_tag_name('a')
    doc_title = presentation_element.text
    if (verbose): print("Found title:", doc_title)
    doc_link = presentation_element.get_attribute('href')
    if (verbose): print("Found link:", doc_link)
    author_group = item.find_elements_by_class_name('author')
    author_list = []
    for each_author in author_group:
        author_list.append(each_author.text)
    authors = '|'.join(author_list)
    if (verbose): print("Found authors:", authors)

    doc_page_browser.get(doc_link)
    doc_page_container = doc_page_browser.find_element_by_class_name('main-container')
    doc_abstract = doc_page_container.find_element_by_class_name('abstract')
    if (verbose): print("Found abstract:", doc_abstract)
    artifact_list = doc_page_container.find_elements_by_tag_name('a')
    for artifact_item in artifact_list:
        if (artifact_item.text == "[PDF]"):
            doc_path = artifact_item.get_attribute('href')
            if (verbose): print("Found PDF:", doc_path)
            if (executeDownload):
                if (mountStorage):
                    download_to_gdrive(doc_path)
                else:
                    download_to_local(doc_path)
            num_documents = num_documents + 1
        if (artifact_item.text == "[Supplemental]"):
            doc_path = artifact_item.get_attribute('href')
            if (verbose): print("Found Supplemental:", doc_path)
            if (executeDownload):
                if (mountStorage):
                    download_to_gdrive(doc_path)
                else:
                    download_to_local(doc_path)
            num_documents = num_documents + 1
    num_presentations = num_presentations + 1
doc_page_browser.quit()

Found title: Multimodal Model-Agnostic Meta-Learning via Task-Aware Modulation
Found link: https://papers.nips.cc/paper/8296-multimodal-model-agnostic-meta-learning-via-task-aware-modulation
Found authors: Risto Vuorio|Shao-Hua Sun|Hexiang Hu|Joseph J. Lim
Found abstract: <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="787e38a3-34d8-4360-a162-3d8a18757d62", element="6949c871-148f-49ef-be92-3e44d5871de8")>
Found PDF: https://papers.nips.cc/paper/8296-multimodal-model-agnostic-meta-learning-via-task-aware-modulation.pdf
Waiting 3 seconds to retrieve https://papers.nips.cc/paper/8296-multimodal-model-agnostic-meta-learning-via-task-aware-modulation.pdf
Downladed file: 8296-multimodal-model-agnostic-meta-learning-via-task-aware-modulation.pdf
Found Supplemental: https://papers.nips.cc/paper/8296-multimodal-model-agnostic-meta-learning-via-task-aware-modulation-supplemental.zip
Waiting 5 seconds to retrieve https://papers.nips.cc/paper/8296-multimodal-model-agnostic-meta-

In [12]:
home_page_browser.quit()
print('Finished finding all available documents on the presentation pages!')
print('Total presentations processed:', num_presentations)
print('Total documents downloaded:', num_documents)

Finished finding all available documents on the presentation pages!
Total presentations processed: 1427
Total documents downloaded: 2854


In [13]:
if (notifyStatus): email_notify("Phase 2 Perform the Scraping and Processing completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [14]:
print ('Total time for the script:',(datetime.now() - startTimeScript))

Total time for the script: 4:06:10.293176
