# Web Scraping of NeurIPS Conference 2019 Using Python and BeautifulSoup
### David Lowe
### January 10, 2020

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping Python code leverages the BeautifulSoup module.

INTRODUCTION: The Conference on Neural Information Processing Systems (NeurIPS) covers a wide range of topics in neural information processing systems and research for the biological, technological, mathematical, and theoretical applications. Neural information processing is a field that benefits from a combined view of biological, physical, mathematical, and computational sciences. This web scraping script will automatically traverse through the entire web page and collect all links to the PDF and PPTX documents. The script will also download the documents as part of the scraping process.

Starting URLs: https://papers.nips.cc/book/advances-in-neural-information-processing-systems-32-2019

## Section 0. Prepare Environment

In [1]:
# Colab-Specific Setup - Refresh Linux package repositories and set up additional Linux and Python tools
# !apt-get update
# !apt install chromium-chromedriver
# !pip install -q pymysql selenium

In [2]:
import numpy as np
import pandas as pd
import os
import shutil
import smtplib
import sys
from email.message import EmailMessage
from datetime import datetime
import requests
from requests.exceptions import HTTPError
from requests.exceptions import ConnectionError
from bs4 import BeautifulSoup
from random import randint
from time import sleep
# from selenium import webdriver

In [3]:
# Begin the timer for the script processing
startTimeScript = datetime.now()

# Set up the verbose and debug flags to print detailed messages for debugging (setting True will activate!)
verbose = False
debug = False

# Set up the flag to send status emails (setting to True will send the status emails!)
notifyStatus = False

# Set up the mountStorage flag to mount G Drive for storing files (setting True will mount the drive!)
mountStorage = False

# Set up the executeDownload flag to download files (setting True will download!)
executeDownload = True

In [4]:
# Colab-Specific Setup - Mount Google Drive for storing downloaded files
if (mountStorage):
    from google.colab import drive
    drive.mount('/content/gdrive')

In [5]:
# Set up the email notification function
def email_notify(msg_text):
    sender = os.environ.get('MAIL_SENDER')
    receiver = os.environ.get('MAIL_RECEIVER')
    gateway = os.environ.get('SMTP_GATEWAY')
    smtpuser = os.environ.get('SMTP_USERNAME')
    password = os.environ.get('SMTP_PASSWORD')
    if sender==None or receiver==None or gateway==None or smtpuser==None or password==None:
        sys.exit("Incomplete email setup info. Script Processing Aborted!!!")
    msg = EmailMessage()
    msg.set_content(msg_text)
    msg['Subject'] = 'Notification from Python Web Scraping Script'
    msg['From'] = sender
    msg['To'] = receiver
    server = smtplib.SMTP(gateway, 587)
    server.starttls()
    server.login(smtpuser, password)
    server.send_message(msg)
    server.quit()

In [6]:
def access_url(url):
    # Creating an html document from the URL
    uastring = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0"
    headers={'User-Agent': uastring}
    # Adding random wait time so we do not hammer the website needlessly
    waitTime = randint(2,5)
    print("Waiting " + str(waitTime) + " seconds to retrieve the next URL.")
    sleep(waitTime)
    try:
        s = requests.Session()
        resp = s.get(url, headers=headers)
        if (debug): print(resp.text)
    except HTTPError as e:
        print('The server could not serve up the web page!')
        sys.exit("Script processing cannot continue!!!")
    except ConnectionError as e:
        print('The server could not be reached due to connection issues!')
        sys.exit("Script processing cannot continue!!!")

    if (resp.status_code==requests.codes.ok):
        print('Successfully accessed the web page: ' + url)
        bsoup_obj = BeautifulSoup(resp.text, 'lxml')
        return(bsoup_obj)

In [7]:
def download_to_local(doc_path):
    # Adding random wait time so we do not hammer the website needlessly
    waitTime = randint(2,5)
    print("Waiting " + str(waitTime) + " seconds to retrieve " + doc_path)
    sleep(waitTime)
    local_file = doc_path.split('/')[-1]
    if (os.path.isfile(local_file) == False):
        with requests.get(doc_path, stream=True) as r:
            with open(local_file, 'wb') as f:
                shutil.copyfileobj(r.raw, f)
        print('Downladed file: ' + local_file)
    else:
        print('Skipped existing file: ' + local_file)

In [8]:
def download_to_gdrive(doc_path):
    # Adding random wait time so we do not hammer the website needlessly
    waitTime = randint(2,5)
    print("Waiting " + str(waitTime) + " seconds to retrieve " + doc_path)
    sleep(waitTime)
    local_file = doc_path.split('/')[-1]
    gdrivePrefix = '/content/gdrive/My Drive/Colab_Downloads/'
    dest_file = gdrivePrefix + local_file
    with requests.get(doc_path, stream=True) as r:
        with open(dest_file, 'wb') as f:
            shutil.copyfileobj(r.raw, f)
    print('Downladed file: ' + dest_file)

In [9]:
if (notifyStatus): email_notify("Phase 0 Prepare Environment completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

## Section 1. Perform the Scraping and Processing

In [10]:
if (notifyStatus): email_notify("Phase 1 Perform the Scraping and Processing has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [11]:
# Specifying the URL of desired web page to be scrapped
starting_url = "https://papers.nips.cc/book/advances-in-neural-information-processing-systems-32-2019"
website_url = "https://papers.nips.cc"

In [12]:
# Access and test the starting URL
web_page = access_url(starting_url)

# Gather all document links from the starting URL (Two Levels)
collection = web_page.find_all('li')
i = 0

# Delete the first li element as it is not a regular list item we need
collection.pop(0)

for item in collection:
    if (verbose): print(item)
    doc_title = item.a.string
    author_group = item.find_all('a', {'class':'author'})
    author_list = []
    for each_author in author_group:
        author_list.append(each_author.string)
    authors = ''.join(author_list)
    doc_link = website_url + item.a['href']

    doc_page = access_url(doc_link)
    artifact_list = doc_page.find('div', class_="main wrapper clearfix").find_all('a')
    for artifact_item in artifact_list:
        if artifact_item.string == "[PDF]":
            doc_path = website_url + artifact_item['href']
            if (executeDownload):
                if (mountStorage):
                    download_to_gdrive(doc_path)
                else:
                    download_to_local(doc_path)
    i = i + 1

print('Finished finding all available documents on the web pages!')
print('Number of documents processed:', i)

Waiting 3 seconds to retrieve the next URL.
Successfully accessed the web page: https://papers.nips.cc/book/advances-in-neural-information-processing-systems-32-2019
Waiting 4 seconds to retrieve the next URL.
Successfully accessed the web page: https://papers.nips.cc/paper/8296-multimodal-model-agnostic-meta-learning-via-task-aware-modulation
Waiting 3 seconds to retrieve https://papers.nips.cc/paper/8296-multimodal-model-agnostic-meta-learning-via-task-aware-modulation.pdf
Downladed file: 8296-multimodal-model-agnostic-meta-learning-via-task-aware-modulation.pdf
Waiting 5 seconds to retrieve the next URL.
Successfully accessed the web page: https://papers.nips.cc/paper/8297-vilbert-pretraining-task-agnostic-visiolinguistic-representations-for-vision-and-language-tasks
Waiting 3 seconds to retrieve https://papers.nips.cc/paper/8297-vilbert-pretraining-task-agnostic-visiolinguistic-representations-for-vision-and-language-tasks.pdf
Downladed file: 8297-vilbert-pretraining-task-agnostic-

In [13]:
if (notifyStatus): email_notify("Phase 1 Perform the Scraping and Processing completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [14]:
print ('Total time for the script:',(datetime.now() - startTimeScript))

Total time for the script: 2:58:23.972607
