# Web Scraping of NeurIPS Conference Proceedings 2020 Using Python and BeautifulSoup
### David Lowe
### November 26, 2021

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping Python code leverages the BeautifulSoup module.

INTRODUCTION: The Conference on Neural Information Processing Systems (NeurIPS) covers a wide range of topics in neural information processing systems and research for the biological, technological, mathematical, and theoretical applications. Neural information processing is a field that benefits from a combined view of biological, physical, mathematical, and computational sciences. This web scraping script will automatically traverse through the entire web page and collect all links to the PDF and PPTX documents.

Starting URL: https://proceedings.neurips.cc/paper/2020

## Section 0. Prepare Environment

In [1]:
import numpy as np
import pandas as pd
import os
import sys
import shutil
from datetime import datetime
import requests
from requests.exceptions import HTTPError
from requests.exceptions import ConnectionError
from bs4 import BeautifulSoup
from random import randint
from time import sleep
# from selenium import webdriver

In [2]:
# Begin the timer for the script processing
startTimeScript = datetime.now()

# Set up the verbose and debug flags to print detailed messages for debugging (setting True will activate!)
verbose = True
debug = False

# Set up the executeDownload flag to download files (setting True will download!)
executeDownload = False

In [3]:
def access_url(url):
    # Creating an html document from the URL
    uastring = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:94.0) Gecko/20100101 Firefox/94.0"
    headers={'User-Agent': uastring}
    # Adding random wait time so we do not hammer the website needlessly
    waitTime = randint(2,5)
    print("Waiting " + str(waitTime) + " seconds to retrieve the next URL.")
    sleep(waitTime)
    try:
        s = requests.Session()
        resp = s.get(url, headers=headers)
        if (debug): print(resp.text)
    except HTTPError as e:
        print('The server could not serve up the web page!')
        sys.exit("Script processing cannot continue!!!")
    except ConnectionError as e:
        print('The server could not be reached due to connection issues!')
        sys.exit("Script processing cannot continue!!!")

    if (resp.status_code==requests.codes.ok):
        print('Successfully accessed the web page: ' + url)
        bsoup_obj = BeautifulSoup(resp.text, 'lxml')
        return(bsoup_obj)

In [4]:
def download_to_local(doc_path):
    # Adding random wait time so we do not hammer the website needlessly
    waitTime = randint(2,5)
    print("Waiting " + str(waitTime) + " seconds to retrieve " + doc_path)
    sleep(waitTime)
    local_file = doc_path.split('/')[-1]
    if (os.path.isfile(local_file) == False):
        with requests.get(doc_path, stream=True) as r:
            with open(local_file, 'wb') as f:
                shutil.copyfileobj(r.raw, f)
        print('Downladed file:', local_file, '\n')
    else:
        print('Skipped existing file:', local_file, '\n')

## Section 1. Perform the Scraping and Processing

In [5]:
# Specifying the URL of desired web page to be scrapped
starting_url = "https://proceedings.neurips.cc/paper/2020"
website_url = "https://proceedings.neurips.cc"

In [6]:
# Access and test the starting URL
web_page = access_url(starting_url)

# Gather all document links from the starting URL (Two Levels)
collection = web_page.find_all('li')

# Delete the first two `li` element as they are not the paper items we need
collection.pop(0)
collection.pop(0)

print('Number of items collected:', len(collection))

Waiting 3 seconds to retrieve the next URL.
Successfully accessed the web page: https://proceedings.neurips.cc/paper/2020
Number of items collected: 1898


In [7]:
i = 0
for item in collection:
    if debug: print(item)
    doc_title = item.a.string
    authors = item.i.string
    doc_link = website_url + item.a['href']
    if verbose: print('Found abstract link at:', doc_link)
    if verbose: print('Found collaborating authors:', authors)

    doc_page = access_url(doc_link)
    artifact_list = doc_page.find_all('a', class_="btn")
    for artifact_item in artifact_list:
        if artifact_item.string == "Paper":
            doc_path = website_url + artifact_item['href']
            if verbose: print('Found paper at:', doc_path, '\n')
            if (executeDownload):
                download_to_local(doc_path)
    i = i + 1

print('Finished finding all available documents on the web pages!')
print('Number of abstracts processed:', i)

Found abstract link at: https://proceedings.neurips.cc/paper/2020/hash/0004d0b59e19461ff126e3a08a814c33-Abstract.html
Found collaborating authors: Seongmin Ok
Waiting 4 seconds to retrieve the next URL.
Successfully accessed the web page: https://proceedings.neurips.cc/paper/2020/hash/0004d0b59e19461ff126e3a08a814c33-Abstract.html
Found paper at: https://proceedings.neurips.cc/paper/2020/file/0004d0b59e19461ff126e3a08a814c33-Paper.pdf 

Found abstract link at: https://proceedings.neurips.cc/paper/2020/hash/00482b9bed15a272730fcb590ffebddd-Abstract.html
Found collaborating authors: Sangnie Bhardwaj, Ian Fischer, Johannes Ballé, Troy Chinen
Waiting 4 seconds to retrieve the next URL.
Successfully accessed the web page: https://proceedings.neurips.cc/paper/2020/hash/00482b9bed15a272730fcb590ffebddd-Abstract.html
Found paper at: https://proceedings.neurips.cc/paper/2020/file/00482b9bed15a272730fcb590ffebddd-Paper.pdf 

Found abstract link at: https://proceedings.neurips.cc/paper/2020/hash/

In [8]:
print ('Total time for the script:',(datetime.now() - startTimeScript))

Total time for the script: 1:57:10.263904
