# Web Scraping of NeurIPS Proceedings Using Python and BeautifulSoup
### David Lowe
### December 23, 2018

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping python code leverages the BeautifulSoup module.

INTRODUCTION: The Neural Information Processing Systems Conference (NeurIPS) hosts its collections of papers on the website, https://papers.nips.cc/. This web scraping script will automatically traverse through the listing and individual paper pages of the 2015 conference and collect all links to the PDF documents. The script will also download the PDF documents as part of the scraping process.

Starting URLs: https://papers.nips.cc/book/advances-in-neural-information-processing-systems-28-2015

## Loading Libraries and Packages

In [29]:
import numpy as np
import pandas as pd
import os
import shutil
import smtplib
from email.message import EmailMessage
from datetime import datetime
import urllib.request
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup
from random import randint
from time import sleep

startTimeScript = datetime.now()

## Setting up the email notification function

In [30]:
def email_notify(msg_text):
    sender = "luozhi2488@gmail.com"
    receiver = "dave@contactdavidlowe.com"
    with open('../email_credential.txt') as f:
        password = f.readline()
        f.close()
    msg = EmailMessage()
    msg.set_content(msg_text)
    msg['Subject'] = 'Notification from Python Script'
    msg['From'] = sender
    msg['To'] = receiver
    server = smtplib.SMTP('smtp.gmail.com', 587)
    server.starttls()
    server.login(sender, password)
    server.send_message(msg)
    server.quit()

In [31]:
email_notify("The web scraping process has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

## Setting up the necessary parameters

In [32]:
# Specifying the URL of desired web page to be scrapped
starting_url = "https://papers.nips.cc/book/advances-in-neural-information-processing-systems-28-2015"
website_url = "https://papers.nips.cc"

# Creating an html document from the URL
uastring = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36"
req = urllib.request.Request(
    starting_url,
    data=None,
    headers={'User-Agent': uastring}
)

try:
    session = urllib.request.urlopen(req)
except HTTPError as e:
    print('The server could not serve up the page!')
    print(e)
    sys.exit(1)
except URLError as e:
    print('The server could not be reached!')
    print(e)
    sys.exit(1)

try:
    webpage = BeautifulSoup(session.read(), 'html5lib')
    main_title = webpage.body.h2
except AttributeError as e:
    print('Page title could not be found - Might indicate problem!')
    sys.exit(1)
else:
    print('Successfully accessed the web page: ' + main_title.string)

Successfully accessed the web page: Advances in Neural Information Processing Systems 28 (NIPS 2015)


## Performing the Scraping and Processing

In [33]:
email_notify("The web page loading and item extraction process has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

# Gather all links to the document pages and delete the very first element which is not a paper listing
collection = webpage.find_all('li')
collection.pop(0)

# Setting up a dataframe to capture the records
df = pd.DataFrame(columns=['title','authors','doc_link', 'abstract'])
i = 0

for item in collection:
    doc_title = "[Not Found]"
    authors = "[Not Found]"
    doc_link = "[Not Found]"
    abstract = "[Not Found]"

    doc_title = item.a.string
    author_group = item.find_all('a', {'class':'author'})
    author_list = []
    for each_author in author_group:
        author_list.append(each_author.string)
    authors = ''.join(author_list)
    doc_link = website_url + item.a['href']

    # Adding random wait time so we do not hammer the website needlessly
    waitTime = randint(3,8)
    sleep(waitTime)
    print("Waited " + str(waitTime) + " seconds to retrieve the next URL.")
    req = urllib.request.Request(
        doc_link,
        data=None,
        headers={'User-Agent': uastring}
    )

    try:
        session = urllib.request.urlopen(req)
    except HTTPError as e:
        print('The server could not serve up the page!')
        print(e)
        sys.exit(1)
    except URLError as e:
        print('The server could not be reached!')
        print(e)
        sys.exit(1)

    try:
        docpage = BeautifulSoup(session.read(), 'html5lib')
        docpage_title = docpage.body.h2
    except AttributeError as e:
        print('Page title could not be found - Might indicate problem!')
        sys.exit(1)
        
    artifact_list = docpage.find('div', class_="main wrapper clearfix").find_all('a')
    for artifact_item in artifact_list:
        if artifact_item.string == "[PDF]":
            doc_path = website_url + artifact_item['href']
            dest_file = os.path.basename(doc_path)
            print('Grabbing document: ' + doc_path + " as " + dest_file)
            with urllib.request.urlopen(doc_path) as in_resp, open(dest_file, 'wb') as out_file:
                shutil.copyfileobj(in_resp, out_file)

    abstract = docpage.find('p', class_="abstract").string
    df.loc[i] = [doc_title, authors, doc_link, abstract]
    i = i + 1

Waited 3 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5677-double-or-nothing-multiplicative-incentive-mechanisms-for-crowdsourcing.pdf as 5677-double-or-nothing-multiplicative-incentive-mechanisms-for-crowdsourcing.pdf
Waited 4 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5941-learning-with-symmetric-label-noise-the-importance-of-being-unhinged.pdf as 5941-learning-with-symmetric-label-noise-the-importance-of-being-unhinged.pdf
Waited 6 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/6019-algorithmic-stability-and-uniform-generalization.pdf as 6019-algorithmic-stability-and-uniform-generalization.pdf
Waited 4 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/6035-adaptive-low-complexity-sequential-inference-for-dirichlet-process-mixture-models.pdf as 6035-adaptive-low-complexity-sequential-inference-for-dirichlet-process-mixture-models.pdf
Waited 6

Waited 8 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5996-optimal-ridge-detection-using-coverage-risk.pdf as 5996-optimal-ridge-detection-using-coverage-risk.pdf
Waited 3 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5742-top-k-multiclass-svm.pdf as 5742-top-k-multiclass-svm.pdf
Waited 5 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5807-policy-evaluation-using-the-return.pdf as 5807-policy-evaluation-using-the-return.pdf
Waited 5 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5998-orthogonal-nmf-through-subspace-exploration.pdf as 5998-orthogonal-nmf-through-subspace-exploration.pdf
Waited 6 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5930-stochastic-online-greedy-learning-with-semi-bandit-feedbacks.pdf as 5930-stochastic-online-greedy-learning-with-semi-bandit-feedbacks.pdf
Waited 4 seconds to retrieve t

Waited 8 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf as 5782-character-level-convolutional-networks-for-text-classification.pdf
Waited 7 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5662-robust-feature-sample-linear-discriminant-analysis-for-brain-disorders-diagnosis.pdf as 5662-robust-feature-sample-linear-discriminant-analysis-for-brain-disorders-diagnosis.pdf
Waited 7 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5721-black-box-optimization-of-noisy-functions-with-unknown-smoothness.pdf as 5721-black-box-optimization-of-noisy-functions-with-unknown-smoothness.pdf
Waited 5 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5906-recovering-communities-in-the-general-stochastic-block-model-without-knowing-the-parameters.pdf as 5906-recovering-communities-in-the-general-stocha

Waited 6 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5981-bounding-the-cost-of-search-based-lifted-inference.pdf as 5981-bounding-the-cost-of-search-based-lifted-inference.pdf
Waited 8 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5890-gradient-free-hamiltonian-monte-carlo-with-efficient-kernel-exponential-families.pdf as 5890-gradient-free-hamiltonian-monte-carlo-with-efficient-kernel-exponential-families.pdf
Waited 5 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5931-linear-multi-resource-allocation-with-semi-bandit-feedback.pdf as 5931-linear-multi-resource-allocation-with-semi-bandit-feedback.pdf
Waited 4 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5785-unsupervised-learning-by-program-synthesis.pdf as 5785-unsupervised-learning-by-program-synthesis.pdf
Waited 6 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc

Waited 4 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5845-deep-visual-analogy-making.pdf as 5845-deep-visual-analogy-making.pdf
Waited 4 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5704-matrix-completion-from-fewer-entries-spectral-detectability-and-rank-estimation.pdf as 5704-matrix-completion-from-fewer-entries-spectral-detectability-and-rank-estimation.pdf
Waited 4 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5833-online-learning-with-adversarial-delays.pdf as 5833-online-learning-with-adversarial-delays.pdf
Waited 8 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5838-multi-layer-feature-reduction-for-tree-structured-group-lasso-via-hierarchical-projection.pdf as 5838-multi-layer-feature-reduction-for-tree-structured-group-lasso-via-hierarchical-projection.pdf
Waited 4 seconds to retrieve the next URL.
Grabbing document: https://papers.n

Waited 8 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5798-learning-with-group-invariant-features-a-kernel-perspective.pdf as 5798-learning-with-group-invariant-features-a-kernel-perspective.pdf
Waited 4 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5720-regularized-em-algorithms-a-unified-framework-and-statistical-guarantees.pdf as 5720-regularized-em-algorithms-a-unified-framework-and-statistical-guarantees.pdf
Waited 6 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5745-distributionally-robust-logistic-regression.pdf as 5745-distributionally-robust-logistic-regression.pdf
Waited 7 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/6005-adaptive-stochastic-optimization-from-sets-to-paths.pdf as 6005-adaptive-stochastic-optimization-from-sets-to-paths.pdf
Waited 6 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/571

Waited 7 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5800-tree-guided-mcmc-inference-for-normalized-random-measure-mixture-models.pdf as 5800-tree-guided-mcmc-inference-for-normalized-random-measure-mixture-models.pdf
Waited 6 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5897-streaming-min-max-hypergraph-partitioning.pdf as 5897-streaming-min-max-hypergraph-partitioning.pdf
Waited 3 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5818-collaboratively-learning-preferences-from-ordinal-data.pdf as 5818-collaboratively-learning-preferences-from-ordinal-data.pdf
Waited 4 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5769-biologically-inspired-dynamic-textures-for-probing-motion-perception.pdf as 5769-biologically-inspired-dynamic-textures-for-probing-motion-perception.pdf
Waited 3 seconds to retrieve the next URL.
Grabbing document: https://papers

Waited 4 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5712-spectral-norm-regularization-of-orthonormal-representations-for-graph-transduction.pdf as 5712-spectral-norm-regularization-of-orthonormal-representations-for-graph-transduction.pdf
Waited 4 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5954-convolutional-networks-on-graphs-for-learning-molecular-fingerprints.pdf as 5954-convolutional-networks-on-graphs-for-learning-molecular-fingerprints.pdf
Waited 4 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5706-mixed-robustaverage-submodular-partitioning-fast-algorithms-guarantees-and-applications.pdf as 5706-mixed-robustaverage-submodular-partitioning-fast-algorithms-guarantees-and-applications.pdf
Waited 3 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5676-tractable-learning-for-complex-probability-queries.pdf as 5676-tractable-learning-for-co

Waited 4 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/6026-revenue-optimization-against-strategic-buyers.pdf as 6026-revenue-optimization-against-strategic-buyers.pdf
Waited 3 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5851-deep-convolutional-inverse-graphics-network.pdf as 5851-deep-convolutional-inverse-graphics-network.pdf
Waited 8 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/6017-sparse-and-low-rank-tensor-decomposition.pdf as 6017-sparse-and-low-rank-tensor-decomposition.pdf
Waited 3 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5730-minimax-time-series-prediction.pdf as 5730-minimax-time-series-prediction.pdf
Waited 5 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5713-differentially-private-learning-of-structured-discrete-distributions.pdf as 5713-differentially-private-learning-of-structured-discr

Waited 5 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5859-action-conditional-video-prediction-using-deep-networks-in-atari-games.pdf as 5859-action-conditional-video-prediction-using-deep-networks-in-atari-games.pdf
Waited 4 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5990-a-pseudo-euclidean-iteration-for-optimal-recovery-in-noisy-ica.pdf as 5990-a-pseudo-euclidean-iteration-for-optimal-recovery-in-noisy-ica.pdf
Waited 7 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5752-distributed-submodular-cover-succinctly-summarizing-massive-data.pdf as 5752-distributed-submodular-cover-succinctly-summarizing-massive-data.pdf
Waited 5 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5808-community-detection-via-measure-space-embedding.pdf as 5808-community-detection-via-measure-space-embedding.pdf
Waited 4 seconds to retrieve the next URL.
Grabbing docume

Grabbing document: https://papers.nips.cc/paper/5929-fast-and-memory-optimal-low-rank-matrix-approximation.pdf as 5929-fast-and-memory-optimal-low-rank-matrix-approximation.pdf
Waited 6 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5989-learnability-of-influence-in-networks.pdf as 5989-learnability-of-influence-in-networks.pdf
Waited 6 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5909-learning-causal-graphs-with-small-interventions.pdf as 5909-learning-causal-graphs-with-small-interventions.pdf
Waited 3 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5841-information-theoretic-lower-bounds-for-convex-optimization-with-erroneous-oracles.pdf as 5841-information-theoretic-lower-bounds-for-convex-optimization-with-erroneous-oracles.pdf
Waited 4 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5877-fixed-length-poisson-mrf-adding-dependencies-to-the-mul

Waited 6 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5900-lifted-inference-rules-with-constraints.pdf as 5900-lifted-inference-rules-with-constraints.pdf
Waited 5 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5899-gradient-estimation-using-stochastic-computation-graphs.pdf as 5899-gradient-estimation-using-stochastic-computation-graphs.pdf
Waited 5 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5672-model-based-relative-entropy-stochastic-search.pdf as 5672-model-based-relative-entropy-stochastic-search.pdf
Waited 4 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5947-semi-supervised-learning-with-ladder-networks.pdf as 5947-semi-supervised-learning-with-ladder-networks.pdf
Waited 3 seconds to retrieve the next URL.
Grabbing document: https://papers.nips.cc/paper/5675-embedding-inference-for-structured-multilabel-prediction.pdf as 5675-embedding

## Organizing Data and Producing Outputs

In [34]:
out_file = df.to_json(orient='records')
with open('web-scraping-py-bsoup-nips-proceedings.json', 'w') as f:
    f.write(out_file)
email_notify("The web scraping process has completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
print ('Total time for the script:',(datetime.now() - startTimeScript))

Total time for the script: 0:42:00.960649
