# Web Scraping of File Download Using Python and BeautifulSoup
### David Lowe
### April 14, 2019

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping python code leverages the BeautifulSoup module.

INTRODUCTION: On occasions I have a need to download a batch of documents off a single web page without clicking on the download link one at a time. This web scraping script will automatically traverse through the entire web page and collect all links to the PDF documents. The script will also download the PDF documents as part of the scraping process.

Starting URLs: https://www.knime.com/about/events/knime-spring-summit-2019-berlin

## Loading Libraries and Packages

In [1]:
# import numpy as np
# import pandas as pd
import os
import sys
import shutil
import smtplib
from email.message import EmailMessage
from datetime import datetime
import urllib.request
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup
from random import randint
from time import sleep

startTimeScript = datetime.now()

## Setting up the email notification function

In [2]:
def email_notify(msg_text):
    sender = os.environ.get('MAIL_USERNAME')
    password = os.environ.get('MAIL_PASSWORD')
    receiver = os.environ.get('RECEIVER_MAIL')
    if sender==None or password==None or receiver==None :
        sys.exit("Incomplete email setup info. Script Processing Aborted!!!")
    msg = EmailMessage()
    msg.set_content(msg_text)
    msg['Subject'] = 'Notification from Python Web Scraping Script'
    msg['From'] = sender
    msg['To'] = receiver
    server = smtplib.SMTP('smtp.gmail.com', 587)
    server.starttls()
    server.login(sender, password)
    server.send_message(msg)
    server.quit()

In [3]:
email_notify("The web scraping process has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

## Setting up the necessary parameters

In [4]:
# Specifying the URL of desired web page to be scrapped
starting_url = "https://www.knime.com/about/events/knime-spring-summit-2019-berlin"
website_url = "https://www.knime.com"

# Creating an html document from the URL
uastring = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36"
req = urllib.request.Request(
    starting_url,
    data=None,
    headers={'User-Agent': uastring}
)

try:
    session = urllib.request.urlopen(req)
except HTTPError as e:
    print('The server could not serve up the web page!')
    sys.exit("Script Processing Aborted!!!")
except URLError as e:
    print('The server could not be reached!')
    sys.exit("Script Processing Aborted!!!")

try:
    webpage = BeautifulSoup(session.read(), 'html5lib')
    main_title = webpage.body.h2
except AttributeError as e:
    print('Page title could not be found - Might indicate problems!')
    sys.exit("Script Processing Aborted!!!")
else:
    print('Successfully accessed the web page: ' + starting_url)

Successfully accessed the web page: https://www.knime.com/about/events/knime-spring-summit-2019-berlin


## Performing the Scraping and Processing

In [5]:
email_notify("The web page loading and item extraction process has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [6]:
# Gather all links to the document
collection = webpage.find_all('a')

for item in collection:
    if item.string == "[PDF]":
        # Adding random wait time so we do not hammer the website needlessly
        waitTime = randint(3,8)
        print("Waiting " + str(waitTime) + " seconds to retrieve the next document.")
        sleep(waitTime)
        doc_path = item['href']
        dest_file = os.path.basename(doc_path)
        print('Downlading document: ' + doc_path + " as " + dest_file)
        with urllib.request.urlopen(doc_path) as in_resp, open(dest_file, 'wb') as out_file:
            shutil.copyfileobj(in_resp, out_file)

Waiting 5 seconds to retrieve the next document.
Downlading document: https://files.knime.com/sites/default/files/01_opening_mb_final.pdf as 01_opening_mb_final.pdf
Waiting 3 seconds to retrieve the next document.
Downlading document: https://files.knime.com/sites/default/files/02_software-edu-partner-community_tg_final.pdf as 02_software-edu-partner-community_tg_final.pdf
Waiting 6 seconds to retrieve the next document.
Downlading document: https://files.knime.com/sites/default/files/20190320_Knime_WienEnergie.pdf as 20190320_Knime_WienEnergie.pdf
Waiting 4 seconds to retrieve the next document.
Downlading document: https://files.knime.com/sites/default/files/20190320_knime_summit_tobiphilippskatja_matthias_final_novideo.pdf as 20190320_knime_summit_tobiphilippskatja_matthias_final_novideo.pdf
Waiting 7 seconds to retrieve the next document.
Downlading document: https://files.knime.com/sites/default/files/02a_01_rewe_feature_generation_at_knime_spring_summit_2019_canbepublished.pdf as

In [7]:
email_notify("The web scraping process has completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [8]:
print ('Total time for the script:',(datetime.now() - startTimeScript))

Total time for the script: 0:04:02.337139
