# Web Scraping of Machine Learning Mastery Blog Using Python and Selenium
### David Lowe
### January 24, 2020

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The Python web scraping code leverages the Selenium module.

INTRODUCTION: Dr. Jason Brownlee’s Machine Learning Mastery hosts its tutorial lessons at https://machinelearningmastery.com/blog. The purpose of this exercise is to practice web scraping by gathering the blog entries from Machine Learning Mastery’s web pages. This iteration of the script automatically traverses the web pages to capture all blog entries and store all captured information in a JSON output file.

Starting URLs: https://machinelearningmastery.com/blog/

## Section 1. Prepare Environment

In [1]:
import os
import shutil
import smtplib
import sys
import pandas as pd
from email.message import EmailMessage
from datetime import datetime
from random import randint
from time import sleep
# from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

In [2]:
# Begin the timer for the script processing
startTimeScript = datetime.now()

# Set up the verbose and debug flags to print detailed messages for debugging (setting True will activate!)
verbose = False
debug = False

# Set up the flag to send status emails (setting to True will send the status emails!)
notifyStatus = False

# Set up the flag to write the output to a JSON document (setting to TRUE will create the document!)
writeToJSON = True

# Set up the mountStorage flag to mount G Drive for storing files (setting True will mount the drive!)
mountStorage = False

# Set up the executeDownload flag to download files (setting True will download!)
executeDownload = False

In [3]:
# Colab-Specific Setup - Mount Google Drive for storing downloaded files
if (mountStorage):
    from google.colab import drive
    drive.mount('/content/gdrive')

In [4]:
# Set up the email notification function
def email_notify(msg_text):
    sender = os.environ.get('MAIL_SENDER')
    receiver = os.environ.get('MAIL_RECEIVER')
    gateway = os.environ.get('SMTP_GATEWAY')
    smtpuser = os.environ.get('SMTP_USERNAME')
    password = os.environ.get('SMTP_PASSWORD')
    if sender==None or receiver==None or gateway==None or smtpuser==None or password==None:
        sys.exit("Incomplete email setup info. Script Processing Aborted!!!")
    msg = EmailMessage()
    msg.set_content(msg_text)
    msg['Subject'] = 'Notification from Python Web Scraping Script'
    msg['From'] = sender
    msg['To'] = receiver
    server = smtplib.SMTP(gateway, 587)
    server.starttls()
    server.login(smtpuser, password)
    server.send_message(msg)
    server.quit()

In [5]:
def download_to_local(doc_path):
    # Adding random wait time so we do not hammer the website needlessly
    waitTime = randint(2,5)
    print("Waiting " + str(waitTime) + " seconds to retrieve " + doc_path)
    sleep(waitTime)
    local_file = doc_path.split('/')[-1]
    if (os.path.isfile(local_file) == False):
        with requests.get(doc_path, stream=True) as r:
            with open(local_file, 'wb') as f:
                shutil.copyfileobj(r.raw, f)
        print('Downladed file: ' + local_file)
    else:
        print('Skipped existing file: ' + local_file)

In [6]:
def download_to_gdrive(doc_path):
    # Adding random wait time so we do not hammer the website needlessly
    waitTime = randint(2,5)
    print("Waiting " + str(waitTime) + " seconds to retrieve " + doc_path)
    sleep(waitTime)
    local_file = doc_path.split('/')[-1]
    gdrivePrefix = '/content/gdrive/My Drive/Colab_Downloads/'
    dest_file = gdrivePrefix + local_file
    with requests.get(doc_path, stream=True) as r:
        with open(dest_file, 'wb') as f:
            shutil.copyfileobj(r.raw, f)
    print('Downladed file: ' + dest_file)

In [7]:
if (notifyStatus): email_notify("Phase 1 Prepare Environment completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

## Section 1. Perform the Scraping and Processing

In [8]:
if (notifyStatus): email_notify("Phase 2 Perform the Scraping and Processing has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [9]:
# Setting up a dataframe to capture the records
df = pd.DataFrame(columns=['blog_title','blog_url','blog_date','blog_author','blog_summary'])

In [10]:
# Specifying the URL of desired web page to be scrapped
website_url = "https://machinelearningmastery.com"
blog_page_url = website_url + "/blog/"

In [11]:
# Access and test the starting URL
firefox_options = Options()
firefox_options.headless = False
home_page_browser = webdriver.Firefox(options=firefox_options)
num_blogs = 0
done = False

In [12]:
while not done:
    # Gather all blog links from the blog page
    print('Attempting to access the web page: ' + blog_page_url)
    home_page_browser.get(blog_page_url)
    main_page_container = home_page_browser.find_element_by_class_name('col-left')
    collection = main_page_container.find_elements_by_tag_name('article')

    for item in collection:
        blog_title = '[Not Found]'
        blog_link = '[Not Found]'
        blog_date = '[Not Found]'
        blog_author = '[Not Found]'
        blog_summary = '[Not Found]'
        if (debug): print(item)
        post_element = item.find_element_by_tag_name('a')
        blog_title = post_element.get_attribute('title')
        if (verbose): print("Found title:", blog_title)
        blog_link = post_element.get_attribute('href')
        if (verbose): print("Found link:", blog_link)
        post_meta = item.find_element_by_class_name('post-meta')
        blog_author = post_meta.find_element_by_tag_name('a').text
        if (verbose): print("Found author:", blog_author)
        blog_date = post_meta.find_element_by_tag_name('abbr').get_attribute('title')
        if (verbose): print("Found date:", blog_date)
        text_section = item.find_element_by_tag_name('section')
        blog_summary = text_section.find_element_by_tag_name('p').text
        if (verbose): print("Found summary:", blog_summary)
        df.loc[num_blogs] = [blog_title, blog_link, blog_date, blog_author, blog_summary]
        num_blogs = num_blogs + 1

    try:
        next_page = home_page_browser.find_element_by_css_selector("a[class='next page-numbers']")
        blog_page_url = next_page.get_attribute('href')
        # Adding random wait time so we do not hammer the website needlessly
        waitTime = randint(3,5)
        print("Waiting " + str(waitTime) + " seconds to retrieve the next URL.")
        sleep(waitTime)
    except:
        done = True

Attempting to access the web page: https://machinelearningmastery.com/blog/
Waiting 5 seconds to retrieve the next URL.
Attempting to access the web page: https://machinelearningmastery.com/blog/page/2/
Waiting 3 seconds to retrieve the next URL.
Attempting to access the web page: https://machinelearningmastery.com/blog/page/3/
Waiting 5 seconds to retrieve the next URL.
Attempting to access the web page: https://machinelearningmastery.com/blog/page/4/
Waiting 3 seconds to retrieve the next URL.
Attempting to access the web page: https://machinelearningmastery.com/blog/page/5/
Waiting 5 seconds to retrieve the next URL.
Attempting to access the web page: https://machinelearningmastery.com/blog/page/6/
Waiting 4 seconds to retrieve the next URL.
Attempting to access the web page: https://machinelearningmastery.com/blog/page/7/
Waiting 3 seconds to retrieve the next URL.
Attempting to access the web page: https://machinelearningmastery.com/blog/page/8/
Waiting 3 seconds to retrieve the n

In [13]:
home_page_browser.quit()
print('Finished finding all available blog pages!')
print('Total blogs processed:', num_blogs)

Finished finding all available blog pages!
Total blogs processed: 868


In [14]:
if (writeToJSON):
    out_file = df.to_json(orient='records')
    with open('web-scraping-py-selenium-mlmastery-blog-take4.json', 'w') as f:
        f.write(out_file)

In [15]:
if (notifyStatus): email_notify("Phase 2 Perform the Scraping and Processing completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [16]:
print ('Total time for the script:',(datetime.now() - startTimeScript))

Total time for the script: 0:07:15.801865
