# Web Scraping of Daines Analytics Blog Using BeautifulSoup
### David Lowe
### May 26, 2019

SUMMARY: The purpose of this project is to practice web scraping by gathering specific pieces of information from a website. The web scraping code was written in Python and leveraged the BeautifulSoup module.

INTRODUCTION: Daines Analytics hosts its blog at dainesanalytics.blog. The purpose of this exercise is to practice web scraping by gathering the blog entries from Daines Analytics’ RSS feed. This iteration of the script automatically traverses the RSS feed to capture all blog entries.

Starting URLs: https://dainesanalytics.blog/feed or https://dainesanalytics.blog/feed/?paged=1

## Loading Libraries and Packages

In [1]:
import numpy as np
import pandas as pd
import os
import shutil
import smtplib
import sys
from email.message import EmailMessage
from datetime import datetime
import requests
from requests.exceptions import HTTPError
from requests.exceptions import ConnectionError
from bs4 import BeautifulSoup
from random import randint
from time import sleep

startTimeScript = datetime.now()

## Setting up the email notification function

In [2]:
def email_notify(msg_text):
    sender = os.environ.get('MAIL_SENDER')
    receiver = os.environ.get('MAIL_RECEIVER')
    gateway = os.environ.get('SMTP_GATEWAY')
    smtpuser = os.environ.get('SMTP_USERNAME')
    password = os.environ.get('SMTP_PASSWORD')
    if sender==None or receiver==None or gateway==None or smtpuser==None or password==None:
        sys.exit("Incomplete email setup info. Script Processing Aborted!!!")
    msg = EmailMessage()
    msg.set_content(msg_text)
    msg['Subject'] = 'Notification from Python Web Scraping Script'
    msg['From'] = sender
    msg['To'] = receiver
    server = smtplib.SMTP(gateway, 587)
    server.starttls()
    server.login(smtpuser, password)
    server.send_message(msg)
    server.quit()

In [3]:
email_notify("The web scraping process has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

## Setting up the necessary parameters

In [4]:
# Set up the verbose flag to print detailed messages for debugging (only YES will activate!)
verbose = "!YES"

In [5]:
# Specifying the URL of desired web page to be scrapped
rssURL = "https://dainesanalytics.blog/feed/?paged="
pageNum = 1

# Creating an html document from the URL
uastring = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36"
headers={'User-Agent': uastring}

## Performing the Scraping and Processing

In [6]:
email_notify("The page loading and item extraction process has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [7]:
# Setting up a dataframe to capture the records
df = pd.DataFrame(columns=['blog_title','author_name','post_date','blog_url','blog_text'])
i = 0

In [8]:
done = False

while not done :
    # Adding random wait time so we do not hammer the website needlessly
    websiteURL = rssURL + str(pageNum)
    waitTime = randint(2,5)
    print("Waiting " + str(waitTime) + " seconds to process URL: " + websiteURL)
    sleep(waitTime)

    try:
        s = requests.Session()
        resp = s.get(websiteURL, headers=headers)
        if (verbose=='YES'): print(resp.text)
    except HTTPError as e:
        print('The server could not serve up the web page!')
        sys.exit("Script processing cannot continue!!!")
    except ConnectionError as e:
        print('The server could not be reached due to connection issues!')
        sys.exit("Script processing cannot continue!!!")

    if (resp.status_code==requests.codes.ok):
        print('Successfully accessed the RSS page: ' + websiteURL)
        webPage = BeautifulSoup(resp.text, 'lxml-xml')

        blog_listing = webPage.find_all("item")
        if (verbose=='YES'): print(blog_listing)

        for blog_item in blog_listing :
            blog_title  = "[Not Found]"
            author_name = "[Not Found]"
            post_date   = "[Not Found]"
            blog_url    = "[Not Found]"
            blog_text   = "[Not Found]"

            blog_title  = blog_item.title.string
            author_name = blog_item.find('dc:creator').string
            post_date   = blog_item.pubDate.string
            blog_url    = blog_item.link.string

            if blog_url != "[Not Found]" :
                # Adding random wait time so we do not hammer the website needlessly
                waitTime = randint(2,5)
                print("Waiting " + str(waitTime) + " seconds to process next blog URL: " + blog_url)
                sleep(waitTime)
                try:
                    resp = s.get(blog_url, headers=headers)
                except HTTPError as e:
                    print("Unable to retrieve the blog page via the URL!")
                    sys.exit("Script processing cannot continue!!!")
                except ConnectionError as e:
                    print('Unable to retrieve the blog page due to connection issues!')
                    sys.exit("Script processing cannot continue!!!")
                else:
                    if (resp.status_code==requests.codes.ok):
                        print('Successfully accessed the blog page: ' + blog_url)
                        blogPage = BeautifulSoup(resp.text, 'lxml')
                        blog_text = blogPage.find("div", class_="entry-content").get_text()

                df.loc[i] = [blog_title, author_name, post_date, blog_url, blog_text]
                if (verbose=='YES'): print(blog_title, author_name, post_date, blog_url, blog_text)
                i = i + 1

        if ((pageNum % 5)==0) :
            email_notify("Finished parsing RSS page: " + websiteURL + " at "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

        pageNum = pageNum + 1
        nextPageURL = rssURL + str(pageNum)
    else:
        print("No more page to retrieve. The RSS feed processing has completed!")
        done = True

Waiting 2 seconds to process URL: https://dainesanalytics.blog/feed/?paged=1
Successfully accessed the RSS page: https://dainesanalytics.blog/feed/?paged=1
Waiting 3 seconds to process next blog URL: https://dainesanalytics.blog/2019/05/20/binary-classification-model-for-heart-disease-study-using-python-take-1/
Successfully accessed the blog page: https://dainesanalytics.blog/2019/05/20/binary-classification-model-for-heart-disease-study-using-python-take-1/
Waiting 5 seconds to process next blog URL: https://dainesanalytics.blog/2019/05/19/web-scraping-templates-for-python-with-beautifulsoup/
Successfully accessed the blog page: https://dainesanalytics.blog/2019/05/19/web-scraping-templates-for-python-with-beautifulsoup/
Waiting 5 seconds to process next blog URL: https://dainesanalytics.blog/2019/05/17/binary-classification-model-for-diabetes-readmission-prediction-using-r-take-3/
Successfully accessed the blog page: https://dainesanalytics.blog/2019/05/17/binary-classification-model

## Organizing Data and Producing Outputs

In [9]:
out_file = df.to_json(orient='records')
with open('web-scraping-py-bsoup-dainesanalytics-blog.json', 'w') as f:
    f.write(out_file)
print('Total number of records written to file:', len(df))
email_notify("The web scraping process has completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
print ('Total time for the script:', (datetime.now() - startTimeScript))

Total number of records written to file: 215
Total time for the script: 0:14:27.763399
