# Web Scraping of Machine Learning Mastery Articles Using BeautifulSoup
### David Lowe
### August 26, 2022

SUMMARY: This project aims to practice web scraping by extracting specific pieces of information from a website. The web scraping Python code leverages the BeautifulSoup module.

INTRODUCTION: Dr. Jason Brownlee’s Machine Learning Mastery hosts its tutorial lessons at https://machinelearningmastery.com/blog. The purpose of this exercise is to practice web scraping by gathering the blog entries from Machine Learning Mastery’s web pages. This iteration of the script automatically traverses the web pages to capture all articles and store the captured information in a CSV output file for sorting and filtering.

Starting URL: https://machinelearningmastery.com/blog

## Task 1. Prepare Environment

In [1]:
import sys
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
from random import randint
from time import sleep

In [2]:
# Set the starting time for calculating script duration
startTimeScript = datetime.now()

# Specifying the website and browsing parameters
WEBSITE_URL = 'https://machinelearningmastery.com/blog'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.81 Safari/537.36 Edg/104.0.1293.54'

## Task 2. Get Blog Article Headings

In [3]:
def access_web_page(url=WEBSITE_URL):
    headers = {'user-agent': USER_AGENT}
    try:
        sess = requests.Session()
        resp = sess.get(url, headers=headers)
        # print(resp.text)
    except requests.HTTPError as e:
        print('The server could not serve up the web page with error code:', e)
        sys.exit("Script processing cannot continue!!!")
    except requests.ConnectionError as e:
        print('The server could not be reached due to connection issues with error code:', e)
        sys.exit("Script processing cannot continue!!!")

    if resp.status_code==requests.codes.ok :
        print('Successfully accessed the web page: ' + url)
        web_page = BeautifulSoup(resp.text, 'lxml')
        return web_page

In [4]:
df_article = pd.DataFrame(columns=['title','url','date','author','category','summary'])
i = 0  # Number of article entries processed
j = 0  # Number of web pages processed
article_page = access_web_page()
done = False
max_pages = 121

while not done :
    article_list = article_page.find_all('article')
    for article_item in article_list:
        blog_title = article_item.header.h2.string
        blog_url = article_item.a["href"]
        blog_date = article_item.abbr.string
        blog_author = article_item.find(class_="fn").a.string
        blog_category = article_item.find(class_="categories").a.string
        blog_summary = article_item.section.p.string
        df_article.loc[i] = [blog_title, blog_url, blog_date,
                             blog_author,blog_category, blog_summary]
        i += 1
        # print(blog_title, blog_url, blog_date, blog_author,blog_category, blog_summary)

    j += 1
    print('Number of web pages processed so far:', j)
    print('Number of articles processed so far:', i)
    next_page_css = article_page.find(class_="next page-numbers")

    if (next_page_css is not None) and (j <= max_pages) :
        next_page_url = next_page_css["href"]
        # Adding a random wait time for accessing web pages
        waitTime = randint(4, 9)
        print("Waiting " + str(waitTime) + " seconds before retrieving the next URL.")
        sleep(waitTime)
        web_page = access_web_page(next_page_url)
    else :
        done = True

Successfully accessed the web page: https://machinelearningmastery.com/blog
Number of web pages processed so far: 1
Number of articles processed so far: 10
Waiting 8 seconds before retrieving the next URL.
Successfully accessed the web page: https://machinelearningmastery.com/blog/page/2/
Number of web pages processed so far: 2
Number of articles processed so far: 20
Waiting 7 seconds before retrieving the next URL.
Successfully accessed the web page: https://machinelearningmastery.com/blog/page/2/
Number of web pages processed so far: 3
Number of articles processed so far: 30
Waiting 6 seconds before retrieving the next URL.
Successfully accessed the web page: https://machinelearningmastery.com/blog/page/2/
Number of web pages processed so far: 4
Number of articles processed so far: 40
Waiting 9 seconds before retrieving the next URL.
Successfully accessed the web page: https://machinelearningmastery.com/blog/page/2/
Number of web pages processed so far: 5
Number of articles processed

### Task 3 Organize Data and Produce Outputs

In [5]:
# out_file = df_article.to_csv(index=False)
out_file = df_article.to_csv(index=False, line_terminator = '\r')
with open('py_webscraping_beautifulsoup_mlmastery_articles.csv', 'w', encoding="utf-8") as f:
    f.write(out_file)
print('Total number of articles found:', len(df_article))

Total number of articles found: 1220


In [6]:
print ('Total time for the script:',(datetime.now() - startTimeScript))

Total time for the script: 0:13:08.970881
