# Web Scraping of MyBidMatch Entries Using BeautifulSoup
### David Lowe
### December 1, 2021

Main URL: http://www.mybidmatch.com/go?sub=55AB9731-0E3E-4BC0-B0E3-56EF81DA7FD4

## Task 1. Prepare Environment

In [None]:
!pip install python-dotenv PyMySQL

Collecting python-dotenv
  Downloading python_dotenv-0.19.2-py2.py3-none-any.whl (17 kB)
Collecting PyMySQL
  Downloading PyMySQL-1.0.2-py3-none-any.whl (43 kB)
[?25l[K     |███████▌                        | 10 kB 24.8 MB/s eta 0:00:01[K     |███████████████                 | 20 kB 31.7 MB/s eta 0:00:01[K     |██████████████████████▍         | 30 kB 35.6 MB/s eta 0:00:01[K     |██████████████████████████████  | 40 kB 37.8 MB/s eta 0:00:01[K     |████████████████████████████████| 43 kB 1.2 MB/s 
[?25hInstalling collected packages: python-dotenv, PyMySQL
Successfully installed PyMySQL-1.0.2 python-dotenv-0.19.2


In [None]:
import pandas as pd
import os
import smtplib
import sys
import pymysql
import requests
from requests.exceptions import HTTPError
from requests.exceptions import ConnectionError
from email.message import EmailMessage
from datetime import date, datetime, timedelta
from random import randint
from time import sleep
from bs4 import BeautifulSoup
from dotenv import load_dotenv

In [None]:
startTimeScript = datetime.now()

## Task 2. Setting up the Basic Parameters and Funtions

In [None]:
# Set up the verbose flag to print detailed messages for debugging (setting True will activate!)
verbose = False

# Set up the writeToDB flag to write records into the database (setting True will record!)
writeToDB = True

# Set up the writeJSON flag to write records into a JSON document (setting True will record!)
writeJSON = False

# The addDelay setting controls whether to add delays to slow down the scrapping
addDelay = True

# Set up the parent directory location for loading the dotenv files
from google.colab import drive
drive.mount('/content/gdrive')
gdrivePrefix = '/content/gdrive/My Drive/Colab_Downloads/'
env_path = '/content/gdrive/My Drive/Colab Notebooks/'
dotenv_path = env_path + "python_script.env"
load_dotenv(dotenv_path=dotenv_path)

Mounted at /content/gdrive


True

In [None]:
# Set up target date to collect the article from that date only
targetDate = datetime.now().date()
# targetDate = date(2021, 11, 27)
if targetDate is None:
    processAll = True
else: processAll = False

In [None]:
# Specifying the URL of desired web page to be scrapped
websiteURL = "http://www.mybidmatch.com"
startingURL = websiteURL + "/go?sub=55AB9731-0E3E-4BC0-B0E3-56EF81DA7FD4"

# Creating an html document from the URL
uastring = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:94.0) Gecko/20100101 Firefox/94.0"
headers={'User-Agent': uastring}

In [None]:
# Define the function for storing the scraped records
def storeDB(posting_date, source_tag, agency_name, fsg_tag, article_title, search_keywords, notice_heading, department_url, notice_url, notice_text):
    print("Inserting record:", posting_date, '|', source_tag, '|', agency_name, '|', fsg_tag, '|', article_title, '|', search_keywords, '|', notice_heading, '|', department_url, '|', notice_url)
    try:
        cur.execute("INSERT INTO bsoup_mybidmatch_notices (posting_date, source_tag, agency_name, fsg_tag, article_title, search_keywords, notice_heading, department_url, notice_url, notice_text) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)", (posting_date, source_tag, agency_name, fsg_tag, article_title, search_keywords, notice_heading, department_url, notice_url, notice_text))
        cur.connection.commit()
        print("Successfully inserted the record into the database.")
    except:
        print("Failed to insert the record into the database.")

In [None]:
if writeToDB:
    # Set up the database connection strings and environment
    db_host = os.environ.get('DB_HOST')
    db_user = os.environ.get('DB_USER')
    db_pass = os.environ.get('DB_PASS')
    db_name = os.environ.get('DB_NAME1')
    print("Trying to open a connection to host", db_host, "as user", db_user, "for database", db_name)

    # Connect to the database
    try:
        conn = pymysql.connect(host=db_host, user=db_user, password=db_pass, db=db_name, charset='utf8')
        cur = conn.cursor()
        cur.execute("USE %s" % (db_name))
        print("Successfully opened a connection to host", db_host, "as user", db_user, "for database", db_name)
    except:
        print("Unable to open a connection to host", db_host, "as user", db_user, "for database", db_name)
        writeToDB = False

Trying to open a connection to host ec2-44-232-76-68.us-west-2.compute.amazonaws.com as user scrapinguser for database webscraping
Successfully opened a connection to host ec2-44-232-76-68.us-west-2.compute.amazonaws.com as user scrapinguser for database webscraping


## Task 3. Performing the Scraping and Processing

In [None]:
try:
    s = requests.Session()
    resp = s.get(startingURL, headers=headers)
    if (verbose): print(resp.text)
except HTTPError as e:
    print('The server could not serve up the web page!')
    sys.exit("Script processing cannot continue!!!")
except ConnectionError as e:
    print('The server could not be reached due to connection issues!')
    sys.exit("Script processing cannot continue!!!")

if (resp.status_code==requests.codes.ok):
    print('Successfully accessed the company web page: ' + startingURL)
    searchPage = BeautifulSoup(resp.text, 'lxml')
    if verbose: print(searchPage)

Successfully accessed the company web page: http://www.mybidmatch.com/go?sub=55AB9731-0E3E-4BC0-B0E3-56EF81DA7FD4


In [None]:
# Setting up a dataframe to capture the records
df = pd.DataFrame(columns=['Posting_Date', 'Source_Tag', 'Agency_Name', 'FSG_Tag', 'Article_Title', 'Search_Keywords', 'Notice_Heading', 'Department_URL', 'Notice_URL', '''Notice_Text'''])
i = 0

In [None]:
done = False

search_listing = searchPage.find("table", class_="data").find_all("tr")
if verbose: print(search_listing)

In [None]:
for search_item in search_listing :
    search_element = search_item.find_all("td")
    posting_date_text = search_element[0].string
    posting_date = datetime.strptime(posting_date_text, '%A, %b %d, %Y').date()
    number_articles = int(search_element[1].string)
    group_url = websiteURL + search_item.find('a').get('href')

    if (number_articles > 0) and ((posting_date == targetDate) or processAll):
        # Adding random wait time so we do not hammer the website needlessly
        if addDelay:
            waitTime = randint(2,5)
            print("Waiting " + str(waitTime) + " seconds before processing the next article grouping page...")
            sleep(waitTime)
        else:
            print("Processing the next article grouping page...")
        
        try:
            s = requests.Session()
            resp = s.get(group_url, headers=headers)
            if (verbose): print(resp.text)
        except HTTPError as e:
            print('The server could not serve up the web page!')
            sys.exit("Script processing cannot continue!!!")
        except ConnectionError as e:
            print('The server could not be reached due to connection issues!')
            sys.exit("Script processing cannot continue!!!")
        if (resp.status_code==requests.codes.ok):
            print('Successfully accessed the article grouping web page: ' + group_url)
            noticePage = BeautifulSoup(resp.text, 'lxml')
                
        notice_listing = noticePage.find("table", class_="data").find_all("tr")
        if verbose: print(notice_listing)

        for notice_item in notice_listing :
            notice_element = notice_item.find_all("td")
            source_tag = notice_element[1].string.strip()
            agency_name = notice_element[2].string.strip()
            if (notice_element[3].string is None):
                fsg_tag = "N/A"
            else:
                fsg_tag = notice_element[3].string.strip()
            article_title = notice_element[4].string.strip()
            search_keywords = notice_element[5].string.strip()
            notice_url = websiteURL + notice_item.find('a').get('href')

            # Adding random wait time so we do not hammer the website needlessly
            if addDelay:
                waitTime = randint(1,3)
                print("Waiting " + str(waitTime) + " seconds before processing the next notice page...")
                sleep(waitTime)
            else:
                print("Processing the next notice page...")

            try:
                s = requests.Session()
                resp = s.get(notice_url, headers=headers)
                if (verbose): print(resp.text)
            except HTTPError as e:
                print('The server could not serve up the web page!')
                sys.exit("Script processing cannot continue!!!")
            except ConnectionError as e:
                print('The server could not be reached due to connection issues!')
                sys.exit("Script processing cannot continue!!!")

            if (resp.status_code==requests.codes.ok):
                print('Successfully accessed the notice web page: ' + notice_url)
                detailPage = BeautifulSoup(resp.text, 'lxml')
                notice_heading = detailPage.find("h4").string
                if (notice_heading is None): notice_heading = detailPage.find("h4").contents[0]
                notice_text = detailPage.find("div", class_="art-box").prettify()
                links_in_detail = detailPage.find("div", class_="art-box").find_all('a')
                if len(links_in_detail) > 0 : department_url = links_in_detail[-1].get('href')
                else: department_url = None

            if verbose: print(posting_date, source_tag, agency_name, fsg_tag, article_title, search_keywords, notice_heading, department_url, notice_url, notice_text)
            df.loc[i] = [posting_date, source_tag, agency_name, fsg_tag, article_title, search_keywords, notice_heading, department_url, notice_url, notice_text]
            if writeToDB: storeDB(posting_date, source_tag, agency_name, fsg_tag, article_title, search_keywords, notice_heading, department_url, notice_url, notice_text)
            else: print("Found record:", posting_date, '|', source_tag, '|', agency_name, '|', fsg_tag, '|', article_title, '|', search_keywords, '|', notice_heading, '|', department_url, '|', notice_url)
            i = i + 1

Waiting 4 seconds before processing the next article grouping page...
Successfully accessed the article grouping web page: http://www.mybidmatch.com/go?doc=1CF3851A-8C7F-442B-BA07-9B13B5142EF1
Waiting 3 seconds before processing the next notice page...
Successfully accessed the notice web page: http://www.mybidmatch.com/article?doc=1CF3851A-8C7F-442B-BA07-9B13B5142EF1&seq=1
Inserting record: 2021-11-30 | procure | DEPT OF DEFENSE | D | ProModel Simulation Consultation | computer?; develop*; naics!541519; service?; software; | DEPT OF DEFENSE, DEPT OF THE ARMY, W6QM MICC-WEST POINT, KO DIRECTORATE OF CONTRACTIN,  WEST POINT NY 10996-1514  | https://beta.sam.gov/opp/0d2d18a94b49464487ed4a092ba8fe3a/view? | http://www.mybidmatch.com/article?doc=1CF3851A-8C7F-442B-BA07-9B13B5142EF1&seq=1
Successfully inserted the record into the database.
Waiting 2 seconds before processing the next notice page...
Successfully accessed the notice web page: http://www.mybidmatch.com/article?doc=1CF3851A-8C7

In [None]:
print('Finished finding all available articles on the web pages!')
print('Number of article processed:', i)

Finished finding all available articles on the web pages!
Number of article processed: 38


In [None]:
if writeToDB:
    try:
        cur.close()
        conn.close()
        print("Successfully closed the connection to host", db_host, "as user", db_user, "for database", db_name)
    except:
        print("Unable to close the connection to host", db_host, "as user", db_user, "for database", db_name)

Successfully closed the connection to host ec2-44-232-76-68.us-west-2.compute.amazonaws.com as user scrapinguser for database webscraping


## Task 4. Organizing Data and Producing Outputs

In [None]:
if writeJSON:
    out_file = df.to_json(orient='records')
    with open('web-scraping-py-bsoup-mybidmatch.json', 'w') as f:
        f.write(out_file)
    print('Total number of records written to file:', len(df))

In [None]:
print ('Total time for the script:', (datetime.now() - startTimeScript))

Total time for the script: 0:02:00.340148
