# Web Scraping of RealMoney Contributor Articles Using Selenium
### David Lowe
### October 2, 2020

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The Python web scraping code leverages the Selenium module.

INTRODUCTION: Real Money is a website dedicated to investment news and blog articles written by financial professionals. The website features numerous professionals with various trading specialties and expertise. The script automatically traverses the news listing for a site contributor and captures the high-level metadata of his/her blogs by storing them in a CSV output file.

Starting URLs: https://realmoney.thestreet.com/author/1619871/james-rev-shark-deporre/all.html

## Task 1. Prepare Environment

In [1]:
import os
import sys
import pandas as pd
# import shutil
# import re
import boto3
from datetime import datetime, date
from random import randint
from time import sleep
# from dotenv import load_dotenv
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options

In [2]:
# Begin the timer for the script processing
startTimeScript = datetime.now()

# Set up the verbose and debug flags to print detailed messages for debugging (setting True will activate!)
verbose = True
debug = False

# Set up the flag to send status emails (setting to True will send the status emails!)
notifyStatus = False

# # Set up the parent directory location for loading the dotenv files
# useColab = False
# if useColab:
#     # Mount Google Drive locally for storing files
#     from google.colab import drive
#     drive.mount('/content/gdrive')
#     gdrivePrefix = '/content/gdrive/My Drive/Colab_Downloads/'
#     env_path = '/content/gdrive/My Drive/Colab Notebooks/'
#     dotenv_path = env_path + "python_script.env"
#     load_dotenv(dotenv_path=dotenv_path)

# # Set up the dotenv file for retrieving environment variables
# useLocalPC = False
# if useLocalPC:
#     env_path = "/Users/david/PycharmProjects/"
#     dotenv_path = env_path + "python_script.env"
#     load_dotenv(dotenv_path=dotenv_path)

# Set up the flag to write the output to a JSON document (setting to TRUE will create the document!)
writeOutput = True

# Set up the executeDownload flag to download files (setting True will download!)
executeDownload = False

In [3]:
# Set up the email notification function
def status_notify(msg_text):
    access_key = os.environ.get('SNS_ACCESS_KEY')
    secret_key = os.environ.get('SNS_SECRET_KEY')
    aws_region = os.environ.get('SNS_AWS_REGION')
    topic_arn = os.environ.get('SNS_TOPIC_ARN')
    if (access_key is None) or (secret_key is None) or (aws_region is None):
        sys.exit("Incomplete notification setup info. Script Processing Aborted!!!")
    sns = boto3.client('sns', aws_access_key_id=access_key, aws_secret_access_key=secret_key, region_name=aws_region)
    response = sns.publish(TopicArn=topic_arn, Message=msg_text)
    if response['ResponseMetadata']['HTTPStatusCode'] != 200 :
        print('Status notification not OK with HTTP status code:', response['ResponseMetadata']['HTTPStatusCode'])

In [4]:
if (notifyStatus): status_notify("Task 1 Prepare Environment completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

## Task 2. Perform the Scraping and Processing

In [5]:
if (notifyStatus): status_notify("Task 2 Perform the Scraping and Processing has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [6]:
# Setting up a dataframe to capture the records
df = pd.DataFrame(columns=['title','author','date_time','url','description'])
num_entries = 0

In [7]:
# Specifying the URL of desired web page to be scrapped
website_url = "https://realmoney.thestreet.com/author/1619871/james-rev-shark-deporre/"

In [8]:
# Initialize the web browser
firefox_options = Options()
firefox_options.headless = True
web_page_browser = webdriver.Firefox(options=firefox_options)

In [9]:
page_num = 0
max_page = 99
done = False

while (page_num <= max_page) and (not done):
    web_page_url = website_url + "all.html?page=" + str(page_num)
    # Adding random wait time so we do not hammer the website needlessly
    waitTime = randint(2,4)
    if verbose: print("Waiting", waitTime, "seconds to retrieve the items on page", web_page_url)
    sleep(waitTime)
    try:
        web_page_browser.get(web_page_url)
        print('Successfully accessed web page:', web_page_url)
    except:
        print('ERROR: The server could not serve up the web page!')
        done = True

    news_listing = web_page_browser.find_elements(By.CLASS_NAME, "news-list-compact__block")
    if debug: print(news_listing)

    author = web_page_browser.find_element(By.CLASS_NAME, "author-email").get_attribute("data-author-name")
    if verbose: print('Found author name:', author)
    for news_item in news_listing:
        title = news_item.find_element(By.TAG_NAME, "h3").text
        news_url = news_item.find_element(By.TAG_NAME, "a").get_attribute("href")
        author_byline = news_item.find_element(By.CLASS_NAME, "news-list-compact__byline")
        datetime_str = author_byline.find_element(By.TAG_NAME, "time").get_attribute("datetime")
        date_time = datetime.strptime(datetime_str, '%Y-%m-%dT%H:%M:%S%z')
        description = news_item.find_element(By.TAG_NAME, "p").text
        if datetime.strftime(date_time, '%w') == '6':
            if verbose: print('Found article URL:', news_url)
            last_entry = len(df)
            if debug: print('Inserting record number', last_entry, 'into the dataframe.')
            df.loc[last_entry] = [title,author,date_time,news_url,description]
    page_num = page_num + 1
    print('Finished processing web page:', web_page_url)

Waiting 4 seconds to retrieve the items on page https://realmoney.thestreet.com/author/1619871/james-rev-shark-deporre/all.html?page=0
Successfully accessed web page: https://realmoney.thestreet.com/author/1619871/james-rev-shark-deporre/all.html?page=0
Found author name: James "Rev Shark" DePorre
Finished processing web page: https://realmoney.thestreet.com/author/1619871/james-rev-shark-deporre/all.html?page=0
Waiting 4 seconds to retrieve the items on page https://realmoney.thestreet.com/author/1619871/james-rev-shark-deporre/all.html?page=1
Successfully accessed web page: https://realmoney.thestreet.com/author/1619871/james-rev-shark-deporre/all.html?page=1
Found author name: James "Rev Shark" DePorre
Found article URL: https://realmoney.thestreet.com/investing/active-vs-passive-trading-the-key-to-big-gains-is-stalking-your-trades-15432222
Finished processing web page: https://realmoney.thestreet.com/author/1619871/james-rev-shark-deporre/all.html?page=1
Waiting 3 seconds to retrie

In [10]:
web_page_browser.quit()

In [11]:
if (notifyStatus): status_notify("Task 2 Perform the Scraping and Processing completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

## Task 3. Finalize the Output

In [12]:
if (notifyStatus): status_notify("Task 3 Finalize the Output has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [13]:
# Spot-checking the dataframe before writing to file
df.head()

Unnamed: 0,title,author,date_time,url,description
0,Active vs. Passive Trading: The Key to Big Gai...,"James ""Rev Shark"" DePorre",2020-09-19 10:00:00-04:00,https://realmoney.thestreet.com/investing/acti...,Take control of your trades rather than just s...
1,How to Find Your 'Formula' for Making Great St...,"James ""Rev Shark"" DePorre",2020-09-12 10:00:00-04:00,https://realmoney.thestreet.com/investing/stoc...,The power and beauty of the stock market is th...
2,"Selling's Not the End of a Good Trade, It's Ju...","James ""Rev Shark"" DePorre",2020-09-05 10:00:00-04:00,https://realmoney.thestreet.com/investing/stoc...,The most powerful strategic tool you possess i...
3,Why You Should Be Both a Trader and an Investor,"James ""Rev Shark"" DePorre",2020-08-29 10:00:00-04:00,https://realmoney.thestreet.com/investing/why-...,The benefit of using both approaches is that i...
4,"It's Compounding, Not Buy-and-Hold Investing, ...","James ""Rev Shark"" DePorre",2020-08-22 10:00:00-04:00,https://realmoney.thestreet.com/investing/it-s...,The biggest challenge you will find in harness...


In [14]:
# Spot-checking the dataframe before writing to file
df.tail()

Unnamed: 0,title,author,date_time,url,description
45,The Biggest Change in the Market in the Last D...,"James ""Rev Shark"" DePorre",2019-11-09 10:00:00-05:00,https://realmoney.thestreet.com/investing/the-...,The most productive change that a trader can m...
46,The Most Important Question in Trading: How Do...,"James ""Rev Shark"" DePorre",2019-11-02 10:00:00-04:00,https://realmoney.thestreet.com/investing/the-...,The best way to improve trading results is to ...
47,Trading During Earnings Season,"James ""Rev Shark"" DePorre",2019-10-19 10:00:00-04:00,https://realmoney.thestreet.com/investing/trad...,The most important thing to remember about tra...
48,Commission-Free Trading: Use It Wisely,"James ""Rev Shark"" DePorre",2019-10-12 10:00:00-04:00,https://realmoney.thestreet.com/investing/stoc...,The key to trading free of cost effectively is...
49,The Secret to Long-Term Trading Success,"James ""Rev Shark"" DePorre",2019-10-05 10:00:00-04:00,https://realmoney.thestreet.com/investing/stoc...,"It is to work at it every day, day after day, ..."


In [15]:
if (writeOutput):
    out_file = df.to_csv(index=False)
    with open('web_scraping_py_selenium_realmoney_contributor_articles.csv', 'w', newline = '\n') as f:
        f.write(out_file)
    print("Number of records written to file:", len(df))

Number of records written to file: 50


In [16]:
if (notifyStatus): status_notify("Task 3 Finalize the Output completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [17]:
print ('Total time for the script:',(datetime.now() - startTimeScript))

Total time for the script: 0:32:58.672917
