This is a code break down for the land_scrape.py script located in this repository. For further context, please see my LinkedIn article https://www.linkedin.com/pulse/i-made-web-bot-instead-doing-my-homework-patrick-crosman and the accompanying Youtube video https://youtu.be/e7rZ24X8osA

The goal of this code break down is to be descriptive and accessible to any reader. If you are a tech nerd and you see a glaring issue in the way I have architected this solution, please make a pull request or reach and let me know. (email address)

In [16]:
''' There are a whole bunch of handy features from Selenium needed to drive our browser and control the pace of execution. '''
from selenium import webdriver as wd
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

For the sake of readibility in this breakdown, I will import libraries as they are used. Let's open up a browser and move it to the side.

In [17]:
''' open a new chrome browser using the webdriver service '''
chromedriver_service = Service('./chromedriver.exe')
browser = wd.Chrome(service=chromedriver_service)
browser.set_window_rect(x=930, y=0, width=1200, height=1125) # this is specific to my monitor

{'height': 1100, 'width': 1200, 'x': 930, 'y': 0}

Now that we have a browser, we could go the events page on the Higher Landing website. For the sake of security, I have stored all credentials and sensitive information into a JSON file. Loading this JSON into a Python dictionary object allows us to use key/value pairs to access private data.

In [18]:
import json
''' get secret test information from a json file '''
base_path = r"H:\2021-11-03-HIGHER-LANDING\\"
test_info_file_name = 'secret_test_info.json'
with open(base_path + './' + test_info_file_name, 'r') as t_f:
    json_string = t_f.read()
test_info_dict = json.loads(json_string)
print(test_info_dict.keys())

dict_keys(['base_path', 'user_name', 'user_password', 'required_sessions_file', 'events_url', 'locators'])


The keys listed above will be used for data access and field location. Below, we __get__ the value of the events url from our test info dictionary and then use the browser's __get__ function to load the url. 

In [19]:
''' open the browser to the events page of the course '''
url = test_info_dict.get('events_url')
browser.get(url)

This will take us to the log in page, but it is good practice to make the browser wait until everything is loaded. The WebdriverWait module is used to achieve this. Let's wait until the Log In button is ready to click. 

In [20]:
''' wait until we can click on the log in button'''
waiter = WebDriverWait(browser, 10)
ready_for_input = waiter.until(EC.element_to_be_clickable((
    By.ID, test_info_dict.get('locators').get('login_submit_button'))))

The By.ID selector is used multiple times here in conjunction with the locators from our test info dictionary. Selenium's find element function returns an html element that can be interacted with.

In [21]:
''' enter the user name and password by using locators, all data is stored in test_info_dict '''
user_name_field = browser.find_element(By.ID, test_info_dict.get('locators').get('user_name_field'))
user_name_field.send_keys(test_info_dict.get('user_name'))
user_password_field = browser.find_element(By.ID, test_info_dict.get('locators').get('user_pass_field'))
user_password_field.send_keys(test_info_dict.get('user_password'))

''' locate the login button and click it '''
login_button = browser.find_element(By.ID, test_info_dict.get('locators').get('login_submit_button'))
login_button.click()

Since we directly loaded the event page url, we will wait for the Upcoming Events link to be ready before clicking it.

In [22]:
''' wait until the Upcoming Events link is ready to be clicked '''
upcoming_events_locator = test_info_dict.get('locators').get('upcoming_events_link')
calendar_ready = waiter.until(EC.element_to_be_clickable((By.ID, upcoming_events_locator)))
upcoming_events_link = browser.find_element(By.ID, upcoming_events_locator)
upcoming_events_link.click()

Each event is listed with a truncated description and an RSVP button. We will wait until the first RSVP button is ready to be clicked.

In [23]:
''' wait until the first rsvp button is ready to be clicked '''
first_rsvp_button_locator = test_info_dict.get('locators').get('first_rsvp_button')
upcoming_events_page_ready = waiter.until(EC.element_to_be_clickable((By.ID, first_rsvp_button_locator)))

Now we will grab all of the link titles using a regular expression and BeautifulSoup's find_all function

In [26]:
import re
from bs4 import BeautifulSoup
''' use the html parser from BeautifulSoup to find all the links matching a specific pattern  '''
page_soup = BeautifulSoup(browser.page_source, 'html.parser')
event_title_regex_pattern = test_info_dict.get('locators').get('event_title_links_regex')
event_title_links = page_soup.find_all("a", {"id": re.compile(event_title_regex_pattern)})

We can compare the link text to the required sessions and determine which events to attend.

In [28]:
''' read the contents of the required sessions file into string for easy comparison '''
required_sessions_file = open(r"H:\2021-11-03-HIGHER-LANDING\schedule.txt", "r")
required_sessions_string = required_sessions_file.read()

''' an empty list which will be populated in the for loop below '''
events_to_attend_list = []

This next chunk of code is more suited to be modularized into three separate functions, but since you have come this far already, let's just work through in the comments below. We will go through every link and grab the event time, zoom link, and description from the event details page. I have also created this script in a modular format. (link here to modular)

In [35]:
''' go through all the events on the page and gather the event start time and details '''
for link in event_title_links:
    if link.text in required_sessions_string:
        ''' link.text, event_time, zoom_link, description  '''
        print(link.text)
        ''' use the id property from the current link to enter the event details page '''
        click_me = browser.find_element(By.ID, link['id'])
        click_me.click()
        rsvp_page_ready = waiter.until(EC.element_to_be_clickable((
            By.ID, test_info_dict.get('locators').get('event_rsvp_button'))))
        
        ''' parse the current page and search for the zoom link '''
        tmp_event_soup = BeautifulSoup(browser.page_source, 'html.parser')
        zoom_link = tmp_event_soup.find('p', text=re.compile('^https:.*zoom'))
        
        ''' get the parent/containing element '''
        if zoom_link is not None:
            description_container_element = zoom_link.parent
            zoom_link_text = zoom_link.text
        else:
            # in the case where our first attempt did not locate the zoom link directly
            description_container_element = tmp_event_soup.find(class_='column wpc65 left')
            zoom_link_text = description_container_element.text.split('via Zoom:')[1]

        ''' this is not perfect currently pulling out too many spaces '''
        description_text = description_container_element.text.split('Send calendar to email')[1].replace(u'\n', ' ')

        ''' collect the start time, date, and full event description '''
        event_start_datetime_locator = test_info_dict.get('locators').get('event_start_datetime')
        event_start_datetime = description_container_element.find(id=event_start_datetime_locator)
        event_start_datetime_formatted = event_start_datetime.text.replace('/', '-')
        
        ''' create a tuple to add to the events to attend list '''
        tmp_tuple = (link.text, event_start_datetime_formatted, zoom_link_text, description_text.replace(u'\xa0', ' '))
        #print(str(tmp_tuple))
        events_to_attend_list.append(tmp_tuple)
        
        ''' navigate back to previous page '''
        back_to_calendar_link = browser.find_element(By.ID, test_info_dict.get('locators').get('back_calendar_link'))
        back_to_calendar_link.click()
        tmp_calendar_ready = waiter.until(EC.element_to_be_clickable((By.ID, upcoming_events_locator)))
        tmp_upcoming_events_link = browser.find_element(By.ID, upcoming_events_locator)
        tmp_upcoming_events_link.click()
        tmp_upcoming_events_page_ready = waiter.until(EC.element_to_be_clickable((By.ID, first_rsvp_button_locator)))
    else:
        print("skipped {}".format(link.text))        

How to Pitch Yourself
skipped Appreciative Inquiry Clinic
skipped Resume Clinic
skipped Marketing Club
skipped Branding Clinic
skipped LinkedIn Clinic
Interview Preparation
Market Yourself Like a Pro Part I
Resume Writing Studio Part II
skipped Pitch Clinic
skipped Grizzly Research Clinic
Art of Blogging
skipped Knowing Your Brand I
skipped Marketing Club
skipped Interview Clinic
Higher Landing Grizzly Den
skipped Knowing Your Brand II
Higher Landing Grizzly Den
Emerging Sectors & Transferable Skills
skipped IP Clinic
skipped Resume Writing Studio I
skipped G2M Finalization Meeting
skipped Transferable Skills & Values Clinic
skipped Branding Clinic
Market Yourself Like a Pro II
Promoting Your Brand on LinkedIn
skipped Resume Clinic
How to Pitch Yourself
skipped Appreciative Inquiry Clinic
skipped Resume Writing Studio II


Ok, so now we have a list of the details for the required events, let's quickly use Pandas to output a csv.

In [36]:
import pandas as pd
''' use the events to attend list to create a pandas dataframe '''
events_to_attend_df = pd.DataFrame(events_to_attend_list, columns=["Title", "Start Time", "Link", "Description"])
events_to_attend_df.to_csv("{}-Required_Events.csv".format(base_path))

The next logical step here would be to use the Google Calendar API to create events, or really, anything. Thank you for your time! 