### Setup

Download Python if you have not already! https://www.python.org/downloads/

Download Jupyter Notebook at https://jupyter.org/install and follow instructions under "Getting started with the classic Jupyter Notebook"

Open terminal and type "jupyter notebook" then press enter!

### Import Packages!

In [115]:
# Allows you to create and manipulate data tables
import pandas as pd
# Allows driver to get access the elements on a page
import urllib.request
# Allows us to wait before moving on to another action
import time

# Airtable documentation: https://pyairtable.readthedocs.io/en/latest/getting-started.html
# Allows us to access Airtable, create tokens here https://airtable.com/create/tokens/new
from pyairtable import Table
# Don't show your API key to other people or they will purchase a bunch of requests and screw you over
f = open("airtable_key.txt", "r")
api_key = f.read()

# Selenium is an automation framework that allows you to access web pages
from selenium import webdriver
# Not used for this section but WebDriverWait allows you to wait until something loads
from selenium.webdriver.support.ui import WebDriverWait
# EC is also used to let our code know when an expected condition such as a page load finishes
from selenium.webdriver.support import expected_conditions as EC
# Time exception can be thrown if the driver can't find the web element you want
from selenium.common.exceptions import TimeoutException
# The variable that tries to fulfill the expected condition
from selenium.webdriver.common.by import By

from webdriver_manager.chrome import ChromeDriverManager
# Options for the webdriver such as going through Chrome incognito and sending requests without headers
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
driver = webdriver.Chrome(ChromeDriverManager().install())
#ChromeDriverManager().install()



### Get the data from both Airtables as dictionaries

In [111]:
# Dictionary is in the format [key: value, key: value, key;value etc]
# Nested Dictionary is in the format [key:{}]

# The base id can be found in the Airtable URL
# For example: https://airtable.com/appymSmq5Liwaf17Y/tblHCYebPG05igKPd/viw8yQuFr4pRttzZK?blocks=hide, the base ID starts with app and is
# appymSmq5Liwaf17Y 

at_venture_firms = Table(api_key, 'appp2lepQxmNkFrSB', 'venture_firms_table')
data_venture_firms = at_venture_firms.all()

at_jobs = Table(api_key, 'appymSmq5Liwaf17Y', 'jobs_table')
data_jobs = at_jobs.all()
data_venture_firms

[{'id': 'reckdYoCVU2nsCGel',
  'createdTime': '2023-03-31T15:49:25.000Z',
  'fields': {'Name': 'Greylock', 'URL': 'https://jobs.greylock.com/jobs'}},
 {'id': 'recpJUlFuSVotqrX1',
  'createdTime': '2023-03-31T15:49:25.000Z',
  'fields': {'Name': 'Khosla', 'URL': 'https://jobs.khoslaventures.com/jobs'}},
 {'id': 'recpLeGkaWO3XT8On',
  'createdTime': '2023-03-31T15:49:25.000Z',
  'fields': {'Name': '8VC', 'URL': 'https://jobs.8vc.com/jobs'}}]

### Get job website URLs from table

In [112]:
vc_list = []
for row in data_venture_firms:
    field = row["fields"]
    vc_list.append(field)
vc_list

[{'Name': 'Greylock', 'URL': 'https://jobs.greylock.com/jobs'},
 {'Name': 'Khosla', 'URL': 'https://jobs.khoslaventures.com/jobs'},
 {'Name': '8VC', 'URL': 'https://jobs.8vc.com/jobs'}]

 ### Locate the job cards // Docs: https://selenium-python.readthedocs.io/locating-elements.html

In [113]:
# Go to your webpage, then right click -> inspect element

def get_jobs(vc):
    name = vc['Name']
    url = vc['URL']
    driver.get(url)
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
    time.sleep(5)
    
    # find the xpath that gets the data you need!
    results = driver.find_elements("xpath","//div[contains(@class, 'job-card')]")
    insert_job(results, name)


In [123]:
get_jobs(vc_list[0])

Greylock Product Manager at Greylock
Greylock Software Engineer at Greylock
Greylock Product Designer at Greylock
Cato Networks Sales Development Representative, North Central at Cato Networks
Cato Networks Sales Development Representative, Australia at Cato Networks
Tome Content Marketing Manager at Tome
Adept Technical Sourcer at Adept
Lyra Health Senior Group Manager, Customer Success Implementation at Lyra Health
Clockwise Product Marketing Manager at Clockwise
Orb Solutions Engineer at Orb
Orb Product Manager at Orb
Orb Engineering Manager at Orb
Discord Senior Software Engineer, Safety Processing at Discord
Discord Senior Software Engineer, Creator Revenue at Discord
Discord Senior Product Marketing Manager, Insights & Strategy at Discord
Discord Group Product Marketing Manager, Privacy at Discord
Cribl Sr UX Researcher at Cribl
Cribl Customer Experience Enablement Specialist at Cribl
Magic Eden Senior Product Marketing Manager, Marketplace Strategy at Magic Eden
Magic Eden Senio

### Find the name and role

In [122]:
def insert_job(results, name):
#     accepted_locations = ["MA, USA", "CA, USA", "WA, USA", "NY, USA", "IL, USA", "Washington, DC, USA", "Remote"]
#     block_positions = ["Senior", "Sr", "Coordinator","principle","sr","Lead","SR", "Facilitator", "Site", "Inventory", "Line", 'Contractor', "Account", "Specialist", "Officer", "Site", "Technician", "Payroll", "Supervisor", "Risk", "Credit", "HR", "Accounting", "Assistant", "Executive", "Agent", "Shift", "Administrator"]
    
    for card in results:
        card_html = card.get_attribute('innerHTML')
        company = card.find_element("xpath", './/div[@itemprop="hiringOrganization"]').text
        role = card.find_element("xpath", './/meta[@itemprop="description"]').get_attribute("content")

        # Make sure the dictionary keys are equal to the Airtable col names
        at_jobs.create({'Company':company,'Role':role})


### Run the program!

In [124]:
for vc in vc_list:
    get_jobs(vc)

Greylock Product Manager at Greylock
Greylock Software Engineer at Greylock
Greylock Product Designer at Greylock
Cato Networks Sales Development Representative, North Central at Cato Networks
Cato Networks Sales Development Representative, Australia at Cato Networks
Tome Content Marketing Manager at Tome
Adept Technical Sourcer at Adept
Lyra Health Senior Group Manager, Customer Success Implementation at Lyra Health
Clockwise Product Marketing Manager at Clockwise
Orb Solutions Engineer at Orb
Orb Product Manager at Orb
Orb Engineering Manager at Orb
Discord Senior Software Engineer, Safety Processing at Discord
Discord Senior Software Engineer, Creator Revenue at Discord
Discord Senior Product Marketing Manager, Insights & Strategy at Discord
Discord Group Product Marketing Manager, Privacy at Discord
Cribl Sr UX Researcher at Cribl
Cribl Customer Experience Enablement Specialist at Cribl
Magic Eden Senior Product Marketing Manager, Marketplace Strategy at Magic Eden
Magic Eden Senio