In [1]:
import pandas as pd
import requests
import bs4
import re

CSV = False

# Scraping HR Hosted Job Boards

Goal:

**Set up templates to scrape most common HR Software hosted job boards**


Steps:
1. Clean up data from manual entry
2. Subset data for specific HR Tools
3. Build scraping templates for each platform
    1. lever
    2. greenhouse
    3. workable
    4. breezy
    5. recruitee

In [2]:
df = pd.read_csv("data/name_url_updated.csv")

## 1. Cleaning up df from manual entry/manipulation

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,Company,URL,HRTool,OpenPositions-15.05.2021
0,0,Olark,https://www.olark.com/jobs,,No
1,1,Help Scout,https://jobs.lever.co/helpscout,Lever,
2,2,Close,https://jobs.lever.co/close.io/,Lever,
3,3,Prezly,https://careers.prezly.com/,,
4,4,Skillcrush,https://skillcrush.breezy.hr/,Breezy,


In [4]:
df.drop("Unnamed: 0", axis=1, inplace=True)

In [5]:
df.head()

Unnamed: 0,Company,URL,HRTool,OpenPositions-15.05.2021
0,Olark,https://www.olark.com/jobs,,No
1,Help Scout,https://jobs.lever.co/helpscout,Lever,
2,Close,https://jobs.lever.co/close.io/,Lever,
3,Prezly,https://careers.prezly.com/,,
4,Skillcrush,https://skillcrush.breezy.hr/,Breezy,


In [6]:
len(df)

97

In [7]:
#all company names are unique but not all urls - must investigate
df.nunique()

Company                     97
URL                         95
HRTool                       5
OpenPositions-15.05.2021     1
dtype: int64

In [8]:
df.loc[df["URL"].duplicated()]

Unnamed: 0,Company,URL,HRTool,OpenPositions-15.05.2021
80,Close.io,https://jobs.lever.co/close.io/,Lever,
91,On The Go Systems,https://www.onthegosystems.com/jobs/,,No


In [9]:
df.loc[df["URL"] == "https://jobs.lever.co/close.io/"]

Unnamed: 0,Company,URL,HRTool,OpenPositions-15.05.2021
2,Close,https://jobs.lever.co/close.io/,Lever,
80,Close.io,https://jobs.lever.co/close.io/,Lever,


In [10]:
df.loc[df["URL"] == "https://www.onthegosystems.com/jobs/"]

Unnamed: 0,Company,URL,HRTool,OpenPositions-15.05.2021
60,OnTheGoSystems,https://www.onthegosystems.com/jobs/,,No
91,On The Go Systems,https://www.onthegosystems.com/jobs/,,No


**Will de-dupe these rows - there was just a difference in naming conventions when I mered the company lists from the two different websites**
I prefer the last of both of these duplicates, since Close.io gives more information, and On The Go Systems with spaces is sticking to the spacing convention of other companies in the list

In [11]:
df.drop_duplicates(subset="URL", keep="last", inplace=True)

## 2. Create sf subsets with only specifc HRTools

In [12]:
df["HRTool"].value_counts()

Lever         10
Greenhouse    10
Workable       7
Breezy         6
Recruitee      4
Name: HRTool, dtype: int64

In [13]:
lever = df.loc[df["HRTool"] == "Lever"].copy().reset_index(drop=True)

In [14]:
greenhouse = df.loc[df["HRTool"] == "Greenhouse"].copy().reset_index(drop=True)

In [15]:
workable = df.loc[df["HRTool"] == "Workable"].copy().reset_index(drop=True)

In [16]:
breezy = df.loc[df["HRTool"] == "Breezy"].copy().reset_index(drop=True)

In [17]:
recruitee = df.loc[df["HRTool"] == "Recruitee"].copy().reset_index(drop=True)

## 3. Scraping Template
Goal Table Information

| Company Name | Job Title | Location (if avail) | Department | URL |
|--------------|-----------|---------------------|------------|-----|
| Help Scout | Data Analyst | Remote | Data Team | www.jobpostinghere.com |


### A. Lever

In [124]:
lever.head()

Unnamed: 0,Company,URL,HRTool,OpenPositions-15.05.2021
0,Help Scout,https://jobs.lever.co/helpscout,Lever,
1,Articulate,https://jobs.lever.co/articulate,Lever,
2,MangoLanguages,https://jobs.lever.co/mangolanguage/,Lever,
3,Guilded,https://jobs.lever.co/guilded/,Lever,
4,Buildkite,https://jobs.lever.co/Buildkite/,Lever,


In [133]:
url_lever = lever["URL"][0]

In [134]:
response_lever = requests.get(url_lever)
soup_lever = bs4.BeautifulSoup(response_lever.content, "html.parser")

**Get Job Title, Loc, Dept**

In [136]:
soup_lever.find_all("a", {"class": "posting-title"})

[<a class="posting-title" href="https://jobs.lever.co/helpscout/b63c26a3-9b78-4293-bac8-8f6b7149a130"><h5 data-qa="posting-name">Director of Brand</h5><div class="posting-categories"><span class="sort-by-location posting-category small-category-label" href="#">Remote</span><span class="sort-by-team posting-category small-category-label" href="#">Brand</span></div></a>,
 <a class="posting-title" href="https://jobs.lever.co/helpscout/b1200570-fbee-4a97-8210-0a290f1f25c3"><h5 data-qa="posting-name">Ops Engineer</h5><div class="posting-categories"><span class="sort-by-location posting-category small-category-label" href="#">Remote</span><span class="sort-by-team posting-category small-category-label" href="#">Engineering</span></div></a>,
 <a class="posting-title" href="https://jobs.lever.co/helpscout/18a5f09e-37d7-458c-b292-8ecc0e090c62"><h5 data-qa="posting-name">Senior Java Engineer</h5><div class="posting-categories"><span class="sort-by-location posting-category small-category-label" 

In [135]:
#confirming capture of all listings on 
len(soup_lever.find_all("a", {"class": "posting-title"}))

9

In [137]:
for x in soup_lever.find_all("a", {"class": "posting-title"}):
    print((x.get_text("<h5>")))

Director of Brand<h5>Remote<h5>Brand
Ops Engineer<h5>Remote<h5>Engineering
Senior Java Engineer<h5>Remote<h5>Engineering
Senior JavaScript Engineer<h5>Remote<h5>Engineering
Future Openings at Help Scout<h5>Remote<h5>Future Openings
Content Writer<h5>Remote<h5>Marketing
Front-end Developer<h5>Remote<h5>Marketing
Sales Manager, Business Development Representatives<h5>Remote<h5>Sales
Technical Support Specialist (formerly Customer Champion)<h5>Remote<h5>Support


**Get URL**

In [138]:
for x in soup_lever.find_all("a", {"class": "posting-title"}):
    print(x.get("href"))

https://jobs.lever.co/helpscout/b63c26a3-9b78-4293-bac8-8f6b7149a130
https://jobs.lever.co/helpscout/b1200570-fbee-4a97-8210-0a290f1f25c3
https://jobs.lever.co/helpscout/18a5f09e-37d7-458c-b292-8ecc0e090c62
https://jobs.lever.co/helpscout/5371bb13-a068-4d33-879e-6fe0badd6372
https://jobs.lever.co/helpscout/54a68d5c-6ffd-4873-a7a8-3a9a37a65a4c
https://jobs.lever.co/helpscout/8f17211d-1673-4311-8745-fa302618127b
https://jobs.lever.co/helpscout/5b317a4a-5138-4dba-b7f3-27774b878760
https://jobs.lever.co/helpscout/1925c5e8-6790-4baa-b84b-0788f109816b
https://jobs.lever.co/helpscout/da2812fc-a893-45ba-9cba-18283cd6349d


**Combining details and URL into list**

In [139]:
list_jobs_all = []
for x in soup_lever.find_all("a", {"class": "posting-title"}):
    list_job_details = (x.get_text("<h5>")).split("<h5>")
    list_job_details.append(x.get("href"))
    list_jobs_all.append(list_job_details)

In [140]:
list_jobs_all

[['Director of Brand',
  'Remote',
  'Brand',
  'https://jobs.lever.co/helpscout/b63c26a3-9b78-4293-bac8-8f6b7149a130'],
 ['Ops Engineer',
  'Remote',
  'Engineering',
  'https://jobs.lever.co/helpscout/b1200570-fbee-4a97-8210-0a290f1f25c3'],
 ['Senior Java Engineer',
  'Remote',
  'Engineering',
  'https://jobs.lever.co/helpscout/18a5f09e-37d7-458c-b292-8ecc0e090c62'],
 ['Senior JavaScript Engineer',
  'Remote',
  'Engineering',
  'https://jobs.lever.co/helpscout/5371bb13-a068-4d33-879e-6fe0badd6372'],
 ['Future Openings at Help Scout',
  'Remote',
  'Future Openings',
  'https://jobs.lever.co/helpscout/54a68d5c-6ffd-4873-a7a8-3a9a37a65a4c'],
 ['Content Writer',
  'Remote',
  'Marketing',
  'https://jobs.lever.co/helpscout/8f17211d-1673-4311-8745-fa302618127b'],
 ['Front-end Developer',
  'Remote',
  'Marketing',
  'https://jobs.lever.co/helpscout/5b317a4a-5138-4dba-b7f3-27774b878760'],
 ['Sales Manager, Business Development Representatives',
  'Remote',
  'Sales',
  'https://jobs.lev

**Success!! Will trasnform into functions/class once I have fleshed out my pipeline**

### B. Greenhouse

In [161]:
greenhouse.head()

Unnamed: 0,Company,URL,HRTool,OpenPositions-15.05.2021
0,Xapo,https://boards.greenhouse.io/xapo61/,Greenhouse,
1,Blockchain,https://boards.greenhouse.io/blockchain,Greenhouse,
2,Collage,https://boards.greenhouse.io/collagecom/,Greenhouse,
3,GitLab,https://boards.greenhouse.io/gitlab,Greenhouse,
4,Sourcegraph,https://boards.greenhouse.io/sourcegraph91,Greenhouse,


In [219]:
url_greenhouse = greenhouse["URL"][0]

In [220]:
response_greenhouse = requests.get(url_greenhouse)
soup_greenhouse = bs4.BeautifulSoup(response_greenhouse.content, "html.parser")

In [221]:
soup_greenhouse

<!DOCTYPE html>

<html lang="en">
<head prefix="og: http://ogp.me/ns#">
<title>Jobs at Xapo</title>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, minimum-scale=1.0" id="viewport" name="viewport">
<meta content="jben" id="rendered-by"/>
<link href="https://boards.cdn.greenhouse.io/assets/application-0b83f797e71a267c31193781fff2814dfb78746f800ca278a81be45c6664afeb.css" media="all" rel="stylesheet"/>
<link href="https://boards.cdn.greenhouse.io/assets/responsive-9bc84e316c0e62c281f938d9a4217ce6017c348b675e254d1a3c8c82f2d88f9b.css" media="all" rel="stylesheet"/>
<meta content="Xapo" property="og:title"/>
<meta content="Xapo is an international fintech startup on a mission to protect and grow its clients’ life savings.
We’re a fully distributed team that works remotely from 50+ countries around the world. We may come from many different cultures and backgrounds, but it’s our values, our resourcefulness, and our drive that makes us Xapiens

In [222]:
soup_greenhouse.find_all("section")

[<section class="level-0">
 <h3 id="4028838003">Compliance</h3>
 <div class="opening" data-department-4028838003="true" data-office-4019860003="true" department_id="4028838003" office_id="4019860003">
 <a data-mapped="true" href="/xapo61/jobs/4487792003">Anti-Financial Crime Operations and Quality Assurance Manager</a>
 <br/>
 <span class="location">Remote - Anywhere</span>
 </div>
 </section>,
 <section class="level-0">
 <h3 id="4027998003">Engineering</h3>
 <div class="opening" data-department-4027998003="true" data-office-4019860003="true" department_id="4027998003" office_id="4019860003">
 <a data-mapped="true" href="/xapo61/jobs/4409203003">Backend Developer Lead (Remote - Work from Anywhere)</a>
 <br/>
 <span class="location">Remote - Anywhere</span>
 </div><div class="opening" data-department-4027998003="true" data-office-4019860003="true" department_id="4027998003" office_id="4019860003">
 <a data-mapped="true" href="/xapo61/jobs/4433247003">Backend Developer Lead (Remote - Wor

Dom tree structure is a bit more complex than lever. I can gather all the data I need in the \<div class="opening">, as well as a department id number. I will need to make a dictionary of departments + their id numbers and then "translate" these numbers to actual departments 

**Extract all h3 ids to create department dictionary**

In [245]:
translation_dict = {}
for x in soup_greenhouse.find_all("section"):
    if x.find_all("h3"):
        translation_dict[x.find("h3")["id"]] = x.find("h3").get_text()
    else:
        translation_dict[x.find("h4")["id"]] = x.find("h4").get_text()

In [246]:
translation_dict

{'4028838003': 'Compliance',
 '4027998003': 'Engineering',
 '4028003003': 'Data',
 '4027995003': 'Finance ',
 '4031831003': 'Operations',
 '4027999003': 'Product',
 '4028000003': 'Design',
 '4027994003': 'Security',
 '4028001003': 'IT',
 '4030484003': 'Xapo Talent Community '}

**Extract job title, location, department id, and url**

In [272]:
#department id, split to account for listings with multiple dpt ids
soup_greenhouse.find("div", {"class":"opening"})["department_id"].split(",")

['4027998003']

In [255]:
#url
base_url = "https://boards.greenhouse.io"
base_url + soup_greenhouse.find("div", {"class":"opening"}).a["href"]

'https://boards.greenhouse.io/xapo61/jobs/4398685003'

In [261]:
#job title
soup_greenhouse.find("div", {"class":"opening"}).a.get_text()

'Director of Engineering (Remote - Work from Anywhere)'

In [262]:
#location
soup_greenhouse.find("div", {"class":"opening"}).span.get_text()

'Remote - Anywhere'

In [273]:
greenhouse_all_jobs = []
for x in soup_greenhouse.find_all("div", {"class":"opening"}):
    dept_ids = []
    for d_id in x["department_id"].split(","):
        dept_ids.append(translation_dict[d_id])
    job_details = [x.a.get_text(),
                  x.span.get_text(),
                  dept_ids,
                  base_url + x.a["href"]]
    greenhouse_all_jobs.append(job_details)

In [275]:
greenhouse_all_jobs

[['Director of Engineering (Remote - Work from Anywhere)',
  'Remote - Anywhere',
  ['Engineering'],
  'https://boards.greenhouse.io/xapo61/jobs/4398685003'],
 ['Front-End Web Developer (Remote - Work from Anywhere)',
  'Remote - Anywhere',
  ['Engineering'],
  'https://boards.greenhouse.io/xapo61/jobs/4423692003'],
 ['Head of Platform Engineering (Remote - Work from Anywhere)',
  'Remote - Anywhere',
  ['Engineering'],
  'https://boards.greenhouse.io/xapo61/jobs/4398679003'],
 ['Head of QA (Remote - Work from Anywhere)',
  'Remote - Anywhere',
  ['Engineering'],
  'https://boards.greenhouse.io/xapo61/jobs/4433105003'],
 ['Platform Engineer (Remote - Work from Anywhere)',
  'Remote - Anywhere',
  ['Engineering'],
  'https://boards.greenhouse.io/xapo61/jobs/4433231003'],
 ['Senior Backend Developer (Remote - Work from Anywhere)',
  'Remote - Anywhere',
  ['Engineering'],
  'https://boards.greenhouse.io/xapo61/jobs/4436980003'],
 ['SRE/DevOps Engineer (Remote - Work from Anywhere)',
  'R

### C. Workable

In [29]:
workable.loc[workable["Company"]== "SkyVerge", "OpenPositions-15.05.2021"] = "No"

In [39]:
workable

Unnamed: 0,Company,URL,HRTool,OpenPositions-15.05.2021
0,SkyVerge,https://apply.workable.com/skyverge/,Workable,No
1,Semaphore CI,https://apply.workable.com/semaphore/,Workable,
2,Zyte,https://apply.workable.com/zyte/,Workable,
3,Doist,https://apply.workable.com/doist/,Workable,
4,BestSelf,https://apply.workable.com/bestself/,Workable,No
5,Human Made,https://apply.workable.com/humanmade/,Workable,
6,Overleaf,https://apply.workable.com/overleaf/,Workable,


In [79]:
#website is a JS website, need to post to API in order to get response

#url_workable = workable["URL"][1]
# response_workable = requests.get(url_workable)
# soup_workable = bs4.BeautifulSoup(response_workable.content, "html.parser")

**Website is created with JS and API calls. Found XHR files to figure out how to call the API. Will need to manually go through and adjust all the workable URLs for API calls**

In [88]:

url_workable = "https://apply.workable.com/api/v3/accounts/semaphore/jobs"

In [110]:
json_workable = requests.post(url_workable).json()

In [122]:
json_workable

{'total': 5,
 'results': [{'id': 1757305,
   'shortcode': '4987E74E3A',
   'title': 'Director of Content',
   'remote': True,
   'location': {'country': None,
    'countryCode': None,
    'city': None,
    'region': None},
   'state': 'published',
   'isInternal': False,
   'code': '',
   'published': '2021-05-20T00:00:00.000Z',
   'language': 'en',
   'department': ['Marketing'],
   'accountUid': '8b8a8312-fdf7-4a9e-8997-f971ec171f4e',
   'approvalStatus': 'approved'},
  {'id': 1741253,
   'shortcode': '64B0A3AD0F',
   'title': 'Head of Customer Success',
   'remote': True,
   'location': {'country': None,
    'countryCode': None,
    'city': None,
    'region': None},
   'state': 'published',
   'isInternal': False,
   'code': '',
   'published': '2021-05-17T00:00:00.000Z',
   'type': 'full',
   'language': 'en',
   'department': ['Customer Success'],
   'accountUid': '8b8a8312-fdf7-4a9e-8997-f971ec171f4e',
   'approvalStatus': 'approved'},
  {'id': 1704004,
   'shortcode': '69009F8B

In [182]:
api_url = "https://apply.workable.com/api/v2/accounts/semaphore/jobs/"
location = ""
workable_list = []
for entry in json_workable["results"]:
    if entry["remote"]:
            location = "Remote"
    else: 
        location = entry["location"]["country"]
    workable_list.append([entry["title"], location, entry["department"], api_url + entry["shortcode"]])
print(workable_list)

[['Director of Content', 'Remote', ['Marketing'], 'https://apply.workable.com/api/v2/accounts/semaphore/jobs/4987E74E3A'], ['Head of Customer Success', 'Remote', ['Customer Success'], 'https://apply.workable.com/api/v2/accounts/semaphore/jobs/64B0A3AD0F'], ['Marketing Project Manager', 'Remote', ['Marketing'], 'https://apply.workable.com/api/v2/accounts/semaphore/jobs/69009F8BBA'], ['Technical Writer', 'Remote', ['Marketing'], 'https://apply.workable.com/api/v2/accounts/semaphore/jobs/C3C1B62E55'], ['Senior Product Designer', 'Remote', ['Design'], 'https://apply.workable.com/api/v2/accounts/semaphore/jobs/AE0FAFF24C']]


### D. Breezy

In [132]:
breezy

Unnamed: 0,Company,URL,HRTool,OpenPositions-15.05.2021
0,Skillcrush,https://skillcrush.breezy.hr/,Breezy,
1,Modern Tribe,https://modern-tribe.breezy.hr/,Breezy,
2,Skyscrapers,https://skyscrapers.breezy.hr/,Breezy,
3,Time Doctor,https://time-doctor.breezy.hr/,Breezy,
4,Dollar Flight Club,https://dollar-flight-club.breezy.hr/,Breezy,
5,Requis,https://requis.breezy.hr/,Breezy,


In [201]:
url_breezy = breezy["URL"][1]

In [202]:
response_breezy = requests.get(url_breezy)
soup_breezy = bs4.BeautifulSoup(response_breezy.content, "html.parser")

In [203]:
soup_breezy.find_all("li", {"class":"position transition"})

[<li class="position transition"><a href="/p/95cf7d97bdcc-future-openings-with-modern-tribe"><button class="button apply polygot button-right bzyButtonColor">%BUTTON_APPLY%</button><h2>Future Openings with Modern Tribe</h2><ul class="meta"><li class="location"><i class="fa fa-wifi"></i><span> Remote Worldwide</span></li><li class="type"><i class="fa fa-building"></i><span class="polygot">%LABEL_POSITION_TYPE_CONTRACT%</span></li></ul><button class="button apply polygot button-full bzyButtonColor">%BUTTON_APPLY%</button></a></li>,
 <li class="position transition"><a href="/p/2440c010f443-product-owner-with-wordpress-experience"><button class="button apply polygot button-right bzyButtonColor">%BUTTON_APPLY%</button><h2>Product Owner with WordPress Experience</h2><ul class="meta"><li class="location"><i class="fa fa-wifi"></i><span> Remote Worldwide</span></li><li class="type"><i class="fa fa-building"></i><span class="polygot">%LABEL_POSITION_TYPE_CONTRACT%</span></li></ul><button class=

In [204]:
breezy_list = []
dept = ""
for x in soup_breezy.find_all("li", {"class":"position transition"}):
    if x.find("li", {"class":"department"}):
        dept = x.find("li", {"class":"location"}).text
    else:
        dept = "Unknown" 
    breezy_list.append([x.h2.text, x.find("li", {"class":"location"}).text, 
                        dept, "https://modern-tribe.breezy.hr" + x.a.get("href")])
print(breezy_list)   

[['Future Openings with Modern Tribe', ' Remote Worldwide', 'Unknown', 'https://modern-tribe.breezy.hr/p/95cf7d97bdcc-future-openings-with-modern-tribe'], ['Product Owner with WordPress Experience', ' Remote Worldwide', 'Unknown', 'https://modern-tribe.breezy.hr/p/2440c010f443-product-owner-with-wordpress-experience'], ['Quality Assurance (QA) Analyst', ' Remote Worldwide', ' Remote Worldwide', 'https://modern-tribe.breezy.hr/p/b481f0511faf-quality-assurance-qa-analyst'], ['Visual Designer', ' Remote Worldwide', ' Remote Worldwide', 'https://modern-tribe.breezy.hr/p/618cf1fd64c3-visual-designer'], ['WordPress Frontend Engineer', 'Remote - %LABEL_POSITION_TYPE_REMOTE%', 'Remote - %LABEL_POSITION_TYPE_REMOTE%', 'https://modern-tribe.breezy.hr/p/c7f010ede237-wordpress-frontend-engineer']]


### E. Recruitee

In [205]:
recruitee

Unnamed: 0,Company,URL,HRTool,OpenPositions-15.05.2021
0,AULA,https://aulaeducation.recruitee.com/,Recruitee,
1,DockYard,https://dockyardinc1.recruitee.com/,Recruitee,
2,DuckDuckGo,https://duckduckgo.recruitee.com/,Recruitee,
3,Circular,https://trycircular.recruitee.com/,Recruitee,


In [206]:
url_recruitee = recruitee["URL"][0]
response_recruitee = requests.get(url_recruitee)
soup_recruitee = bs4.BeautifulSoup(response_recruitee.content, "html.parser")

In [249]:
recruitee_list = []
for x in soup_recruitee.find_all("a", {"class":"col-md-6"}):
    recruitee_list.append([x.h5.text.strip(), x.find("li", {"class":"location"}).text.strip(), 
                           x.find("div", {"class":"department"}).text.strip(), 
                           "https://aulaeducation.recruitee.com"+ x["href"]])
print(recruitee_list)

[['Senior User Researcher', 'Remote job', 'Product', 'https://aulaeducation.recruitee.com/o/senior-user-researcher'], ['Senior Software Engineer - Full Remote - EdTech Startup', 'Remote job', 'Product', 'https://aulaeducation.recruitee.com/o/senior-software-engineer-full-remote-edtech-startup'], ['Senior Data Analyst - Full Remote', 'Remote job', 'Product', 'https://aulaeducation.recruitee.com/o/senior-data-analyst-full-remote'], ['Learning Design Coach', 'Remote job', 'Learning', 'https://aulaeducation.recruitee.com/o/learning-design-coach']]


## Conclusions

**Achieved making MVP for every hr job platform that I have seen so far**

Next steps
1. Fully outline pipeline so that I can understand what I would like my database to look like and how I will use/connect each step
2. Create classes/functions in an IDE
3. Automate the pipeline so I can schedule scrapes
4. After development of the pipeline basics, move on to MVP of extracting the job posting url