# Scraping Google Career Webpages

## Prerequisites
* selenium
* one of the following (depending on which browser you're using)
  * firefox: [geckodriver](https://github.com/mozilla/geckodriver/releases/)
  * chrome/chromium: [chromedriver](http://chromedriver.chromium.org/)
  
## Useful Tutorials
* https://huilansame.github.io/huilansame.github.io/archivers/sleep-implicitlywait-wait
* https://wangxin1248.github.io/python/2018/09/python3-spider-8.html

In [1]:
class GoogleJob:
    """ Wraps job title, location, minimum/preferred qualifications and responsibilities """
    def __init__(self, title, location, minimum_qual, preferred_qual, responsibilities):
        self.title = title
        self.location = location
        self.minimum_qual = minimum_qual
        self.preferred_qual = preferred_qual
        self.responsibilities = responsibilities
        
    def __repr__(self):
        return '{} @ {}'.format(self.title, self.location)

## 1. Scraping a single page into a GoogleJob object
Use `scrape_job()` provided below on single job with its url.

Example target: https://careers.google.com/jobs/results/6163626811654144-front-end-software-engineer/?company=Google&company=YouTube&employment_type=FULL_TIME&hl=en_US&jlo=en_US&q=software&sort_by=relevance

In [2]:
from selenium.webdriver import Firefox
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
import selenium.webdriver.support.ui as ui

import pandas
import time
import csv

In [3]:
def _extract(class_name: str):
    """ Extracts the specified element by class name 
    :return: readable text in the element.
    """
    return driver.find_element_by_class_name(class_name).text

In [4]:
def scrape_job(url: str, wait: WebDriverWait):
    """ Scrape the job info from the specified Url. A broswer driver MUST be initialized beforehand.
    :param url: the url of a detailed google job page.
    :param wait: contains timeout.
    :return: a GoogleJob object.
    """
    driver.get(url)
    
    # Wait until all required elements are generated.
    wait.until(ec.presence_of_element_located((By.CLASS_NAME, 'gc-card__title')))
    wait.until(ec.presence_of_element_located((By.CLASS_NAME, 'gc-job-tags__location')))
    wait.until(ec.presence_of_element_located((By.CLASS_NAME, 'gc-job-qualifications')))
    wait.until(ec.presence_of_element_located((By.CLASS_NAME, 'gc-job-detail__section--responsibilities')))
    
    # Extract job information.
    title = _extract('gc-card__title')
    location = _extract('gc-job-tags__location')
    qualifications = _extract('gc-job-qualifications').split('\n\n')
    minimum_qual = qualifications[0].replace('Minimum qualifications:\n', '')
    preferred_qual = qualifications[1].replace('Preferred qualifications:\n', '')
    responsibilities = _extract('gc-job-detail__section--responsibilities').replace('Responsibilities\n', '', 1)
    
    return GoogleJob(title, location, minimum_qual, preferred_qual, responsibilities)

In [5]:
options = Options()
options.add_argument('-headless')
driver = Firefox(executable_path='/opt/firefox/geckodriver', options=options)

wait = WebDriverWait(driver, timeout=10)
job = scrape_job(r'https://careers.google.com/jobs/results/6163626811654144-front-end-software-engineer/?company=Google&company=YouTube&employment_type=FULL_TIME&hl=en_US&jlo=en_US&q=software&sort_by=relevance', wait)
print(vars(job))

driver.quit()

{'title': 'Front End Software Engineer', 'location': 'Pittsburgh, PA, USA', 'minimum_qual': 'BA/BS degree or equivalent practical experience.\n1 year of work experience in software development.\nExperience with server-side web frameworks such as JSP or ASP.Net.\nDevelopment experience in C, C++ or Java and experience designing modular, object-oriented JavaScript.', 'preferred_qual': '4 years of relevant work experience, including web application experience or skills using AJAX, HTML, CSS or JavaScript.\nProgramming experience in GWT.\nExperience with user interface frameworks such as XUL, Flex, AJAX, and XAML.\nKnowledge of user interface design.', 'responsibilities': "Build next-generation web applications with a focus on the client side.\nRedesign UI's, Implement new UI's, and pick up Java as necessary.\nEngage with back-end systems."}


## Write the dict representation of a GoogleJob object to CSV

In [6]:
csv_file = 'google_jobs.csv'
job_dict = vars(job)

with open(csv_file, 'w') as f:
    w = csv.DictWriter(f, job_dict.keys())
    w.writeheader()
    w.writerow(job_dict)
    
print('File written: ' + csv_file)

File written: google_jobs.csv


## Csv to Pandas DataFrame

In [7]:
dat = pandas.read_csv(csv_file)
dat

Unnamed: 0,title,location,minimum_qual,preferred_qual,responsibilities
0,Front End Software Engineer,"Pittsburgh, PA, USA",BA/BS degree or equivalent practical experienc...,"4 years of relevant work experience, including...",Build next-generation web applications with a ...


---

## 2. Search & Scrape All Relevant Jobs
Use `scrape_jobs(keyword, wait)` provided below on all jobs relevant to a specific keyword.

Example: all jobs related to the keyword `software`.

In [18]:
def scrape_jobs(keyword: str, wait: WebDriverWait):
    """ Scrape info of all jobs related to the specified keyword
    :param keyword: google job search keyword.
    :param wait: contains timeout.
    :return: a list of GoogleJob objects.
    """
    # Open Google job search page.
    driver.get(r'https://careers.google.com/jobs/results/?company=Google&company=YouTube&employment_type=FULL_TIME&hl=en_US&jlo=en_US&q=software&sort_by=relevance');
    
    # Type 'software' as the keyword in the searchbox and press RETURN.
    searchbox_name = 'q'
    driver.find_element_by_name(searchbox_name).send_keys(keyword)
    driver.find_element_by_name(searchbox_name).send_keys(Keys.RETURN)
    
    # Loop until there's no `next` hyperlink.
    urls = []
    
    while True:
        try:
            wait.until(ec.presence_of_element_located((By.ID, 'search-results')))
            wait.until(ec.presence_of_element_located((By.XPATH, "//a[@data-gtm-ref='job-results-card']")))
        
            result_pane = driver.find_element_by_id('search-results')
            cards = result_pane.find_elements_by_xpath("//a[@data-gtm-ref='job-results-card']")
        
            urls += [card.get_attribute('href') for card in cards]
        
            # If `next` cannot be found after `timeout` seconds, it will throw a TimeoutException
            # then we can break the loop.
            wait.until(ec.presence_of_element_located((By.XPATH, "//a[@data-gtm-ref='search-results-next-click']")))
        
            # Click on `next`
            driver.find_element_by_xpath("//a[@data-gtm-ref='search-results-next-click']").send_keys(Keys.RETURN)
        except Exception as e:
            print(e)
            break
    
    # For each url in urls, scrape_job() and get a GoogleJob object,
    # hence this could take a while.
    return [scrape_job(url, wait) for url in urls]

In [19]:
options = Options()
options.add_argument('-headless')
driver = Firefox(executable_path='/opt/firefox/geckodriver', options=options)

wait = WebDriverWait(driver, timeout=10)
jobs = scrape_jobs('software', wait)
driver.quit()

print('Collected ' + str(len(jobs)) + ' jobs.')

Message: The element reference of <a class="gc-card" href="/jobs/results/6696990677073920-software-engineer-android/?company=Google&company=YouTube&employment_type=FULL_TIME&hl=en_US&jlo=en_US&page=2&q=softwaresoftware&sort_by=relevance"> is stale; either the element is no longer attached to the DOM, it is not in the current frame context, or the document has been refreshed

140
Collected 140 jobs.


## Write the info of all jobs to CSV

In [20]:
csv_file = 'google_jobs.csv'

with open(csv_file, 'w') as f:
    w = csv.DictWriter(f, vars(jobs[0]).keys())
    w.writeheader()
    
    for job in jobs:
        w.writerow(vars(job))
    
print('File written: ' + csv_file)

File written: google_jobs.csv


In [22]:
dat = pandas.read_csv(csv_file)
dat

Unnamed: 0,title,location,minimum_qual,preferred_qual,responsibilities
0,Front End Software Engineer,"Pittsburgh, PA, USA",BA/BS degree or equivalent practical experienc...,"4 years of relevant work experience, including...",Build next-generation web applications with a ...
1,Front End Software Engineer,"Pittsburgh, PA, USA",BS degree in Electrical Engineering or Compute...,MS degree in Electrical Engineering or Compute...,"Design, implement and launch complex HTML5 vid..."
2,Front End Software Engineer,"Pittsburgh, PA, USA",BA/BS in Computer Science or related technical...,Experience with one or more general purpose pr...,"Design, implement and launch highly-visible, p..."
3,Network Test Engineer,"Sunnyvale, CA, USA",BA/BS degree in Computer Science or equivalent...,"Master's degree or PhD, or equivalent practica...","Design, develop and execute test plans for net..."
4,"Software Engineer, Front End Development",Singapore,"Bachelor's degree in a technical field, or equ...","Development experience in designing modular, o...",Build next-generation web applications with a ...
5,Front End Software Engineer,"Pittsburgh, PA, USA","Bachelor's degree in Computer Science, Electri...",Experience working with hardware designers/rea...,Develop the whole software stack for consumer ...
6,Front End Software Engineer,"Pittsburgh, PA, USA","Bachelor's degree in Computer Science, Mathema...",Experience with machine learning libraries (e....,Work with data scientists to productionize exp...
7,"Wireless Software Engineer, Google Home","Taipei, Taiwan",Master's degree in Electrical Engineering or C...,PhD degree.\nExperience with wireless protocol...,"Participate in architecting, developing, testi..."
8,"Web Solutions Engineer, Google Cloud","Sunnyvale, CA, USA","BS degree in Computer Science, Math, or relate...",Experience with Google Cloud Platform and its ...,Develop frontend and backend systems from star...
9,Front End Software Engineer,"Pittsburgh, PA, USA",Bachelor's degree or equivalent practical expe...,Experience in providing infrastructure solutio...,Identify and problem solve complex systemic ch...
