# Scraping Google Career Webpages

## Prerequisites
* selenium
* one of the following (depending on which browser you're using)
  * firefox: [geckodriver](https://github.com/mozilla/geckodriver/releases/)
  * chrome/chromium: [chromedriver](http://chromedriver.chromium.org/)
  
## Useful Tutorials
* https://huilansame.github.io/huilansame.github.io/archivers/sleep-implicitlywait-wait
* https://wangxin1248.github.io/python/2018/09/python3-spider-8.html

## 1. Scraping a single page into a GoogleJob object
Use `scrape_job()` provided below on single job with its url.

Example target: https://careers.google.com/jobs/results/6163626811654144-front-end-software-engineer/?company=Google&company=YouTube&employment_type=FULL_TIME&hl=en_US&jlo=en_US&q=software&sort_by=relevance

In [1]:
from selenium.webdriver import Firefox
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
from selenium.common.exceptions import TimeoutException
import selenium.webdriver.support.ui as ui

import pandas
import time
import csv

In [2]:
def _extract(class_name: str):
    """ Extracts the specified element by class name 
    :return: readable text in the element.
    """
    return driver.find_element_by_class_name(class_name).text

In [3]:
def scrape_job(url: str, wait: WebDriverWait, retry=3):
    """ Scrape the job info from the specified Url. A broswer driver MUST be initialized beforehand.
    :param url: the url of a detailed google job page.
    :param wait: contains timeout.
    :param retry: times to retry.
    :return: a dict wrapping all info.
    """
    for i in range(0, retry):
        driver.get(url)
    
        # Wait until all required elements are generated.
        try:
            wait.until(ec.presence_of_element_located((By.CLASS_NAME, 'gc-job-detail__header')))
            wait.until(ec.presence_of_element_located((By.CLASS_NAME, '_1n-z _6hy- _1kdd')))
            wait.until(ec.presence_of_element_located((By.CLASS_NAME, 'gc-job-detail__section--qualifications')))
            wait.until(ec.presence_of_element_located((By.CLASS_NAME, 'gc-job-detail__section--responsibilities')))
            
            # Extract job information.
            title = driver.find_element_by_class_name('gc-job-detail__header') \
                            .find_element_by_class_name('gc-heading--beta').text
            location = driver.find_element_by_class_name('gc-job-detail__tags') \
                            .find_element_by_class_name('gc-job-tags__location').text
            qualifications = _extract('gc-job-detail__section--qualifications').split('\n\n')
            minimum_qual = qualifications[0].replace('Minimum qualifications:\n', '').replace('Qualifications\n', '', 1)
            preferred_qual = qualifications[1].replace('Preferred qualifications:\n', '') if len(qualifications) > 1 else ''
            responsibilities = _extract('gc-job-detail__section--responsibilities').replace('Responsibilities\n', '', 1)
            
            return {
                'title': title,
                'loc': location,
                'minimum_qual': minimum_qual,
                'preferred_qual': preferred_qual,
                'resp': responsibilities
            }
        except TimeoutException:
            return None
        except Exception:
            continue
    
    # If all retries have failed, return None.
    return None

In [4]:
options = Options()
options.add_argument('-headless')
driver = Firefox(executable_path='/opt/firefox/geckodriver', options=options)

wait = WebDriverWait(driver, timeout=10)
job = scrape_job(r'https://careers.google.com/jobs/results/4890468019273728-sensor-prototyping-engineer-consumer-hardware/?company=Google&company=YouTube&employment_type=FULL_TIME&hl=en_US&jlo=en_US&page=48&q=software&sort_by=relevancey%3Drelevance_by%3Drelevanceceelevancesort_by%3Drelevancencece', wait)

if job is not None:
    print(job)

driver.quit()

{'title': 'Sensor Prototyping Engineer, Consumer Hardware', 'loc': 'Taipei, Taiwan', 'minimum_qual': 'BA/BS degree in Electrical Engineering, Physics, Mechanical Engineering, Computer Science or related field, or equivalent practical experience.\nExperience in electronics system prototyping that covers component selection, schematic design, Printed Circuit Board (PCB) layout design, PCB bring up and Firmware (FW) development.\nExperience in microcontroller selection and its various digital communication protocols and interfaces (e.g USB, RS-232, GPIO, SPI, I2C and UART).', 'preferred_qual': "Master's degree in Electrical Engineering, Physics, Mechanical Engineering, Computer Science or related field.\nKnowledge of inertial, magnetic, optical and/or environmental sensors.\nKnowledge of flex/PCB surface mount and assembly process.\nExperience with scripting languages (e.g Python/MATLAB) and software development languages (e.g C/C++).", 'resp': 'Conduct sensor system prototyping, includin

---

## 2. Search & Scrape All Relevant Jobs
Use `scrape_jobs(keyword, wait)` provided below on all jobs relevant to a specific keyword.

Example: all jobs related to the keyword `software`.

In [5]:
def _collect_urls(wait: WebDriverWait, urls: list, page_count, url_count):
    """ Collect all urls we have to scrape """
    for i in range(0, page_count):
        try:
            time.sleep(2) # Sleep for 2 secs for the page to load or it will scream like a bitch
            
            wait.until(ec.presence_of_element_located((By.ID, 'search-results')))
            wait.until(ec.presence_of_element_located((By.XPATH, "//a[@data-gtm-ref='job-results-card']")))
            result_pane = driver.find_element_by_id('search-results')
            cards = result_pane.find_elements_by_xpath("//a[@data-gtm-ref='job-results-card']")
            
            urls += [card.get_attribute('href') for card in cards]
            print('\rCollecting urls... {}/{}'.format(len(urls), url_count), end='')
            
            # If `next` cannot be found after `timeout` seconds, it will throw 
            # a TimeoutException, then we can break the loop.
            wait.until(ec.presence_of_element_located((By.XPATH, "//a[@data-gtm-ref='search-results-next-click']")))
            driver.find_element_by_xpath("//a[@data-gtm-ref='search-results-next-click']").send_keys(Keys.RETURN)
        except Exception as e:
            print(e)
            break
    print()

In [17]:
def scrape_jobs(keyword: str, wait: WebDriverWait, urls: list, start=1):
    """ Scrape info of all jobs related to the specified keyword
    :param keyword: google job search keyword.
    :param wait: contains timeout.
    :param urls: urls cache.
    :param start: the number of the record to start scraping.
    """
    # Let start = 1713, starting_page = 86, starting_card_no = 13.
    items_per_page = 20
    starting_page = start // items_per_page + 1
    starting_card_no = start - (starting_page - 1) * items_per_page
    
    # Open Google job search page.
    driver.get(r'https://careers.google.com/jobs/results/?company=Google&company=YouTube&employment_type=FULL_TIME&hl=en_US&jlo=en_US&page={}&q={}&sort_by=relevance'.format(starting_page, keyword));
    
    # VERY DIRTY WORKAROUND :(
    # There's a weird bug. I cannot get url count from any of the search result pages.
    # However, I can get it from one of the job's detail page.
    wait.until(ec.presence_of_element_located((By.ID, 'search-results')))
    wait.until(ec.presence_of_element_located((By.XPATH, "//a[@data-gtm-ref='job-results-card']")))
    result_pane = driver.find_element_by_id('search-results')
    cards = result_pane.find_elements_by_xpath("//a[@data-gtm-ref='job-results-card']")
    driver.get(cards[0].get_attribute('href'))
    
    # Get `x` jobs matched and calculate how many pages we have to loop through.
    url_count_class_name = 'gc-jobs-matched__count--active'
    wait.until(ec.presence_of_element_located((By.CLASS_NAME, url_count_class_name)))
    url_count = int(driver.find_element_by_class_name(url_count_class_name).text)
    page_count = (url_count // items_per_page) + 1
    driver.back()
    
    # Loop until there's no `next` hyperlink.
    print('Collecting urls...', end='')
    
    if len(urls) != url_count:
        urls.clear()
        _collect_urls(wait, urls, page_count, url_count)
    
    with open('google_jobs.csv', 'w') as f:
        w = csv.DictWriter(f, fieldnames = ['title', 'loc', 'minimum_qual', 'preferred_qual', 'resp'])
        w.writeheader()
        
        for i in range(start - 1, len(urls)):
            print('\rProcessing ({}/{}): {}'.format(i, len(urls), urls[i]), end='')
            job = scrape_job(urls[i], wait)
            
            if job is not None:
                w.writerow(job)

In [None]:
# We'll cache all urls we have to scrape later in this list.
urls = []

In [18]:
options = Options()
options.add_argument('-headless')
driver = Firefox(executable_path='/opt/firefox/geckodriver', options=options)

wait = WebDriverWait(driver, timeout=10)
scrape_jobs('software', wait, urls, start=1)
driver.quit()

Processing (1715/1716): https://careers.google.com/jobs/results/4794570533699584-account-representative-search-ads-360-english-portuguese/?company=Google&company=YouTube&employment_type=FULL_TIME&hl=en_US&jlo=en_US&page=86&q=software&sort_by=relevancepage=86&q=software&sort_by=relevancerelevance

## CSV to pandas DataFrame
Download the csv [here](https://drive.google.com/file/d/1IPqyHeLukbMcabIBlejGucthB21lQT8j/view)

In [21]:
dat = pandas.read_csv('google_jobs.csv')
dat.head(10)

Unnamed: 0,title,loc,minimum_qual,preferred_qual,resp
0,Front End Software Engineer,"Pittsburgh, PA, USA",BA/BS degree or equivalent practical experienc...,"4 years of relevant work experience, including...",Build next-generation web applications with a ...
1,"Software Engineer, HTML5 Video, Google Cloud P...","Sunnyvale, CA, USA",BS degree in Electrical Engineering or Compute...,MS degree in Electrical Engineering or Compute...,"Design, implement and launch complex HTML5 vid..."
2,"Front End Software Engineer, YouTube","San Bruno, CA, USA",BA/BS in Computer Science or related technical...,Experience with one or more general purpose pr...,"Design, implement and launch highly-visible, p..."
3,"Software Engineer, Google Home","Shanghai, China","Bachelor's degree in Computer Science, Electri...",Experience working with hardware designers/rea...,Develop the whole software stack for consumer ...
4,"Software Engineer, Front End Development",Singapore,"Bachelor's degree in a technical field, or equ...","Development experience in designing modular, o...",Build next-generation web applications with a ...
5,"Wireless Software Engineer, Google Home","Taipei, Taiwan",Master's degree in Electrical Engineering or C...,PhD degree.\nExperience with wireless protocol...,"Participate in architecting, developing, testi..."
6,"Network Engineer, Software and Automation","Sydney NSW, Australia",BA/BS in Computer Science or related field or ...,Master's degree or PhD in Computer Science or ...,Engage in and improve the lifecycle of service...
7,"Software Engineer, Cloud SQL","Sunnyvale, CA, USA","BS degree in Computer Science, similar technic...",10 years of relevant work experience in softwa...,Work alongside the Technical Lead to drive lon...
8,"Software Engineer, Infrastructure (English)","Tel Aviv-Yafo, Israel","Bachelor's degree in Computer Science, a relat...",Master’s degree.\nExperience with Unix/Linux o...,"Design, develop, test, deploy, maintain and im..."
9,Software Engineer,"Seoul, South Korea","Bachelor's degree in Computer Science, similar...","Master’s degree or PhD in Engineering, Compute...","Design, develop, test, deploy, maintain and im..."
