# Scraping Google Career Webpages

## Prerequisites
* selenium
* one of the following (depending on which browser you're using)
  * firefox: [geckodriver](https://github.com/mozilla/geckodriver/releases/)
  * chrome/chromium: [chromedriver](http://chromedriver.chromium.org/)
  
## Useful Tutorials
* https://huilansame.github.io/huilansame.github.io/archivers/sleep-implicitlywait-wait
* https://wangxin1248.github.io/python/2018/09/python3-spider-8.html

In [83]:
class GoogleJob:
    """ Wraps job title, location, minimum/preferred qualifications and responsibilities """
    def __init__(self, title, location, minimum_qual, preferred_qual, responsibilities):
        self.title = title
        self.location = location
        self.minimum_qual = minimum_qual
        self.preferred_qual = preferred_qual
        self.responsibilities = responsibilities

In [84]:
def extract(class_name: str):
    """ Extracts the specified element by class name """
    return driver.find_element_by_class_name(class_name).text

## Test on the following page
https://careers.google.com/jobs/results/6163626811654144-front-end-software-engineer/?company=Google&company=YouTube&employment_type=FULL_TIME&hl=en_US&jlo=en_US&q=software&sort_by=relevance

In [85]:
from selenium.webdriver import Firefox
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support import expected_conditions as expected
from selenium.webdriver.support.wait import WebDriverWait

In [86]:
url = r'https://careers.google.com/jobs/results/6163626811654144-front-end-software-engineer/?company=Google&company=YouTube&employment_type=FULL_TIME&hl=en_US&jlo=en_US&q=software&sort_by=relevance'

options = Options()
options.add_argument('-headless')

driver = Firefox(executable_path='/opt/firefox/geckodriver', options=options)
wait = WebDriverWait(driver, timeout=10)

driver.get(url)
time.sleep(3) # Temporarily using stupid sleep approach.

In [87]:
title = extract('gc-card__title')
location = extract('gc-job-tags__location')
qualifications = extract('gc-job-qualifications').split('\n\n')
minimum_qual = qualifications[0].replace('Minimum qualifications:\n', '')
preferred_qual = qualifications[1].replace('Preferred qualifications:\n', '')
responsibilities = extract('gc-job-detail__section--responsibilities').replace('Responsibilities\n', '', 1)

job = GoogleJob(title, location, minimum_qual, preferred_qual, responsibilities)
print(vars(job))

driver.quit()

{'title': 'Front End Software Engineer', 'location': 'Pittsburgh, PA, USA', 'minimum_qual': 'BA/BS degree or equivalent practical experience.\n1 year of work experience in software development.\nExperience with server-side web frameworks such as JSP or ASP.Net.\nDevelopment experience in C, C++ or Java and experience designing modular, object-oriented JavaScript.', 'preferred_qual': '4 years of relevant work experience, including web application experience or skills using AJAX, HTML, CSS or JavaScript.\nProgramming experience in GWT.\nExperience with user interface frameworks such as XUL, Flex, AJAX, and XAML.\nKnowledge of user interface design.', 'responsibilities': "Build next-generation web applications with a focus on the client side.\nRedesign UI's, Implement new UI's, and pick up Java as necessary.\nEngage with back-end systems."}


## Write the dict to a Csv File

In [88]:
csv_file = 'google_jobs.csv'
job_dict = vars(job)

with open(csv_file, 'w') as f:
    w = csv.DictWriter(f, job_dict.keys())
    w.writeheader()
    w.writerow(job_dict)
    
print('File written to: ' + csv_file)

File written to: google_jobs.csv


## Csv to Pandas DataFrame

In [89]:
import pandas

In [90]:
dat = pandas.read_csv(csv_file)
dat

Unnamed: 0,title,location,minimum_qual,preferred_qual,responsibilities
0,Front End Software Engineer,"Pittsburgh, PA, USA",BA/BS degree or equivalent practical experienc...,"4 years of relevant work experience, including...",Build next-generation web applications with a ...
