# Job Getter
Scrapes Indeed job postings, saves info to excel

## Imports
We need *requests* to get our data, *BeautifulSoup* to parse our HTML, and *selenium* to programmatically nagivate Chrome. Last, we will use *openpyxl* to save everything to an Excel file.

In [63]:
import re
import requests

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

from bs4 import BeautifulSoup
from openpyxl import Workbook

## Helper Functions

In [64]:
def clean_summary(arr):
    """Removes \n char from the html elts 
    with class .summary"""
    new_str = ""
    if len(arr) <= 1:
        clean_str = arr[0].strip('\n').strip()                        
        new_str += clean_str
        return new_str
    for s in arr:        
        try:
            try:
                clean_str = s.strip('\n').strip()                        
                new_str += clean_str
            except TypeError as err:
                pass               
        except:
            new_str += s.contents[0].string
    return new_str

In [65]:
def last_row_index(sheet):
    """receives Excel worksheet and returns an int that represents the 
    last occupied row of the first column of that sheet""" 
    try:
        curr_cell = 0
        for row in sheet.iter_rows(min_row=1,max_col=1):
            for cell in row:                           
                curr_cell = cell.row
        return curr_cell      
    except:
        print("Invalid excel sheet")
        return False

In [66]:
def make_job_req(url):
    """Makes a req, returns False if
    req status-code is not 200"""
    req = requests.get(url)
    if req.status_code != 200:
        print("Invalid HTTP request from given URL")
        return False
    return req

In [67]:
def make_job_divs(req):
    """Takes in a request object from Indeed,
    returns divs containg job data"""
    try:
        soup = BeautifulSoup(req.text, 'html.parser')
        jobs = new_soup.find_all("div", class_="result")
    except:
        print("Invalid request object")
        return False
    
    return jobs

In [68]:
def make_job_list(job_divs):
    """Converts job HTML data to a dict"""
    job_list = []
    for i, j in enumerate(job_divs):
        new_job = {}
        new_job_summary = job_divs[i].find_all('div', class_='summary')[0].contents
        new_job['id'] = i
        new_job['summary'] = clean_summary(new_job_summary)
        new_job['title'] = job_divs[i].find_all('a')[0]['title']
        new_job['link'] = 'https://indeed.com' + job_divs[0].find_all('a')[0]['href']

        job_list.append(new_job)
    
    return job_list

In [69]:
def write_jobs(sheet, jobs, file_name):
    """Writes job data to an Excel file"""
    try:
        last_row = last_row_index(ws) + 1
        for job in jobs:    
            curr_col = 1
            for val in job.values():
                d = ws.cell(row=(job['id']+last_row), column=curr_col)
                d.value = val        
                curr_col+=1        
    except:
        return "There was an err"
    
    wb.save(file_name)
    return True

## Selenium

I am using chrome, but there are other options. Check the Selenium docs.

In [70]:
driver = webdriver.Chrome('./chromedriver')

Simple test to make sure the request worked

In [71]:
driver.get('https://www.indeed.com/')
driver.title

'Job Search | Indeed'

Send the job you want to search as a string to the `input` with `name:"q"` then press *RETURN* to search for jobs.

In [72]:
curr_job = "web developer"

elem = driver.find_element_by_name("q")
elem.send_keys(curr_job)
elem.send_keys(Keys.RETURN)

Another simple test to see the the query parameter contains our job. *Indeed* will use your location for the "where".

In [73]:
driver.current_url

'https://www.indeed.com/jobs?q=web+developer&l=Dumont%2C+NJ'

## Requests
Now that we have our url, lets grab our HTML using `requests`

Check to make sure the req was successful

In [74]:
req_string = driver.current_url

new_req = requests.get("https://www.indeed.com/jobs?q=web+developer&l=Dumont%2C+NJ&start=10")
new_req.status_code

200

## BeautifulSoup
BeautifulSoup provides an awesome API for navigating HTML. This is how we will get the content from our search results and build our data. 

In [75]:
new_soup = BeautifulSoup(new_req.text, 'html.parser')

In [76]:
job_divs = new_soup.find_all("div", class_="result")

Let's take a look at how many jobs we are listed

In [77]:
len(job_divs)

18

## Build the Jobs List
Here we a building a list of jobs were each job in dictionary with the following keys:
- id
    - used to index the job in excel
- summary
    - a short summary of the job
- title
    - job title
- link
    - link to the job description

In [78]:
job_divs[0]

<div class="jobsearch-SerpJobCard unifiedRow row result" data-ci="333252521" data-empn="8292458881371076" data-jk="f299a53109ae3d1e" id="pj_f299a53109ae3d1e">
<style>
.jobcard_logo{margin:6px 0}.jobcard_logo img{width:auto;max-width:80px;max-height:30px}
</style>
<div class="title">
<a class="jobtitle turnstileLink" data-tn-element="jobTitle" href="/pagead/clk?mo=r&amp;ad=-6NYlbfkN0AwKv_lxr9QxZkKYQR3Tp6JPuvL6qupby2r5pYLIAGzD7VsCbadw5bOKKFYnQStTiderWv7r7DTDMKtEvImHoKmLGw_SL6tUynuqIAgBi62gRo5yP4JWUgNIcAku0S91iinxGKM78saMUam6i63m15pWfJNO97Y3D9_l3PNNlZL68A7DHLdPLYqUZms_l0qpYgdmqDHIW2bE_MhmOAOVnuGPlGN33ro3eD74qhxlDcXI3_2A7Jz02vVAoONMBYhfldjofZ6DxPHykI1-ukhxBT0qGaHbzrHwHZTh2HhTLYaynmkhxQ0GR14MiFkIphuXIWB118d6gkWDnojojsUWKj6Sj4rSPbGrfBAb8X3I03HrianWDUXFWjUYFUnVBn8O3OPRAEQfNQWCOgcfWUVEkBS6wFNkL4mEs8FCKmMRwMqtQgK3vuzsNbjYossVw32A7_-4a60WCIP57rCGn7pMKz5&amp;p=0&amp;fvj=1&amp;vjs=3" id="sja0" onclick="setRefineByCookie([]); sjoc('sja0', 1); convCtr('SJ'); rclk(this,jobmap[0],true,1);" onmousedown

In [82]:
new_job_list = []
for i, job in enumerate(job_divs):
    new_job = {}
    new_job_summary = job.find_all('div', class_='summary')[0].contents
    new_job['id'] = i
    new_job['summary'] = clean_summary(new_job_summary)    
    new_job['title'] = job.find_all('a', class_='jobtitle')[0]['title']
    new_job['link'] = 'https://indeed.com' + job_divs[0].find_all('a')[0]['href']
    
    new_job_list.append(new_job)
    
new_job_list[0]   

{'id': 0,
 'link': 'https://indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0AwKv_lxr9QxZkKYQR3Tp6JPuvL6qupby2r5pYLIAGzD7VsCbadw5bOKKFYnQStTiderWv7r7DTDMKtEvImHoKmLGw_SL6tUynuqIAgBi62gRo5yP4JWUgNIcAku0S91iinxGKM78saMUam6i63m15pWfJNO97Y3D9_l3PNNlZL68A7DHLdPLYqUZms_l0qpYgdmqDHIW2bE_MhmOAOVnuGPlGN33ro3eD74qhxlDcXI3_2A7Jz02vVAoONMBYhfldjofZ6DxPHykI1-ukhxBT0qGaHbzrHwHZTh2HhTLYaynmkhxQ0GR14MiFkIphuXIWB118d6gkWDnojojsUWKj6Sj4rSPbGrfBAb8X3I03HrianWDUXFWjUYFUnVBn8O3OPRAEQfNQWCOgcfWUVEkBS6wFNkL4mEs8FCKmMRwMqtQgK3vuzsNbjYossVw32A7_-4a60WCIP57rCGn7pMKz5&p=0&fvj=1&vjs=3',
 'summary': '',
 'title': 'ReactJS Developer'}

## openpyxl

In [83]:
wb = Workbook()
ws = wb.active

Create a sheet with the same title as the current job search term

In [84]:
ws.title = curr_job
wb.sheetnames

['web developer']

In [85]:
last_row = last_row_index(ws) +1
for job in new_job_list:    
    curr_col = 1
    for val in job.values():
        d = ws.cell(row=(job['id']+last_row), column=curr_col)
        d.value = val        
        curr_col+=1

In [86]:
wb.save('jobs.xlsx')

## Selenium again to get the next page

In [87]:
elem = driver.find_element_by_class_name("np")
elem.click()

In [88]:
# close pop up if it exist
try:
    close_pop_up = driver.find_element_by_id('popover-close-link')
    close_pop_up.click()
except:
    print('Pop up not there')

Pop up not there


Interesting. Looks like we may not need selenium anymore after all -- we can just modify the query parameter for *start*

In [89]:
driver.current_url

'https://www.indeed.com/jobs?q=web+developer&l=Dumont%2C+NJ&start=10'

In [90]:
new_req = make_job_req(driver.current_url)

In [91]:
job_divs = make_job_divs(new_req)
job_divs[0]

<div class="jobsearch-SerpJobCard unifiedRow row result" data-ci="333252521" data-empn="8292458881371076" data-jk="f299a53109ae3d1e" id="pj_f299a53109ae3d1e">
<style>
.jobcard_logo{margin:6px 0}.jobcard_logo img{width:auto;max-width:80px;max-height:30px}
</style>
<div class="title">
<a class="jobtitle turnstileLink" data-tn-element="jobTitle" href="/pagead/clk?mo=r&amp;ad=-6NYlbfkN0AwKv_lxr9QxZkKYQR3Tp6JPuvL6qupby2r5pYLIAGzD7VsCbadw5bOKKFYnQStTiderWv7r7DTDMKtEvImHoKmLGw_SL6tUynuqIAgBi62gRo5yP4JWUgNIcAku0S91iinxGKM78saMUam6i63m15pWfJNO97Y3D9_l3PNNlZL68A7DHLdPLYqUZms_l0qpYgdmqDHIW2bE_MhmOAOVnuGPlGN33ro3eD74qhxlDcXI3_2A7Jz02vVAoONMBYhfldjofZ6DxPHykI1-ukhxBT0qGaHbzrHwHZTh2HhTLYaynmkhxQ0GR14MiFkIphuXIWB118d6gkWDnojojsUWKj6Sj4rSPbGrfBAb8X3I03HrianWDUXFWjUYFUnVBn8O3OPRAEQfNQWCOgcfWUVEkBS6wFNkL4mEs8FCKmMRwMqtQgK3vuzsNbjYossVw32A7_-4a60WCIP57rCGn7pMKz5&amp;p=0&amp;fvj=1&amp;vjs=3" id="sja0" onclick="setRefineByCookie([]); sjoc('sja0', 1); convCtr('SJ'); rclk(this,jobmap[0],true,1);" onmousedown

In [92]:
another_job_list = make_job_list(job_divs)

KeyError: 'title'

In [56]:
write_jobs(ws, jobs=another_job_list, file_name="jobs.xlsx")

True

In [57]:
another_job_list

[{'id': 0,
  'link': 'https://indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0DnIQaT9_ideK20aGE-Gf5L-4mEV8Q0k7vqCcpbdP_vWWm99hxL5_fHY4xpZMP0Uhx5zsIIHC2JpM0ZxkCrbuPTXhftT08i6R8KTyzH84kb8ZTW1mFYdSyTp3t2dAwbTXA_-AAsx2CykH2HyxiFCiWdWS1f3NhKDL0fnzF6qqFEIOWnS9NPVKIl7OBxAwNYUvRpvl0TOLzLEHa24kK3X88ub5_07pNNaafkzbORsSI4vaiYcJNz2oCG-3ov0p4ZdL1uNlK_7XXdXOV6gKQOHVvk5MWLHAvHluOt2gXiCBcOjjrUdNiYX7egvyC8rFzUWThl6MHvNY_rxlAj_GLS0IQHNCT6ihfSOzInieqkDvSF5pC5HcLW2S0bE2W2xheOQP7p2A5qFHOMMb9WWtVoWAG_7pNup5wxBCpLoPfRlXjstyP8e5R0UozbgW2cGHsZXVTb0VmlDWPy5HbZAtQSpr7o_1szV9zig-iJ6jUR5dxfRHfkh6vSRFoHFSITvjZ_EwYWmBKxdgLge9pgV8D0uRd5&vjs=3&p=0&fvj=0',
  'summary': 'Collaborate with back-enddevelopersto design and build a robust API. Michael Kors is always interested in hearing from talented, globally-minded individuals...',
  'title': 'Front-end Developer'},
 {'id': 1,
  'link': 'https://indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0DnIQaT9_ideK20aGE-Gf5L-4mEV8Q0k7vqCcpbdP_vWWm99hxL5_fHY4xpZMP0Uhx5zsIIHC2JpM0ZxkCrbuPTXhftT08i6R