This script demonstrates how to use Selenium webdriver to automatically download and parse search results from indeed.com.
I will be using Firefox as my browser of choice.

First we need to install BeautifulSoup and Selenium libraries. We need BeautifulSoup to extract text enclosed by certain HTML tags.

In [3]:
import sys
!{sys.executable} -m pip install selenium
!{sys.executable} -m pip install bs4



Now we need to download the latest version of geckodriver so we can control Firefox windows from python.

It can be downoaded from here:
https://github.com/mozilla/geckodriver/releases
    
Unzip the file and remember its location. In this example I am using "c:/geckodriver/geckodriver.exe"

In [4]:
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import time
import matplotlib.pyplot as plt
import seaborn as sns

Before we start the webdriver, let us specify some parameters that will make monitoring and debugging process easier.

<b>options.headless = False</b>

Opens a firefox window that is controlled by python. We can use the window as if it were a regular firefox window, we can scroll, click on links, view page source, inspect elements etc.

<b>pageLoadStrategy = 'normal'</b>

Default option, we will be able to interact with the partially loaded page.

<b>driver.set_page_load_timeout(10)</b>

Set page load timeout to 10 seconds.

In [7]:
print('Starting Selenium Firefox Webdriver...')

from selenium import webdriver


from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.firefox.options import Options


caps = DesiredCapabilities().FIREFOX
caps['pageLoadStrategy'] = 'normal'

options = Options()
options.headless = False


driver = webdriver.Firefox(executable_path = 'c:/geckodriver/geckodriver.exe', options=options, capabilities=caps)
driver.set_page_load_timeout(10)
print('Done...')

Starting Selenium Firefox Webdriver...
Done...


We will be looking for Data Sciene jobs in Texas.

To do so, we open a new browser window, go to indeed.com, specify job title: Data Scientist, Location: Texas. Click Find jobs. 
Address bar should change to the following string:

https://www.indeed.com/jobs?q=data+scientist&l=Texas

In [8]:
url = 'https://www.indeed.com/jobs?q=data+scientist&l=Texas'

In [9]:
# Open url
driver.get(url)

Indeed is going to ask you for your location/registration. Politely decline.

In [133]:
try:
    # Find the "no, thanks" link, click on it.
    no_thanks = driver.find_element_by_link_text('No, thanks')
    no_thanks.click()
    print('Closing the pop-up asking for location/registration')
except:
    pass

Great! We are remotely controlling a firefox window from python. 
Now it is time to figure out the structure of the page. We see that each job listing is enclosed in a box.
Right click the box and select "inspect element". 

<img src="job_container.png">

Each box is an html container that has a multi level structure:
    
Top level: <div id="..." class = "jobsearch-SerpJobCard unifiedRow row result clickcard" data-jk = "..." ...>

To locate each box we use find_elements_by_xpath method:

We are looking for a <b>div</b> container that has a <b>class</b> attribute with the value of <b>"jobsearch-SerpJobCard unifiedRow row result clickcard"</b>

In [14]:
xp = "//div[contains(@class, 'jobsearch-SerpJobCard unifiedRow row result clickcard')]"
job_elements = driver.find_elements_by_xpath(xp)

In [15]:
len(job_elements)

17

Now in order to see the job desciption we need to click on the box, the desciption will appear in the right half of the screen.

Let's simulate clicking on the first job listing.

In [16]:
job_elements[0].click()

The page loads and shows only one job description at a time, we need to simulate clicks and extract the text.
Once again we right click on the description and choose "Inspect element". It has a very simple structure.

div id="vjs-desc"


In [None]:
This time we use beautiful soup, because we no longer need to interact with the page. Beautiful can get rid of the html tags and clean the text for us

In [17]:
# Get page source
src = driver.page_source
soup = BeautifulSoup(src)
description = soup.find('div', {"id": "vjs-desc"}) 

In [19]:
description.get_text()[0:500]

'IMMEDIATE NEED FOR A DATA ENGINEER/SCIENTIST DATA STRUCTURES - Dallas or San DiegoHealth IQ is adding a Data Engineer to its growing Business Intelligence team. As a Data Engineer, you will work with clients, team members, department heads and 3rd party data providers, to develop, maintain, and enhance our data engineering capabilities in support of our data and predictive analytic offerings to the insurance marketplace. The ideal candidate demonstrates a curious analytical mind with the ability'

In [None]:
# Get page source
src = driver.page_source
soup = BeautifulSoup(src)

In [24]:
j=[]

while len(j)<20:
    xp = "//div[contains(@class, 'jobsearch-SerpJobCard unifiedRow row result clickcard')]"
    # find job containers on current page
    job_elements = driver.find_elements_by_xpath(xp)
    print('Jobs on current page:', len(job_elements))
    
    # click on every job posting on current page
    for element in job_elements:
        
        # make sure the are not bugging you with regisration, politely decline.
        try:
            no_thanks = driver.find_element_by_link_text('No, thanks')
            no_thanks.click()
            print('Closing the pop-up asking for location/registration')
        except:
            pass

        element.click()

        # job description
        src = driver.page_source
        soup = BeautifulSoup(src)
        description = soup.find('div', {"id": "vjs-desc"}) 


        # company and locations

        cl = element.find_elements_by_class_name('sjcl')

        if len(cl)>0:
            company_location = cl[0].text
        else:
            company_location = None    


        # title
        t = element.find_elements_by_class_name('title')

        if len(cl)>0:
            title = t[0].text
        else:
            title = None

        # salary
        s = element.find_elements_by_class_name('salarySnippet.holisticSalary')
        if len(s)>0:
            salary = s[0].text
        else:
            salary = None
        print(len(j),title)
        j.append((title, company_location, salary, description.get_text()))
        
    # click next to view more postings
    next_link = driver.find_elements_by_partial_link_text('Next')[-1]
    next_link.click()
    time.sleep(5)

Jobs on current page: 19
0 Senior Statistician and Software Developer
1 Quantitative Analyst Lead/Model Development
2 Lead Data Scientist-NLP Expert
3 Modeling Analyst
4 Decision Science Analyst
5 Machine Learning - Artificial Intelligence Architect
6 Data Scientist III (B3)
7 Artificial Intelligence Researcher
8 Senior Data Scientist
9 Clinical Data Scientist
10 NLP Engineer
11 Data Scientist / Statistician
12 Artificial Intelligence Engineer
13 Data Science Engineer
14 Geospatial Software Developer and Data Scientist
15 Data Analyst Specialist
16 Decision Science Analyst Senior
17 Decision Science Analyst Lead
18 Sr. Data Analyst
Jobs on current page: 17
19 Data Scientist Lead
20 Data Scientist Senior
21 Data Scientist I
22 Statistician and Software Developer
23 Data Scientist / AI Engineer
24 Artificial Intelligence Engineer
25 Quantitative Researcher
26 Data Scientist - Nationwide Opportunities
27 Data Scientist
28 Data Scientist
29 Data Scientist
30 Machine Learning Architectures 

In [25]:
JOBS = pd.DataFrame(j, columns = ['title', 'company_location','salary','description'])
JOBS_CL = JOBS['company_location'].str.split('\n', expand = True)
JOBS_CL.columns = ['company/rating', 'location']
JOBS = pd.concat([JOBS.drop(columns = 'company_location'), JOBS_CL], axis = 1)

In [34]:
JOBS.head(10)

Unnamed: 0,title,salary,description,company/rating,location
0,Senior Statistician and Software Developer,,StataCorp is seeking a person with a good unde...,StataCorp 4.0,"College Station, TX 77845"
1,Quantitative Analyst Lead/Model Development,,Purpose of Job We are currently seeking a tale...,USAA 3.9,"Austin, TX 78701 (Downtown area)"
2,Lead Data Scientist-NLP Expert,,Purpose of Job The candidate selected for this...,USAA 3.9,"San Antonio, TX 78206 (King William area)"
3,Modeling Analyst,,Purpose of Job We are currently seeking a tale...,USAA 3.9,"San Antonio, TX 78206 (King William area)"
4,Decision Science Analyst,,Purpose of Job We are currently seeking a tale...,USAA 3.9,"San Antonio, TX 78206 (King William area)"
5,Machine Learning - Artificial Intelligence Arc...,,"Interview Location: Georgetown, TX\n\nJob Titl...",Loram Maintenance of Way 3.2,"Georgetown, TX"
6,Data Scientist III (B3),,The Data Scientist job expectations (typically...,Applied Materials Inc. 3.9,"Austin, TX"
7,Artificial Intelligence Researcher,"$250,000 - $450,000 a year","As an Artificial Intelligence Researcher, you ...",Axiom Group 2.7,"Austin, TX"
8,Senior Data Scientist,,Job requirements/ skills:﻿\nMaster’s degree wi...,Data Quantist,"Austin, TX"
9,Clinical Data Scientist,,The Challenge:\nAre you excited at the prospec...,Booz Allen Hamilton 3.9,"San Antonio, TX 78205 (Downtown area)"
