This script demonstrates how to use Selenium webdriver to automatically download and parse search results from indeed.com.
I will be using Firefox as my browser of choice.

First we need to install BeautifulSoup and Selenium libraries. We need BeautifulSoup to extract text enclosed by certain HTML tags.

In [3]:
import sys
!{sys.executable} -m pip install selenium
!{sys.executable} -m pip install bs4



Now we need to download the latest version of geckodriver so we can control Firefox windows from python.

It can be downoaded from here:
https://github.com/mozilla/geckodriver/releases
    
Unzip the file and remember its location. In this example I am using "c:/geckodriver/geckodriver.exe"

In [4]:
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import time
import matplotlib.pyplot as plt
import seaborn as sns

Before we start the webdriver, let us specify some parameters that will make monitoring and debugging process easier.

<b>options.headless = False</b>

Opens a firefox window that is controlled by python. We can use the window as if it were a regular firefox window, we can scroll, click on links, view page source, inspect elements etc.

<b>pageLoadStrategy = 'normal'</b>

Default option, we will be able to interact with the partially loaded page.

<b>driver.set_page_load_timeout(10)</b>

Set page load timeout to 10 seconds.

In [7]:
print('Starting Selenium Firefox Webdriver...')

from selenium import webdriver


from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.firefox.options import Options


caps = DesiredCapabilities().FIREFOX
caps['pageLoadStrategy'] = 'normal'

options = Options()
options.headless = False


driver = webdriver.Firefox(executable_path = 'c:/geckodriver/geckodriver.exe', options=options, capabilities=caps)
driver.set_page_load_timeout(10)
print('Done...')

Starting Selenium Firefox Webdriver...
Done...


We will be looking for Data Sciene jobs in Texas.

To do so, we open a new browser window, go to indeed.com, specify job title: Data Scientist, Location: Texas. Click Find jobs. 
Address bar should change to the following string:

https://www.indeed.com/jobs?q=data+scientist&l=Texas

In [8]:
url = 'https://www.indeed.com/jobs?q=data+scientist&l=Texas'

In [9]:
# Open url
driver.get(url)

Indeed is going to ask you for your location/registration. Politely decline.

In [133]:
try:
    # Find the "no, thanks" link, click on it.
    no_thanks = driver.find_element_by_link_text('No, thanks')
    no_thanks.click()
    print('Closing the pop-up asking for location/registration')
except:
    pass

Great! We are remotely controlling a firefox window from python. 
Now it is time to figure out the structure of the page. We see that each job listing is enclosed in a box.
Right click the box and select "inspect element". 

<img src="job_container.png">

Each box is an html container that has a multi level structure:
    
Top level: <div id="..." class = "jobsearch-SerpJobCard unifiedRow row result clickcard" data-jk = "..." ...>

To locate each box we use find_elements_by_xpath method:

We are looking for a <b>div</b> container that has a <b>class</b> attribute with the value of <b>"jobsearch-SerpJobCard unifiedRow row result clickcard"</b>

In [14]:
xp = "//div[contains(@class, 'jobsearch-SerpJobCard unifiedRow row result clickcard')]"
job_elements = driver.find_elements_by_xpath(xp)

In [15]:
len(job_elements)

17

Now in order to see the job desciption we need to click on the box, the desciption will appear in the right half of the screen.

Let's simulate clicking on the first job listing.

In [16]:
job_elements[0].click()

The page loads and shows only one job description at a time, we need to simulate clicks and extract the text.
Once again we right click on the description and choose "Inspect element". It has a very simple structure.

#<div id="vjs-desc">
#HTML TEXT
#</div>

In [None]:
This time we use beautiful soup, because we no longer need to interact with the page. Beautiful can get rid of the html tags and clean the text for us

In [17]:
# Get page source
src = driver.page_source
soup = BeautifulSoup(src)
description = soup.find('div', {"id": "vjs-desc"}) 

In [19]:
description.get_text()[0:500]

'IMMEDIATE NEED FOR A DATA ENGINEER/SCIENTIST DATA STRUCTURES - Dallas or San DiegoHealth IQ is adding a Data Engineer to its growing Business Intelligence team. As a Data Engineer, you will work with clients, team members, department heads and 3rd party data providers, to develop, maintain, and enhance our data engineering capabilities in support of our data and predictive analytic offerings to the insurance marketplace. The ideal candidate demonstrates a curious analytical mind with the ability'

In [None]:
# Get page source
src = driver.page_source
soup = BeautifulSoup(src)

In [134]:
while len(j)<4500:
    xp = "//div[contains(@class, 'jobsearch-SerpJobCard unifiedRow row result clickcard')]"
    # find job containers on current page
    job_elements = driver.find_elements_by_xpath(xp)
    print('Jobs on current page:', len(job_elements))
    
    # click on every job posting on current page
    for element in job_elements:
        
        # make sure the are not bugging you with regisration, politely decline.
        try:
            no_thanks = driver.find_element_by_link_text('No, thanks')
            no_thanks.click()
            print('Closing the pop-up asking for location/registration')
        except:
            pass

        element.click()

        # job description
        src = driver.page_source
        soup = BeautifulSoup(src)
        description = soup.find('div', {"id": "vjs-desc"}) 


        # company and locations

        cl = element.find_elements_by_class_name('sjcl')

        if len(cl)>0:
            company_location = cl[0].text
        else:
            company_location = None    


        # title
        t = element.find_elements_by_class_name('title')

        if len(cl)>0:
            title = t[0].text
        else:
            title = None

        # salary
        s = element.find_elements_by_class_name('salarySnippet.holisticSalary')
        if len(s)>0:
            salary = s[0].text
        else:
            salary = None
        print(len(j),title)
        j.append((title, company_location, salary, description))
        
    # click next to view more postings
    next_link = driver.find_elements_by_partial_link_text('Next')[-1]
    next_link.click()
    time.sleep(5)

Jobs on current page: 18
509 Customer Service Representative
510 Customer Service Agents for Short Term, Remote Positions
511 NOW HIRING - Brookshire's is hiring for multiple positions
512 Find jobs with Flexible Hours!
513 PSE MAIL PROCESSING CLERK
514 Billing Administrator
515 Product Manager
516 Production Superintendent
517 RealContact Calling Client Concierge (Part-time/Remote)
518 GARDENER WANTED
519 Inventory Associate
520 FedEx Delivery Driver
521 Systems Analyst I, II or III
522 Client Support Rep
523 Cash Posting Manager
524 Find jobs with Flexible Hours!
525 Executive IT Support Specialist
526 Service Desk Representative
Jobs on current page: 15
527 Healthcare Specialist
528 Screener
529 Warehouse Associate Night Shift
530 Customer Support Representative (Temp-to-Perm)
531 Customer Service Representative
532 Shipping & Receiving Operators-Hiring All Shifts!!!
533 Customer Service Representative
534 Sales Utility Representative - PT
535 Front Desk Agent (DAL)
536 Inside Sales

725 Insurance Sales Professional
726 Life Insurance Sales Representative
727 Find jobs with Flexible Hours!
728 General Manager (Huntsville TX)
729 Electrical Estimator
730 Restoration Construction Sales Estimator - 411 - FEATURED
731 Concrete Polishing/ Epoxy Flooring crew leader
732 Pizza Hut General Manager.....Immediate Need
733 Life Insurance Sales Agent
734 Pizza Hut Assistant Manager
735 Estimator (Maintenance/Construction)
736 Estimator
Jobs on current page: 15
737 Inside Sales Representative
738 Roof/Restoration Consultant
739 Outside Sales Representative
740 Roofing sales rep
741 Roofing Sales
742 Roofing Contractor- Sales
743 Roofing Contractor
744 Concrete Construction
745 Mechanical Estimator
746 Outside Sales Representative
747 Pizza Hut General Manager
748 Outside Sales Representative
749 Project Manager
750 Sales Consultant
751 Bath & Kitchen Sales Rep
Jobs on current page: 15
752 Automotive Sales Consultant
753 Restoration Sales Representative
754 Outside Sales Rep (CS

946 Customer Experience Specialist - Part-Time
Jobs on current page: 15
947 Bilingual Customer Service Representative
948 Collections Specialist (spanish speaking)
949 Home Health Patient Staffing Coordinator - Entry Level Admin...
950 Bilingual Call Center Representative Remote
951 Front Office - Receptionist
952 Bilingual Collections Representative
953 Third Party Medical Collector - English/Spanish Bilingual
954 Bilingual Insurance Customer Service Representative
955 Spanish Bilingual Interpreter-Paid Training-Comprehensive Be...
956 Medical Receptionist
957 Veterinary Assistant
958 P417: Mental Health Case Manager
959 Customer Service Agent
960 Bilingual Customer Service
961 Account Manager
Jobs on current page: 15
962 Veterinary/Kennel Assistant
963 Bilingual Customer Service Representative
964 Bilingual Telemarketer
965 Bilingual CSR - Customer Service Representative
966 Customer Service Representative
967 Bilingual Customer Service Representative
968 Bilingual Call Center Resear

1151 P362: MH Case Manager
1152 Account Sales Executive
1153 ADT Sales Representative
1154 Acquisitions Specialist
1155 Icon Golf Sales Executive
1156 Team Leader
Jobs on current page: 15
1157 Sales Team Leader Trainee
1158 Delivery Driver (Amazon Contractor)
1159 Sales Representative
1160 Professional Housing Consultant
1161 In-Home Solar Specialist
1162 New Home Sales Consultant- Houston North
1163 Roofing Sales
1164 National Sales Manager (Equipment Sales)
1165 Sales Consultant
1166 Outreach Specialist
1167 Sales and Design Consultant
1168 Outside Sales Representative
1169 Sales Beast
1170 TJJD - Youth Development Coach III / IV / V - Halfway House...
1171 COVID SWAT Team - Long Term Care
Jobs on current page: 15
1172 Sales Oral Care Territory Manager - Fort Worth, Texas
1173 Outside Sales Representative
1174 Outside Sales Territory Manager
1175 Project Manager / Sales Representative
1176 Outside Sales Representative
1177 Roofing Sales/ Trainer
1178 Roofing sales rep
1179 Roofing Sa

1360 Pizza Delivery Driver
1361 Restaurant Delivery Driver
1362 DoorDash Delivery Driver - No Experience Needed
1363 Delivery Driver - McCart
1364 Join our Dominos Pizza E-Bike Delivery Team Cash Paid Daily...
1365 Pizza Hut Delivery Driver - William Cannon
1366 Van Driver
Jobs on current page: 15
1367 Teamleaders and Crewmembers needed (Huntsville, TX)
1368 Pizza Delivery Driver
1369 Delivery Driver - Stockyards
1370 Delivery Driver (part or full time) with DoorDash - Make up...
1371 Delivery Driver - Pipeline
1372 Dominos Delivery Driver - 921 10th St. - (Floresville) Store...
1373 Delivery Driver
1374 Delivery Driver
1375 Delivery Driver - Haslet(FW)
1376 Delivery Driver
1377 Pizza Driver - Cash Every Shift!
1378 Delivery Driver - Rendon(FW)
1379 Delivery Driver - Camp Bowie
1380 Restaurant Team Member
1381 Delivery Driver - 4111BR
Jobs on current page: 15
1382 Pizza Delivery Driver
1383 Team Member Position - Killeen
1384 Team Member
1385 Delivery Driver
1386 Delivery Driver - Atte

IndexError: list index out of range

In [9]:
#non_ds = pd.DataFrame(j, columns = ['title', 'company_location','salary','description'])
#non_ds.to_csv('non_ds.csv')

In [154]:
ds = pd.DataFrame(j, columns = ['title', 'company_location','salary','description'])
ds['target'] = 0

In [155]:
ds.iloc[509:1800,:].to_csv('non_ds2.csv')

In [18]:
ds.to_csv('ds.csv')

In [25]:
ds['target'] = 1

In [26]:
non_ds['target'] = 0

In [27]:
df = pd.concat([ds, non_ds])

In [30]:
df.reset_index(drop = True, inplace=True)