# Scraping Job Postings from LinkedIn

This code is adapted and modified from the following article: https://maoviola.medium.com/a-complete-guide-to-web-scraping-linkedin-job-postings-ad290fcaa97f and Cohort 2's work.

### Data Source 

LinkedIn job post board. This data collection is focusing on job posts near Rancho Cardova, California in the past months. <br>
This code will focus on **date, job titles, company names, job descriptions, and job criteria**.

### Import Libraries

In [1]:
#Import packages
import time, os
import pandas as pd

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select

from bs4 import BeautifulSoup as bs
import requests
import re
import pickle

#Hide Warnings
import warnings
warnings.filterwarnings('ignore')

### Initialize Chromedriver for Selenium

The chromedriver is being used in this case. If you have not previously installed it, you can do so at:
https://chromedriver.chromium.org/downloads. <br> 
Be sure to move the chromedriver to the **Application Folder** for the code below to work. <br>
<br>
The Selenium functions here are tasked to: <br>
- Get the location set to Rancho Cordova
- Select 'Past Month' in "Date Posted"
- Select 'within 10 miles' in "Distanct"

In [2]:
# Set up chromedirver
chromedriver = "/Applications/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
source = 'https://www.linkedin.com/jobs'
driver.get(source)

# Get Location to Rancho Cordova
location_box_clear = driver.find_element_by_xpath('//*[@id="JOBS"]/section[2]/button')
location_box_clear.click()
location_box = driver.find_element_by_xpath('//*[@id="JOBS"]/section[2]/input')
location_box.click()
location_box.send_keys("Rancho Cordova, California, United States")
location_box.send_keys(Keys.RETURN)

In [4]:
# Select past month 
time_dropdown = driver.find_element_by_xpath('//*[@id="jserp-filters"]/ul/li[1]/div/div/button')
time_dropdown.click()
past_month_button = driver.find_element_by_xpath('//*[@id="jserp-filters"]/ul/li[1]/div/div/div/fieldset/div/div[3]/label')
past_month_button.click()
time_done_button = driver.find_element_by_xpath('//*[@id="jserp-filters"]/ul/li[1]/div/div/div/button')
time_done_button.click()

# Pause in action or linkedin will jump to sign in page
time.sleep(5)

# within 10 miles
distance_dropdown = driver.find_element_by_xpath('//*[@id="jserp-filters"]/ul/li[2]/div/div/button')
distance_dropdown.click()
filter_10mi = driver.find_element_by_xpath('//*[@id="jserp-filters"]/ul/li[2]/div/div/div/fieldset/div/div[1]/label')
filter_10mi.click()
distance_done_button = driver.find_element_by_xpath('//*[@id="jserp-filters"]/ul/li[2]/div/div/div/button')
distance_done_button.click()

Additional filters can be set with **Company, Salary, Location, Job Type, Experience Level, On-site/Remote**.

### How many job posts are associated with the job seach?

In [5]:
#How many jobs are curently available within 10 miles of Rancho Cordova on LinkedIn
no_of_jobs = driver.find_element_by_css_selector('h1>span').get_attribute('innerText')

print('There are', no_of_jobs, 'jobs available within 10 miles of Rancho Cordova on LinkedIn over the past month.')

There are 14,000+ jobs available within 10 miles of Rancho Cordova on LinkedIn over the past month.


### Show all the jobs

The following segment of code will scroll and click "Show more job" until all available job post are showing.

In [6]:
#Browse all jobs for the search.
# Set pause time
SCROLL_PAUSE_TIME = 10
last_height = driver.execute_script("return document.body.scrollHeight")
while True: 
    #Scroll until hit the see more jobs button.
    driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)
    
    try:
        #Click the see more jobs button and then keep scrolling.
        driver.find_element_by_xpath('//*[@id="main-content"]/section/button').click()
        time.sleep(15)
        print("clicked loading button")
    except:
        pass
        time.sleep(15)
        print("no loading button")
        
    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    
    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)
    
    # Stop the scrolling and button clicking if the page isn't loading more jobs
    if new_height == last_height:
        print("loading button stopped working")
        break
    last_height = new_height

# This can take awhile

no loading button
no loading button
no loading button
no loading button
no loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
clicked loading button
loading button stopped working


***No more loading with clicking on the button on the webdriver***

### Create a list of all jobs in the search

In [7]:
#Create a list of the jobs.
job_lists = driver.find_element_by_class_name('jobs-search__results-list')
jobs = job_lists.find_elements_by_tag_name('li')

In [8]:
#Test that it collected all jobs.
#If it significantly dropped, the sleep.time time may need to be increased to allow:
#More loading time or
#Not set off restrictions for the site.
print(len(jobs), 'were collected from the search')

998 were collected from the search


It seems like LinkedIn only lets you get to that amount of job listings (close to the number from previous cohort) <br>
Another option is adding more filter for the search to narrow down the search.

### Getting Job basic information

The following steps can be done together, but there is higher risk of the tasks being canceled due to interactions with the the website. Overall, there are fewer errors and everything is completed in less time by breaking it all up.

In [9]:
#Pull basic information from each job.
job_title = []
company_name = []
date = []
job_link = []

for job in jobs:
    job_title0 = job.find_element_by_css_selector('h3').get_attribute('innerText')
    job_title.append(job_title0)
 
    company_name0 = job.find_element_by_css_selector('h4').get_attribute('innerText')
    company_name.append(company_name0)
 
    date0 = job.find_element_by_css_selector('div>div>time').get_attribute('datetime')
    date.append(date0)
    
    job_link0 = job.find_element_by_css_selector('a').get_attribute('href')
    job_link.append(job_link0)

In [10]:
#See first 5 of each for verification.
print('Job Titles:',job_title[:5])
print(' ')
print('Company Names:',company_name[:5])
print(' ')
print('Date:', date[:5])

Job Titles: ['Platform Supply Analyst', 'Platform Supply Analyst', 'Supervisor Support Services', 'Communications Specialist', 'Planning Analyst']
 
Company Names: ['Intel Corporation', 'Intel Corporation', 'Kaiser Permanente', 'Dignity Health', 'Intel Corporation']
 
Date: ['2022-01-15', '2022-01-26', '2022-01-10', '2022-02-02', '2022-01-26']


In [11]:
# Create and save a dataframe of the collected data.
job_post_data = pd.DataFrame({'Date': date,
                              'Company': company_name,
                              'Title': job_title,
                              'Job Link': job_link})

In [12]:
job_post_data.head()

Unnamed: 0,Date,Company,Title,Job Link
0,2022-01-15,Intel Corporation,Platform Supply Analyst,https://www.linkedin.com/jobs/view/platform-su...
1,2022-01-26,Intel Corporation,Platform Supply Analyst,https://www.linkedin.com/jobs/view/platform-su...
2,2022-01-10,Kaiser Permanente,Supervisor Support Services,https://www.linkedin.com/jobs/view/supervisor-...
3,2022-02-02,Dignity Health,Communications Specialist,https://www.linkedin.com/jobs/view/communicati...
4,2022-01-26,Intel Corporation,Planning Analyst,https://www.linkedin.com/jobs/view/planning-an...


### Getting more job details

**Note**: The following code will have longer times in scraping due to time.sleep(), but it helps working around StaleElementReferenceException.

In [13]:
#Inital job description and criteria list
jd = []
cl = []
#Get job descriptions and criteria list

for job in jobs:
        
    job.click()
    
    jd_path = 'show-more-less-html__markup'
    detail_path = 'description__job-criteria-list'
    
    try:
        jd0 = driver.find_element_by_class_name(jd_path).get_attribute('innerText')
        jd.append(jd0)
        details = driver.find_element_by_class_name(detail_path).get_attribute('innerText')
        cl.append(details)
        time.sleep(20)
        
    except: # working around StaleElementReferenceException
        time.sleep(15)
        jd0 = driver.find_element_by_class_name(jd_path).get_attribute('innerText')
        jd.append(jd0)
        details = driver.find_element_by_class_name(detail_path).get_attribute('innerText')
        cl.append(details)
        time.sleep(20)


In [14]:
# Verify description is correct
print(jd[0])
print(cl[0])

Job Description

In this position you will be part of Capacity and Supply Business Planning (CSBP), a team that is responsible for setting and executing Intel's product platform strategies, working with partners across both the internal and external supply chain organizations.

A Successful Candidate Will Demonstrate

Skills in effectively presenting to and obtaining direction from senior management.

Business acumen and understanding of supply line management processes and demand market models.

Experience in roles that have required a high degree of adaptability to change.

Advanced expertise in Excel (pivot tables, formulas, conditional formatting, etc)

Familiarity with Intel Architecture products and customers, an understanding of the broader market drivers and trends.

Responsibilities May Include

Ensure a high level of service to Intel's customers by understanding customer demand for our products and working closely with Intel's internal and external supply network to align sup

In [15]:
# Verify that new lists are the same length as the df
print(len(cl))
# print(len(jd))

998
1029


In [None]:
# weird that jd has different length, would need some investigation

In [23]:
# job_post_data["job_description"] = jd
job_post_data["criteria"] = cl

In [24]:
job_post_data.head()

Unnamed: 0,Date,Company,Title,Job Link,criteria
0,2022-01-15,Intel Corporation,Platform Supply Analyst,https://www.linkedin.com/jobs/view/platform-su...,Seniority level\nMid-Senior level\nEmployment ...
1,2022-01-26,Intel Corporation,Platform Supply Analyst,https://www.linkedin.com/jobs/view/platform-su...,\n \n \n Seniority ...
2,2022-01-10,Kaiser Permanente,Supervisor Support Services,https://www.linkedin.com/jobs/view/supervisor-...,\n \n \n Seniority ...
3,2022-02-02,Dignity Health,Communications Specialist,https://www.linkedin.com/jobs/view/communicati...,\n \n \n Seniority ...
4,2022-01-26,Intel Corporation,Planning Analyst,https://www.linkedin.com/jobs/view/planning-an...,\n \n \n Seniority ...


In [25]:
job_post_data.to_csv('../Data/LinkedIn_Job_Postings_notcleaned.csv', index = False)

Another notebook will detail the cleaning process for the text columns.

In [26]:
# close out webdriver
driver.quit()