# Scraping Job Postings from LinkedIn 

This code was adapted from the following article: <br>
https://maoviola.medium.com/a-complete-guide-to-web-scraping-linkedin-job-postings-ad290fcaa97f

### What data will you be able to get?

##### All jobs from a defined job search. <br>
That search can be limited by any of the given LinkedIn options.<br>

##### Items that are on the job card. <br>
This code will focus on date, job titles, company names, job descriptions, and job criteria.

In [1]:
#Import packages
from selenium import webdriver
import time
import pandas as pd

#Hide Warnings
import warnings
warnings.filterwarnings('ignore')

##### Define the search and copy the url. <br>
Do not be signed-in to LinkedIn for this part. In LinkedIn create the search for the required job postings. This case is focused on all jobs in Rancho Cordova, limited to a 10 mile radius.

In [2]:
# URL for LinkedIn Job search of jobs in Rancho Cordova with 10 mile radius. 
# Past Month
# (Not signed-in search)

# Replace this url with the one for your needed search.

url = 'https://www.linkedin.com/jobs/search?keywords=&location=Rancho%20Cordova%2C%20California%2C%20United%20States&locationId=&geoId=104863738&sortBy=R&f_TPR=r2592000&f_PP=104863738&distance=10&position=1&pageNum=0'

##### Set-up the web driver. <br>

The chromedriver is being used in this case. If you have not previously installed it, you can do so at: <br>
https://chromedriver.chromium.org/downloads

In [3]:
#Set-up chromedriver
wd = webdriver.Chrome(executable_path=r'C:\chromedriver.exe')

#Connect the webdriver to the job search.
wd.get(url)

##### How many job posts are associated with the job seach? <br>
Note: It may go to a log in page. In the future a bypass will be added. However, at the moment you can just re-run the following cell.

In [4]:
#How many jobs are curently available within 10 miles of Rancho Cordova on LinkedIn

no_of_jobs = int(wd.find_element_by_css_selector('h1>span').get_attribute('innerText'))

print('There are', no_of_jobs, 'jobs available within 10 miles of Rancho Cordova on LinkedIn over the past month.')

There are 951 jobs available within 10 miles of Rancho Cordova on LinkedIn over the past month.


##### Show all the jobs. <br>
The following segment of code  will scroll and click 'Show more job' until all jobs are showing.

In [5]:
#Browse all jobs for the search.
i = 2
while i <= int(no_of_jobs/25)+1: 
    #Scroll until hit the see more jobs button.
    wd.execute_script('window.scrollTo(0, document.body.scrollHeight);')
    i = i + 1
    try:
        #Click the see more jobs button and then keep scrolling.
        wd.find_element_by_xpath(f"//button[@aria-label='Load more results']").click()
        time.sleep(20)
    except:
        pass
        time.sleep(20)

##### Create a list of all jobs in the search.

In [6]:
#Create a list of the jobs.
job_lists = wd.find_element_by_class_name('jobs-search__results-list')
jobs = job_lists.find_elements_by_tag_name('li')

In [7]:
#Test that it collected all jobs.
#If it significantly dropped, the sleep.time time may need to be increased to allow:
#More loading time or
#Not set off restrictions for the site.
print(len(jobs), 'were collected from the search')

957 were collected from the search


Sometimes the numbers do not match. That is not a problem at this step. Verification will happen after the details are pulled to make sure the lists line up correctly.

##### Create arrays for each attribute that is needed.

The following steps can be done together, but there is higher risk of the tasks being canceled due to interactions with the the website. Overall, there are fewer errors and everything is completed in less time by breaking it all up.

In [8]:
#Pull basic information from each job.
job_title = []
company_name = []
date = []

for job in jobs:
    job_title0 = job.find_element_by_css_selector('h3').get_attribute('innerText')
    job_title.append(job_title0)
 
    company_name0 = job.find_element_by_css_selector('h4').get_attribute('innerText')
    company_name.append(company_name0)
 
    date0 = job.find_element_by_css_selector('div>div>time').get_attribute('datetime')
    date.append(date0)

In [10]:
#See first 5 of each for verification.
print('Job Titles:',job_title[:5])
print(' ')
print('Company Names:',company_name[:5])
print(' ')
print('Date:', date[:5])

Job Titles: ['Admissions Outreach and Communications Specialist', 'Patient Advocate', 'Benefits Verification & Authorization Specialist', 'Manager, Customer Care', 'Public Policy and Government Affairs Manager (46.21)']
 
Company Names: ['California Northstate University', 'Dignity Health', 'CVS Health', 'Magellan Health', 'American States Water Company']
 
Date: ['2021-11-04', '2021-11-04', '2021-10-12', '2021-10-26', '2021-10-13']


In [None]:
#Inital job description and criteria lists.
jd = []
cl = []

In [None]:
#Get job criteria.

#Do in batches that end with longer timers 
#to work around StaleElementReferenceException.
batchsize = 10

for i in range(0, len(jobs), batchsize):
    
    batch = jobs[i:i+batchsize]

    for job in batch:
       
        job.click()
        
        detail_path = 'description__job-criteria-list'
        details = wd.find_element_by_class_name(detail_path).get_attribute('innerText')
        cl.append(details)
    
        time.sleep(5)
        
    time.sleep(30)

In [None]:
#Verify criteria is correct.
print(cl[0])

In [None]:
#Get job descriptions.

#Do in batches that end with longer timers 
#to work around StaleElementReferenceException.
batchsize = 10

for i in range(0, len(jobs), batchsize):
    
    batch = jobs[i:i+batchsize]

    for job in batch:
        
        job.click()
    
        jd_path = 'show-more-less-html__markup'
        jd0 = wd.find_element_by_class_name(jd_path).get_attribute('innerText')
        jd.append(jd0)
    
        time.sleep(5)
        
    time.sleep(30)

In [None]:
#Verify job description is correct.
print(jd[0])

##### Verify that all lists are the same length.

In [None]:
print(len(job_title))
print(len(company_name))
print(len(date))
print(len(cl))
print(len(jd))

##### Create and save a dataframe of the collected data.

In [28]:
job_post_data = pd.DataFrame({'Date': date,
                              'Company': company_name,
                              'Title': job_title,
                             'Job Description':jd,
                              'Criteria': cl})

job_post_data.to_csv('LinkedIn_Job_Postings.csv', index = False)

job_post_data.head()

##### Another notebook will detail the cleaning process for the text columns.