# Extracting job information from LinkedIn Jobs using BeautifulSoup and Selenium

When I was looking for jobs on the LinkedIn job board, I found the built-in sorting system didn't provide enough functionalities to help me choose the most suitable jobs to apply to. And I was thinking to extract job descriptions from the LinkedIn platform and built my personal job boards. The first thing I need to start is collecting data from LinkedIn. 

Firstly, I went to the LinkedIn API, but the provided API endpoints are very limited. So, I decided to scrape the data with BeautifulSoup and Selenium. This notebook is the demonstration of setting up an extraction pipeline for searching data analyst position in Canada. I fetched the data using a headless browser to simulate human browsing behavior, then formatted and saved the data into a pandas data frame, eventually exporting it to a CSV file.

After running the below code, we will get the following information about the job posting:

- Date
- Title
- Company Name
- Location
- Job Description
- Job Level
- Job Type
- Function
- Industry
- Job ID

You can also use this code for different type of jobs with different location. To do that, follow the below process:

Open this link in a chrome incognito mode.
Specify to job title and location in the search bar.
Copy and pase sortBy=DD& after location=(will show your searched location) in the weblink.
Copy the final link and replace url variable with the new url in code block 2.
To search data for more jobs, specify the numner of jobs (in multiple of 25 like 50 or 75 or 100 and so on) against variable called 'no_of_jobs' in code block 2.
After all the above the steps are done, run the code
For this example, we will only look for 25 recent jobs.

In [48]:
# importing packages
import pandas as pd
import re

import bs4
from bs4 import BeautifulSoup
from datetime import date, timedelta, datetime
from IPython.core.display import clear_output
from random import randint
from requests import get
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from time import sleep
from time import time
start_time = time()

from warnings import warn

from tqdm import tqdm

In [49]:
# replace variables here.
# url = "https://www.linkedin.com/jobs/search/?f_TPR=r604800&geoId=101174742&keywords=data%20analyst&location=Canada&sortBy=DD"
url = "https://www.linkedin.com/jobs/search/?geoId=90009551&keywords=data%20analyst&location=Greater%20Toronto%20Area%2C%20Canada"
no_of_jobs = 10

In [50]:
# this will open up new window with the url provided above 

# https://stackoverflow.com/questions/29858752/error-message-chromedriver-executable-needs-to-be-available-in-the-path
driver = webdriver.Chrome("../chromedriver.exe")
driver.get(url)
sleep(3)
action = ActionChains(driver)

In [23]:
# to show more jobs. Depends on number of jobs selected
i = 2
while i <= (no_of_jobs/10): 
    driver.find_element_by_xpath('/html/body/main/div/section/button').click()
    i = i + 1
    sleep(5)

In [51]:
# parsing the visible webpage
pageSource = driver.page_source
lxml_soup = BeautifulSoup(pageSource, 'lxml')

# searching for all job containers
job_container = lxml_soup.find('ul', class_ = 'jobs-search__results-list').findAll("li")

print('You are scraping information about {} jobs.'.format(len(job_container)))

You are scraping information about 16 jobs.


In [52]:
# setting up list for job information
job_id = []
post_title = []
company_name = []
post_date = []
job_location = []
job_desc = []
level = []
emp_type = []
functions = []
industries = []

# for loop for job title, company, id, location and date posted
for i, job in tqdm(enumerate(job_container)):
    
    # click current panel
    current_panel_xpath = '//*[@id="main-content"]/section[2]/ul/li[{}]/div'.format(i+1)
    driver.find_element_by_xpath(current_panel_xpath).click()
    
    sleep(2)
    
    # job title
    job_titles = job.find("h3", class_="base-search-card__title").text
    post_title.append(job_titles)
    
    # linkedin job id
    job_ids = job.find('a', href=True)['href']
    job_ids = re.findall(r'(?!-)([0-9]*)(?=\?)',job_ids)[0]
    job_id.append(job_ids)
    
    # company name
    company_names = job.select_one('img')['alt']
    company_name.append(company_names)
    
    # job location
    job_locations = job.find("span", class_="job-search-card__location").text
    job_location.append(job_locations)
    
    # posting date
    post_dates = job.select_one('time')['datetime']
    post_date.append(post_dates)
    
    # click show more
    showmore_xpath = '//button[@aria-label="Show more, visually expands previously read content above this button"]'
    driver.find_element_by_xpath(showmore_xpath).click()
    
    
    
    # job description
    # jobdesc_xpath = '/html/body/div[1]/div/section/div[2]/section[2]/div/section/div'
    jobdesc_xpath = '//div[@class="description__text description__text--rich"]'
    job_descs = driver.find_element_by_xpath(jobdesc_xpath).text
    job_desc.append(job_descs)
    
    
    # job description page lxml
    jd_page = BeautifulSoup(driver.page_source, 'lxml')
    
    # define a criteria collector
    criteria_map = { "level": '', "emp_type": "", "functions": "", "industries": "" }
    
    # fill the data into the collector
    for criterias in jd_page.findAll('li', {'class': 'description__job-criteria-item'}):

        if "Seniority level" in criterias.text:
            criteria_map["level"] = criterias.span.text.strip()

        if "Employment type" in criterias.text:
            criteria_map["emp_type"] = criterias.span.text.strip()
        
        if "Job function" in criterias.text:
            criteria_map["functions"] = criterias.span.text.strip()

        if "Industries" in criterias.text:
            criteria_map["industries"] = criterias.span.text.strip()
    
    
    level.append(criteria_map["level"])
    emp_type.append(criteria_map["emp_type"])
    functions.append(criteria_map["functions"])
    industries.append(criteria_map["industries"])

16it [00:41,  2.59s/it]


In [53]:
# to check if we have all information
print(len(job_id))
print(len(post_date))
print(len(company_name))
print(len(post_title))
print(len(job_location))
print(len(job_desc))
print(len(level))
print(len(emp_type))
print(len(functions))
print(len(industries))

16
16
16
16
16
16
16
16
16
16


In [54]:
# creating a dataframe
job_data = pd.DataFrame({'Job ID': job_id,
'Date': post_date,
'Company Name': company_name,
'Post': post_title,
'Location': job_location,
'Description': job_desc,
'Level': level,
'Type': emp_type,
'Function': functions,
'Industry': industries
})

# cleaning description column
job_data['Description'] = job_data['Description'].str.replace('\n',' ')

print(job_data.info())
job_data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Job ID        16 non-null     object
 1   Date          16 non-null     object
 2   Company Name  16 non-null     object
 3   Post          16 non-null     object
 4   Location      16 non-null     object
 5   Description   16 non-null     object
 6   Level         16 non-null     object
 7   Type          16 non-null     object
 8   Function      16 non-null     object
 9   Industry      16 non-null     object
dtypes: object(10)
memory usage: 1.4+ KB
None


Unnamed: 0,Job ID,Date,Company Name,Post,Location,Description,Level,Type,Function,Industry
0,2553709029,2021-05-17,,\n \n Data Analyst - Remote\...,"\n Toronto, Ontario, Canada\n","Tucows (NASDAQ:TCX, TSX:TC) is on a mission to...",Entry level,Full-time,Information Technology,"Information Technology and Services, Computer ..."
1,2552463476,2021-05-24,,\n \n Business Analyst\n ...,"\n Toronto, Ontario, Canada\n",The Role Working closely with project stakeho...,Associate,Full-time,Information Technology and Analyst,Veterinary and Information Technology and Serv...
2,2542029589,2021-05-10,,\n \n Data Analyst\n \n...,"\n Toronto, Ontario, Canada\n",ContentFly (YC W21) is one of the fastest grow...,Mid-Senior level,Full-time,Analyst,"Marketing and Advertising, Computer Software, ..."
3,2547021172,2021-05-20,,\n \n Data Analyst\n \n...,"\n Mississauga, Ontario, Canada\n ...",Must-Have • Analytical and problem-solving sk...,Mid-Senior level,Full-time,Information Technology,Information Technology and Services
4,2562460575,2021-05-22,,\n \n Data Analyst\n \n...,"\n Toronto, Ontario, Canada\n",About PartnerStack PartnerStack helps compani...,Entry level,Full-time,Information Technology,"Marketing and Advertising, Computer Software, ..."


In [55]:
job_data.to_csv('LinkedIn Job Data.csv', index=0)