# **WEB SCRAPING LINKEDIN JOBS**

Reference: https://maoviola.medium.com/a-complete-guide-to-web-scraping-linkedin-job-postings-ad290fcaa97f 

It might be a little different from the reference due to an update from Selenium.

Pre-requisite:
- Have python > 3.0. You can installed here : https://www.python.org/downloads/windows/
- Ensure pip or anaconda is installed. Choose one, whatever do you prefer.
- Have jupyter notebook installed : https://jupyter.org/install (if using pip) or https://anaconda.org/anaconda/jupyter (if using anaconda)
- Have Selenium WebDriver installed : https://pypi.org/project/selenium/ (if using pip) or https://anaconda.org/conda-forge/selenium (if using anaconda)
- Have Pandas installed
- Download chrome webdriver : https://chromedriver.chromium.org/downloads (make sure it supports your Chrome version!)

In [1]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd
from webdriver_manager.chrome import ChromeDriverManager

# 1. Opening Browser & Scroll The Job Listing

In [58]:
# Correct PATH to avoid unicode error
# chrome_driver_path  = r'C:\Users\62851\project-web-scraping-linkedin\chromedriver.exe'

# Set up Chrome options
options = Options()
options.add_argument("start-maximized")

#Sets up and configure the Chrome browser for automated web browsing and interacting using Selenium
webdriver_service = Service( r'C:\Users\62851\project-web-scraping-linkedin\chromedriver.exe')
driver = webdriver.Chrome(options=options, service=webdriver_service)
wait = WebDriverWait(driver, 5)

#Define URL
linkedin_url = 'https://www.linkedin.com/jobs/search/?keywords=data%20scientist&location=Indonesia'

# Action Steps
driver.maximize_window()
driver.get(linkedin_url)  # Open web page

# Determine how many jobs we want to scrape, and calculate how many times we need to scroll down
no_of_jobs = 100
n_scroll = int(no_of_jobs / 25) + 1
print(n_scroll)

i = 1
driver.execute_script("window.scrollTo(0, 0);")  # Scroll to top
while i <= n_scroll:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")  # Scroll to the bottom of page
    time.sleep(3)  # Wait for 3 seconds to load the content
    try:
        button = driver.find_element(By.XPATH, "/html/body/div[1]/div/main/section[2]/button")
        button.click()
        time.sleep(2)
        print("Load more clicked")
    except Exception as e:
        print(f"No more load button found: {e}")
        break
    i += 1

print("Total jobs:")
jobs = driver.find_element(By.CLASS_NAME, "jobs-search__results-list").find_elements(By.TAG_NAME, 'li')  # Return a list
print(len(jobs))

5
No more load button found: Message: element not interactable
  (Session info: chrome=124.0.6367.208)
Stacktrace:
	GetHandleVerifier [0x00007FF6555B7062+63090]
	(No symbol) [0x00007FF655522CB2]
	(No symbol) [0x00007FF6553BEB1D]
	(No symbol) [0x00007FF655405D56]
	(No symbol) [0x00007FF6553FA708]
	(No symbol) [0x00007FF655426FDA]
	(No symbol) [0x00007FF6553FA00A]
	(No symbol) [0x00007FF6554271F0]
	(No symbol) [0x00007FF655443412]
	(No symbol) [0x00007FF655426D83]
	(No symbol) [0x00007FF6553F83A8]
	(No symbol) [0x00007FF6553F9441]
	GetHandleVerifier [0x00007FF6559B25CD+4238301]
	GetHandleVerifier [0x00007FF6559EF72D+4488509]
	GetHandleVerifier [0x00007FF6559E7A0F+4456479]
	GetHandleVerifier [0x00007FF6556905A6+953270]
	(No symbol) [0x00007FF65552E57F]
	(No symbol) [0x00007FF655529254]
	(No symbol) [0x00007FF65552938B]
	(No symbol) [0x00007FF655519BC4]
	BaseThreadInitThunk [0x00007FFC14B1257D+29]
	RtlUserThreadStart [0x00007FFC1636AA48+40]

Total jobs:
70


In [56]:
job_id= []
job_title = []
company_name = []
location = []
date = []
job_link = []
for job in jobs:
    job_id0 = job.get_attribute('data-entity-urn')
    job_id.append(job_id0)

    job_title0 = job.find_element(By.CSS_SELECTOR,'h3').get_attribute('innerText')
    job_title.append(job_title0)

    company_name0 = job.find_element(By.CSS_SELECTOR,'h4').get_attribute('innerText')
    company_name.append(company_name0)

    location0 = job.find_element(By.CLASS_NAME,'job-search-card__location').get_attribute('innerText')
    location.append(location0)

    date0 = job.find_element(By.CSS_SELECTOR,'div>div>time').get_attribute('datetime')
    date.append(date0)

    job_link0 = job.find_element(By.CSS_SELECTOR,'a').get_attribute('href')
    job_link.append(job_link0)

# 2. Get Main Attributes of Each Job Listing

Important Notes:
- The HTML and CSS element path needs to be checked regularly, because it's possible that it will change in the future
- You can also group all the possibly changing web elements in one place
- This is not the most efficient code ever. It works but definitely needs improvement, and feel free to do so on your own.

In [63]:
jd = []
seniority = []
emp_type = []
job_func = []
industries = []
# Loop through each job
for item in range(len(jobs)):
    # Clicking job to view job details
    try:
        job_click_path = f'/html/body/div[1]/div/main/section[2]/ul/li[{item+1}]/a/div[1]/img'
        job.find_element(By.XPATH, job_click_path).click()
    except:
        job_click_path = f'/html/body/div[1]/div/main/section[2]/ul/li[{item+1}]/div/a'
        job.find_element(By.XPATH, job_click_path).click()
    
    time.sleep(3)  # Wait for the page to load

    # Fetching job description
    try:
        jd_path = '/html/body/div[1]/div/section/div[2]/div/section[1]/div/div/section'
        jd0 = job.find_element(By.XPATH, jd_path).get_attribute('innerText')
    except:
        jd_path = '/html/body/div[1]/div/section/div[2]/div/section[2]/div/div/section/div'
        jd0 = job.find_element(By.XPATH, jd_path).get_attribute('innerText')

    # Check for specific text to determine correct section
    is_benefit = 'Base pay range' in jd0
    if is_benefit:
        jd_path = '/html/body/div[1]/div/section/div[2]/div/section[2]/div/div/section/div'
        jd0 = job.find_element(By.XPATH, jd_path).get_attribute('innerText')
    
    jd.append(jd0)
    
    jd_path2 = '/html/body/div[1]/div/section/div[2]/div/section[2]/div' if is_benefit else '/html/body/div[1]/div/section/div[2]/div/section[1]/div'

    # Fetching seniority
    try:
        seniority_path = jd_path2 + '/ul/li[1]/span'
        seniority0 = job.find_element(By.XPATH, seniority_path).get_attribute('innerText')
        seniority.append(seniority0)
    except:
        seniority.append('')  # Handling if seniority is not available

    # Fetching employment type
    try:
        emp_type_path = jd_path2 + '/ul/li[2]/span'
        emp_type0 = job.find_element(By.XPATH, emp_type_path).get_attribute('innerText')
        emp_type.append(emp_type0)
    except:
        emp_type.append('')  # Handling if employment type is not available

    # Fetching job function
    try:
        job_func_path = jd_path2 + '/ul/li[3]/span'
        job_func0 = job.find_element(By.XPATH, job_func_path).get_attribute('innerText')
        job_func.append(job_func0)
    except:
        job_func.append('')  # Handling if job function is not available

    # Fetching industries
    try:
        industries_path = jd_path2 + '/ul/li[4]/span'
        industries0 = job.find_element(By.XPATH, industries_path).get_attribute('innerText')
        industries.append(industries0)
    except:
        industries.append('')  # Handling if industries are not available

    # Adding a slight delay to avoid overwhelming the server and giving enough time for elements to load
    time.sleep(2)

# Print results for debugging
print("Job Descriptions:", jd)
print("Seniorities:", seniority)
print("Employment Types:", emp_type)
print("Job Functions:", job_func)
print("Industries:", industries)

Job Descriptions: ["A Data Scientist is a specialist in doing deep-dive analytics, with advanced-level proficiency in data analytics by applying data science methodology, creating model algorithms (Mathematics or Statistical Model), data mining technique and storytelling delivery skills.\n\n\n\n\nKey Responsibilities\n\nYou will work closely with Tribe Leaders, Product Managers & Data Team in identifying potential data science solutions to improve Lending and funding portfolio‘s performance\nConsolidate, verifying, analyzing, from operational data, transaction data, marketing data, and external data to develop hypotheses and building potential quick win use case or business improvement recommendation\nHand in hand with data engineering and IT team identifying potential available data, for developing the business DataMart, build the models, train the model, validate, test it and fine tuning on real world use case\nIn charge and comfortable in model presentation, articulating, communicat

In [64]:
job_data = pd.DataFrame({'ID': job_id,
                        'Date': date,
                        'Company': company_name,
                        'Title': job_title,
                        'Location': location,
                        'Description' : jd,
                        'Level': seniority,
                        'Type': emp_type,
                        'Function': job_func,
                        'Industry': industries
                        })

In [65]:
job_data.tail(10)

Unnamed: 0,ID,Date,Company,Title,Location,Description,Level,Type,Function,Industry
60,,2024-05-14,Traveloka,Data Scientist - Experimentation,"Jakarta, Indonesia",It's fun to work in a company where people tru...,Mid-Senior level,Full-time,Engineering and Information Technology,Software Development and Travel Arrangements
61,,2024-04-29,Traveloka,Machine Learning Engineer (Product),Jakarta Metropolitan Area,It's fun to work in a company where people tru...,Entry level,Full-time,Engineering and Information Technology,Software Development
62,,2024-04-12,Shopee,"Data Scientist, Business Intelligence","Jakarta, Indonesia","Job Description\n\n\nPrototyping, developing, ...",Mid-Senior level,Full-time,Engineering and Information Technology,"Technology, Information and Internet"
63,,2024-05-03,SQE (S-Quantum Engine),Senior Data Scientist (Pipeline),"Jakarta, Indonesia",The S-Quantum Engine is Sinarmas Financial Ser...,Mid-Senior level,Full-time,Information Technology,"Technology, Information and Internet, Financia..."
64,,2024-02-02,ilmuOne Data,Data Scientist,Jakarta Metropolitan Area,We are an independent data analytics consultan...,,,,
65,,2024-03-04,KOMPAS GRAMEDIA,Data Scientist KG Media,"Jakarta, Jakarta, Indonesia",Capitalize structured /unstructured data and u...,Entry level,Full-time,Engineering and Information Technology,Media Production
66,,2024-05-13,JALA,Data Scientist,"Yogyakarta, Yogyakarta, Indonesia",Responsibilities\n\n\nApply your expertise in ...,Entry level,Full-time,Engineering and Information Technology,"Technology, Information and Internet"
67,,2024-04-02,Akulaku Indonesia,Data Scientist,"Jakarta, Jakarta, Indonesia",Responsibilities\n\n\nPerform qualitative and ...,Entry level,Full-time,Engineering and Information Technology,Financial Services
68,,2024-05-13,KOMPAS GRAMEDIA,Data Scientist,"Jakarta, Jakarta, Indonesia",Collaborate with different functional teams to...,Entry level,Full-time,Engineering and Information Technology,Media Production
69,,2024-03-05,KECILIN,AI Developer (NLP),"Jakarta, Jakarta, Indonesia","Responsibilities\n\n\nDesign, develop, and dep...",,,,


In [66]:
#change output file path to your desired path. Don't forget to put the filename and filetype also
output_file_path = "C:\\Users\\62851\\project-web-scraping-linkedin\\datascience.xlsx"
job_data.to_excel(output_file_path, index = False)