# Web Scraping SSCASN BKN: From Interactive Elements to Meaningful Data

Okay, so here’s the deal. I’m using this notebook to scrape data from the SSCASN BKN website. Why? Because I’m trying to figure out which job formations have a higher chance of getting me through the CPNS 2025 selection process. Instead of manually shifting through pages and pages of information, Why not let Python do the boring work?

This project is super personal and I’m basically building my own cheat sheet to make smarter decisions during the application process. Plus, I get to brush up on my web scraping skills while I’m at it. Win-win, right?

What’s the Plan?

The SSCASN website has everything I need, but it’s not exactly user-friendly for data nerds like me. Especially the filter that only allow us to select one-selection per load. Hence, for the python scrape automation challenge is:

- Dropdown menus that dynamically load options.
- Pagination, because of course, all the good stuff is spread across multiple pages.
- Dynamic tables that make scraping a bit tricky.
- The job description of each formation is described in different link. How to gather all that data needed??

The goal? 

Automate the whole process so I can get clean data in one go. Then let's analyze the data retrieved

In [2]:
import pandas as pd
import math
import os
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

In [3]:
# This is path to the chromedriver. You can adjust as you need
path_chromedriver = os.path.join(os.getcwd(), "chromedriver-win64", "chromedriver.exe")

chrome_options = Options()
service = Service(executable_path=path_chromedriver)

In [4]:
# Let's load the Driver
driver = webdriver.Chrome(service=service, options=chrome_options)
url = "https://sscasn.bkn.go.id/" # this is the website links
driver.get(url)

This function was initially created for development purposes to reset the web driver back to the homepage of the SSCASN website. It was particularly useful for troubleshooting or ensuring the code ran smoothly by restarting the scraping process from a clean state. 

Now, it is also used as a convenient way to return to the homepage during operations

In [5]:
def reset_scrap(): 
    """
    This function is created to reset the website into homepage.    
    """
    url = "https://sscasn.bkn.go.id/"
    driver.get(url)  
reset_scrap()

Here I limit the filters into my desired criteria, you may adjust it as you need 
(make sure the text matches exactly what already on the websites)

In [6]:
jenjang = "S-1/Sarjana"
pengadaan = "CPNS"

Since I want to perform multiple searchers, My plan is to loop through the prodi list and retrieve all the data for each prodi.

In [8]:
prodi_list = [
    "S-1 ILMU STATISTIK",
    "S-1 ILMU STATISTIKA",
    "S-1 KEPENDUDUKAN DAN STATISTIK",
    "S-1 KOMPUTASI STATISTIKA",
    "S-1 MATEMATIKA STATISTIKA",
    "S-1 STATISTIK",
    "S-1 STATISTIK (DATA ANALISIS)",
    "S-1 STATISTIK KEUANGAN",
    "S-1 STATISTIK KOMPUTASI",
    "S-1 STATISTIKA",
    "S-1 STATISTIKA BISNIS DAN INDUSTRI",
    "S-1 STATISTIKA DAN SAINS DATA",
    "S-1 STATISTIKA TERAPAN",
    "S-1/D-IV STATISTIK"
]

Here's is the data to interact with the filters.

In [9]:
all_data = [] #initiate base data

In [16]:
def interact_with_filter(jenjang,prodi):
    global all_data

    # Select the "Jenjang Pendidikan"
    jenjang_pendidikan = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, '//*[@id="pencarian"]/div/div/form/div[1]/div[1]/div/div/div/input'))
    )
    jenjang_pendidikan.click()

    options = WebDriverWait(driver, 10).until(
        EC.presence_of_all_elements_located((By.XPATH, "//li[contains(text(), 'S-1/Sarjana')]"))
    )

    for option in options:
        if option.text == jenjang:
            option.click() 
            break
    time.sleep(1)  # Give the website time to load

    # Select the "Prodi"
    prodi_path = '//*[@id="pencarian"]/div/div/form/div[1]/div[2]/div/div/div/input'
    prodi_select = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, prodi_path))
    )
    prodi_select.click()

    options = WebDriverWait(driver, 10).until(
        EC.presence_of_all_elements_located((By.XPATH, f"//li[contains(text(),'{prodi}')]"))
    )

    for option in options:
        if option.text == prodi:
            option.click()
            break

    time.sleep(1)  # Give the website time to load

    # Select the "CPNS"
    pengadaan_path = '//*[@id="pencarian"]/div/div/form/div[1]/div[4]/div/div/div/input'
    pengadaan_select = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, pengadaan_path))
    )
    pengadaan_select.click()

    options = WebDriverWait(driver, 10).until(
        EC.presence_of_all_elements_located((By.XPATH, f"//li[contains(text(),'{pengadaan}')]"))
    )

    for option in options:
        if option.text == pengadaan:
            option.click()

    time.sleep(1)  # Again, give the website time to load 

    # Finally, click the Search button
    cari_path = '//*[@id="pencarian"]/div/div/form/div[1]/div[5]/a'
    driver.find_element(By.XPATH, cari_path).click()

    time.sleep(5)  # Give the website time to load the tables

    # Let's check for how many pages there are
    total_formasi_path = driver.find_element(By.XPATH, '//*[@id="daftarFormasi"]/div[2]/div/div/div[3]/ul/li[1]')

    text_halaman = total_formasi_path.text
    total_formasi = text_halaman.split(': ')[1].split()[0]
    total_page = math.ceil(int(total_formasi) / 10)
    
    return total_page

This function defines how to interact with the pagination, specifically the '>' button. 

Why does the code look like this? After analying the html structure, I discovered that the button dynamically changes based on unique patterns. To handle this, first I create a base format and then loop through with criteria to match each logic.

In [11]:
def klik(index):
    """This function created to click the pagination next button"""
    
    button_xpath = f'//*[@id="daftarFormasi"]/div[2]/div/div/div[3]/ul/li[{index}]/button'
    button = driver.find_element(By.XPATH,button_xpath)
    button.click()

and here it is the main agenda, extracting the data. We will extract the data from dynamic table that changes based on pagination and filters. 

In [13]:
def extract_data():
    table_data = []
    table = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, "//table"))
    )

    rows = table.find_elements(By.XPATH, ".//tbody/tr")

    for row in rows:

        cells = row.find_elements(By.TAG_NAME, "td")
        row_data = {
            "Jabatan": cells[0].text,
            "Instansi": cells[1].text,
            "Unit Kerja": cells[2].text,
            "Formasi": cells[3].text,
            "(PPPK) Khusus disabilitas? (CPNS) Dapat Diisi Disabilitas?": cells[4].text,
            "Penghasilan (juta)": cells[5].text,
            "Jumlah Kebutuhan": cells[6].text,
            "Jumlah Lulus verifikasi": cells[7].text,
            "Link": cells[8].find_element(By.TAG_NAME, "a").get_attribute('href')
        }

        table_data.append(row_data)
    return table_data

In [15]:
def loop_data(total_page):
    """This function is created to loop the extracting progress"""
    global all_data
    page_index = 10  
    current_page = 1

    for i in range(current_page,total_page):
        data = extract_data()
        all_data.extend(data)

        if current_page<=3:
            klik(page_index)

        elif current_page == total_page-3 or current_page == 4:
            page_index = 11
            klik(page_index)

        elif current_page >= total_page-2:
            page_index = 10
            klik(page_index)

        else:
            page_index = 12
            klik(page_index)
        
        current_page += 1
    reset_scrap()
    time.sleep(3)

In [None]:
for prodi in prodi_list:
    pages = interact_with_filter(jenjang,prodi)
    loop_data(pages)

In [14]:
df = pd.DataFrame(all_data)
df.to_excel("extracted_data.xlsx", index=False) 