# Web Scraping LPSE - Blacklist Detailed Data

---

For introduction of Selenium, please be kind to open [**this site**](https://www.scrapingbee.com/blog/selenium-python/)

## Import modules

`%pip freeze > requirements.txt`

In [1]:
# Module for web scraping
from selenium import webdriver
# Module for data manipulation
import pandas as pd
from bs4 import BeautifulSoup
# Module for regular expression
import re

## Load the Chromedriver

Read how to download webdriver for Chrome [**here**]('https://chromedriver.chromium.org/downloads')

In [2]:
# Detailed link
detailed_link = 'https://inaproc.id/daftar-hitam#4537'

In [3]:
# Access to detailed link
DRIVER_PATH = '../bin/chromedriver'
driver = webdriver.Chrome(executable_path = DRIVER_PATH)
driver.get(detailed_link)

## Core Procedure

### 1 Prepare the column names

In [4]:
# Data collection
dataCollection = driver.find_element_by_id('injunctions').find_element_by_tag_name('tbody')

In [5]:
# Prepare blank dictionary for columns
third_column = {
    'Judul Pelanggaran': [],
    'Isi Pelanggaran': [],
    'Nama KLPD': [],
    'Nama Satker': [],
    'Masa Berlaku Sanksi': [],
    'Tanggal Penayangan': []
}

third_column

{'Judul Pelanggaran': [],
 'Isi Pelanggaran': [],
 'Nama KLPD': [],
 'Nama Satker': [],
 'Masa Berlaku Sanksi': [],
 'Tanggal Penayangan': []}

### 2 Get the data

In [6]:
# Length of rows in page
lengthRows = dataCollection.find_elements_by_class_name('item')
# dataSubCollection = dataCollection.find_elements_by_tag_name('tr')

for row in range(len(lengthRows)):
    # Get data
    # print(dataCollection.find_element_by_tag_name('td')[row].text)
    valVioHeader = dataCollection.find_elements_by_class_name('header')[row].text
    valVioContent = dataCollection.find_elements_by_class_name('description')[row].text
    valList = []
    elemVal = dataCollection.find_elements_by_tag_name('tbody')[row].find_elements_by_tag_name('td')[1::2]
    for elem in elemVal:
        elemValSub = elem.text
        valList.append(elemValSub)
    # Key-value
    dict_val = {
        'vio_header': valVioHeader,
        'vio_content': valVioContent,
        'sub_institution': valList[0],
        'sub_name': valList[1],
        'sub_expire': valList[2],
        'sub_show': valList[3]
    }
    # Parse into list
    for col in range(len(dict_val.keys())):
        value = dict_val[list(dict_val.keys())[col]]
        third_column[list(third_column.keys())[col]].append(value)

In [7]:
# Number of law
lawList = []
for element in dataCollection.find_elements_by_tag_name('td'):
    string = element.text
    try:
        value = re.match(pattern = 'No+ : \S+\d+$', string = string)[0]
    except:
        continue
    # Append to list
    lawList.append(value)

In [8]:
# Add number of law into dictionary
third_column['SK Penetapan'] = lawList
third_column

{'Judul Pelanggaran': ['Peraturan LKPP No. 17 Tahun 2018 Pasal 3 huruf g'],
 'Isi Pelanggaran': ['Penyedia yang tidak melaksanakan kontrak, tidak menyelesaikan pekerjaan, atau dilakukan pemutusan kontrak secara sepihak oleh PPK yang disebabkan oleh kesalahan Penyedia Barang/Jasa'],
 'Nama KLPD': ['Kementerian Keuangan'],
 'Nama Satker': ['KANWIL DITJEN PERBENDAHARAAN PROVINSI KALIMANTAN TENGAH'],
 'Masa Berlaku Sanksi': ['10 Sep 2021 s/d 10 Sep 2022'],
 'Tanggal Penayangan': ['17 Sep 2021'],
 'SK Penetapan': ['No : KEP-001/WPB.18/KPA/2021']}

## Convert into JSON

In [9]:
# Dictionary for data
current_data = '4537'
dict_full = {
    current_data: third_column
}

In [10]:
# Data
dict_full

{'4537': {'Judul Pelanggaran': ['Peraturan LKPP No. 17 Tahun 2018 Pasal 3 huruf g'],
  'Isi Pelanggaran': ['Penyedia yang tidak melaksanakan kontrak, tidak menyelesaikan pekerjaan, atau dilakukan pemutusan kontrak secara sepihak oleh PPK yang disebabkan oleh kesalahan Penyedia Barang/Jasa'],
  'Nama KLPD': ['Kementerian Keuangan'],
  'Nama Satker': ['KANWIL DITJEN PERBENDAHARAAN PROVINSI KALIMANTAN TENGAH'],
  'Masa Berlaku Sanksi': ['10 Sep 2021 s/d 10 Sep 2022'],
  'Tanggal Penayangan': ['17 Sep 2021'],
  'SK Penetapan': ['No : KEP-001/WPB.18/KPA/2021']}}

## Convert into data frame

In [11]:
# Create a data frame
df = pd.DataFrame(
        data = dict_full['4537']
)

In [12]:
print('Dimension: {} rows and {} columns'.format(len(df), len(df.columns)))
df.head()

Dimension: 1 rows and 7 columns


Unnamed: 0,Judul Pelanggaran,Isi Pelanggaran,Nama KLPD,Nama Satker,Masa Berlaku Sanksi,Tanggal Penayangan,SK Penetapan
0,Peraturan LKPP No. 17 Tahun 2018 Pasal 3 huruf g,"Penyedia yang tidak melaksanakan kontrak, tida...",Kementerian Keuangan,KANWIL DITJEN PERBENDAHARAAN PROVINSI KALIMANT...,10 Sep 2021 s/d 10 Sep 2022,17 Sep 2021,No : KEP-001/WPB.18/KPA/2021
