1. **Standard Library Imports:**
    - `argparse`: Used to write user-friendly command-line interfaces. The module defines a few functions and classes to help you create a parser for command-line options, arguments, and subcommands.
    - `concurrent.futures.ThreadPoolExecutor, wait`: Provides a high-level interface for computing asynchronously. The `ThreadPoolExecutor` allows you to launch parallel tasks in a thread pool.
    - `datetime.date`: Provides date objects for working with date values and implementing date manipulations.
    - `io`: A core utility for working with streams and input/output operations.
    - `os`: Provides a way of using operating system-dependent functionality like reading or writing to the filesystem.
    - `re`: Provides support for regular expressions, allowing for string searching and manipulation.
    - `time`: Provides time-related functions, such as getting the current time or causing a delay in execution with `sleep()`.
    - `urllib`: A module for fetching data across the web, providing a high-level interface for dealing with URL handling.

2. **Third-party Imports:**
    - `fitz` (PyMuPDF): A library used for handling and manipulating PDF files, providing functionalities like extracting text, images, or other document manipulations.
    - `pandas as pd`: A powerful data manipulation and analysis library. It provides data structures and functions needed to work with structured data seamlessly.
    - `requests`: A simple, yet elegant, HTTP library for making HTTP requests.
    - `BeautifulSoup` from `bs4`: A library for parsing HTML and XML documents to extract data.
    - `pdfminer.high_level`: Part of the PDFMiner library, used for text extraction from PDF documents.

In [1]:
# %%
# Standard library imports
import argparse
from concurrent.futures import ThreadPoolExecutor, wait
from datetime import date
import io
import os
import re
import time
import urllib

# Third-party imports
import fitz  # PyMuPDF
import pandas as pd
import requests
from bs4 import BeautifulSoup
from pdfminer import high_level

This code defines a few utility functions to ensure a directory exists: `ensure_directory_exists()` checks and creates a directory if it does not exist, `create_path()` constructs a full path using the current directory and a given folder name, and `create_directory_if_not_exists()` creates the directory when it doesn't already exist.

In [2]:
# %%
CURRENT_DIRECTORY = os.getcwd()

def ensure_directory_exists(folder_name):

    path = create_path(folder_name)
    create_directory_if_not_exists(path)
    return path

def create_path(folder_name):
    return os.path.join(CURRENT_DIRECTORY, folder_name)

def create_directory_if_not_exists(path):
    if not os.path.exists(path):
        os.makedirs(path)

This code defines a function `fetch_html_with_retries(url)` that attempts to fetch HTML content from a given URL, retrying up to 3 times with a 15-second delay between retries, using the `requests` library and parsing the content with `BeautifulSoup` using the "lxml" parser.

In [3]:
# %%

MAX_RETRIES = 3
RETRY_DELAY = 15  # delay in seconds


def fetch_html_with_retries(url):
    attempt_count = 0
    while attempt_count < MAX_RETRIES:
        try:
            response = requests.get(url)
            return BeautifulSoup(response.text, "lxml")
        except requests.RequestException:
            attempt_count += 1
            time.sleep(RETRY_DELAY)

This code defines a function `extract_detail_from_soup(soup, keyword)` that extracts text from the next sibling tag of a `<td>` tag containing a specific keyword within a parsed HTML document (`soup`). It safely handles cases where the desired tag might not be found.

In [4]:
def extract_detail_from_soup(soup, keyword):
    try:
        keyword_tag = soup.find(lambda tag: tag.name == "td" and keyword in tag.text)
        if keyword_tag:
            next_tag = keyword_tag.find_next()
            detail_text = next_tag.get_text().strip()
            return detail_text
        else:
            return ""
    except AttributeError:  # This catches potential errors like 'NoneType' object has no attribute 'find_next'
        return ""

This code defines a function `download_pdf(file_url, output_directory)` that downloads a PDF file from a given URL, saves it to the specified output directory, and returns the content as a `BytesIO` object along with the filename. It handles errors using exceptions for `URLError` and `IOError`.

In [5]:
# %%
def download_pdf(file_url, output_directory):
    try:
        # Open the URL
        response = urllib.request.urlopen(file_url)

        # Extract the filename from headers
        filename = response.info().get_filename().replace("/", " ")

        # Read file content
        pdf_content = response.read()

        # Construct full file path
        file_path = os.path.join(output_directory, filename)

        # Write file content to a file
        with open(file_path, "wb") as pdf_file:
            pdf_file.write(pdf_content)

        # Return the content in BytesIO and the filename
        return io.BytesIO(pdf_content), filename

    except (urllib.error.URLError, IOError) as e:
        print(f"Error occurred: {e}")
        return None, None

This function, `clean_text(text)`, removes specific unwanted texts and formats from the input text. It removes newline characters, unnecessary spaces, page numbers, and replaces or deletes specific unwanted phrases from the text.

In [6]:
# %%
def clean_text(text):

    UNWANTED_TEXTS = [
    "M a h ka m a h A g u n g R e p u blik In d o n esia\n",
    "Disclaimer\n",
    "Kepaniteraan Mahkamah Agung Republik Indonesia berusaha untuk selalu mencantumkan informasi paling kini dan akurat sebagai bentuk komitmen Mahkamah Agung untuk pelayanan publik, transparansi dan akuntabilitas\n",
    "pelaksanaan fungsi peradilan. Namun dalam hal-hal tertentu masih dimungkinkan terjadi permasalahan teknis terkait dengan akurasi dan keterkinian informasi yang kami sajikan, hal mana akan terus kami perbaiki dari waktu kewaktu.\n",
    "Dalam hal Anda menemukan inakurasi informasi yang termuat pada situs ini atau informasi yang seharusnya ada, namun belum tersedia, maka harap segera hubungi Kepaniteraan Mahkamah Agung RI melalui :\n",
    "Email : kepaniteraan@mahkamahagung.go.id    Telp : 021-384 3348 (ext.318)\n",
    "Direktori Putusan Mahkamah Agung Republik Indonesia",
    "putusan.mahkamahagung.go.id",
    "hkama ahkamah Agung Repub ahkamah Agung Republik Indonesia mah Agung Republik Indonesia blik Indonesi",
    "Disclaimer Kepaniteraan Mahkamah Agung Republik Indonesia berusaha untuk selalu mencantumkan informasi paling kini dan akurat sebagai bentuk komitmen Mahkamah Agung untuk pelayanan publik, transparansi dan akuntabilitas pelaksanaan fungsi peradilan. Namun dalam hal-hal tertentu masih dimungkinkan terjadi permasalahan teknis terkait dengan akurasi dan keterkinian informasi yang kami sajikan, hal mana akan terus kami perbaiki dari waktu kewaktu. Dalam hal Anda menemukan inakurasi informasi yang termuat pada situs ini atau informasi yang seharusnya ada, namun belum tersedia, maka harap segera hubungi Kepaniteraan Mahkamah Agung RI melalui : Email : kepaniteraan@mahkamahagung.go.id Telp : 021-384 3348 (ext.318)",
    ]

    # remove extra space
    text = ' '.join(text.replace('\n', ' ').split())

    # Remove pages number
    for i in range(1, 100):
        UNWANTED_TEXTS.append(f"Halaman  | {i} ")
    # Remove 'Halaman {number} dari {number}' patterns
    text = re.sub(r'Halaman \d+ dari \d+', '', text)

    # Replace word 'Halaman' with a newline and remove any numbers following it
    text = re.sub(r'Halaman \d+', '\n', text)

    for unwanted_text in UNWANTED_TEXTS:
        text = text.replace(unwanted_text, "")
    return text

This code defines a function `extract_and_clean_text(pdf_path)` that extracts text from a PDF file using the PyMuPDF library (`fitz`) and returns the cleaned text by removing newlines and extra spaces.

In [7]:
def extract_and_clean_text(pdf_path):
    """Extracts and cleans text from a PDF file."""
    import fitz  # Importing at function level for modularity

    text_content = ""
    with fitz.open(pdf_path) as document:
        for page in document:
            text_content += page.get_text()

    return clean_text(text_content)

This code defines an `extract_data` function to scrape and extract data from a webpage, including PDF details. It then saves this data into a CSV and a text file. Key functions within this involve `process_pdf`, which downloads and processes a PDF, and `save_results`, which saves the extracted data to specified files.

In [8]:
# %%
PDF_AVAILABLE_TEXT = "Ada PDF"
PDF_NOT_AVAILABLE_TEXT = "Tidak ada PDF"
LOGGING_FILE_NAME = "Logging.csv"

def extract_data(url, keyword_url):
    # Extract the required details and save to both PDF and CSV simultaneously
    path_output = create_path("putusan")
    path_pdf = create_path("pdf-putusan")
    current_date = date.today().strftime("%Y-%m-%d")
    soup = fetch_html_with_retries(url)
    table = soup.find("table", {"class": "table"})
    judul = table.find("h2").text.strip()
    tahun = extract_detail_from_soup(table, "Tahun")
    tanggal_register = extract_detail_from_soup(table, "Tanggal Register")
    kaidah = extract_detail_from_soup(table, "Kaidah")

    link_pdf, text_pdf, file_name_pdf, has_pdf = process_pdf(soup, path_pdf)

    data = {
        "judul": judul,
        "tanggal_register": tanggal_register,
        "tahun": tahun,
        "kaidah": kaidah,
        "link": url,
        "link_pdf": link_pdf,
        "pdf_name": file_name_pdf,
        "has_pdf": has_pdf,
    }

    save_results(path_output, keyword_url, current_date, data, file_name_pdf, text_pdf)

def process_pdf(soup, path_pdf):
    try:
        link_pdf = soup.find("a", href=re.compile(r"/pdf/"))["href"]
        file_pdf, file_name_pdf = download_pdf(link_pdf, path_pdf)
        pdf_file_path = os.path.join(path_pdf, file_name_pdf)
        text_pdf = clean_text(extract_and_clean_text(pdf_file_path))
        has_pdf = PDF_AVAILABLE_TEXT
    except Exception as e:
        link_pdf = ""
        text_pdf = ""
        file_name_pdf = ""
        has_pdf = PDF_NOT_AVAILABLE_TEXT
    return link_pdf, text_pdf, file_name_pdf, has_pdf

def save_results(path_output, keyword_url, current_date, data, file_name_pdf, text_pdf):
    result = pd.DataFrame([data])
    keyword_url = keyword_url.replace("/", " ")

    destination_csv = os.path.join(path_output, LOGGING_FILE_NAME)
    if not os.path.isfile(destination_csv):
        result.to_csv(destination_csv, header=True, index=False)
    else:
        result.to_csv(destination_csv, mode="a", header=False, index=False)

    destination_txt = os.path.join(path_output, f"{file_name_pdf}_{current_date}.txt")
    with open(destination_txt, 'w', encoding='utf-8') as file:
        file.write(text_pdf)

This code defines three functions:

1. `build_search_link(base_url, page_number, sort)`: Constructs a search URL based on input parameters, including sorting details.

2. `run_process(base_url, page_number, sort_results)`: Utilizes the `build_search_link` function to get a search URL, fetches HTML content, finds decision links, and calls `extract_data` on them.

3. `run_scraper(keyword=None, url=None, sort_by_date=True, download_pdfs=True)`: Acts as the main scraping function, taking in a keyword or URL and sorting preferences, scraping pages asynchronously using `ThreadPoolExecutor`.

The key feature: The cell orchestrates a web scraping process to retrieve decision data based on different parameters, leveraging multi-threading for efficiency.

In [9]:
def build_search_link(base_url, page_number, sort):
    """Build the search link based on the given parameters."""
    search_link = f"{base_url}&page={page_number}" if base_url.startswith(
        "https") else f"https://putusan3.mahkamahagung.go.id/search.html?q={base_url}&page={page_number}"
    if sort:
        search_link += "&obf=TANGGAL_PUTUS&obm=desc"
    return search_link


def run_process(base_url, page_number, sort_results):
    search_link = build_search_link(base_url, page_number, sort_results)
    soup = fetch_html_with_retries(search_link)
    decision_links = soup.find_all("a", {"href": re.compile("/direktori/putusan")})
    for decision_link in decision_links:
        extract_data(decision_link["href"], base_url)


def run_scraper(keyword=None, url=None, sort_by_date=True, download_pdfs=True):
    """Main scraping function, accepts keyword or URL and sorting preferences."""
    if not keyword and not url:
        print("Please provide a keyword or URL")
        return

    path_output = create_path("putusan")
    path_pdf = create_path("pdf-putusan")
    today_str = date.today().strftime("%Y-%m-%d")
    search_link = url if url else f"https://putusan3.mahkamahagung.go.id/search.html?q={keyword}&page=1"

    soup = fetch_html_with_retries(search_link)
    last_page_number = int(soup.find_all("a", {"class": "page-link"})[-1]["data-ci-pagination-page"])

    base_url = url or keyword
    print(
        f"Scraping with {'url' if url else 'keyword'}: {base_url} - {20 * last_page_number} data - {last_page_number} page")

    futures = []
    with ThreadPoolExecutor(max_workers=4) as executor:
        for page_number in range(1, last_page_number + 1):
            futures.append(executor.submit(run_process, base_url, page_number, sort_by_date))
    wait(futures)

This code executes the `run_scraper` function, which initiates a web scraping process targeting a specified URL related to Kelas 1A/Kelas 1A Khusus PN Makassar (Perdata Wanprestasi).

In [10]:
# Download Kelas 1A/Kelas 1A Khusus PN Makassar (Perdata Wanprestasi)
run_scraper(url="https://putusan3.mahkamahagung.go.id/search.html?q=&jenis_doc=putusan&cat=3c40e48bbab311301a21c445b3c7fe57&jd=&tp=0&court=098111PN340+++++++++++++++++++++&t_put=2024&t_reg=&t_upl=&t_pr=")

Scraping with url: https://putusan3.mahkamahagung.go.id/search.html?q=&jenis_doc=putusan&cat=3c40e48bbab311301a21c445b3c7fe57&jd=&tp=0&court=098111PN340+++++++++++++++++++++&t_put=2024&t_reg=&t_upl=&t_pr= - 620 data - 31 page
Error occurred: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond


KeyboardInterrupt: 