Proyek ini bertujuan untuk melakukan text summarization otomatis terhadap berita-berita mengenai Perum DAMRI dari berbagai sumber daring.
Proses meliputi tahapan scraping data, preprocessing, embedding berbasis MiniLM, serta pembuatan ringkasan (summary) dengan dua pendekatan:

* Semantic Similarity (MiniLM)
* TF-IDF Extractive Summarization

# Data Preprocessing

## Data Loading dan Persiapan Awal

In [1]:
import pandas as pd
from time import sleep
import random

In [2]:
input_csv = 'data/data_raw/Kelompok1_Link Berita_DAMRI - Data.csv'
output_csv = 'data/data_processed/scraped_articles.csv'

df_urls = pd.read_csv(input_csv)
urls = df_urls['link'].dropna().astype(str).tolist()
sources = df_urls['sumber'].dropna().astype(str).tolist()

In [3]:
import re
from bs4 import BeautifulSoup

def clean_text(html_text):
    text = BeautifulSoup(html_text, "html.parser").get_text(separator=" ")
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'Cookies|Setuju|Kebijakan Privasi|Iklan|Advertisement|ADVERTISEMENT|Copyright', '', text, flags=re.IGNORECASE)
    return text.strip()


## Web Scraping Berita Menggunakan Trafilatura

In [None]:
import requests
import trafilatura

def scrape_content(url):
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/120.0.0.0 Safari/537.36"
        )
    }

    try:
        response = requests.get(url, headers=headers, timeout=10)
        if response.status_code != 200:
            return {
                "url": url,
                "title": None,
                "content": None,
                "error": f"Status code: {response.status_code}"
            }
            
        downloaded = trafilatura.extract(
            response.text,
            include_comments=False,
            include_links=False,
            include_tables=False,
            deduplicate=True,
        )

        if not downloaded or not downloaded.strip():
            return {
                "url": url,
                "title": None,
                "content": None,
                "error": "Konten kosong"
            }

        metadata = trafilatura.extract_metadata(response.text)
        title = metadata.title if metadata and metadata.title else None

        return {
            "url": url,
            "title": title,
            "content": downloaded.strip(),
            "error": None
        }

    except Exception as e:
        return {
            "url": url,
            "title": None,
            "content": None,
            "error": str(e)
        }


## Proses Scraping Secara Batch

In [5]:
results = []
for i, (url, sumber) in enumerate(zip(urls, sources), start=1):
    print(f"[{i}/{len(urls)}] Scraping: {url}")
    data = scrape_content(url)
    data['sumber'] = sumber
    results.append(data)
    sleep(random.uniform(1, 2))

[1/154] Scraping: https://jogjapolitan.harianjogja.com/read/2025/09/20/510/1228835/jadwal-damri-ke-bandara-yia-hari-ini-jogja-purworejo-kebumen
[2/154] Scraping: https://travel.detik.com/travel-news/d-7871241/jadwal-damri-bandara-soekarno-hatta-2025-rute-dan-tarifnya
[3/154] Scraping: https://www.kabarbumn.com/rilis-bumn/116454847/mulai-besok-damri-buka-rute-tanjung-baratbandara-soetta-tarifnya-terjangkau
[4/154] Scraping: https://www.beritatrans.com/artikel/254369/sebanyak-940-ribu-masyarakat-terbantu-mobilisasi-ke-bandara-soekarno-hatta-naik-damri/
[5/154] Scraping: https://metrobanten.co.id/damri-buka-rute-tanjung-barat-bandara-soetta-tarifnya-terjangkau/
[6/154] Scraping: https://www.kompas.tv/info-publik/615445/rute-dan-harga-tiket-damri-ke-bandara-soekarno-hatta-dari-jakarta-mulai-rp60-000
[7/154] Scraping: https://ekonomi.bisnis.com/read/20250129/98/1835381/jadwal-rute-dan-tarif-damri-bandara-soekarno-hatta-2025
[8/154] Scraping: https://www.tempo.co/ekonomi/damri-beri-diskon-ti

In [6]:
df_result = pd.DataFrame(results)
df_result.to_csv(output_csv, index=False, encoding='utf-8-sig')

In [7]:
print(f"Scraping selesai! Hasil disimpan di: {output_csv}")
print(f"Total artikel berhasil: {df_result['content'].notna().sum()} dari {len(df_result)}")


Scraping selesai! Hasil disimpan di: data/data_processed/scraped_articles.csv
Total artikel berhasil: 125 dari 154


In [8]:
df_ok = df_result.loc[df_result["error"].isna()]
df_ok["sumber"].unique().tolist()

['Harian Jogja',
 'DetikTravel',
 'beritatrans.com',
 'metrobanten.co.id',
 'Kompas TV',
 'ekonomi.bisnis.com',
 'tempo.co',
 'bisnisnews.id',
 'detiktravel',
 'antaranews.com',
 'detik.news',
 'detik.com',
 'kompas.tv',
 'kompas.com',
 'harianterbit.com',
 'garuda.tv',
 'otodriver.com',
 'bisnis.com',
 'pasbana.com',
 'dutatv.com',
 'mediaindonesia.com',
 'beritasatu.com',
 'idntimes.com',
 'rentak.id',
 'kabarsdgs.com',
 'sekitarbandung.com',
 'kanalkalimantan.com',
 'tempo.com',
 'liputan6.com',
 'bantenraya.com',
 'tangerangraya.id',
 'antranews.com',
 'suarabahana.com',
 'explorebromo.com',
 'travel.detik.com',
 'infopublik.id',
 'traveloka.com',
 'jogjapolitan.harianjogja.com',
 'jatengprov.go.id',
 'rri.co.id',
 'damri.co.id',
 'mojok.com',
 'wartakita.org',
 'news.republika.co.id',
 'jogya.com',
 'indtimes.com',
 'beritakini.co.id',
 'tintapena.id',
 'denpasar.finansialinsight.com',
 'grandwisatabekasi.com',
 'pontianak.beritaenergi.id',
 'beritaterkini.co.id',
 'hariankepri.co

In [9]:
df_error = df_result.loc[df_result["error"].notna()]
df_error["sumber"].unique().tolist()

['kabarbumn.com',
 'kabarbandung.com',
 'bantenraya.com',
 'banyuwangikab.go.id',
 'rri.co.id',
 'medan.insiderindonesia.com',
 'kabar bumn',
 'glints',
 'radio prfm',
 'insider indonesia',
 'kota wisata']

In [10]:
df_ok = df_result.loc[df_result["error"].isna()].copy()
df_ok["content_cleaned"] = df_ok["content"].apply(clean_text)

## Pembersihan dan Tokenisasi Teks

In [11]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize

df_ok = df_ok.copy()

df_ok["content_cleaned_tokenize"] = df_ok["content_cleaned"].apply(
    lambda text: sent_tokenize(text) if isinstance(text, str) else []
)

[nltk_data] Downloading package punkt to /home/aliffatur/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/aliffatur/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [12]:
df_ok.content_cleaned_tokenize[1]

['- Jadwal Damri Bandara Soekarno Hatta 2025 1.',
 'Stasiun DAMRI Ciputat 2.',
 'Stasiun KA Gambir 3.',
 'Stasiun KCIC Halim 4.',
 'Stasiun DAMRI Kemayoran 5.',
 'Terminal Blok M 6.',
 'Terminal Lebak Bulus 7.',
 'Terminal Pasar Minggu 8.',
 'Terminal Rawamangun 9.',
 'Terminal Kampung Rambutan 10.',
 'Terminal Tanjung Priok 11.',
 'Stasiun DAMRI Merak 12.',
 'Botani Square Bogor 13.',
 'Cibinong City Mall 14.',
 'Mall Kelapa Gading 15.',
 'Intermark BSD 16.',
 'Hollywood Junction Cikarang 17.',
 'Grand Taruma Karawang 18.',
 'Terminal Kayuringin Bekasi Barat 20.',
 'Stasiun DAMRI Purwakarta 21.',
 'Stasiun DAMRI Sukabumi 22.',
 'Transpark Mall Bintaro 23.',
 'Terminal Depok Sawangan 24.',
 'Khalifah Station 25.',
 'Terminal Pulo Gebang 26.',
 'Senayan Park - Cara Pesan Tiket DAMRI Bus DAMRI merupakan Perusahaan Umum Djawatan Angkoetan Motor Repoeblik Indonesia (Perum DAMRI) milik Badan Usaha Milik Negara (BUMN).',
 'DAMRI menyediakan layanan angkutan penumpang, termasuk ke Bandara Soe

# Semantic Embedding dengan MiniLM

In [15]:
from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

df_ok = df_ok.copy()
df_ok["text_for_embedding"] = df_ok["content_cleaned_tokenize"].apply(
    lambda sentences: " ".join(sentences) if isinstance(sentences, list) else str(sentences)
)

texts = df_ok["text_for_embedding"].tolist()
embeddings = model.encode(texts, convert_to_tensor=True, batch_size=16, show_progress_bar=True)

df_ok["embeddings"] = [emb for emb in embeddings]
df_ok[["sumber", "title", "embeddings"]].head()


Batches: 100%|██████████| 8/8 [00:00<00:00, 26.38it/s]


Unnamed: 0,sumber,title,embeddings
0,Harian Jogja,"Jadwal DAMRI ke Bandara YIA Hari Ini, Jogja-Pu...","[tensor(-0.0294, device='cuda:0'), tensor(0.03..."
1,DetikTravel,"Jadwal DAMRI Bandara Soekarno Hatta 2025, Rute...","[tensor(0.0703, device='cuda:0'), tensor(0.027..."
3,beritatrans.com,Sebanyak 940 Ribu Masyarakat Terbantu Mobilisa...,"[tensor(0.0441, device='cuda:0'), tensor(0.046..."
4,metrobanten.co.id,"DAMRI Buka Rute Tanjung Barat–Bandara Soetta, ...","[tensor(-0.0207, device='cuda:0'), tensor(0.02..."
5,Kompas TV,Rute dan Harga Tiket DAMRI ke Bandara Soekarno...,"[tensor(-0.0157, device='cuda:0'), tensor(0.02..."


In [17]:
import numpy as np

sample_sentences = df_ok.iloc[0]["content_cleaned_tokenize"]
sentence_embeddings = model.encode(sample_sentences, convert_to_tensor=True)
print(sentence_embeddings)

tensor([[-0.0719,  0.0796,  0.0313,  ...,  0.0080, -0.0519, -0.0028],
        [ 0.0158,  0.0976, -0.0229,  ..., -0.0204,  0.0605, -0.0261],
        [ 0.0089,  0.0307, -0.0028,  ...,  0.0504,  0.0027,  0.0514],
        ...,
        [-0.0429,  0.0937, -0.0648,  ...,  0.0552, -0.0019, -0.0125],
        [ 0.0073,  0.0707,  0.0308,  ..., -0.0601, -0.0501,  0.0490],
        [-0.0261,  0.0899,  0.0012,  ...,  0.0027, -0.0796, -0.0312]],
       device='cuda:0')


In [20]:
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(sentence_embeddings.cpu().numpy())
print(similarity_matrix)

sentence_scores = similarity_matrix.mean(axis=1)


[[0.99999964 0.5365598  0.6917523  0.64707035 0.5630669  0.5037866
  0.5624476  0.5276336  0.47442123 0.630592  ]
 [0.5365598  1.         0.44861278 0.52361465 0.42342895 0.29007778
  0.32909    0.58526295 0.32921624 0.44610956]
 [0.6917523  0.44861278 1.         0.7002914  0.5852872  0.42488238
  0.46290728 0.5158551  0.4895502  0.608407  ]
 [0.64707035 0.52361465 0.7002914  0.9999999  0.53840744 0.41545653
  0.55981517 0.6064916  0.44713098 0.5840802 ]
 [0.5630669  0.42342895 0.5852872  0.53840744 1.0000005  0.43679118
  0.44893947 0.54449606 0.4467555  0.576164  ]
 [0.5037866  0.29007778 0.42488238 0.41545653 0.43679118 1.
  0.7443645  0.4021709  0.37432447 0.55190986]
 [0.5624476  0.32909    0.46290728 0.55981517 0.44893947 0.7443645
  1.         0.3837429  0.41487208 0.58231276]
 [0.5276336  0.58526295 0.5158551  0.6064916  0.54449606 0.4021709
  0.3837429  0.9999999  0.35750014 0.5033487 ]
 [0.47442123 0.32921624 0.4895502  0.44713098 0.4467555  0.37432447
  0.41487208 0.35750014

In [None]:
top_n = 3
top_sentence_indices = np.argsort(sentence_scores)[-top_n:]

top_sentence_indices.sort()

summary = " ".join([sample_sentences[i] for i in top_sentence_indices])
print(summary)


Jadwal DAMRI ke Bandara YIA Hari Ini, Jogja-Purworejo-Kebumen  Harianjogja.com, JOGJA—Bus DAMRI adalah pilihan tepat untuk Anda yang ingin menuju ke Bandara YIA. Layanan transportasi DAMRI ini bisa Anda manfaatkan jika ingin menuju Bandara YIA. Desa Wisata Adat Osing Kemiren Banyuwangi Masuk Jaringan Terbaik Dunia  Berita Populer - Tarif Rp70.000, Ini Jadwal Bus DAMRI Jogja-Semarang PP - Pengolahan Sampah Menjadi Energi Listrik Dibutuhkan Untuk Atasi TPA - Ribuan Titik Jalan di Bantul Masih Gelap Rawan Kecelakaan - Buruh Jogja Beri Rapor Merah Setahun Kinerja Prabowo-Gibran - Jalur dan Rute Trans Jogja ke Prambanan, Goden, hingga Bantul


# Extractive Summarization berbasis Semantic Similarity

In [22]:
def summarize_text(sentences, model, top_n=3):
    if not sentences:
        return ""
    embeddings = model.encode(sentences, convert_to_tensor=True)
    sim = cosine_similarity(embeddings.cpu().numpy())
    scores = sim.mean(axis=1)
    top_idx = np.argsort(scores)[-top_n:]
    top_idx.sort()
    return " ".join([sentences[i] for i in top_idx])

df_ok["summary"] = df_ok["content_cleaned_tokenize"].apply(
    lambda sents: summarize_text(sents, model, top_n=3)
)


In [23]:
df_ok.head()

Unnamed: 0,url,title,content,error,sumber,content_cleaned,content_cleaned_tokenize,text_for_embedding,embeddings,summary
0,https://jogjapolitan.harianjogja.com/read/2025...,"Jadwal DAMRI ke Bandara YIA Hari Ini, Jogja-Pu...",Advertisement\nJadwal DAMRI ke Bandara YIA Har...,,Harian Jogja,"Jadwal DAMRI ke Bandara YIA Hari Ini, Jogja-Pu...","[Jadwal DAMRI ke Bandara YIA Hari Ini, Jogja-P...","Jadwal DAMRI ke Bandara YIA Hari Ini, Jogja-Pu...","[tensor(-0.0294, device='cuda:0'), tensor(0.03...","Jadwal DAMRI ke Bandara YIA Hari Ini, Jogja-Pu..."
1,https://travel.detik.com/travel-news/d-7871241...,"Jadwal DAMRI Bandara Soekarno Hatta 2025, Rute...",- Jadwal Damri Bandara Soekarno Hatta 2025 1. ...,,DetikTravel,- Jadwal Damri Bandara Soekarno Hatta 2025 1. ...,[- Jadwal Damri Bandara Soekarno Hatta 2025 1....,- Jadwal Damri Bandara Soekarno Hatta 2025 1. ...,"[tensor(0.0703, device='cuda:0'), tensor(0.027...",Stasiun DAMRI Kemayoran - Tarif: Rp 80 ribu - ...
3,https://www.beritatrans.com/artikel/254369/seb...,Sebanyak 940 Ribu Masyarakat Terbantu Mobilisa...,Oleh : Ahmad\nJAKARTA (BeritaTrans.com) – DAMR...,,beritatrans.com,Oleh : Ahmad JAKARTA (BeritaTrans.com) – DAMRI...,[Oleh : Ahmad JAKARTA (BeritaTrans.com) – DAMR...,Oleh : Ahmad JAKARTA (BeritaTrans.com) – DAMRI...,"[tensor(0.0441, device='cuda:0'), tensor(0.046...",Stasiun DAMRI Merak: tarif Rp140 ribu tersedia...
4,https://metrobanten.co.id/damri-buka-rute-tanj...,"DAMRI Buka Rute Tanjung Barat–Bandara Soetta, ...","DAMRI Buka Rute Tanjung Barat–Bandara Soetta, ...",,metrobanten.co.id,"DAMRI Buka Rute Tanjung Barat–Bandara Soetta, ...","[DAMRI Buka Rute Tanjung Barat–Bandara Soetta,...","DAMRI Buka Rute Tanjung Barat–Bandara Soetta, ...","[tensor(-0.0207, device='cuda:0'), tensor(0.02...","DAMRI Buka Rute Tanjung Barat–Bandara Soetta, ..."
5,https://www.kompas.tv/info-publik/615445/rute-...,Rute dan Harga Tiket DAMRI ke Bandara Soekarno...,"JAKARTA, KOMPAS.TV - Jawatan Angkutan Motor Re...",,Kompas TV,"JAKARTA, KOMPAS.TV - Jawatan Angkutan Motor Re...","[JAKARTA, KOMPAS.TV - Jawatan Angkutan Motor R...","JAKARTA, KOMPAS.TV - Jawatan Angkutan Motor Re...","[tensor(-0.0157, device='cuda:0'), tensor(0.02...",Tiket transportasi DAMRI dari Jakarta ke Banda...


In [None]:
df_ok[["url", "title", "content", "summary"]].to_csv("data/data_processed/data_extraction_minilm_summary.csv", sep=";")

# TF-IDF Based Summarization

In [None]:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import re

nltk.download('punkt', quiet=True)

def tfidf_summarize(text, num_sentences=3):
    if not isinstance(text, str) or text.strip() == "":
        return ""
    
    text = clean_text(text)
    sentences = nltk.sent_tokenize(text)
    if len(sentences) == 0:
        return ""
    if len(sentences) <= num_sentences:
        return " ".join(sentences)

    clean_sentences = [
        re.sub(r'\s+', ' ', re.sub(r'[^a-zA-Z0-9.,!? ]', '', s)).strip()
        for s in sentences
    ]

    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(clean_sentences)
    sentence_scores = tfidf_matrix.sum(axis=1).A1

    top_indices = np.argsort(sentence_scores)[-num_sentences:]
    top_indices.sort()

    summary = " ".join([sentences[i] for i in top_indices])
    summary = re.sub(r'\s+', ' ', summary).strip()
    return summary

df_ok["tf_idf_summary"] = df_ok["content"].apply(lambda x: tfidf_summarize(x, num_sentences=3))

df_ok[["url", "title", "tf_idf_summary"]].head()


Unnamed: 0,url,title,tf_idf_summary
0,https://jogjapolitan.harianjogja.com/read/2025...,"Jadwal DAMRI ke Bandara YIA Hari Ini, Jogja-Pu...",Keberangkatan dari Yogyakarta (Jogja) Sleman C...
1,https://travel.detik.com/travel-news/d-7871241...,"Jadwal DAMRI Bandara Soekarno Hatta 2025, Rute...",Senayan Park - Cara Pesan Tiket DAMRI Bus DAMR...
3,https://www.beritatrans.com/artikel/254369/seb...,Sebanyak 940 Ribu Masyarakat Terbantu Mobilisa...,"Purwakarta dengan presentase 9,13 persen dari ..."
4,https://metrobanten.co.id/damri-buka-rute-tanj...,"DAMRI Buka Rute Tanjung Barat–Bandara Soetta, ...",“Dengan kapasitas yang pas dan fasilitas yang ...
5,https://www.kompas.tv/info-publik/615445/rute-...,Rute dan Harga Tiket DAMRI ke Bandara Soekarno...,Menurut keterangan yang dirilis akun Instagram...


In [55]:
df_ok[["url", "title", "content", "tf_idf_summary"]].to_csv("data/data_processed/data_extraction_minilm_summary.csv", sep=";")
df_ok.to_csv("data/data_processed/data_extraction_summary[full].csv", sep=";")

In [39]:
print(df_ok.summary[0])
print(df_ok.tf_idf_summary[0])

Jadwal DAMRI ke Bandara YIA Hari Ini, Jogja-Purworejo-Kebumen  Harianjogja.com, JOGJA—Bus DAMRI adalah pilihan tepat untuk Anda yang ingin menuju ke Bandara YIA. Layanan transportasi DAMRI ini bisa Anda manfaatkan jika ingin menuju Bandara YIA. Desa Wisata Adat Osing Kemiren Banyuwangi Masuk Jaringan Terbaik Dunia  Berita Populer - Tarif Rp70.000, Ini Jadwal Bus DAMRI Jogja-Semarang PP - Pengolahan Sampah Menjadi Energi Listrik Dibutuhkan Untuk Atasi TPA - Ribuan Titik Jalan di Bantul Masih Gelap Rawan Kecelakaan - Buruh Jogja Beri Rapor Merah Setahun Kinerja Prabowo-Gibran - Jalur dan Rute Trans Jogja ke Prambanan, Goden, hingga Bantul
Keberangkatan dari Yogyakarta (Jogja) Sleman City Hall ke Bandara YIA Pukul 07.00 WIB-19.00 WIB Harga tiket Rp80.000 Berangkat setiap 60 menit, tiket tidak dapat dilakukan refund atau reschedule Terminal Condongcatur ke Bandara YIA Pukul 04.00 WIB-15.00 WIB Harga tiket Rp80.000 Berangkat setiap 60 menit, tiket tidak dapat dilakukan refund atau reschedul