## News Scraper Implementation (scraping.py)

This script defines a class-based structure for scraping news article content from a list of URLs. It uses the `requests` library for fetching HTML and the `trafilatura` library for robust content extraction. Key features include configuration management, retry logic, rate limiting, logging, and checkpointing for resuming interrupted jobs.

### 1. Configuration (ScraperConfig)

- **Defines constants** for the scraping process, including:
  - `CHUNK_SIZE`/`checkpoint_interval`: How often to save progress (default 100).
  - `MIN_DELAY` and `MAX_DELAY`: Defines a random time delay (rate limiting) between requests (2.0 to 5.0 seconds).
  - `TIMEOUT`: Request timeout (15 seconds).
  - `MAX_RETRIES`: Number of times to retry a failed request (3).
  - `MIN_ARTICLE_LENGTH`: Minimum characters required for extracted text to be considered a valid article (200).
  - `USER_AGENTS`: A list of different user agents to rotate, helping to avoid bot detection.
  - `SKIP_PATTERNS`: A list of URL patterns (e.g., '/video/', '/foto/') that indicate non-article pages which should be skipped immediately.

### 2. Setup (Logging)

- Sets up standard logging to both the console and a file named `scraper.log` to track activity, warnings, and errors.

### 3. NewsScraper Class

The main class that encapsulates the scraping logic.

#### Key Methods

- **`__init__`**: Initializes the class, creates a persistent `requests.Session`, and initializes a `stats` dictionary to track success, failure, and skips, broken down by source domain.
- **`should_skip_url(self, url)`**: Checks if a URL contains any of the defined `SKIP_PATTERNS`.
- **`get_domain(self, url)`**: Extracts the clean domain name (netloc) from the URL.
- **`fetch_with_retry(self, url)`**:
  - Implements the core fetching mechanism.
  - Randomly selects a `User-Agent` for each request.
  - Uses a loop to retry the request up to `MAX_RETRIES` times upon `Timeout` or other `RequestException` errors.
  - Uses exponential backoff (`time.sleep(2 ** attempt)`) before retrying.
  - Raises an exception for bad HTTP status codes (`response.raise_for_status()`).
- **`extract_article(self, html, url)`**:
  - Uses `trafilatura.extract` to intelligently pull the main article text from the HTML content.
  - Gathers article metadata (title, author, date, description) using `trafilatura.extract_metadata`.
  - Returns `None` if the extraction fails or the content is shorter than `MIN_ARTICLE_LENGTH`.
- **`scrape_url(self, url)`**:
  - The single-URL processing pipeline.
  - Checks for skip patterns.
  - Calls `fetch_with_retry` and then `extract_article`.
  - Records the outcome (`success`, `failed`, or `skipped`) in the result dictionary and updates the internal `self.stats`.
- **`save_checkpoint(self, results, checkpoint_file)`**: Saves all collected results into a temporary CSV file (`_checkpoint.csv`) to protect progress.
- **`load_checkpoint(self, checkpoint_file)`**: Loads existing results and a set of already processed URLs from the checkpoint file, enabling the scraping job to **resume** where it left off.
- **`scrape_dataset(...)`**:
  - The main control function.
  - Loads existing progress via `load_checkpoint`.
  - Filters the input URLs to process only those not yet completed.
  - Iterates through the remaining URLs, calling `scrape_url`.
  - Applies a **random delay** between requests for rate limiting.
  - Saves a checkpoint every `checkpoint_interval` (default 50 in the main block).
  - After completion, saves the final results to the specified `output_file` and removes the checkpoint file.

### 4. Main Execution Block (`if __name__ == "__main__":`)

- **Loads Data:** Reads the list of URLs from a source CSV file (`fidelity_check_model/data/NewsDataClean_fidelity.csv`) into a pandas DataFrame.
- **Initializes:** Creates instances of `ScraperConfig` and `NewsScraper`.
- **Runs Scraper:** Calls `scraper.scrape_dataset` with the loaded DataFrame and sets the checkpoint interval to 50.
- **Analysis:** Prints final statistics, showing the count of `success`, `failed`, and `skipped` articles, and breaks down the status counts by **domain**.


# Read Data

In [None]:
import pyodbc
import pandas as pd
import re
import unicodedata

In [2]:
conn = pyodbc.connect(
    'DRIVER={ODBC Driver 17 for SQL Server};'
    'SERVER=LAPTOP-H9LRGGLD;'
    'DATABASE=InflationNews;'
    'Trusted_Connection=yes;'
)

In [3]:
querry = " SELECT  * FROM final_scrape_articles"
df = pd.read_sql(querry, conn)

  df = pd.read_sql(querry, conn)


# Simple EDA

In [4]:
df.head()

Unnamed: 0,url,status,domain,scraped_at,text,title,author,date,description,error
0,https://market.bisnis.com/read/20200629/235/12...,success,market.bisnis.com,2025-10-12T11:01:00.011643,"Bisnis.com, JAKARTA - Harga emas berjangka sem...","Harga Emas Semakin Dekati US$1.800, Level Tert...",Finna U Ulfah; Hafiyyan,2020-06-29,Pada perdagangan Senin (29/6/2020) hingga puku...,
1,https://market.bisnis.com/read/20200826/93/128...,success,market.bisnis.com,2025-10-12T11:01:13.405078,"Bisnis.com, JAKARTA - Nilai tukar rupiah terha...",Mantap! Rupiah Menguat Saat Mayoritas Mata Uan...,Rivki Maulana,2020-08-26,Nilai tukar rupiah terhadap dolar AS menguat 1...,
2,https://market.bisnis.com/read/20200831/94/128...,success,market.bisnis.com,2025-10-12T11:01:18.507975,"Bisnis.com, JAKARTA - Harga minyak mengalami p...","Selain Emas, Harga Minyak Juga Terbantu Lesuny...",Hafiyyan,2020-08-31,Dolar AS yang lemah telah mendukung harga miny...,
3,https://market.bisnis.com/read/20210910/235/14...,success,market.bisnis.com,2025-10-12T11:02:56.788542,"Bisnis.com, JAKARTA – Harga emas batangan 24 k...","Naik Tipis! Harga Emas 24 Karat di Pegadaian, ...",Lorenzo Anugrah Mahardhika; Aprianto Cahyo Nug...,2021-09-10,Harga emas 24 karat UBS ukuran terkecil yakni ...,
4,https://market.bisnis.com/read/20200926/235/12...,success,market.bisnis.com,2025-10-12T11:01:23.140415,"Bisnis.com, JAKARTA - Harga emas berjangka jat...","Dolar AS Dinilai Lebih Aman, Harga Emas Jatuh",Rivki Maulana,2020-09-26,"Dalam pekan ini, harga emas berjangka sudah ja...",


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9293 entries, 0 to 9292
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   url          9293 non-null   object
 1   status       9293 non-null   object
 2   domain       9293 non-null   object
 3   scraped_at   9293 non-null   object
 4   text         9226 non-null   object
 5   title        9225 non-null   object
 6   author       8505 non-null   object
 7   date         9226 non-null   object
 8   description  9223 non-null   object
 9   error        67 non-null     object
dtypes: object(10)
memory usage: 726.1+ KB


Explanation:
- text = the cleaned main article body that trafilatura extracts from the page content
- description = a short blurb from HTML metadata(typically from og: description, twitter:description, or meta name = description). IT IS NOT THE BODY. It ofter overlaps with the first sentence of the article and can be missing or generic per site.
-  title/author/date come from metadata too. Come from trafilatura.extract_metadata that i used in scraping.py
- We decided to keep the text as ground truth for NLP.
- & Also to keep description only as a candidate summary. WHY? because it can be redundant. Usually, many sites reuse the first paragraph as description. Some evenn put marketing copy instead.

In [6]:
df['status'].value_counts()

status
success    9226
failed       57
skipped      10
Name: count, dtype: int64

In [7]:
df['error'].value_counts()

error
failed to fetch                           53
matched skip pattern: /foto/               7
extraction failed or content too short     4
matched skip pattern: /photo/              3
Name: count, dtype: int64

As you can see all of the column error that aren't NULL are the reason why the status is failed or skipped. Its better to remove them as it doesnt provide any values.

In [8]:
df = df[df['status'] == 'success'].copy()

In [9]:
df['status'].value_counts()

status
success    9226
Name: count, dtype: int64

We left with only 9226 data left after cleaning the failed scraped links

In [10]:
df['error'].value_counts()

Series([], Name: count, dtype: int64)

In [11]:
df['domain'].value_counts()

domain
market.bisnis.com          2769
liputan6.com               2469
ekonomi.bisnis.com         1184
cnnindonesia.com            866
finansial.bisnis.com        674
bisnis.liputan6.com         291
koran.bisnis.com            172
sumatra.bisnis.com          149
kabar24.bisnis.com           99
surabaya.bisnis.com          70
industri.bisnis.com          61
sulawesi.bisnis.com          54
bali.bisnis.com              47
otomotif.bisnis.com          44
semarang.bisnis.com          41
tempo.co                     38
jakarta.bisnis.com           34
bandung.bisnis.com           32
foto.bisnis.com              27
kalimantan.bisnis.com        17
surabaya.liputan6.com        12
tekno.liputan6.com            8
premium.bisnis.com            8
teknologi.bisnis.com          7
hijau.bisnis.com              7
hot.liputan6.com              6
news.liputan6.com             6
otomotif.liputan6.com         5
tv.liputan6.com               4
papua.bisnis.com              4
infografik.bisnis.com         3
e

even though the domain is vary, in summary it actually from these 4 domain source:
- Bisnis.com
- liputan6.com
- tempo.co
- cnnindonesia.com

In [None]:
df.to_csv("final_scraping_articles.csv", index=False, encoding="utf-8-sig")

# Clean Column Text by each own domain

In [13]:
df_bisnis = df[df['domain'].str.contains('bisnis.com', case=False, na=False)].copy()
df_tempo = df[df['domain'].str.contains('tempo.co', case = False, na = False)].copy()
df_liputan6 = df[df['domain'].str.contains('liputan6.com',case=False, na=False)].copy()
df_cnn = df[df['domain'].str.contains('cnn',case=False,na=False)].copy()
# Case = False ; ignore upper & lower case
# na = False treat missing values as Na

In [14]:
cols_to_drop = ['status', 'scraped_at', 'author', 'description', 'error']

df_bisnis = df_bisnis.drop(columns=cols_to_drop, errors='ignore')
df_tempo = df_tempo.drop(columns=cols_to_drop, errors='ignore')
df_liputan6 = df_liputan6.drop(columns=cols_to_drop, errors='ignore')
df_cnn = df_cnn.drop(columns=cols_to_drop, errors='ignore')

In [15]:
print(len(df_bisnis), len(df_tempo), len(df_liputan6), len(df_cnn))
print(len(df_bisnis)+len(df_tempo)+ len(df_liputan6)+len(df_cnn))

5513 38 2809 866
9226


## Bisnis

First, bisnis.com has many branches of its own domain, lets see of each domain, the result of our scraping by the column text as a representation. It use for identify a boilerplate and other useless information template that can be deleted.

In [16]:
print(df_bisnis['domain'].value_counts())
print("Total Unique Domain: ",len(df_bisnis['domain'].value_counts()))

domain
market.bisnis.com          2769
ekonomi.bisnis.com         1184
finansial.bisnis.com        674
koran.bisnis.com            172
sumatra.bisnis.com          149
kabar24.bisnis.com           99
surabaya.bisnis.com          70
industri.bisnis.com          61
sulawesi.bisnis.com          54
bali.bisnis.com              47
otomotif.bisnis.com          44
semarang.bisnis.com          41
jakarta.bisnis.com           34
bandung.bisnis.com           32
foto.bisnis.com              27
kalimantan.bisnis.com        17
premium.bisnis.com            8
hijau.bisnis.com              7
teknologi.bisnis.com          7
papua.bisnis.com              4
infografik.bisnis.com         3
lifestyle.bisnis.com          3
entrepreneur.bisnis.com       3
banten.bisnis.com             2
traveling.bisnis.com          1
bola.bisnis.com               1
Name: count, dtype: int64
Total Unique Domain:  26


### Sampling

In [17]:
import pandas as pd

def sample_representative(df, text_col='text', domain_col='domain', random_state=42):
    samples = []
    for domain, group in df.groupby(domain_col):
        n = len(group)
        if n < 10:
            k = 1
        elif n < 100:
            k = 2
        elif n < 1000:
            k = 3
        else:
            k = 4
        sampled = group.sample(n=k, random_state=random_state)
        samples.append(sampled)
    df_repr = pd.concat(samples, ignore_index=True)
    df_repr = df_repr[[domain_col, text_col]]
    print("Representative sample per domain:")
    print(df_repr.groupby(domain_col).size())
    print("\nTotal samples:", len(df_repr)) 
    return df_repr

df_repr = sample_representative(df_bisnis, text_col='text', domain_col='domain')

Representative sample per domain:
domain
bali.bisnis.com            2
bandung.bisnis.com         2
banten.bisnis.com          1
bola.bisnis.com            1
ekonomi.bisnis.com         4
entrepreneur.bisnis.com    1
finansial.bisnis.com       3
foto.bisnis.com            2
hijau.bisnis.com           1
industri.bisnis.com        2
infografik.bisnis.com      1
jakarta.bisnis.com         2
kabar24.bisnis.com         2
kalimantan.bisnis.com      2
koran.bisnis.com           3
lifestyle.bisnis.com       1
market.bisnis.com          4
otomotif.bisnis.com        2
papua.bisnis.com           1
premium.bisnis.com         1
semarang.bisnis.com        2
sulawesi.bisnis.com        2
sumatra.bisnis.com         3
surabaya.bisnis.com        2
teknologi.bisnis.com       1
traveling.bisnis.com       1
dtype: int64

Total samples: 49


From here lets create an auto-detect repeated pjhrase across the samples using TF-IDF or n-gram frequency to help identify biolerplate text patterns quickly.

Why do we used this approach?
Boilerplate text foten looks obvious to our eyes but subtle to an algorithm. Ex: 2025 , Bisnis.com, all rights reseverd, editor:..., author: ...

By using 3-4 random sampling we can see the repeated scafolding that trafilatura missed, to design one regex cleaner per subdomain instead of over-fitting to a signle article.

Large domains are represented enough to reveal recurring structures, small ones aren’t overweighted.

### Manual Inspection

In [18]:
df_repr.to_csv('bisnis_representation.csv',index=False, encoding="utf-8-sig")

In [19]:
def preview_domain_samples(df, domain_col='domain', text_col='text', n_samples=2, char_limit=800):
    domains = df[domain_col].unique()

    for domain in domains:
        print(f"\n=== {domain} ===")
        sample_texts = df[df[domain_col] == domain][text_col].head(n_samples).values
        for i, text in enumerate(sample_texts):
            snippet = str(text)[:char_limit].strip()
            print(f"\nSample {i+1}:\n{snippet}")
preview_domain_samples(df_repr, domain_col='domain', text_col='text', n_samples=2, char_limit=800)


=== bali.bisnis.com ===

Sample 1:
Bisnis.com, DENPASAR – Bali mewaspadai kenaikan inflasi yang mulai terjadi sejak Juni 2025, di mana tercatat inflasi mencapai 2,94% year-on-year (yoy). Bank Indonesia (BI) menyiapkan strategi untuk menahan laju inflasi di masa peak season wisatawan.
Kepala Perwakilan Bank Indonesia Provinsi Bali, Erwin Soeriadimadja, menjelaskan inflasi Bali ke depan perlu tetap mendapat perhatian karena lebih tinggi dibandingkan inflasi nasional baik bulanan maupun tahunan yang masing-masing tercatat 0,19% month-to-month (mtm) dan 1,87% (yoy).
"Untuk itu, diperlukan penguatan pengendalian inflasi melalui kolaborasi, inovasi, dan sinergi Tim Pengendalian Inflasi Daerah (TPID) khususnya dalam menyambut periode peak season kunjungan wisatawan mancanegara seiring periode summer holiday," kata Erwin dikutip da

Sample 2:
Bisnis.com, DENPASAR — Bank Indonesia memproyeksikan ekonomi NTB pada kuartal II/2024 diproyeksikan bisa tumbuh lebih tinggi dibandingkan kuartal I/2024

Here are the sample & also we search maually all of the sample.

### Quantitative Frequency Inspection

This step is just for validaton of our first step.

In [20]:
from sklearn.feature_extraction.text import CountVectorizer

def get_top_ngrams(
    df, 
    text_col='text', 
    ngram_range=(2, 5), 
    min_df=2, 
    top_n=20, 
    stop_words=None
):
    vectorizer = CountVectorizer(
        ngram_range=ngram_range,
        min_df=min_df,
        stop_words=stop_words
    )

    X = vectorizer.fit_transform(df[text_col])
    freqs = zip(vectorizer.get_feature_names_out(), X.sum(axis=0).A1)
    common_phrases = sorted(freqs, key=lambda x: x[1], reverse=True)

    print(f"\nTop {top_n} n-grams (range={ngram_range}, min_df={min_df}):\n")
    for phrase, count in common_phrases[:top_n]:
        print(f"{phrase} → {count}")

    return common_phrases[:top_n]
top_phrases = get_top_ngrams(df_repr, text_col='text', ngram_range=(2, 5), min_df=2, top_n=100)


Top 100 n-grams (range=(2, 5), min_df=2):

bisnis com → 49
dolar as → 44
pertumbuhan ekonomi → 40
harga emas → 38
bisnis com jakarta → 31
com jakarta → 31
bank indonesia → 29
sementara itu → 23
baca juga → 22
kenaikan harga → 21
mata uang → 20
harga jual → 19
harga beli → 18
mengalami inflasi → 18
dengan harga → 17
selain itu → 17
dari sisi → 16
pada kuartal → 16
tahun ini → 16
di indonesia → 15
inflasi di → 15
ke level → 15
laju inflasi → 15
nilai tukar → 15
suku bunga → 15
yang lebih → 15
bahan makanan → 14
dan harga → 14
hari ini → 13
lebih tinggi → 13
masing masing → 13
pada bulan → 13
pada tahun → 13
persen ke → 13
setara dengan → 13
yang sama → 13
atau setara → 12
dan harga jual → 12
jawa tengah → 12
pukul 09 → 12
salah satu → 12
the fed → 12
daging ayam → 11
harga yang → 11
lebih rendah → 11
level us → 11
persen ke level → 11
seiring dengan → 11
badan pusat → 10
badan pusat statistik → 10
beli dolar → 10
di dunia → 10
di pasar → 10
ekonomi indonesia → 10
hal ini → 10
hal terseb

In [None]:
boilerplate_patterns_bisnis = [
    # Dateline / Publisher Prefix
    r'^(?:Bisnis\.com,?\s*)?[A-ZÁÉÍÓÚÜ\s]{2,30}-\s*',
    r'\bbisnis\s*\.com\b',
    r'\bbisnis\s*\.com\s*[A-Z][a-z]+',   # e.g. bisnis com jakarta

    # Press-Release Citations
    r'dikutip dari siaran pers\s\w+\s\(\d{1,2}/\d{1,2}/\d{4}\)\.',

    # Internal Linking / Read More
    r'Baca\s*Juga\s?:.*',
    r'Simak\s*juga\s?:.*',
    r'Cek\s*juga\s?:.*',
    r'Baca\s*selengkapnya\s?:.*',
    r'Lihat\s*berita\s*lainnya.*',

    # Time Stamps / Date-like boilerplate
    r'\bpukul\s*\d{1,2}[:.]?\d{0,2}\b',    # e.g. pukul 09, 09.30
    r'\b[I|V|X]{1,4}\s*20\d{2}\b',         # e.g. IV 2024

    # Call-to-Action / Social Media
    r'Cek\s*Berita\s*dan\s*Artikel.*Google\s*News.*',
    r'Ikuti\s*Saluran\s*WhatsApp\s*Bisnis\.com\s*di\s*Sini.*',
    r'Ikuti\s*kami\s*di.*',
    r'Follow.*Bisnis\.com.*',
    r'Gabung\s*grup\s*Telegram.*',

    # Editor / Author / Credits
    r'Editor\s?:\s?.*',
    r'Reporter\s?:\s?.*',
    r'Penulis\s?:\s?.*',
    r'Kontributor\s?:\s?.*',
    r'Sumber\s?:\s?.*',
    r'Data\s*diolah\s*Bisnis\.com.*',

    # Legal Disclaimer
    r'Bisnis\.com\s*tidak\s*bertanggung\s*jawab.*',
    r'Konten\s*ini\s*merupakan\s*tanggung\s*jawab.*',

    # Copyright / Footer
    r'©\s*\d{4}\s*Bisnis\.com.*',
    r'Hak\s*c?C?ipta\s*dilindungi.*',

    # Advertisement / Promo
    r'Klik\s*di\s*sini\s*untuk\s*berlangganan.*',
    r'Baca\s*berita\s*lainnya\s*di\s*kanal.*',

    # Hashtag / Campaign
    r'#\w+'
]

### Remove Boilerplates 

In [None]:
def normalize_text(text):
    # Normalize Unicode → NFKC (standard form from latin ligature, fractions, etc)
    text = unicodedata.normalize("NFKC", text)
    # Replace fancy dashes and spaces with regular ones
    text = re.sub(r"[–—−]", "-", text) # dash standardization 1
    text = re.sub(r"--", "-", text)# dash standardization 2
    text = re.sub(r"\s+", " ", text) # normalize whitespace
    return text.strip() # removes any leading or trailing whitespace from the normalized text.

def remove_boilerplate(text, patterns):
    # For each pattern, it searches the text and replaces all matches with an empty string 
    for pattern in patterns:
        # IGNORECASE=it will match all variations of the phrase, regardless of case
        text = re.sub(pattern, '', text, flags=re.IGNORECASE)

    return re.sub(r'\s+', ' ', text).strip()

df_bisnis['text'] = df_bisnis['text'].apply(normalize_text)
df_bisnis['clean_text'] = df_bisnis['text'].apply(lambda t: remove_boilerplate(t, boilerplate_patterns_bisnis))

In [23]:
df_bisnis.head()

Unnamed: 0,url,domain,text,title,date,clean_text
0,https://market.bisnis.com/read/20200629/235/12...,market.bisnis.com,"Bisnis.com, JAKARTA - Harga emas berjangka sem...","Harga Emas Semakin Dekati US$1.800, Level Tert...",2020-06-29,Harga emas berjangka semakin mendekati level U...
1,https://market.bisnis.com/read/20200826/93/128...,market.bisnis.com,"Bisnis.com, JAKARTA - Nilai tukar rupiah terha...",Mantap! Rupiah Menguat Saat Mayoritas Mata Uan...,2020-08-26,Nilai tukar rupiah terhadap dolar Amerika Seri...
2,https://market.bisnis.com/read/20200831/94/128...,market.bisnis.com,"Bisnis.com, JAKARTA - Harga minyak mengalami p...","Selain Emas, Harga Minyak Juga Terbantu Lesuny...",2020-08-31,Harga minyak mengalami penguatan seiring denga...
3,https://market.bisnis.com/read/20210910/235/14...,market.bisnis.com,"Bisnis.com, JAKARTA - Harga emas batangan 24 k...","Naik Tipis! Harga Emas 24 Karat di Pegadaian, ...",2021-09-10,Harga emas batangan 24 karat yang dijual di Pe...
4,https://market.bisnis.com/read/20200926/235/12...,market.bisnis.com,"Bisnis.com, JAKARTA - Harga emas berjangka jat...","Dolar AS Dinilai Lebih Aman, Harga Emas Jatuh",2020-09-26,Harga emas berjangka jatuh pada akhir pekan se...


## Liputan6

In [24]:
print(df_liputan6['domain'].value_counts())
print("Total Unique Domain: ",len(df_liputan6['domain'].value_counts()))

domain
liputan6.com              2469
bisnis.liputan6.com        291
surabaya.liputan6.com       12
tekno.liputan6.com           8
hot.liputan6.com             6
news.liputan6.com            6
otomotif.liputan6.com        5
tv.liputan6.com              4
citizen6.liputan6.com        2
jateng.liputan6.com          1
regional.liputan6.com        1
properti.liputan6.com        1
showbiz.liputan6.com         1
lifestyle.liputan6.com       1
jatim.liputan6.com           1
Name: count, dtype: int64
Total Unique Domain:  15


### Sampling

In [25]:
df_repr_2 = sample_representative(df_liputan6,text_col='text',domain_col='domain')

Representative sample per domain:
domain
bisnis.liputan6.com       3
citizen6.liputan6.com     1
hot.liputan6.com          1
jateng.liputan6.com       1
jatim.liputan6.com        1
lifestyle.liputan6.com    1
liputan6.com              4
news.liputan6.com         1
otomotif.liputan6.com     1
properti.liputan6.com     1
regional.liputan6.com     1
showbiz.liputan6.com      1
surabaya.liputan6.com     2
tekno.liputan6.com        1
tv.liputan6.com           1
dtype: int64

Total samples: 21


### Manual Inspection

In [26]:
df_repr_2.to_csv('liputan6_representation.csv',index=False, encoding="utf-8-sig")

In [27]:
preview_domain_samples(df_repr_2, domain_col='domain', text_col='text', n_samples=2, char_limit=800)


=== bisnis.liputan6.com ===

Sample 1:
Liputan6.com, Jakarta - Harga emas PT Aneka Tambang Tbk (Antam) atau emas Antam naik Rp 3.000 menjadi Rp 637 ribu per gram pada perdagangan Rabu (3/1/2018). Pada perdagangan sebelumnya atau pada Selasa kemarin, harga emas Antam berada di posisi Rp 634 ribu per gram.
Sedangkan harga pembelian kembali atau buyback naik Rp 2.000 menjadi Rp 567 ribu per gram. Harga buyback ini adalah jika Anda akan menjual emas, maka Antam akan membelinya di harga Rp 567 ribu per gram.
Advertisement
Baca Juga
Pembayaran buyback dengan volume di atas 1 kilogram (kg) akan dilakukan maksimal dua hari setelah transaksi dengan mengacu pada harga buyback hari transaksi.
Antam menjual emas dengan ukuran mulai 1 gram hingga 500 gram. Hingga pukul 08.19 WIB, sebagian besar ukuran emas Antam masih tersedia kecuali ukura

Sample 2:
Liputan6.com, Jakarta - Ditutup dengan inflasi Desember sebesar 0,65 persen (mtm), angka inflasi Jakarta sepanjang 2017 tetap terkendali. Inflasi DK

### Quantitive Frequency Inspection

In [28]:
top_phrases = get_top_ngrams(df_repr_2, text_col='text', ngram_range=(2, 5), min_df=2, top_n=100)


Top 100 n-grams (range=(2, 5), min_df=2):

liputan6 com → 24
kenaikan harga → 22
baca juga → 17
com jakarta → 16
liputan6 com jakarta → 16
daya beli → 13
pertumbuhan ekonomi → 13
ribu per → 13
jawa timur → 12
per kilogram → 12
saat ini → 12
ujar dia → 12
di jakarta → 10
hal ini → 10
mengalami kenaikan → 10
beli masyarakat → 9
daya beli masyarakat → 9
daya listrik → 9
kata dia → 9
di pasar → 8
rata rata → 8
di bawah → 7
di jawa → 7
dibandingkan dengan → 7
tidak ada → 7
tidak hanya → 7
angka inflasi → 6
bank indonesia → 6
di indonesia → 6
di rumah → 6
ekonomi nasional → 6
harga bahan → 6
harga cabai → 6
harga pangan → 6
juta hingga → 6
menjadi rp → 6
ribu per kilogram → 6
rumah tangga → 6
salah satu → 6
selain itu → 6
advertisement baca → 5
advertisement baca juga → 5
bahan pokok → 5
di jawa timur → 5
di masa → 5
ekonomi daerah → 5
gas kg → 5
ini akan → 5
lebih rendah → 5
menaikkan harga → 5
pemerintah pusat → 5
sekitar rp → 5
sumber daya → 5
tahun sebelumnya → 5
yang bisa → 5
yang dila

In [29]:
boilerplate_patterns_liputan6 = [
    # Dateline / Publisher Prefix
    r'^(?:Liputan6\.com,?\s*)?[A-ZÁÉÍÓÚÜ\s]{2,30}-\s*',
    r'\bliputan6\s*\.com\b',
    r'\bliputan6\s*\.com\s*[A-Z][a-z]+',  # e.g. liputan6 com jakarta

    # Internal links & ad placeholders
    r'advertisement\s*baca.*',
    r'Baca juga\s?:.*',

    # Video embeds / “Saksikan Video Pilihan …”
    r'saksikan\s*video.*',
    r'video\s*pilihan.*',
    r'pilihan\s*di\s*bawah.*',
    r'di\s*bawah\s*ini.*',

    # Social media / cross-promotion
    r'Ikuti\s*kanal\s*(WhatsApp|YouTube)\s*Liputan6\.com.*',
    r'Cek\s*Berita\s*dan\s*Artikel.*Google\s*News.*',

    # Author credits
    r'(Reporter|Penulis|Editor)\s?:\s?.*',

    # Source and data
    r'Sumber\s?:\s?.*',
    r'Data\sdiolah\sLiputan6\.com.*',

    # Disclaimers
    r'Liputan6\.com\s*t[iy]dak\s*bertanggung\s*jawab.*',

    # Copyright
    r'©\s*\d{4}\s*Liputan6\.com.*',
    r'Hak\s*Cipta\s*Dilindungi.*',

    # Subscription / Promo
    r'Klik\s*di\s*sini\s*untuk\s*berlangganan.*',

    # Hashtags / campaigns
    r'#\w+'
]


### Remove Boilerplates

In [30]:
df_liputan6['text'] = df_liputan6['text'].apply(normalize_text)
df_liputan6['clean_text'] = df_liputan6['text'].apply(lambda t: remove_boilerplate(t, boilerplate_patterns_liputan6))


In [31]:
df_liputan6.head()

Unnamed: 0,url,domain,text,title,date,clean_text
47,https://www.liputan6.com/news/read/5553069/dem...,liputan6.com,"Liputan6.com, Bojonegoro Saat ini Indonesia da...","Demi Pangan dan Petani Indonesia, Mentan Amran...",2024-03-18,", Bojonegoro Saat ini Indonesia dalam kondisi ..."
60,https://www.liputan6.com/news/read/5575791/eko...,liputan6.com,"Liputan6.com, Jakarta - Ketua Dewan Pertimbang...",Ekonomi Indonesia Dinilai Masih Kuat Hadapi An...,2024-04-18,Ketua Dewan Pertimbangan Kamar Dagang dan Indu...
78,https://www.liputan6.com/news/read/5749782/hip...,liputan6.com,"Liputan6.com, Jakarta - Himpunan Pengusaha KAH...",HIPKA Dorong Kadin Jabar Dukung Target 8 Perse...,2024-10-15,Himpunan Pengusaha KAHMI (HIPKA) yang didirika...
79,https://www.liputan6.com/otomotif/read/3795390...,liputan6.com,"Liputan6.com, Jakarta - Pemerintah Indonesia d...","Punya Nikel, Indonesia Bisa Menguasai Rantai P...",2018-11-30,Pemerintah Indonesia dalam waktu dekat bakal m...
81,https://www.liputan6.com/otomotif/read/4383411...,liputan6.com,"Liputan6.com, Jakarta - Langkah Indonesia untu...",Indonesia Makin Serius Bangun Industri Baterai...,2020-10-15,Langkah Indonesia untuk menguasai industri ken...


## CNN

### Sampling

In [32]:
df_repr_3 = df_cnn.sample(n=30,random_state=42)

Well here the cnn only contain 1 domains and doesnt have any subdomain so we can just sample it directly

### Manual Inspection

In [33]:
df_repr_3.to_csv('cnn_representation.csv',index = False, encoding='utf-8-sig')

In [34]:
preview_domain_samples(df_repr_3, domain_col='domain', text_col='text', n_samples=10, char_limit=800)


=== cnnindonesia.com ===

Sample 1:
Jakarta, CNN Indonesia -- Harga jual
emas PT Aneka Tambang (Persero) Tbk atau
Antam berada di posisi Rp766 ribu per gram pada Jumat (3/1) atau naik Rp4.000 dari Rp762 ribu per gram pada Kamis (2/1). Harga pembelian kembali (
buyback) melejit Rp5.000 dari Rp678 ribu menjadi Rp683 ribu per gram pada hari ini.
Berdasarkan data Antam, harga jual emas berukuran 0,5 gram senilai Rp407,5 ribu, 2 gram Rp1,48 juta, 3 gram Rp2,2 juta, 5 gram Rp3,65 juta, 10 gram Rp7,23 juta, 25 gram Rp17,98 juta, dan 50 gram Rp35,88 juta. Kemudian, harga emas berukuran 100 gram senilai Rp71,7 juta, 250 gram Rp175,25 juta, 500 gram Rp357,8 juta, dan 1 kilogram Rp699,6 juta.
Harga jual emas tersebut sudah termasuk Pajak Penghasilan (PPh) 22 atas emas batangan sebesar 0,45 persen bagi pemegang Nomor Pokok Wajib Pajak (

Sample 2:
Harga jual emas PT Antam (Persero) Tbk berada di posisi Rp947 ribu per gram pada Kamis (19/8). Harganya naik tipis Rp1.000 dibandingkan perdagangan Rab

### Quantitative Frequency Inspection

In [35]:
top_phrases = get_top_ngrams(df_repr_3, text_col='text', ngram_range=(2, 5), min_df=2, top_n=100)


Top 100 n-grams (range=(2, 5), min_df=2):

harga emas → 55
dolar as → 40
suku bunga → 34
per troy → 32
per troy ons → 32
troy ons → 32
harga jual → 25
persen dan → 25
hari ini → 24
mata uang → 24
harga jual emas → 23
jual emas → 23
emas di → 22
harga emas di → 22
per gram → 22
ribu per → 22
harga bbm → 21
ribu per gram → 21
bunga acuan → 19
di perdagangan → 19
berada di → 18
gram senilai → 18
juta dan → 18
juta gram → 18
emas berukuran → 17
emas di perdagangan → 17
harga emas di perdagangan → 17
sebesar persen → 17
suku bunga acuan → 17
gram pada → 16
kenaikan harga → 16
minyak mentah → 16
per gram pada → 16
saat ini → 16
bank sentral → 15
di posisi → 15
kepada cnnindonesia → 15
per dolar → 15
per dolar as → 15
ribu per gram pada → 15
sementara itu → 15
the fed → 15
cnnindonesia com → 14
lebih tinggi → 14
nilai tukar → 14
pada hari → 14
kepada cnnindonesia com → 13
nilai tukar rupiah → 12
pertumbuhan ekonomi → 12
tukar rupiah → 12
yang tidak → 12
bank indonesia → 11
berada di posisi →

In [36]:
boilerplate_patterns_cnn = [
    # Dateline / Publisher Prefix
    r'^[A-ZÁÉÍÓÚÜ][a-zA-Z\s]+,\sCNN\sIndonesia\s[-–—]\s*',
    r'CNN\sIndonesia\s[-–—]\s*',

    # Advertisement placeholders
    r'advertisement\s*scroll.*',
    r'scroll\s*to\s*continue.*',
    r'to\s*continue\s*with\s*content.*',
    r'continue\s*with\s*content.*',

    # Attribution / Quotation boilerplate
    r'kepada\s*cnnindonesia(\.com)?',
    r'menurut\s*cnnindonesia(\.com)?',
    r'dikutip\s*dari\s*cnnindonesia(\.com)?',

    # Repetitive Branding
    r'\bcnn\s*indonesia\b',
    r'\bcnnindonesia\s*\.com\b',
    r'jakarta\s*cnn\s*indonesia',

    # Internal Links / Promos
    r'(Baca|Lihat)\s[Jj]uga\s?:.*',
    r'Ikuti\s*(kami|berita)\s.*CNN\sIndonesia.*',

    # Editor / Author Credits
    r'(Reporter|Penulis|Editor|Sumber)\s?:\s?.*',

    # Legal Disclaimer
    r'CNN\sIndonesia\s(tidak\sbertanggung\s*jawa[bp]|tidak\smenanggung).*',

    # Copyright / Footer
    r'©\s*\d{4}\s*CNN\sIndonesia.*',
    r'Hak\s*Cipta\s*Dilindungi.*',

    # Hashtag / Campaigns
    r'#\w+'
]

### Remove Boilerplates

In [37]:
df_cnn['text'] = df_cnn['text'].apply(normalize_text)
df_cnn['clean_text'] = df_cnn['text'].apply(lambda t: remove_boilerplate(t, boilerplate_patterns_cnn))

In [38]:
df_cnn.head()

Unnamed: 0,url,domain,text,title,date,clean_text
40,https://www.cnnindonesia.com/nasional/20231203...,cnnindonesia.com,"Cawapres nomor urut 2, Gibran Rakabuming Raka ...",Gibran Blusukan ke Pasar Rawasari: Harga Cabai...,2023-12-03,"Cawapres nomor urut 2, Gibran Rakabuming Raka ..."
62,https://www.cnnindonesia.com/nasional/20200430...,cnnindonesia.com,"Jakarta, CNN Indonesia - Wakil Ketua MPR dari ...",MPR Ingatkan Bahaya Inflasi Jika Ngotot Cetak ...,2020-04-30,"Wakil Ketua MPR dari Fraksi Partai Demokrat, S..."
67,https://www.cnnindonesia.com/ekonomi/201809270...,cnnindonesia.com,"Jakarta, CNN Indonesia - The Federal Reserve (...","'Pede' Ekonomi AS Tumbuh, Alasan The Fed Kerek...",2018-09-27,The Federal Reserve ( The Fed) percaya diri ek...
77,https://www.cnnindonesia.com/nasional/20231202...,cnnindonesia.com,Calon presiden nomor urut 3 Ganjar Pranowo men...,Pedagang Pasar Inpres NTT Curhat ke Ganjar Har...,2023-12-02,Calon presiden nomor urut 3 Ganjar Pranowo men...
91,https://www.cnnindonesia.com/ekonomi/201901080...,cnnindonesia.com,"Jakarta, CNN Indonesia - Nilai tukar rupiah be...",Rupiah 'Gagah Perkasa' ke Rp14.049 per Dolar AS,2019-01-08,Nilai tukar rupiah berada di posisi Rp14.049 p...


## Tempo

In [39]:
print(df_tempo['domain'].value_counts())
print("Total Unique Domain: ",len(df_tempo['domain'].value_counts()))

domain
tempo.co    38
Name: count, dtype: int64
Total Unique Domain:  1


### Sampling

In [40]:
df_repr_4 = df_tempo.sample(n=30, random_state= 42)

In [41]:
df_repr_4.to_csv('tempo_representation.csv',index=False, encoding="utf-8-sig")

In [42]:
preview_domain_samples(df_repr_4, domain_col='domain', text_col='text', n_samples=2, char_limit=800)


=== tempo.co ===

Sample 1:
Ringkasan Berita
Pertumbuhan ekonomi Indonesia makin lambat dan tertahan di kisaran 5 persen.
Kekeliruan strategi dalam mengarahkan investasi menyebabkan ekonomi tidak efisien.
Prabowo mengandalkan program populis untuk mendorong pertumbuhan ekonomi.
TREN perlambatan ekonomi Indonesia masih berlanjut. Pada 2022-2023, tingkat pertumbuhan ekonomi kita melambat dari 5,31 persen menjadi 5,05 persen. Tahun lalu, angkanya hanya 5,03 persen.
- Akses edisi mingguan dari Tahun 1971
- Akses penuh seluruh artikel Tempo+
- Baca dengan lebih sedikit gangguan iklan
- Fitur baca cepat di edisi Mingguan
- Anda Mendukung Independensi Jurnalisme Tempo

Sample 2:
TEMPO.CO, Jakarta - Ekonom IPB University Hermanto Siregar mengatakan kebijakan tarif yang diterapkan Presiden Amerika Serikat Donald Trump bisa berdampak buruk bagi Indonesia. Hal ini terjadi bila pemerintah gagal melakukan negosiasi dan tidak memiliki langkah mitigasi yang efektif.
Dalam kajiannya, Hermanto mengata

### Quantitative Frequency Inspection

In [43]:
top_phrases = get_top_ngrams(df_repr_4, text_col='text', ngram_range=(2, 5), min_df=2, top_n=100)


Top 100 n-grams (range=(2, 5), min_df=2):

harga emas → 58
mata uang → 30
amerika serikat → 27
agustus 2025 → 23
dolar as → 22
pertumbuhan ekonomi → 22
edisi mingguan → 20
nilai tukar → 19
bank indonesia → 18
suku bunga → 17
kata dia → 15
salah satu → 14
dengan lebih → 13
hari ini → 12
november 2024 → 12
atau sekitar → 11
di edisi → 11
kebijakan moneter → 11
september 2025 → 11
1971 akses → 10
1971 akses penuh → 10
1971 akses penuh seluruh → 10
1971 akses penuh seluruh artikel → 10
akses edisi → 10
akses edisi mingguan → 10
akses edisi mingguan dari → 10
akses edisi mingguan dari tahun → 10
akses penuh → 10
akses penuh seluruh → 10
akses penuh seluruh artikel → 10
akses penuh seluruh artikel tempo → 10
anda mendukung → 10
anda mendukung independensi → 10
anda mendukung independensi jurnalisme → 10
anda mendukung independensi jurnalisme tempo → 10
artikel tempo → 10
artikel tempo baca → 10
artikel tempo baca dengan → 10
artikel tempo baca dengan lebih → 10
baca cepat → 10
baca cepat di

In [44]:
boilerplate_patterns_tempo = [
    # Dateline / Publisher Prefix
    r'^(?:TEMPO\.CO,?\s*)?[A-ZÁÉÍÓÚÜ\s]{2,30}-\s*',

    # Paywall / Access Prompts
    r'akses\s*penuh.*',
    r'akses\s*edisi.*',
    r'akses\s*edisi\s*mingguan.*',
    r'dari\s*tahun\s*1971.*',
    r'seluruh\s*artikel\s*tempo.*',

    # Support / Donation Messaging
    r'anda\s*mendukung\s*independensi\s*jurnalisme\s*tempo.*',
    r'mendukung\s*independensi\s*jurnalisme.*',
    r'independensi\s*jurnalisme\s*tempo.*',

    # Reading Feature Prompts
    r'fitur\s*baca\s*cepat.*',
    r'baca\s*cepat.*',
    r'baca\s*dengan\s*lebih\s*sedikit\s*gangguan.*',
    r'dengan\s*lebih\s*sedikit\s*gangguan.*',
    r'gangguan\s*iklan.*',
    r'iklan\s*fitur\s*baca.*',

    # Weekly Edition / Promo
    r'edisi\s*mingguan.*',
    r'mingguan\s*anda.*',

    # Article Footer Branding
    r'artikel\s*tempo\s*baca.*',
    r'tempo\s*baca\s*dengan.*',

    # Source / Author Credits
    r'(Reporter|Penulis|Editor|Sumber)\s?:\s?.*',

    # Copyright
    r'©\s*\d{4}\s*TEMPO\.CO.*',
    r'Hak\s*Cipta\s*Dilindungi.*',

    # Hashtags / Campaigns
    r'#\w+'
]

### Remove Boilerplate

In [45]:
df_tempo['text'] = df_tempo['text'].apply(normalize_text)
df_tempo['clean_text'] = df_tempo['text'].apply(
    lambda t: remove_boilerplate(t, boilerplate_patterns_tempo)
)

In [48]:
df_tempo.head()

Unnamed: 0,url,domain,text,title,date,clean_text
703,https://www.tempo.co/ekonomi/target-pertumbuha...,tempo.co,Ringkasan Berita Pencapaian target pertumbuhan...,Tantangan Eksternal Target Pertumbuhan Ekonomi...,2025-01-19,Ringkasan Berita Pencapaian target pertumbuhan...
706,https://www.tempo.co/ekonomi/palu-godam-krisis...,tempo.co,"TEMPO.CO, Jakarta - Reformasi 1998 menjadi sal...",Palu Godam Krisis Moneter 1998 Menjelang Refor...,2025-05-20,Reformasi 1998 menjadi salah satu peristiwa se...
708,https://www.tempo.co/ekonomi/core-indonesia-ta...,tempo.co,KEBIJAKAN tarif resiprokal Amerika Serikat men...,CORE Indonesia: Tarif Resiprokal Trump Ancam E...,2025-08-21,KEBIJAKAN tarif resiprokal Amerika Serikat men...
921,https://www.tempo.co/ekonomi/subsidi-energi-de...,tempo.co,Ringkasan Berita Formulasi atau format subsidi...,Mengapa Subsidi Energi lewat Bantuan Langsung ...,2024-11-07,Ringkasan Berita Formulasi atau format subsidi...
1589,https://www.tempo.co/ekonomi/alasan-purbaya-in...,tempo.co,MENTERI Keuangan Purbaya Yudhi Sadewa mengungk...,Alasan Purbaya Ingin Alihkan Dana Pemerintah R...,2025-09-11,MENTERI Keuangan Purbaya Yudhi Sadewa mengungk...


# Save Result

In [54]:
dfs = {
    "df_bisnis": df_bisnis,
    "df_liputan6": df_liputan6,
    "df_cnn": df_cnn,
    "df_tempo": df_tempo
}

for name, df in dfs.items():
    if 'text' in df.columns:
        df = df.drop('text', axis=1)
    df.to_csv(f"result/{name}.csv", index=False)

print("Saved")

Saved
