# Jakarta Post Business News Scraper

A Jupyter notebook version of the Jakarta Post business news scraper.

## Features
- Scrapes articles from the business section
- Configurable date range
- Optional full content fetching
- Duplicate prevention
- Saves to JSON

## 1. Install Dependencies

Run this cell to install required packages:

In [None]:
# Install required packages
!pip install requests beautifulsoup4 lxml -q
print("Dependencies installed!")

## 2. Import Libraries

In [1]:
import json
import re
import os
from datetime import datetime, timedelta
from typing import Optional, List, Dict, Tuple, Set
from urllib.parse import urljoin

import requests
from bs4 import BeautifulSoup

print("Libraries imported successfully!")

Libraries imported successfully!


## 3. Configuration

Set your scraping parameters here:

In [2]:
# Configuration
DAYS_BACK = 2  # Number of days back to scrape
OUTPUT_FILE = "news_data.json"  # Output JSON file
BASE_URL = "https://www.thejakartapost.com/business/latest"  # Base URL to scrape
FETCH_CONTENT = True  # Set to True to fetch full article content (slower)

print(f"Configuration:")
print(f"  Days back: {DAYS_BACK}")
print(f"  Output file: {OUTPUT_FILE}")
print(f"  Fetch content: {FETCH_CONTENT}")

Configuration:
  Days back: 2
  Output file: news_data.json
  Fetch content: True


## 4. Scraper Class Definition

In [3]:
class JakartaPostScraper:
    def __init__(
        self,
        days_back: int = 2,
        output_file: str = "news_data.json",
        base_url: str = "https://www.thejakartapost.com/business/latest",
        fetch_content: bool = False
    ):
        self.days_back = days_back
        self.output_file = output_file
        self.base_url = base_url
        self.fetch_content = fetch_content
        self.cutoff_date = datetime.now() - timedelta(days=days_back)
        self.existing_urls = self._load_existing_urls()
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.0'
        })
        self.article_url_pattern = re.compile(r'/business/\d{4}/\d{2}/\d{2}/')
        self.scraped_articles: List[Dict] = []
        self.new_articles: List[Dict] = []
    
    def _load_existing_urls(self) -> Set[str]:
        if not os.path.exists(self.output_file):
            return set()
        try:
            with open(self.output_file, 'r', encoding='utf-8') as f:
                data = json.load(f)
                return {item['url'] for item in data if 'url' in item}
        except (json.JSONDecodeError, KeyError):
            return set()
    
    def _load_existing_data(self) -> List[Dict]:
        if not os.path.exists(self.output_file):
            return []
        try:
            with open(self.output_file, 'r', encoding='utf-8') as f:
                return json.load(f)
        except json.JSONDecodeError:
            return []
    
    def _save_data(self, data: List[Dict]):
        with open(self.output_file, 'w', encoding='utf-8') as f:
            json.dump(data, f, indent=2, ensure_ascii=False, default=str)
        print(f"Saved {len(data)} articles to {self.output_file}")
    
    def _extract_date_from_url(self, url: str) -> Optional[datetime]:
        match = re.search(r'/business/(\d{4})/(\d{2})/(\d{2})/', url)
        if match:
            year, month, day = match.groups()
            try:
                return datetime(int(year), int(month), int(day))
            except ValueError:
                return None
        return None
    
    def _fetch_page(self, page_num: int) -> Optional[BeautifulSoup]:
        url = f"{self.base_url}?page={page_num}"
        try:
            response = self.session.get(url, timeout=30)
            response.raise_for_status()
            return BeautifulSoup(response.content, 'html.parser')
        except requests.RequestException as e:
            print(f"Error fetching page {page_num}: {e}")
            return None
    
    def _fetch_article_content(self, url: str) -> Dict:
        content_data = {
            'content': '',
            'author': '',
            'category': '',
            'tags': [],
            'fetch_error': None
        }
        
        try:
            response = self.session.get(url, timeout=30)
            response.raise_for_status()
            soup = BeautifulSoup(response.content, 'html.parser')
            
            # Extract author
            author_elem = soup.find('meta', attrs={'name': 'author'})
            if author_elem:
                content_data['author'] = str(author_elem.get('content', '')).strip()
            else:
                author_elem = soup.find('span', class_='tjp-meta__label')
                if author_elem:
                    content_data['author'] = author_elem.get_text(strip=True)
            
            # Extract category
            category_elem = soup.find('li', class_='tjp-breadcrumb__list-item active')
            if category_elem:
                content_data['category'] = category_elem.get_text(strip=True)
            
            # Extract content
            content_elem = soup.select_one('div.tjp-single__content')
            if content_elem:
                paragraphs = content_elem.find_all('p')
                if paragraphs:
                    content_text = '\n\n'.join([p.get_text(strip=True) for p in paragraphs if p.get_text(strip=True)])
                    if len(content_text) > 50:
                        content_data['content'] = content_text
            
            # Fallback
            if not content_data['content']:
                article_area = soup.find('div', class_='tjp-single') or soup.find('article')
                if article_area:
                    paragraphs = article_area.find_all('p')
                    content_text = '\n\n'.join([p.get_text(strip=True) for p in paragraphs if p.get_text(strip=True)])
                    if len(content_text) > 50:
                        content_data['content'] = content_text
            
            # Extract tags
            keywords_elem = soup.find('meta', attrs={'name': 'keywords'})
            if keywords_elem:
                keywords = keywords_elem.get('content', '')
                if keywords:
                    content_data['tags'] = [tag.strip() for tag in str(keywords).split(',') if tag.strip()]
            
        except Exception as e:
            content_data['fetch_error'] = str(e)
        
        return content_data
    
    def _extract_article_data(self, article_elem, skip_existing: bool = True) -> Optional[Dict]:
        try:
            link_elem = article_elem.find('a', href=True)
            if not link_elem:
                return None
            
            url = urljoin("https://www.thejakartapost.com", link_elem['href'])
            
            if not self.article_url_pattern.search(url):
                return None
            
            is_existing = url in self.existing_urls
            if is_existing and skip_existing:
                return None
            
            title_elem = article_elem.find('h2', class_='titleNews')
            if not title_elem:
                title_elem = article_elem.find(['h2', 'h3', 'h1'])
            if not title_elem:
                title_elem = link_elem
            
            title = title_elem.get_text(strip=True) if title_elem else "No title"
            
            article_date = self._extract_date_from_url(url)
            date_str = None
            
            date_elem = article_elem.find('span', class_='date')
            if date_elem:
                date_text = date_elem.get_text(strip=True)
                parsed_date = self._try_parse_date(date_text)
                if parsed_date:
                    article_date = parsed_date
            
            summary_elem = article_elem.find('p')
            summary = summary_elem.get_text(strip=True) if summary_elem else ""
            
            return {
                'title': title,
                'url': url,
                'date': article_date.isoformat() if article_date else date_str,
                'date_parsed': article_date,
                'summary': summary,
                'is_existing': is_existing,
                'scraped_at': datetime.now().isoformat()
            }
        except Exception as e:
            print(f"Error extracting article: {e}")
            return None
    
    def _try_parse_date(self, date_str: str) -> Optional[datetime]:
        date_formats = [
            '%Y-%m-%dT%H:%M:%S', '%Y-%m-%dT%H:%M:%S%z',
            '%Y-%m-%d %H:%M:%S', '%B %d, %Y', '%b %d, %Y',
            '%d %B %Y', '%d %b %Y', '%Y-%m-%d',
        ]
        for fmt in date_formats:
            try:
                return datetime.strptime(date_str.strip(), fmt)
            except ValueError:
                continue
        return None
    
    def _find_articles_on_page(self, soup: BeautifulSoup) -> Tuple[List[Dict], List[Dict]]:
        all_articles = []
        new_articles = []
        processed_urls = set()
        
        list_news = soup.find_all('div', class_='listNews')
        
        for elem in list_news:
            article_data = self._extract_article_data(elem, skip_existing=False)
            if article_data and article_data['url'] not in processed_urls:
                all_articles.append(article_data)
                processed_urls.add(article_data['url'])
                if not article_data['is_existing']:
                    new_articles.append(article_data)
        
        if not all_articles:
            all_links = soup.find_all('a', href=True)
            for link in all_links:
                href = str(link.get('href', ''))
                if self.article_url_pattern.search(href):
                    parent = link.find_parent(['div', 'article', 'li', 'section'])
                    if parent:
                        article_data = self._extract_article_data(parent, skip_existing=False)
                        if article_data and article_data['url'] not in processed_urls:
                            all_articles.append(article_data)
                            processed_urls.add(article_data['url'])
                            if not article_data['is_existing']:
                                new_articles.append(article_data)
        
        return all_articles, new_articles
    
    def _should_stop_pagination(self, all_articles: List[Dict], page_num: int) -> bool:
        if not all_articles:
            print(f"  No articles found on page {page_num}, stopping")
            return True
        
        dated_articles = [a for a in all_articles if a.get('date_parsed')]
        
        if dated_articles and all(a['date_parsed'] < self.cutoff_date for a in dated_articles):
            print(f"  All articles on page {page_num} are older than cutoff")
            return True
        
        return False
    
    def scrape(self) -> List[Dict]:
        print(f"\nStarting scrape with {self.days_back} days back (cutoff: {self.cutoff_date.date()})\n")
        
        all_articles = self._load_existing_data()
        new_articles = []
        page_num = 1
        stop_pagination = False
        
        while not stop_pagination:
            print(f"Fetching page {page_num}...")
            soup = self._fetch_page(page_num)
            if not soup:
                print(f"  Failed to fetch page {page_num}")
                break
            
            page_all_articles, page_new_articles = self._find_articles_on_page(soup)
            
            print(f"  Found {len(page_all_articles)} total ({len(page_new_articles)} new) articles")
            
            for article in page_new_articles:
                new_articles.append(article)
                self.existing_urls.add(article['url'])
            
            if self._should_stop_pagination(page_all_articles, page_num):
                stop_pagination = True
            
            page_num += 1
            if page_num > 100:
                print("  Reached maximum page limit (100)")
                break
        
        if new_articles:
            if self.fetch_content:
                print(f"\nFetching content for {len(new_articles)} new articles...\n")
                for i, article in enumerate(new_articles, 1):
                    print(f"  [{i}/{len(new_articles)}] {article['title'][:50]}...")
                    content_data = self._fetch_article_content(article['url'])
                    article['content'] = content_data['content']
                    article['author'] = content_data['author']
                    article['category'] = content_data['category']
                    article['tags'] = content_data['tags']
                    if content_data['fetch_error']:
                        article['content_error'] = content_data['fetch_error']
            
            all_articles.extend(new_articles)
            self._save_data(all_articles)
            print(f"\nAdded {len(new_articles)} new articles")
        else:
            print("\nNo new articles found")
        
        self.scraped_articles = all_articles
        self.new_articles = new_articles
        return new_articles

print("Scraper class defined!")

Scraper class defined!


## 5. Run the Scraper

Execute the scraping process:

In [4]:
# Initialize and run scraper
scraper = JakartaPostScraper(
    days_back=DAYS_BACK,
    output_file=OUTPUT_FILE,
    base_url=BASE_URL,
    fetch_content=FETCH_CONTENT
)

# Run the scrape
new_articles = scraper.scrape()

print(f"\n{'='*60}")
print(f"SCRAPING COMPLETE!")
print(f"{'='*60}")
print(f"New articles added: {len(new_articles)}")
print(f"Total articles in file: {len(scraper.scraped_articles)}")


Starting scrape with 2 days back (cutoff: 2026-02-16)

Fetching page 1...
  Found 28 total (28 new) articles
Fetching page 2...
  Found 28 total (28 new) articles
  All articles on page 2 are older than cutoff

Fetching content for 56 new articles...

  [1/56] Bumi Resources Minerals says Palu operations unaff...
  [2/56] New state miner Perminas signs MoU on Gabon rare e...
  [3/56] As Global Attention Turns to the US: What Are Its ...
  [4/56] Coal must go for climate funds to flow, German amb...
  [5/56] Trump says Japan to invest in energy, industrial p...
  [6/56] Govt urged not to compromise too much for US trade...
  [7/56] Unlocking regional potential to boost economic gro...
  [8/56] Job threats, rogue bots: five hot issues in AI...
  [9/56] Dollar holds gains in thin trading as markets awai...
  [10/56] Idle oil and gas projects risk losing licenses, Ba...
  [11/56] Site selection for first nuclear plant expected by...
  [12/56] Asian markets sluggish as Lunar New Year holid

## 6. View Results

Display the scraped data:

In [5]:
# Display summary
print(f"\n{'='*60}")
print("SCRAPING SUMMARY")
print(f"{'='*60}\n")

print(f"Total new articles: {len(new_articles)}")
print(f"Total articles in database: {len(scraper.scraped_articles)}")

if new_articles:
    # Show date distribution
    from collections import Counter
    dates = [a.get('date', 'unknown')[:10] for a in new_articles]
    date_counts = Counter(dates)
    
    print(f"\nArticles by date:")
    for date in sorted(date_counts.keys()):
        print(f"  {date}: {date_counts[date]} articles")
    
    # Content stats if available
    if FETCH_CONTENT:
        with_content = sum(1 for a in new_articles if a.get('content'))
        total_chars = sum(len(a.get('content', '')) for a in new_articles)
        print(f"\nContent stats:")
        print(f"  Articles with content: {with_content}/{len(new_articles)}")
        print(f"  Total content chars: {total_chars:,}")
        print(f"  Avg content length: {total_chars // len(new_articles) if new_articles else 0:,} chars")


SCRAPING SUMMARY

Total new articles: 56
Total articles in database: 56

Articles by date:
  2026-02-06: 3 articles
  2026-02-07: 2 articles
  2026-02-08: 3 articles
  2026-02-09: 6 articles
  2026-02-10: 9 articles
  2026-02-11: 5 articles
  2026-02-12: 5 articles
  2026-02-13: 5 articles
  2026-02-14: 2 articles
  2026-02-15: 3 articles
  2026-02-16: 4 articles
  2026-02-17: 4 articles
  2026-02-18: 5 articles

Content stats:
  Articles with content: 56/56
  Total content chars: 157,040
  Avg content length: 2,804 chars


In [6]:
# Display sample article
if new_articles:
    print(f"\n{'='*60}")
    print("SAMPLE ARTICLE")
    print(f"{'='*60}\n")
    
    sample = new_articles[0]
    
    for key, value in sample.items():
        if key == 'content' and value:
            print(f"{key}:")
            print(f"  {value[:500]}..." if len(value) > 500 else f"  {value}")
        elif isinstance(value, list):
            print(f"{key}: {value}")
        elif isinstance(value, str) and len(value) > 100:
            print(f"{key}: {value[:100]}...")
        else:
            print(f"{key}: {value}")


SAMPLE ARTICLE

title: Bumi Resources Minerals says Palu operations unaffected by site closure
url: https://www.thejakartapost.com/business/2026/02/18/bumi-resources-minerals-says-palu-operations-unaf...
date: 2026-02-18T00:00:00
date_parsed: 2026-02-18 00:00:00
summary: The publicly listed miner, while confirming the closure of a site within subsidiary CPM's forest con...
is_existing: False
scraped_at: 2026-02-18T09:20:55.717217
content:
  T Bumi Resources Minerals (BRM) says the government's closure of a contested gold mine at its concession in Palu, Central Sulawesi, will not disrupt core production activities, as the site in question was not yet operational.

The publicly listed miner said in a statement on Monday that the forest area enforcement task force (Satgas PKH) had sealed off a location within a contract of work area managed by its subsidiary, Citra Palu Minerals (CPM).

The Satgas PKH team moved to seal the site after...
author: Divya Karyza
category: Companies
tags: ['B

## 7. Data Export & Analysis

Export or analyze the scraped data:

In [7]:
# Load and display all data
import pandas as pd

if os.path.exists(OUTPUT_FILE):
    with open(OUTPUT_FILE, 'r', encoding='utf-8') as f:
        all_data = json.load(f)
    
    # Convert to DataFrame for easier analysis
    df = pd.DataFrame(all_data)
    
    print(f"\nDataFrame shape: {df.shape}")
    print(f"\nColumns: {list(df.columns)}\n")
    display(df.head())
else:
    print("No data file found")


DataFrame shape: (56, 11)

Columns: ['title', 'url', 'date', 'date_parsed', 'summary', 'is_existing', 'scraped_at', 'content', 'author', 'category', 'tags']



Unnamed: 0,title,url,date,date_parsed,summary,is_existing,scraped_at,content,author,category,tags
0,Bumi Resources Minerals says Palu operations u...,https://www.thejakartapost.com/business/2026/0...,2026-02-18T00:00:00,2026-02-18 00:00:00,"The publicly listed miner, while confirming th...",False,2026-02-18T09:20:55.717217,T Bumi Resources Minerals (BRM) says the gover...,Divya Karyza,Companies,"[BRMS, mining, PKH]"
1,New state miner Perminas signs MoU on Gabon ra...,https://www.thejakartapost.com/business/2026/0...,2026-02-18T00:00:00,2026-02-18 00:00:00,PT Perusahaan Mineral Nasional and Danantara m...,False,2026-02-18T09:20:55.717752,he country’s newly established state-owned min...,Ruth Dea Juwita,Companies,"[Perminas, Danantara, rare-earths, investment]"
2,As Global Attention Turns to the US: What Are ...,https://www.thejakartapost.com/business/2026/0...,2026-02-18T00:00:00,2026-02-18 00:00:00,With much of the global conversation currently...,False,2026-02-18T09:20:55.718037,ith much of the global conversation currently ...,Creative Desk,Companies,"[adv-BC, adv-usaseanbusinesscouncil, AWS]"
3,"Coal must go for climate funds to flow, German...",https://www.thejakartapost.com/business/2026/0...,2026-02-18T00:00:00,2026-02-18 00:00:00,Private funds could flow elsewhere without coa...,False,2026-02-18T09:20:55.718265,ermany says billions of dollars in private cli...,Ruth Dea Juwita,Regulations,"[Germany, JETP, energy-transition, coal-power-..."
4,"Trump says Japan to invest in energy, industri...",https://www.thejakartapost.com/business/2026/0...,2026-02-18T00:00:00,2026-02-18 00:00:00,United States President Donald Trump's adminis...,False,2026-02-18T09:20:55.718472,nited States President Donald Trump's administ...,David Lawder and Jarrett Renshaw,Economy,"[US-Japan, trade-agreement, investment, oil-an..."


In [8]:
# Export to CSV (optional)
if os.path.exists(OUTPUT_FILE):
    csv_file = OUTPUT_FILE.replace('.json', '.csv')
    df.to_csv(csv_file, index=False, encoding='utf-8')
    print(f"Exported to {csv_file}")

Exported to news_data.csv


## 8. Interactive Exploration

Search and filter the scraped articles:

In [9]:
# Search articles by keyword
keyword = ""  # Enter your search term here

if keyword and new_articles:
    matches = [
        a for a in new_articles 
        if keyword.lower() in a.get('title', '').lower() 
        or keyword.lower() in a.get('summary', '').lower()
        or (FETCH_CONTENT and keyword.lower() in a.get('content', '').lower())
    ]
    
    print(f"Found {len(matches)} articles matching '{keyword}':\n")
    for i, article in enumerate(matches[:5], 1):
        print(f"{i}. {article['title']}")
        print(f"   Date: {article.get('date', 'N/A')[:10]}")
        print(f"   URL: {article['url']}\n")
else:
    print("Enter a keyword in the 'keyword' variable above to search")

Enter a keyword in the 'keyword' variable above to search


In [10]:
print("Done")

Done
