# TechInAsia Scraper: A Computational Narrative

This notebook presents a step-by-step breakdown of a web scraping script designed to extract articles from the TechInAsia website, focusing on the artificial intelligence category. The script utilizes Selenium for dynamic page loading and BeautifulSoup for HTML parsing. The scraped data is then organized into a pandas DataFrame and saved as CSV files.

This notebook is structured into five main parts:

1.  **Setup and Configuration**: This section covers the necessary imports, logging setup, and the definition of data structures and configuration classes.
2.  **Web Driver and Scrolling**: Here, we initialize the Selenium WebDriver and implement a scrolling mechanism to load more content on the page.
3.  **Article Parsing**: This part focuses on extracting relevant information from each article element on the page.
4.  **Scraping Logic**: This section details the main scraping process, including retry mechanisms and data processing.
5.  **Execution and Summary**: Finally, we execute the scraper, display a sample of the results, and provide a summary of the scraping process.

In [14]:
# Code Cell
import logging
import time
from datetime import datetime
from typing import List, Dict, Optional
from dataclasses import dataclass, field
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, StaleElementReferenceException
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
from random import uniform
import os
import re
from dateutil import parser
import argparse
import undetected_chromedriver as uc
from tqdm import tqdm

# Set up logging
log_dir = "logs"
if not os.path.exists(log_dir):
    os.makedirs(log_dir)
log_filename = os.path.join(log_dir, f"logs_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log")
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler(log_filename),
        logging.StreamHandler()
    ]
)

logger = logging.getLogger(__name__)

@dataclass
class Article:
    """Data class for storing article information"""
    article_id: Optional[str]
    title: str
    article_url: str
    source: Optional[str]
    source_url: Optional[str]
    image_url: Optional[str]
    posted_time: Optional[str]
    relative_time: str
    categories: List[str] = field(default_factory=list)
    tags: List[str] = field(default_factory=list)
    scraped_at: str = field(default_factory=lambda: datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
    posted_time_iso: Optional[str] = None

    def __post_init__(self):
        """Convert posted_time to ISO format"""
        if self.posted_time:
            try:
                dt = parser.parse(self.posted_time)
                self.posted_time_iso = dt.isoformat()
            except (ValueError, TypeError) as e:
                logger.warning(f"Failed to parse posted_time: {self.posted_time}, error: {e}")
                self.posted_time_iso = None

    def to_dict(self) -> Dict:
        """Convert the Article instance to a dictionary"""
        return {
            'article_id': self.article_id,
            'title': self.title,
            'article_url': self.article_url,
            'source': self.source,
            'source_url': self.source_url,
            'image_url': self.image_url,
            'posted_time': self.posted_time,
            'posted_time_iso': self.posted_time_iso,
            'relative_time': self.relative_time,
            'categories': ','.join(self.categories),
            'tags': ','.join(self.tags),
            'scraped_at': self.scraped_at
        }

class ScraperConfig:
    """Configuration management for the scraper"""
    DEFAULT_CONFIG = {
        'num_articles': 50,
        'max_scrolls': 10,
        'timeout': 20,
        'retry_count': 3,
        'scroll_pause_time': 1.5,
        'batch_size': 100,
        'base_url': 'https://www.techinasia.com/news',
        'output_dir': 'output',
        'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'min_delay': 1,
        'max_delay': 3
    }

    def __init__(self, **kwargs):
        """Initialize the configuration with default and custom values"""
        self.config = {**self.DEFAULT_CONFIG, **kwargs}
        self.__dict__.update(self.config)

class ScrollManager:
    """Manages page scrolling"""
    def __init__(self, driver, config):
        self.driver = driver
        self.config = config

    def scroll_page(self) -> bool:
        """Scroll the page and return True if new content was loaded"""
        last_height = self.driver.execute_script("return document.body.scrollHeight")
        self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(uniform(self.config.min_delay, self.config.max_delay))
        new_height = self.driver.execute_script("return document.body.scrollHeight")
        return new_height > last_height

## 2. Web Driver and Scrolling

This section focuses on setting up the Selenium WebDriver and managing page scrolling:

-   **WebDriver Initialization**: The `setup_driver` method initializes the Chrome WebDriver with specific options, including headless mode and user-agent settings. It also uses `undetected_chromedriver` to avoid detection by anti-bot systems.
-   **Scroll Management**: The `ScrollManager` class is responsible for scrolling the page to load more content. The `scroll_page` method scrolls to the bottom of the page and checks if new content has been loaded.

In [15]:
# Code Cell
class TechInAsiaScraper:
    def __init__(self, config: Optional[Dict] = None):
        """Initialize the scraper with configuration"""
        self.config = ScraperConfig(**(config or {}))
        self.driver = None
        self.scroll_manager = None
        self._setup_output_directory()
        self.processed_article_ids = set()
        self.incomplete_articles = 0
        self.total_articles = 0

    def _setup_output_directory(self):
        """Create output directory if it doesn't exist"""
        if not os.path.exists(self.config.output_dir):
            os.makedirs(self.config.output_dir)

    def setup_driver(self):
        """Initialize Selenium WebDriver with retry mechanism"""
        options = webdriver.ChromeOptions()
        options.add_argument('--headless')
        options.add_argument('--no-sandbox')
        options.add_argument('--disable-dev-shm-usage')
        options.add_argument('--disable-gpu')
        options.add_argument(f'user-agent={self.config.user_agent}')

        try:
            driver_manager = ChromeDriverManager()
            driver_path = driver_manager.install()
            self.driver = uc.Chrome(
                service=Service(driver_path),
                options=options,
                use_subprocess=True
            )
            self.scroll_manager = ScrollManager(self.driver, self.config)
            logger.info("WebDriver initialized successfully 🚀")
        except Exception as e:
            logger.error(f"Failed to initialize WebDriver: {e} 😞")
            raise

## 3. Article Parsing

This section details how individual article elements are parsed to extract relevant information:

-   **Data Extraction**: The `parse_article` method uses BeautifulSoup to find and extract the title, article URL, source, source URL, image URL, posted time, categories, and tags from each article element.
-   **Data Cleaning**: The `_clean_article_data` method normalizes the extracted data, replacing 'N/A' values with `None`.
-   **Validation**: The `_is_valid_article` method checks if the article has all the required fields, specifically the article URL.

In [16]:
# Code Cell
def _is_valid_article(self, article: Article) -> bool:
    """Validate if the article has all required fields"""
    if not article.article_url:
        logger.warning(f"⚠️ Skipping article with missing article_url: {article.article_id}")
        return False
    return True

def _clean_article_data(self, article: Article) -> Article:
    """Clean and normalize article data"""
    if article.source == 'N/A':
        article.source = None
    if article.image_url == 'N/A':
        article.image_url = None
    if article.source_url == 'N/A':
        article.source_url = None
    return article

def parse_article(self, article_element) -> Optional[Article]:
    """Parse a single article element"""
    article_id = None
    title = 'N/A'
    article_url = None
    source = None
    source_url = None
    image_url = None
    posted_time = None
    relative_time = 'N/A'
    categories = []
    tags = []

    try:
        logger.info(f"Parsing article: {article_id}")
        content_div = article_element.find('div', class_='post-content')

        # Extract article information
        logger.info(f"  - Extracting title...")
        title_element = content_div.find('h3', class_='post-title')
        title = title_element.text.strip() if title_element else 'N/A'
        logger.info(f"  - Title extracted: {title}")

        # Get article URL and ID
        logger.info(f"  - Extracting article URL and ID...")
        article_links = [a for a in content_div.find_all('a') if not 'post-source' in a.get('class', [])]
        if article_links:
            href = article_links[0]['href']
            article_url = f"https://www.techinasia.com{href}" if not href.startswith('http') else href
            match = re.search(r'/([^/]+)$', href)
            article_id = match.group(1) if match else None
            logger.info(f"  - Article URL extracted: {article_url}, ID: {article_id}")
        else:
            logger.warning(f"  - No article links found.")

        # Get source information
        logger.info(f"  - Extracting source... 📰")
        source_element = content_div.find('span', class_='post-source-name')
        source = source_element.text.strip() if source_element else None
        logger.info(f"  - Source extracted: {source} ✅")

        source_link = content_div.find('a', class_='post-source')
        source_url = source_link.get('href') if source_link else None
        logger.info(f"  - Source URL extracted: {source_url} 🌐")

        # Get image information 🖼️
        logger.info(f"  - Extracting image URL... 🖼️")
        image_div = article_element.find('div', class_='post-image')
        if image_div:
            img_tag = image_div.find('img')
            image_url = img_tag.get('src') if img_tag else None
            logger.info(f"  - Image URL extracted: {image_url} 🖼️")
        else:
            logger.warning(f"  - No image div found. 🖼️")

        # Get time and categories/tags
        logger.info(f"  - Extracting time and categories/tags... ⏰")
        footer = article_element.find('div', class_='post-footer')
        time_element = footer.find('time') if footer else None
        posted_time = time_element.get('datetime') if time_element else None
        relative_time = time_element.text.strip() if time_element else 'N/A'
        logger.info(f"  - Time extracted: {posted_time} ⏰, Relative Time: {relative_time}")

        # Parse categories and tags
        if footer:
            tag_elements = footer.find_all('a', class_='post-taxonomy-link')
            for tag in tag_elements:
                tag_text = tag.text.strip('· ')
                if tag_text:
                    if tag.get('href', '').startswith('/category/'):
                        categories.append(tag_text)
                    elif tag.get('href', '').startswith('/tag/'):
                        tags.append(tag_text)
            logger.info(f"  - Categories: {categories}, Tags: {tags}")

        article = Article(
            article_id=article_id,
            title=title,
            article_url=article_url,
            source=source,
            source_url=source_url,
            image_url=image_url,
            posted_time=posted_time,
            relative_time=relative_time,
            categories=categories,
            tags=tags,
        )
        logger.info(f"Article parsing complete: {article_id} 🎉")
        return article

    except Exception as e:
        logger.warning(f"⚠️ Error parsing article: {article_id}, error: {e}")
        return None

## 4. Scraping Logic

This section outlines the main scraping process:

-   **Main Scraping Method**: The `scrape` method orchestrates the entire scraping process, including setting up the WebDriver, performing the scraping, and processing the scraped data.
-   **Retry Mechanism**: The `_scrape_with_retry` method implements a retry mechanism to handle potential errors during scraping. It retries the scraping process up to a specified number of times with exponential backoff.
-   **Performing Scraping**: The `_perform_scraping` method navigates to the target URL, scrolls the page to load more articles, and parses each article element. It uses a progress bar to track the scraping progress.
-   **Data Processing**: The `_process_articles` method converts the scraped articles into a pandas DataFrame, removes duplicates, and saves the data in batches as CSV files.

In [17]:
# Code Cell
def scrape(self) -> pd.DataFrame:
    """Main scraping method"""
    articles = []
    try:
        self.setup_driver()
        articles = self._scrape_with_retry()
        return self._process_articles(articles)
    finally:
        if self.driver:
            self.driver.quit()

def _scrape_with_retry(self) -> List[Article]:
    """Implement retry logic for scraping"""
    for attempt in range(self.config.retry_count):
        try:
            return self._perform_scraping()
        except Exception as e:
            logger.warning(f"Attempt {attempt + 1} failed: {e}, retrying...")
            if attempt == self.config.retry_count - 1:
                raise
            time.sleep(2 * (attempt + 1))  # Exponential backoff

def _perform_scraping(self) -> List[Article]:
    """Perform the actual scraping"""
    articles = []
    url = f"{self.config.base_url}?category=artificial-intelligence"

    logger.info(f"Navigating to URL: {url}")
    self.driver.get(url)
    wait = WebDriverWait(self.driver, self.config.timeout)
    wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'post-card')))
    logger.info(f"Page loaded successfully.")

    progress_bar = tqdm(total=self.config.num_articles, desc="Scraping Articles", unit="article")
    scroll_count = 0
    while len(articles) < self.config.num_articles and scroll_count < self.config.max_scrolls:
        soup = BeautifulSoup(self.driver.page_source, 'html.parser')
        article_elements = soup.find_all('article', class_='post-card')

        for article_element in article_elements:
            try:
                article = self.parse_article(article_element)
                if article and article.article_id not in self.processed_article_ids:
                    if self._is_valid_article(article):
                        cleaned_article = self._clean_article_data(article)
                        articles.append(cleaned_article)
                        self.processed_article_ids.add(article.article_id)
                        logger.info(f"Processed article {article.article_id} - {article.title}")
                        progress_bar.update(1)
                        progress_bar.set_postfix({"incomplete": self.incomplete_articles})
                    else:
                        self.incomplete_articles += 1
                    self.total_articles += 1
                    if len(articles) >= self.config.num_articles:
                        break
            except StaleElementReferenceException as e:
                logger.warning(f"StaleElementReferenceException: {e}, retrying parsing")
                continue
            except Exception as e:
                logger.error(f"Error during article processing: {e}")
                continue

        if not self.scroll_manager.scroll_page():
            break
        scroll_count += 1
        logger.info(f"Scrolled {scroll_count} times, found {len(articles)} articles")

    progress_bar.close()
    return articles

def _process_articles(self, articles: List[Article]) -> pd.DataFrame:
    """Process and clean scraped articles"""
    if not articles:
        return pd.DataFrame()

    # Convert articles to DataFrame
    df = pd.DataFrame([article.to_dict() for article in articles])

    # Remove duplicates
    df = df.drop_duplicates(subset=['article_url'], keep='first')

    # Save in batches
    self._save_batches(df)

    return df

def _save_batches(self, df: pd.DataFrame):
    """Save DataFrame in batches"""
    batch_size = self.config.batch_size
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    for i in range(0, len(df), batch_size):
        batch = df[i:i + batch_size]
        filename = f"{self.config.output_dir}/techinasia_ai_news_v1_batch_{i//batch_size}_{timestamp}.csv"
        try:
            batch.to_csv(filename, index=False)
            logger.info(f"Saved batch {i//batch_size + 1} to {filename}")
        except Exception as e:
            logger.error(f"Error saving batch {i//batch_size + 1} to {filename}: {e}")

## 5. Execution and Summary

This section executes the scraper and provides a summary of the scraping process:

-   **Main Function**: The `main` function parses command-line arguments, initializes the `TechInAsiaScraper`, and executes the scraping process.
-   **Sample Display**: After scraping, a sample of the scraped articles is displayed, including the title, source, posted time, categories, and tags.
-   **Summary Logging**: The `_log_summary` method logs summary statistics of the scraping process, including the total number of articles scraped, valid articles, incomplete articles, and duplicate articles.

In [18]:
# Code Cell
def _log_summary(self):
    """Log summary statistics of the scraping process"""
    logger.info("📊 --- Scraping Summary ---")
    logger.info(f"📰 Total articles scraped: {self.total_articles}")
    logger.info(f"✅ Valid articles scraped: {len(self.processed_article_ids)}")
    logger.info(f"⚠️ Incomplete articles skipped: {self.incomplete_articles}")
    logger.info(f"🔄 Duplicate articles skipped: {self.total_articles - len(self.processed_article_ids) - self.incomplete_articles}")
    logger.info("📊 ------------------------")

def main():
    """Main function to run the scraper"""
    parser = argparse.ArgumentParser(description="TechInAsia Scraper")
    parser.add_argument('--num_articles', type=int, help='Number of articles to scrape', default=100)
    parser.add_argument('--max_scrolls', type=int, help='Maximum number of scrolls', default=15)
    parser.add_argument('--output_dir', type=str, help='Output directory', default='techinasia_output')
    parser.add_argument('--min_delay', type=float, help='Minimum delay between scrolls', default=1)
    parser.add_argument('--max_delay', type=float, help='Maximum delay between scrolls', default=3)
    args = parser.parse_args()

    config = {
        'num_articles': args.num_articles,
        'max_scrolls': args.max_scrolls,
        'output_dir': args.output_dir,
        'min_delay': args.min_delay,
        'max_delay': args.max_delay
    }

    logger.info("Starting TechInAsia scraper v1.5")
    scraper = TechInAsiaScraper(config)

    try:
        df = scraper.scrape()
        logger.info(f"Successfully scraped {len(df)} articles")

        # Display sample of results
        if not df.empty:
            print("\nSample of scraped articles:")
            display_columns = ['title', 'source', 'posted_time_iso', 'categories', 'tags']
            print(df[display_columns].head())

    except Exception as e:
        logger.error(f"Scraping failed: {e}")
    finally:
        scraper._log_summary()



In [22]:
# Code Cell - Test Run

import logging
from techinasia_scraper_v1_5 import TechInAsiaScraper  # Assuming your script is named techinasia_scraper_v1_5.py

# Configure logging for this test run
logger = logging.getLogger(__name__)

# Define a test configuration
test_config = {
    'num_articles': 75,  # Scrape only 5 articles for a quick test
    'max_scrolls': 3,   # Limit the number of scrolls
    'output_dir': 'techinasia_output',  # Use a specific output directory for testing
    'min_delay': 0.5,
    'max_delay': 1
}

try:
    # Initialize the scraper with the test configuration
    scraper = TechInAsiaScraper(test_config)

    # Run the scraper
    df = scraper.scrape()

    # Check if any articles were scraped
    if not df.empty:
        print(f"✅ Test run successful! Scraped {len(df)} articles.")
        print("\nSample of scraped articles:")
        display_columns = ['title', 'source', 'posted_time_iso', 'categories', 'tags']
        print(df[display_columns].head())
    else:
        print("⚠️ Test run completed, but no articles were scraped.")

except Exception as e:
    logger.error(f"❌ Test run failed: {e}")
    print(f"❌ Test run failed: {e}")

finally:
    if 'scraper' in locals():
        scraper._log_summary()

Scraping Articles:  60%|██████    | 45/75 [00:03<00:02, 14.94article/s, incomplete=0]

✅ Test run successful! Scraped 45 articles.

Sample of scraped articles:
                                               title     source  \
0  Ytl, Sea launch Malaysia’s first AI-powered di...   Ryt Bank   
1    Taiwanese startup MetAI raises $4m seed funding      MetAI   
2  Saudi firm Halo AI secures $6m for global expa...  Arab News   
3  Israeli cybersecurity startup raises $36m seed...  Calcalist   
4  Biden signs order for AI data centers, US-made...       CNBC   

       posted_time_iso      categories      tags  
0  2025-01-15T11:08:00      AI,Fintech  Malaysia  
1  2025-01-15T11:00:00  AI,Investments            
2  2025-01-15T10:05:00  AI,Investments            
3  2025-01-15T10:00:00  AI,Investments            
4  2025-01-15T09:00:00              AI            



