# Odia News Article Scraper

## Objective
The primary purpose of this notebook is to perform a targeted web scrape of a single news article from an Odia-language newspaper website. It serves as a foundational component of the data collection pipeline, designed to extract, clean, and filter raw article text to produce a monolingual Odia text file suitable for downstream NLP tasks, including the creation of a parallel corpus.

## Methodology
This script employs a robust, hybrid approach to ensure accurate extraction of both article metadata and its main content.

* **Data Fetching:** Uses the `requests` library to download the raw HTML of the target URL.

* **Metadata Extraction:** Leverages the `newspaper3k` library for its reliable ability to parse the document and extract the article's title.

* **Content Extraction:** Employs the `BeautifulSoup` library for a more granular and precise extraction of the main article body, specifically targeting the primary content `<div>` of the newspaper's layout.

* **Language Filtering:** A custom function (`filter_odia_text`) programmatically isolates characters belonging to the Odia script (Unicode range `U+0B00`–`U+0B7F`), along with essential punctuation. This is a critical step to remove all non-Odia text (e.g., English words, navigation links) and create a pure monolingual corpus.

## Workflow
The notebook executes the following sequential steps:

1. Mounts Google Drive to ensure persistent storage of the output.

2. Installs all necessary Python dependencies.

3. Fetches the HTML content from a specified article URL.

4. Parses the HTML to extract the title and main body text.

5. Cleans the extracted text by removing known boilerplate phrases and normalizing whitespace.

6. Filters the cleaned text, retaining only Odia-language characters and punctuation.

7. Saves the final, clean Odia text to a `.txt` file in the specified Google Drive directory.

## Input & Output
* **Input:** A single string variable, `article_url`, containing the URL to a news article.

* **Output:** A single `.txt` file (e.g., `dharitri_odia_article_filtered_100.txt`) saved to Google Drive, containing the article's title and its filtered, clean Odia text.

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
!pip install -q newspaper3k

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m70.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.1/211.1 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.3/81.3 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.4/107.4 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for tinysegmenter (setup.py) ... [?25l[?25hdone
  Building wheel for feedfinder2 (setup.py) ... [?25l[?25hdone
  Building wheel for jieba3k (setup.py) ... [?25l[?25hdone
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone


In [None]:
!pip install lxml_html_clean

Collecting lxml_html_clean
  Downloading lxml_html_clean-0.4.2-py3-none-any.whl.metadata (2.4 kB)
Downloading lxml_html_clean-0.4.2-py3-none-any.whl (14 kB)
Installing collected packages: lxml_html_clean
Successfully installed lxml_html_clean-0.4.2


In [None]:
import os
import sys
import requests
from bs4 import BeautifulSoup
from newspaper import Article

# Article Scraping
headers = {
    "User-Agent": """Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:138.0) Gecko/20100101 Firefox/138.0"""
}

# --- Target Article URL ---
article_url = "https://www.dharitri.com/kismis-will-make-you-beautiful/"

# --- Function to filter out non-Odia, non-punctuation, non-whitespace characters ---
def filter_odia_text(text):
  """
  Filters input text to retain only Odia script characters, specified punctuation, and whitespace.

  This function processes the input string to keep characters within the Unicode range for Odia
  (Oriya) script (U+0B00 to U+0B7F), a predefined set of punctuation marks, and whitespace characters.
  All other characters are removed, ensuring the output text is suitable for applications requiring
  clean Odia text while preserving sentence structure and readability.

  Args:
    text (str): The input string to be filtered.

  Returns:
    str: A string containing only Odia characters, allowed punctuation, and whitespace,
    with leading and trailing whitespace removed.

  Example:
    >>> sample_text = "ନମସ୍କାର! Hello, କେମିତି ଅଛନ୍ତି? 123"
    >>> filter_odia_text(sample_text)
    'ନମସ୍କାର! କେମିତି ଅଛନ୍ତି?'
  """
  odia_chars = []
  # Unicode range for Odia script (Oriya): U+0B00 to U+0B7F
  odia_start = 0x0B00
  odia_end = 0x0B7F

  # Common punctuation and whitespace to explicitly allow
  # This ensures sentence structure and readability for Odia text.
  allowed_common_chars = ".,!?;:()[]{}'\"-—/%&+=*#@₹" + os.linesep + "\t"

  for char in text:
      char_code = ord(char)
      if (odia_start <= char_code <= odia_end) or \
          (char in allowed_common_chars) or \
          (char.isspace()): # Catches other whitespace characters like form feed, vertical tab etc.
          odia_chars.append(char)
  return "".join(odia_chars).strip()

session = requests.Session()
article_title = ""
article_text = ""

try:
  print(f"Attempting to fetch article from: {article_url}")
  response = session.get(article_url, headers=headers, timeout=30)

  # Check if the request was successful (HTTP status code 200)
  if response.status_code == 200:
    # --- Use newspaper for title ---
    # Initialize the Article object with the URL
    article = Article(article_url)
    # Download the HTML content. 'newspaper' can use response.text directly.
    article.download()
    # Parse the article to extract title, text, authors, etc.
    article.parse()
    article_title = article.title

    # --- Use BeautifulSoup for main article text extraction ---
    soup = BeautifulSoup(response.text, 'html.parser')
    # Find the main article content div.
    article_content_div = soup.find('div', class_='td-post-content')

    raw_article_text = ""
    if article_content_div:
      # Extract all text from within this specific div
      # .get_text(separator='\n') helps preserve paragraph breaks
      raw_article_text = article_content_div.get_text(separator='\n').strip()
      # Clean up unwanted common phrases before Odia filtering
      unwanted_phrases = [
          "All Right Reserved By Dharitri.Com",
          "Enter your email to get our daily news in your inbox.",
          # Generic archive text that often appears in footers/sidebars
          "Archives Archives Select Month",
          # More specific month sequences that might appear due to parsing
          "June 2025 May 2025 April 2025 March 2025 February 2025 January 2025",
          "December 2024 November 2024 October 2024 September 2024 August 2024 July 2024 June 2024 May 2024 April 2024 March 2024 February 2024 January 2024",
          "December 2023 November 2023 October 2023 September 2023 August 2023 July 2023 June 2023 May 2023 April 2023 March 2023 February 2023 January 2023",
          "December 2022 November 2022 October 2022 September 2022 August 2022 July 2022 June 2022 May 2022 April 2022 March 2022 February 2022 January 2022",
          "December 2021 November 2021 October 2021 September 2021 August 2021 July 2021 June 2021 May 2021 April 2021 March 2021 February 2021 January 2021",
          "December 2020 November 2020 October 2020 September 2020 August 2020 July 2020 June 2020 May 2020 April 2020 March 2020 February 2020 January 2020",
          "December 2019 November 2019 October 2019 September 2019 August 2019 July 2019 June 2019 May 2019 April 2019 March 2019",
          "December 2018 November 2018 October 2018 August 2018 July 2018 June 2018 January 2018 December 2017 October 2017 January 2017"
      ]
      for phrase in unwanted_phrases:
        raw_article_text = raw_article_text.replace(phrase, '').strip()

      # Remove multiple consecutive newlines, leaving at most two for paragraph separation
      raw_article_text = os.linesep.join([s for s in raw_article_text.splitlines() if s.strip()])

    else:
      print("Warning: Could not find the main article content div ('td-post-content'). Extracting entire page text.")
      raw_article_text = soup.get_text(separator='\n').strip() # Fallback to entire page text


    # Apply the Odia-specific filter
    article_text = filter_odia_text(raw_article_text)

    # --- Print Extracted Content ---
    print(f"\n--- Extracted Article Details ---")
    print(f"Title: {article_title}")
    print(f"\n--- Filtered Odia Article Text ---")
    if article_text:
      # print(article_text)
      print(f"\nTotal text length: {len(article_text)} characters")
    else:
      print("[No Odia-specific text extracted after filtering. The article might contain very little pure Odia text, or the filtering was too aggressive.]")

    # Optional: Save the extracted content to a text file
    output_file_name = "/content/drive/MyDrive/Thesis/Data/dharitri_odia_article_filtered_100.txt"
    with open(output_file_name, "w", encoding="utf-8") as f:
      f.write(f"Title: {article_title}\n\n")
      f.write(article_text)
    print(f"\nFiltered Odia content saved to '{output_file_name}'.")

  else:
    print(f"Failed to fetch article. Status code: {response.status_code} from {article_url}")
    sys.exit(1)
except requests.exceptions.ReadTimeout as e:
  print(f"Error: The website took too long to respond. Read timed out after {e.request.timeout} seconds.")
  print("Consider increasing the 'timeout' value further if this persists, or try again later.")
  sys.exit(1)
except Exception as e:
  print(f"An unexpected error occurred while fetching or parsing the article from {article_url}: {e}")
  print("Common issues: URL is incorrect, network problems, or website structure changes.")
  sys.exit(1)

Attempting to fetch article from: https://www.dharitri.com/kismis-will-make-you-beautiful/

--- Extracted Article Details ---
Title: ସୁନ୍ଦର କରିବ କିସ୍‌ମିସ୍‌

--- Filtered Odia Article Text ---

Total text length: 6618 characters

Filtered Odia content saved to '/content/drive/MyDrive/Thesis/Data/dharitri_odia_article_filtered_100.txt'.
