# Module A — Dataset Construction & Indexing

1. 2500 data per language
2. 5 English and 5 Bangla newspaper
3. Crawling done using BeautifulSoup and RSS
4. Metadata :<br>
    a. title <br>
    b. body<br>
    c. date<br>
    d. url<br>
    e. language<br>
    f. token number<br>

In [1]:
!nvidia-smi

Mon Jan 12 17:47:40 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.97                 Driver Version: 555.97         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4060 ...  WDDM  |   00000000:01:00.0 Off |                  N/A |
| N/A   32C    P8              1W /  140W |       0MiB /   8188MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))

True
NVIDIA GeForce RTX 4060 Laptop GPU


In [3]:
import feedparser
import json
from tqdm import tqdm
import os
from bs4 import BeautifulSoup
import html
import re

## HTML to Text

In [4]:
def html_to_text(s):
    if not s:
        return ""
    s = html.unescape(s)
    soup = BeautifulSoup(s, "html.parser")
    text = soup.get_text(" ", strip=True)
    text = re.sub(r"\s+", " ", text).strip()
    return text

## 1. Data Crawling 

### 1.1 Collect RSS feeds

In [5]:
def collect_from_rss_feeds(rss_feeds, max_docs=200, doc_prefix="en"):
    docs = []
    seen_urls = set()
    doc_i = 0

    for rss_url in tqdm(rss_feeds, desc="Processing RSS feeds"):
        feed = feedparser.parse(rss_url)

        for entry in feed.entries:
            url = getattr(entry, "link", "").strip()
            if not url or url in seen_urls:
                continue

            title = getattr(entry, "title", "").strip()
            summary_raw = getattr(entry, "summary", "").strip()
            summary_text = html_to_text(summary_raw)

            date = getattr(entry, "published", "")

            seen_urls.add(url)

            docs.append({
                "doc_id": f"{doc_prefix}_{doc_i:06d}",
                "title": html_to_text(title),
                "body": summary_text,
                "url": url,
                "date": date,
                "language": doc_prefix,
                "token_count": len(summary_text.split())
            })

            doc_i += 1

            if len(docs) >= max_docs:
                return docs

    return docs

### 1.2 English Newspapers URLs

In [6]:
rss_feeds=[
    "https://www.thedailystar.net/rss.xml",
    "https://www.dhakatribune.com/feed/",
    "https://dailynewnation.com/feed/",
    "https://thebangladeshtoday.com/?feed=rss2",
    "https://dailyasianage.com/rss/feed.xml",
    "https://www.thedailystar.net/historical/front-page/rss.xml",
    "https://www.thedailystar.net/business/rss.xml",
    "https://www.thedailystar.net/science-tech/rss.xml",
    "https://www.thedailystar.net/sports/rss.xml",
    "https://www.thedailystar.net/opinion/rss.xml",
    "https://www.thedailystar.net/world/rss.xml",
    "https://www.thedailystar.net/country/rss.xml",
    "https://www.thedailystar.net/environment/rss.xml",
    "https://www.thedailystar.net/arts-culture/rss.xml",
    "https://www.thedailystar.net/magazine/rss.xml",
    "https://www.thedailystar.net/backpage/rss.xml",
    "https://www.thedailystar.net/star-weekend/rss.xml",
    "https://www.thedailystar.net/star-multimedia/rss.xml",
]

### 1.3 Data Collecting from English Newspapers in JSON file

In [7]:
saving_path = r"E:\DM\Cross-Lingual-Information-Retrieval-System\data"
file_path = os.path.join(saving_path, "document_en.json")

if os.path.exists(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        existing_docs = json.load(f)
else:
    existing_docs = []

existing_urls = {doc["url"] for doc in existing_docs}

new_docs = collect_from_rss_feeds(rss_feeds)
new_docs = [doc for doc in new_docs if doc["url"] not in existing_urls]

start_id = len(existing_docs)
for i, doc in enumerate(new_docs):
    doc["doc_id"] = f"en_{start_id + i:06d}"

all_docs = existing_docs + new_docs

with open(file_path, "w", encoding="utf-8") as f:
    json.dump(all_docs, f, ensure_ascii=False, indent=2)

print(f"Added {len(new_docs)} new documents. Total: {len(all_docs)}")

Processing RSS feeds:  89%|████████▉ | 16/18 [00:09<00:01,  1.64it/s]


Added 188 new documents. Total: 7062


### 1.4 Bangla Newspapers URLs

In [8]:
rss_feeds=[
    "https://www.risingbd.com/rss/rss.xml",
    "https://bd-journal.com/feed/latest-rss.xml",
    "https://bangladeshdiplomat.com/feed",
    "https://www.jagonews24.com/rss/rss.xml",
    "https://bdpratidin.net/rss/latest-posts",
    "https://www.kalerkantho.com/rss.xml",
    "https://www.banglatribune.com/feed/",
    "https://bangla.thedailystar.net/rss.xml",
    "https://rss.app/feeds/MeTNrZ6WtYhicYRP.xml",
    
]

### 1.5 Data Collecting from Bangla Newspapers in JSON file

In [9]:
saving_path = r"E:\DM\Cross-Lingual-Information-Retrieval-System\data"
file_path = os.path.join(saving_path, "document_bn.json")

if os.path.exists(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        existing_docs = json.load(f)
else:
    existing_docs = []

existing_urls = {doc["url"] for doc in existing_docs}

new_docs = collect_from_rss_feeds(rss_feeds)
new_docs = [doc for doc in new_docs if doc["url"] not in existing_urls]

start_id = len(existing_docs)
for i, doc in enumerate(new_docs):
    doc["doc_id"] = f"bn_{start_id + i:06d}"

all_docs = existing_docs + new_docs

with open(file_path, "w", encoding="utf-8") as f:
    json.dump(all_docs, f, ensure_ascii=False, indent=2)

print(f"Added {len(new_docs)} new documents. Total: {len(all_docs)}")

Processing RSS feeds:  44%|████▍     | 4/9 [00:05<00:06,  1.33s/it]


Added 175 new documents. Total: 10163


### 1.6 Sitemap-based Web Crawling for Bangla Newspaper (Section-filtered)

In [10]:
#!pip -q install requests beautifulsoup4 lxml trafilatura tqdm

In [11]:
import re
import time
import json
import requests
from tqdm import tqdm
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET

In [12]:
def get_sitemaps_from_robots(base_url, timeout=20):
    robots_url = base_url.rstrip("/") + "/robots.txt"
    r = requests.get(robots_url, timeout=timeout, headers={"User-Agent": "Mozilla/5.0"})
    r.raise_for_status()
    sitemaps = []
    for line in r.text.splitlines():
        line = line.strip()
        if line.lower().startswith("sitemap:"):
            sitemaps.append(line.split(":", 1)[1].strip())
    return sitemaps



In [13]:
def parse_sitemap(xml_text):
    root = ET.fromstring(xml_text)
    tag = root.tag.lower()

    ns = ""
    if "}" in root.tag:
        ns = root.tag.split("}")[0] + "}"

    locs = []
    if tag.endswith("sitemapindex"):
        for sm in root.findall(f"{ns}sitemap"):
            loc = sm.find(f"{ns}loc")
            if loc is not None and loc.text:
                locs.append(loc.text.strip())
    elif tag.endswith("urlset"):
        for u in root.findall(f"{ns}url"):
            loc = u.find(f"{ns}loc")
            if loc is not None and loc.text:
                locs.append(loc.text.strip())

    return locs

In [14]:
def collect_urls_from_sitemaps(base_url, max_urls=2000, timeout=25):
    sitemap_urls = get_sitemaps_from_robots(base_url, timeout=timeout)
    if not sitemap_urls:
        sitemap_urls = [base_url.rstrip("/") + "/sitemap.xml"]

    seen_sitemaps = set()
    seen_urls = set()

    queue = list(sitemap_urls)

    with tqdm(total=max_urls, desc=f"Sitemap URLs {base_url}") as pbar:
        while queue and len(seen_urls) < max_urls:
            sm_url = queue.pop(0)
            if sm_url in seen_sitemaps:
                continue
            seen_sitemaps.add(sm_url)

            try:
                r = requests.get(sm_url, timeout=timeout, headers={"User-Agent": "Mozilla/5.0"})
                r.raise_for_status()
            except Exception:
                continue

            locs = parse_sitemap(r.text)

            if locs and locs[0].endswith(".xml"):
                for nxt in locs:
                    if nxt not in seen_sitemaps:
                        queue.append(nxt)
            else:
                for u in locs:
                    if u.startswith(base_url) and u not in seen_urls:
                        seen_urls.add(u)
                        pbar.update(1)
                        if len(seen_urls) >= max_urls:
                            break

    return list(seen_urls)


In [15]:
import trafilatura

def extract_article_text(url, timeout=25):
    try:
        downloaded = trafilatura.fetch_url(url)
        if not downloaded:
            return None
        text = trafilatura.extract(
            downloaded,
            include_comments=False,
            include_tables=False
        )
        if not text:
            return None
        text = re.sub(r"\s+", " ", text).strip()
        return text
    except Exception:
        return None

In [16]:
from urllib.parse import urlparse
import os

def build_dataset_from_sitemap_section_append(
    section_url,
    language,
    existing_json_path,
    max_new_docs=200,
    sleep_sec=0.8
):
    p = urlparse(section_url)
    base_url = f"{p.scheme}://{p.netloc}"
    section_path = p.path.rstrip("/")
    if section_path == "":
        section_path = "/"

    if os.path.exists(existing_json_path):
        with open(existing_json_path, "r", encoding="utf-8") as f:
            existing_docs = json.load(f)
    else:
        existing_docs = []

    existing_urls = set(d.get("url", "").strip() for d in existing_docs if d.get("url"))
    existing_count = len(existing_docs)

    all_urls = collect_urls_from_sitemaps(base_url, max_urls=max_new_docs * 50)
    section_urls = [u for u in all_urls if f"{base_url}{section_path}/" in u]

    new_docs = []
    doc_i = existing_count

    for url in tqdm(section_urls, desc=f"Append from {section_url}"):
        if url in existing_urls:
            continue

        text = extract_article_text(url)
        time.sleep(sleep_sec)

        if not text:
            continue
        if len(text.split()) < 50:
            continue

        title = ""
        try:
            html = requests.get(url, timeout=20, headers={"User-Agent": "Mozilla/5.0"}).text
            soup = BeautifulSoup(html, "html.parser")
            h1 = soup.find("h1")
            if h1:
                title = h1.get_text(" ", strip=True)
        except Exception:
            pass

        if not title:
            title = text.split(".")[0][:120]

        d = {
            "doc_id": f"{language}_{doc_i:06d}",
            "title": title,
            "body": text,
            "url": url,
            "date": "",
            "language": language,
            "token_count": len(text.split())
        }

        existing_docs.append(d)
        existing_urls.add(url)
        new_docs.append(d)
        doc_i += 1

        if len(new_docs) >= max_new_docs:
            break

    with open(existing_json_path, "w", encoding="utf-8") as f:
        json.dump(existing_docs, f, ensure_ascii=False, indent=2)

    print("Existing docs (before):", existing_count)
    print("New docs added:", len(new_docs))
    print("Total docs (after):", len(existing_docs))
    if new_docs:
        print("Example new:", new_docs[0]["title"], "|", new_docs[0]["url"])


In [17]:
selected_urls = [
#prothom alo
"https://www.prothomalo.com/bangladesh",
"https://www.prothomalo.com/politics",
"https://www.prothomalo.com/world",
"https://www.prothomalo.com/business",
"https://www.prothomalo.com/sports",
"https://www.prothomalo.com/entertainment",

#dhaka post
"https://www.dhakapost.com/national",
"https://www.dhakapost.com/politics",
"https://www.dhakapost.com/economy",
"https://www.dhakapost.com/international",
"https://www.dhakapost.com/sports",
"https://www.dhakapost.com/entertainment",


#jugantor
"https://www.jugantor.com/national",
"https://www.jugantor.com/politics",
"https://www.jugantor.com/economics",
"https://www.jugantor.com/international",
"https://www.jugantor.com/entertainment",

]


In [18]:
for url in selected_urls[0:17]:
    print("Processing Link: ",url)
    build_dataset_from_sitemap_section_append(
    section_url=url,
    language="bn",
    existing_json_path=r"E:\DM\Cross-Lingual-Information-Retrieval-System\data\document_bn.json",
    max_new_docs=500,
    sleep_sec=1.0)

Processing Link:  https://www.prothomalo.com/bangladesh


Sitemap URLs https://www.prothomalo.com:  82%|████████▏ | 20463/25000 [01:01<00:13, 335.07it/s]
Append from https://www.prothomalo.com/bangladesh: 100%|██████████| 7473/7473 [5:38:30<00:00,  2.72s/it]   


Existing docs (before): 10163
New docs added: 38
Total docs (after): 10201
Example new: ১৯৭১ ও ২০২৪: ইতিহাসের ধারাবাহিকতা | https://www.prothomalo.com/bangladesh/egd9l7a0wf
Processing Link:  https://www.prothomalo.com/politics


Sitemap URLs https://www.prothomalo.com:  82%|████████▏ | 20552/25000 [00:23<00:05, 873.46it/s] 
Append from https://www.prothomalo.com/politics: 100%|██████████| 1009/1009 [16:52<00:00,  1.00s/it]


Existing docs (before): 10201
New docs added: 5
Total docs (after): 10206
Example new: ক্ষমতার ভারসাম্য প্রশ্নে আলোচনা এগোচ্ছে না | https://www.prothomalo.com/politics/pdilbzc2oz
Processing Link:  https://www.prothomalo.com/world


Sitemap URLs https://www.prothomalo.com:  82%|████████▏ | 20556/25000 [00:09<00:01, 2283.62it/s]
Append from https://www.prothomalo.com/world: 100%|██████████| 1690/1690 [30:14<00:00,  1.07s/it]


Existing docs (before): 10206
New docs added: 6
Total docs (after): 10212
Example new: ইয়েমেনের ঘটনা দেখিয়ে দিল সৌদি আরব ও আমিরাত পরস্পরকে কতটা অবিশ্বাস করে | https://www.prothomalo.com/world/middle-east/cq5a9m6hq2
Processing Link:  https://www.prothomalo.com/business


Sitemap URLs https://www.prothomalo.com:  82%|████████▏ | 20563/25000 [00:09<00:01, 2220.66it/s]
Append from https://www.prothomalo.com/business: 100%|██████████| 1075/1075 [17:14<00:00,  1.04it/s]


Existing docs (before): 10212
New docs added: 2
Total docs (after): 10214
Example new: বোয়িংয়ের বিরুদ্ধে মুখ খোলা আরেক ব্যক্তির মৃত্যু, চলতি বছর এ নিয়ে দুজন | https://www.prothomalo.com/business/world-business/7wqlgehenf
Processing Link:  https://www.prothomalo.com/sports


Sitemap URLs https://www.prothomalo.com:  82%|████████▏ | 20564/25000 [00:09<00:01, 2253.41it/s]
Append from https://www.prothomalo.com/sports: 100%|██████████| 1789/1789 [41:45<00:00,  1.40s/it]


Existing docs (before): 10214
New docs added: 7
Total docs (after): 10221
Example new: দাবি নিয়ে বিসিবিতে প্রথম, দ্বিতীয় ও তৃতীয় বিভাগের ক্রিকেটাররা | https://www.prothomalo.com/sports/cricket/1cgoi16rdl
Processing Link:  https://www.prothomalo.com/entertainment


Sitemap URLs https://www.prothomalo.com:  82%|████████▏ | 20568/25000 [00:09<00:01, 2223.56it/s]
Append from https://www.prothomalo.com/entertainment: 100%|██████████| 1239/1239 [23:15<00:00,  1.13s/it]


Existing docs (before): 10221
New docs added: 10
Total docs (after): 10231
Example new: পড়শীর বর কে এই নিলয় | https://www.prothomalo.com/entertainment/song/ptgsbeaxg4
Processing Link:  https://www.dhakapost.com/national


Sitemap URLs https://www.dhakapost.com: 100%|██████████| 25000/25000 [00:03<00:00, 7451.14it/s]
Append from https://www.dhakapost.com/national: 100%|██████████| 23090/23090 [7:11:49<00:00,  1.12s/it]   


Existing docs (before): 10231
New docs added: 278
Total docs (after): 10509
Example new: রাজধানীর শাহজাহানপুরে কলেজ শিক্ষার্থীর আত্মহত্যার অভিযোগ | https://www.dhakapost.com/national/403027
Processing Link:  https://www.dhakapost.com/politics


Sitemap URLs https://www.dhakapost.com: 100%|██████████| 25000/25000 [00:02<00:00, 9810.52it/s]
Append from https://www.dhakapost.com/politics: 100%|██████████| 18/18 [00:00<?, ?it/s]


Existing docs (before): 10509
New docs added: 0
Total docs (after): 10509
Processing Link:  https://www.dhakapost.com/economy


Sitemap URLs https://www.dhakapost.com: 100%|██████████| 25000/25000 [00:02<00:00, 12060.85it/s]
Append from https://www.dhakapost.com/economy: 100%|██████████| 20/20 [00:00<?, ?it/s]


Existing docs (before): 10509
New docs added: 0
Total docs (after): 10509
Processing Link:  https://www.dhakapost.com/international


Sitemap URLs https://www.dhakapost.com: 100%|██████████| 25000/25000 [00:02<00:00, 11237.70it/s]
Append from https://www.dhakapost.com/international: 100%|██████████| 20/20 [00:00<?, ?it/s]


Existing docs (before): 10509
New docs added: 0
Total docs (after): 10509
Processing Link:  https://www.dhakapost.com/sports


Sitemap URLs https://www.dhakapost.com: 100%|██████████| 25000/25000 [00:02<00:00, 11291.69it/s]
Append from https://www.dhakapost.com/sports: 100%|██████████| 42/42 [00:51<00:00,  1.21s/it]


Existing docs (before): 10509
New docs added: 3
Total docs (after): 10512
Example new: আগামীকাল থেকে শুরু হচ্ছে নারী ক্রিকেট লিগ | https://www.dhakapost.com/sports/cricket/416701
Processing Link:  https://www.dhakapost.com/entertainment


Sitemap URLs https://www.dhakapost.com: 100%|██████████| 25000/25000 [00:02<00:00, 9435.06it/s]
Append from https://www.dhakapost.com/entertainment: 100%|██████████| 22/22 [00:26<00:00,  1.20s/it]


Existing docs (before): 10512
New docs added: 0
Total docs (after): 10512
Processing Link:  https://www.jugantor.com/national


Sitemap URLs https://www.jugantor.com:   3%|▎         | 815/25000 [00:01<00:49, 487.65it/s]
Append from https://www.jugantor.com/national: 100%|██████████| 74/74 [02:00<00:00,  1.63s/it]


Existing docs (before): 10512
New docs added: 74
Total docs (after): 10586
Example new: চার দেশের বাংলাদেশ মিশনের প্রেস সচিবকে অব্যাহতি | https://www.jugantor.com/national/1051930
Processing Link:  https://www.jugantor.com/politics


Sitemap URLs https://www.jugantor.com:   3%|▎         | 815/25000 [00:01<00:32, 746.09it/s]
Append from https://www.jugantor.com/politics: 100%|██████████| 26/26 [00:43<00:00,  1.67s/it]


Existing docs (before): 10586
New docs added: 26
Total docs (after): 10612
Example new: দেশের স্বার্থে গণভোটে ‘হ্যাঁ’ ভোট দেন: জামায়াত আমির | https://www.jugantor.com/politics/1051891
Processing Link:  https://www.jugantor.com/economics


Sitemap URLs https://www.jugantor.com:   3%|▎         | 815/25000 [00:01<00:54, 445.93it/s]
Append from https://www.jugantor.com/economics: 100%|██████████| 9/9 [00:13<00:00,  1.48s/it]


Existing docs (before): 10612
New docs added: 9
Total docs (after): 10621
Example new: দেশের ইতিহাসে সর্বোচ্চ দামে স্বর্ণ | https://www.jugantor.com/economics/1052014
Processing Link:  https://www.jugantor.com/international


Sitemap URLs https://www.jugantor.com:   3%|▎         | 815/25000 [00:01<00:52, 464.20it/s]
Append from https://www.jugantor.com/international: 100%|██████████| 70/70 [01:58<00:00,  1.69s/it]


Existing docs (before): 10621
New docs added: 70
Total docs (after): 10691
Example new: ব্রিটিশ সরকারকে ‘ফ্যাসিস্ট’ বললেন ইলন মাস্ক | https://www.jugantor.com/international/1051494
Processing Link:  https://www.jugantor.com/entertainment


Sitemap URLs https://www.jugantor.com:   3%|▎         | 815/25000 [00:02<01:10, 343.36it/s]
Append from https://www.jugantor.com/entertainment: 100%|██████████| 47/47 [01:11<00:00,  1.52s/it]


Existing docs (before): 10691
New docs added: 47
Total docs (after): 10738
Example new: ‘সালমানের সঙ্গে অভিনয় স্বপ্নপূরণের মতো’ | https://www.jugantor.com/entertainment/1051629


### 1.7 Sitemap-based Web Crawling for English Newspaper (Section-filtered)

In [20]:
import re
import time
import json
import os
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
from urllib.parse import urljoin, urlparse
import trafilatura

In [21]:
def _norm_url(u):
    if not u:
        return ""
    u = u.strip().split("#", 1)[0]
    return u


In [22]:
def _same_domain(u, base_url):
    try:
        return urlparse(u).netloc == urlparse(base_url).netloc
    except Exception:
        return False


In [23]:
def _get_html(url, timeout=25):
    r = requests.get(url, timeout=timeout, headers={
        "User-Agent": "Mozilla/5.0",
        "Accept-Language": "en-US,en;q=0.9"
    })
    r.raise_for_status()
    return r.text


In [24]:
def _extract_article_text_requests(url, timeout=25):
    try:
        html = _get_html(url, timeout=timeout)
        text = trafilatura.extract(html, include_comments=False, include_tables=False)
        if not text:
            return None
        text = re.sub(r"\s+", " ", text).strip()
        return text
    except Exception:
        return None


In [25]:
def _extract_title_h1(html):
    try:
        soup = BeautifulSoup(html, "html.parser")
        h1 = soup.find("h1")
        if h1:
            return h1.get_text(" ", strip=True)
    except Exception:
        pass
    return ""


In [26]:
def _extract_links_from_listing(html, base_url):
    soup = BeautifulSoup(html, "html.parser")
    links = []
    for a in soup.find_all("a", href=True):
        href = _norm_url(a.get("href", ""))
        if not href:
            continue
        abs_url = urljoin(base_url, href)
        links.append(abs_url)
    return links


In [27]:
def _looks_like_article(u, base_url):
    if not _same_domain(u, base_url):
        return False
    path = urlparse(u).path.lower().strip("/")
    if not path:
        return False
    bad = ["category", "tag", "author", "page", "search", "privacy", "terms", "contact", "about", "login", "signup"]
    if any(f"/{b}/" in "/" + path + "/" for b in bad):
        return False
    if path.endswith((".jpg", ".jpeg", ".png", ".gif", ".pdf", ".mp4")):
        return False
    if len(path.split("/")) == 1 and len(path) < 8:
        return False
    return True


In [28]:
def _find_next_page(html, current_url, base_url):
    soup = BeautifulSoup(html, "html.parser")

    rel_next = soup.find("link", rel=lambda x: x and "next" in x.lower())
    if rel_next and rel_next.get("href"):
        return urljoin(base_url, _norm_url(rel_next["href"]))

    a_next = soup.find("a", rel=lambda x: x and "next" in x.lower())
    if a_next and a_next.get("href"):
        return urljoin(base_url, _norm_url(a_next["href"]))

    for a in soup.find_all("a", href=True):
        txt = a.get_text(" ", strip=True).lower()
        if txt in ["next", "next >", "older", "older posts", "›", "»"]:
            return urljoin(base_url, _norm_url(a["href"]))

    return None


In [29]:
def build_dataset_from_section_pages_append(
    section_urls,
    language,
    existing_json_path,
    max_new_docs=200,
    max_pages_per_section=5,
    sleep_sec=1.0,
    min_tokens=50
):
    if os.path.exists(existing_json_path):
        with open(existing_json_path, "r", encoding="utf-8") as f:
            existing_docs = json.load(f)
    else:
        existing_docs = []

    existing_urls = set(d.get("url", "").strip() for d in existing_docs if d.get("url"))
    doc_i = len(existing_docs)

    base_url = f"{urlparse(section_urls[0]).scheme}://{urlparse(section_urls[0]).netloc}"

    new_docs = []

    for section_url in section_urls:
        current = section_url
        for _ in range(max_pages_per_section):
            try:
                listing_html = _get_html(current)
            except Exception:
                break

            links = _extract_links_from_listing(listing_html, base_url)
            article_links = []
            for u in links:
                if _looks_like_article(u, base_url):
                    article_links.append(u)
            article_links = list(dict.fromkeys(article_links))

            for url in tqdm(article_links, desc=f"Scrape {current}"):
                if url in existing_urls:
                    continue

                text = _extract_article_text_requests(url)
                time.sleep(sleep_sec)

                if not text:
                    continue
                if len(text.split()) < min_tokens:
                    continue

                title = ""
                try:
                    html_article = _get_html(url)
                    title = _extract_title_h1(html_article)
                except Exception:
                    pass
                if not title:
                    title = text.split(".")[0][:120]

                d = {
                    "doc_id": f"{language}_{doc_i:06d}",
                    "title": title,
                    "body": text,
                    "url": url,
                    "date": "",
                    "language": language,
                    "token_count": len(text.split())
                }

                existing_docs.append(d)
                existing_urls.add(url)
                new_docs.append(d)
                doc_i += 1

                if len(new_docs) >= max_new_docs:
                    with open(existing_json_path, "w", encoding="utf-8") as f:
                        json.dump(existing_docs, f, ensure_ascii=False, indent=2)
                    print("New docs added:", len(new_docs))
                    print("Total docs:", len(existing_docs))
                    if new_docs:
                        print("Example new:", new_docs[0]["title"], "|", new_docs[0]["url"])
                    return

            nxt = _find_next_page(listing_html, current, base_url)
            if not nxt or nxt == current:
                break
            current = nxt

    with open(existing_json_path, "w", encoding="utf-8") as f:
        json.dump(existing_docs, f, ensure_ascii=False, indent=2)

    print("New docs added:", len(new_docs))
    print("Total docs:", len(existing_docs))
    if new_docs:
        print("Example new:", new_docs[0]["title"], "|", new_docs[0]["url"])


In [30]:
selected_urls = [
"https://www.newagebd.net/articlelist/41/bangladesh",
"https://www.newagebd.net/articlelist/29/business-economy",
"https://www.newagebd.net/articlelist/31/world",
"https://www.newagebd.net/articlelist/22/sports",
"https://www.newagebd.net/articlelist/25/editorial",
"https://www.newagebd.net/articlelist/27/environment",
"https://dailynewnation.com/category/todays-news/national",
"https://dailynewnation.com/category/todays-news/business-economy",
"https://dailynewnation.com/category/todays-news/international",
"https://dailynewnation.com/category/todays-news/sports",
"https://dailynewnation.com/category/todays-news/entertainment",
"https://dailynewnation.com/category/news-buzz",
"https://www.daily-sun.com/bangladesh",
"https://www.daily-sun.com/business",
"https://www.daily-sun.com/world",
"https://www.daily-sun.com/sports",
"https://www.daily-sun.com/entertainment",
"https://www.daily-sun.com/technology",
"https://www.dhakatribune.com/latest-news",
"https://www.dhakatribune.com/politics",
"https://www.dhakatribune.com/business",
"https://www.dhakatribune.com/world",
"https://www.dhakatribune.com/sport",
"https://www.dhakatribune.com/showtime",
]

In [31]:
for url in selected_urls[0:1]:
    print("Processing Link: ",url)
    build_dataset_from_section_pages_append(
    section_urls=selected_urls,
    language="en",
    existing_json_path=r"E:\DM\Cross-Lingual-Information-Retrieval-System\data\document_en.json",
    #max_new_docs=500,
    max_pages_per_section=10,
    sleep_sec=1.0
)
    print("New docs added:", len(new_docs))
    print("Total docs:", len(existing_docs))


Processing Link:  https://www.newagebd.net/articlelist/41/bangladesh


Scrape https://www.newagebd.net/articlelist/41/bangladesh: 100%|██████████| 86/86 [00:56<00:00,  1.52it/s]
Scrape https://www.newagebd.net/articlelist/41/bangladesh?page=2: 100%|██████████| 91/91 [00:58<00:00,  1.57it/s]
Scrape https://www.newagebd.net/articlelist/41/bangladesh?page=3: 100%|██████████| 91/91 [00:58<00:00,  1.57it/s]
Scrape https://www.newagebd.net/articlelist/41/bangladesh?page=4: 100%|██████████| 91/91 [00:57<00:00,  1.57it/s]
Scrape https://www.newagebd.net/articlelist/41/bangladesh?page=5: 100%|██████████| 91/91 [00:59<00:00,  1.52it/s]
Scrape https://www.newagebd.net/articlelist/41/politics?page=6: 100%|██████████| 91/91 [00:59<00:00,  1.52it/s]
Scrape https://www.newagebd.net/articlelist/41/politics?page=7: 100%|██████████| 91/91 [01:00<00:00,  1.50it/s]
Scrape https://www.newagebd.net/articlelist/41/politics?page=8: 100%|██████████| 91/91 [00:59<00:00,  1.53it/s]
Scrape https://www.newagebd.net/articlelist/41/politics?page=9: 100%|██████████| 91/91 [00:56<00:00, 

New docs added: 200
Total docs: 7262
Example new: Television | https://www.newagebd.net/articlelist/80/television
New docs added: 175
Total docs: 9988





Purpose

1. Gain exposure to real-world, messy data
2. Understand indexing fundamentals
3. Create a foundation for multilingual search