# Smart WARC Duplicator Removal — Build-As-You-Learn

This notebook is a clean, step-by-step build of the project:
1) **Fetch a list form common crawl**  
2) **WARC read (charset-aware)**  
3) **HTML → clean text (Unicode-safe)**  
4) **Quick analysis** (domain counts, length stats)  
5) **Dedup** (exact hash)  
6) **Near-dup** (SimHash first; MinHash/LSH later)  
7) **Export** (JSONL: all + representatives + summary)

> We’ll keep each step small and well-documented. When we fully understand it, we’ll refactor into a VS Code package + CLI.


In [5]:
import pandas as pd 
import numpy as np 
import requests 
import gzip 
import io 
from warcio.archiveiterator import ArchiveIterator

In [6]:
# Fetch list of crawls (public API)
colls = requests.get("https://index.commoncrawl.org/collinfo.json", timeout=30).json()


In [7]:
print(type(colls))


<class 'list'>


In [8]:
colls[:2000:2]

[{'id': 'CC-MAIN-2025-33',
  'name': 'August 2025 Index',
  'timegate': 'https://index.commoncrawl.org/CC-MAIN-2025-33/',
  'cdx-api': 'https://index.commoncrawl.org/CC-MAIN-2025-33-index',
  'from': '2025-08-02T22:09:07',
  'to': '2025-08-15T23:42:38'},
 {'id': 'CC-MAIN-2025-26',
  'name': 'June 2025 Index',
  'timegate': 'https://index.commoncrawl.org/CC-MAIN-2025-26/',
  'cdx-api': 'https://index.commoncrawl.org/CC-MAIN-2025-26-index',
  'from': '2025-06-12T11:28:40',
  'to': '2025-06-25T09:54:11'},
 {'id': 'CC-MAIN-2025-18',
  'name': 'April 2025 Index',
  'timegate': 'https://index.commoncrawl.org/CC-MAIN-2025-18/',
  'cdx-api': 'https://index.commoncrawl.org/CC-MAIN-2025-18-index',
  'from': '2025-04-17T13:50:10',
  'to': '2025-05-01T01:05:29'},
 {'id': 'CC-MAIN-2025-08',
  'name': 'February 2025 Index',
  'timegate': 'https://index.commoncrawl.org/CC-MAIN-2025-08/',
  'cdx-api': 'https://index.commoncrawl.org/CC-MAIN-2025-08-index',
  'from': '2025-02-06T11:42:25',
  'to': '2025

In [9]:
crawl_id = 'CC-MAIN-2008-2009'
base = f"https://data.commoncrawl.org/crawl-data/{crawl_id}/"

In [10]:
base

'https://data.commoncrawl.org/crawl-data/CC-MAIN-2008-2009/'

In [11]:
# fetching all the warc files in the path in that crawl
r = requests.get(base +'warc.paths.gz',timeout=60, stream=True) 

In [12]:
r

<Response [404]>

- 400 bad request something went wrong in the request
- 401 Unauthorized request , you need authentication
- 403 forbidden resource you are not allowed to use it
- 404 resource not found
- 429 too many requests 

In [13]:
crawl_id = 'CC-MAIN-2025-33'
base = f"https://data.commoncrawl.org/crawl-data/{crawl_id}/"

In [14]:
base

'https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-33/'

In [15]:
r = requests.get(base + "warc.paths.gz", stream=True, timeout=60)
r

<Response [200]>

- 200 Resource exists
- 201 Resource created
- 202 response recived but still being processed 

In [16]:
gz = gzip.GzipFile(fileobj=io.BytesIO(r.content))

In [17]:
# Collect all WARC file URLs
paths = []
for line in gz:
    rel_path = line.decode('utf-8').strip()
    full_url = "https://data.commoncrawl.org/" + rel_path
    paths.append(full_url)

print("Total WARC URLs stored in paths:", len(paths))
print("First 3:", paths[:3])
print("Last 3:", paths[-3:])

Total WARC URLs stored in paths: 100000
First 3: ['https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-33/segments/1754151279521.11/warc/CC-MAIN-20250802220907-20250803010907-00000.warc.gz', 'https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-33/segments/1754151279521.11/warc/CC-MAIN-20250802220907-20250803010907-00001.warc.gz', 'https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-33/segments/1754151279521.11/warc/CC-MAIN-20250802220907-20250803010907-00002.warc.gz']
Last 3: ['https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-33/segments/1754151579063.98/warc/CC-MAIN-20250815204238-20250815234238-00997.warc.gz', 'https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-33/segments/1754151579063.98/warc/CC-MAIN-20250815204238-20250815234238-00998.warc.gz', 'https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-33/segments/1754151579063.98/warc/CC-MAIN-20250815204238-20250815234238-00999.warc.gz']


In [18]:
print(len(paths))

100000


### **next step is to check if each url is reachable and is available to prevent wasting time**

In [19]:
url100 = paths[:100]
url100

['https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-33/segments/1754151279521.11/warc/CC-MAIN-20250802220907-20250803010907-00000.warc.gz',
 'https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-33/segments/1754151279521.11/warc/CC-MAIN-20250802220907-20250803010907-00001.warc.gz',
 'https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-33/segments/1754151279521.11/warc/CC-MAIN-20250802220907-20250803010907-00002.warc.gz',
 'https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-33/segments/1754151279521.11/warc/CC-MAIN-20250802220907-20250803010907-00003.warc.gz',
 'https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-33/segments/1754151279521.11/warc/CC-MAIN-20250802220907-20250803010907-00004.warc.gz',
 'https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-33/segments/1754151279521.11/warc/CC-MAIN-20250802220907-20250803010907-00005.warc.gz',
 'https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-33/segments/1754151279521.11/warc/CC-MAIN-20250802220907-20250803010907-00006.warc.gz',

In [20]:
summary = []
for u in url100:
    try:
        h = requests.head(u, timeout=(10, 30), allow_redirects=True)  # requests.head
        size = int(h.headers.get("Content-Length", 0))                # Content-Length
        summary.append({
            "url": u,
            "status": h.status_code,
            "size_bytes": size,
            "size_mb": round(size / (1024*1024), 2),
        })
    except Exception as e:
        # add a tiny diagnostic to see the actual error once
        print("HEAD error for:", u, "|", repr(e))
        summary.append({"url": u, "status": "ERR", "size_bytes": 0, "size_mb": 0})
        
k = sum(1 for s in summary if s["status"] == 200)
print(f"Reachable: {k}/{len(summary)}\n")
for s in summary[:5]:  # show a few; you can remove the [:5] to see all
    print(s["status"], f"{s['size_mb']} MB", "→", s["url"])

Reachable: 100/100

200 922.99 MB → https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-33/segments/1754151279521.11/warc/CC-MAIN-20250802220907-20250803010907-00000.warc.gz
200 917.84 MB → https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-33/segments/1754151279521.11/warc/CC-MAIN-20250802220907-20250803010907-00001.warc.gz
200 911.54 MB → https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-33/segments/1754151279521.11/warc/CC-MAIN-20250802220907-20250803010907-00002.warc.gz
200 956.82 MB → https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-33/segments/1754151279521.11/warc/CC-MAIN-20250802220907-20250803010907-00003.warc.gz
200 911.15 MB → https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-33/segments/1754151279521.11/warc/CC-MAIN-20250802220907-20250803010907-00004.warc.gz


In [21]:
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()

retry = Retry(
    total=5,                # up to 5 total retries
    connect=5,              # retries for connect errors
    read=5,                 # retries for read errors
    backoff_factor=1.2,     # exponential backoff: 0s, 1.2s, 2.4s, ...
    status_forcelist=[429, 500, 502, 503, 504],
    allowed_methods=["GET", "HEAD"]
)

adapter = HTTPAdapter(max_retries=retry, pool_connections=4, pool_maxsize=4)
session.mount("https://", adapter)
session.headers.update({"User-Agent": "warc-stream/0.1"})


In [22]:
test_url = paths[0]   # first WARC URL
max_records = 10
seen = 0

with session.get(test_url, stream=True, timeout=(60, 600)) as resp:  # 60s connect, 600s read
    resp.raise_for_status()
    gz = gzip.GzipFile(fileobj=resp.raw)

    for record in ArchiveIterator(gz):
        if record.rec_type != "response":
            continue
        http = record.http_headers
        if not http:
            continue

        url   = record.rec_headers.get_header("WARC-Target-URI")
        code  = http.get_statuscode()
        ctype = (http.get_header("Content-Type") or "").lower()

        print(f"{seen:02d} | status={code} | ctype={ctype}\n   {url}")
        seen += 1
        if seen >= max_records:
            break

00 | status=200 | ctype=text/html; charset=utf-8
   http://0014housingrental.shop/
01 | status=200 | ctype=text/html
   http://010ganji.com/html/yingjianchanpin/chanpinfenleisi/150.html
02 | status=200 | ctype=text/html; charset=utf-8
   http://01dom.ru/sale/prodlenie_aktsii_na_keramicheskie_bloki_porotherm/
03 | status=200 | ctype=text/html
   http://0594jy.com/live/sepak/f4219021.html
04 | status=200 | ctype=text/html
   http://0cpm.org/
05 | status=200 | ctype=text/html; charset=utf-8
   http://100.ubc.ca/ubc-impact/nicole-eredics/
06 | status=200 | ctype=text/html; charset=utf-8
   http://12dim-trikal.tri.sch.gr/2024/05/27/
07 | status=200 | ctype=text/html; charset=big5
   http://1599888.gg33t.com/index.phtml?PUT=a_show&AID=272908&FID=1599888&R2=&CHANNEL=
08 | status=200 | ctype=text/html; charset=big5
   http://170248.hwe2.com/?FID=170248
09 | status=200 | ctype=text/html; charset=big5
   http://170248.hwe2.com/?PUT=a_show&AID=280092&FID=170248&R2=&CHANNEL=


In [23]:
def stream_warc_metadata(url, max_items=50, only_html=True, session=None):
    """
    Stream the WARC at `url` and yield small dicts of metadata for response records.
    - only_html=True → keep Content-Type that looks like text/html
    - max_items → stop after N items to keep runs fast
    """
    sess = session
    if sess is None:
        import requests
        sess = requests.Session()
        sess.headers.update({"User-Agent": "warc-stream/0.1"})

    seen = 0
    with sess.get(url, stream=True, timeout=(60, 600)) as resp:
        resp.raise_for_status()
        gz = gzip.GzipFile(fileobj=resp.raw)
        for rec in ArchiveIterator(gz):
            if rec.rec_type != "response" or not rec.http_headers:
                continue

            ctype = (rec.http_headers.get_header("Content-Type") or "").lower()
            if only_html and "text/html" not in ctype:
                continue

            meta = {
                "warc_target_uri": rec.rec_headers.get_header("WARC-Target-URI"),
                "warc_date":       rec.rec_headers.get_header("WARC-Date"),
                "warc_record_id":  rec.rec_headers.get_header("WARC-Record-ID"),
                "warc_ip":         rec.rec_headers.get_header("WARC-IP-Address"),
                "http_status":     rec.http_headers.get_statuscode(),
                "content_type":    ctype,
                "http_length":     rec.http_headers.get_header("Content-Length"),
            }
            yield meta

            seen += 1
            if seen >= max_items:
                break

In [24]:
# Reuse the resilient session you already built (named `session` in previous step)
test_url = paths[0]  # or any from your first 20
rows = list(stream_warc_metadata(test_url, max_items=20, only_html=True, session=session))

print(f" {len(rows)} metadata rows")
for r in rows[:5]:
    print(r["http_status"], r["content_type"], "→", r["warc_target_uri"])


 20 metadata rows
200 text/html; charset=utf-8 → http://0014housingrental.shop/
200 text/html → http://010ganji.com/html/yingjianchanpin/chanpinfenleisi/150.html
200 text/html; charset=utf-8 → http://01dom.ru/sale/prodlenie_aktsii_na_keramicheskie_bloki_porotherm/
200 text/html → http://0594jy.com/live/sepak/f4219021.html
200 text/html → http://0cpm.org/


In [25]:
from bs4 import BeautifulSoup
import re

def pick_charset(content_type_header: str | None) -> str:
    """Extract charset from Content-Type if present; default to utf-8."""
    if not content_type_header:
        return "utf-8"
    # Example: "text/html; charset=utf-8"
    m = re.search(r"charset=([-_a-zA-Z0-9]+)", content_type_header, flags=re.I)
    return (m.group(1).lower() if m else "utf-8")

def html_to_text(html: str) -> tuple[str, str]:
    """
    Return (title, text) from HTML using BeautifulSoup.
    Very small & safe: remove scripts/styles, collapse whitespace.
    """
    soup = BeautifulSoup(html, "html.parser")

    # Title
    title = (soup.title.string or "").strip() if soup.title and soup.title.string else ""

    # Drop boilerplate
    for t in soup(["script", "style", "noscript"]):
        t.extract()

    # Text
    text = soup.get_text(separator="\n")
    # Normalize whitespace: strip lines, drop empty, collapse multiples
    lines = [ln.strip() for ln in text.splitlines()]
    text = "\n".join([ln for ln in lines if ln])

    return title, text


In [26]:
def extract_html_sample(url, max_items=30, max_bytes=2_000_000, session=None):
    """
    Stream a WARC from `url` and yield up to `max_items` HTML pages with clean text.
    - max_bytes caps how much we read per page (avoid huge PDFs disguised as html).
    """
    sess = session or requests.Session()
    sess.headers.update({"User-Agent": "warc-stream/0.1"})

    seen = 0
    with sess.get(url, stream=True, timeout=(60, 600)) as resp:
        resp.raise_for_status()
        gz = gzip.GzipFile(fileobj=resp.raw)

        for rec in ArchiveIterator(gz):
            if rec.rec_type != "response" or not rec.http_headers:
                continue

            ctype = (rec.http_headers.get_header("Content-Type") or "").lower()
            if "text/html" not in ctype:
                continue

            # Read the HTTP body bytes (cap to avoid surprises)
            stream = rec.content_stream()
            raw = stream.read(max_bytes)

            # Decode: prefer charset from header; fall back to utf-8 -> latin-1
            charset = pick_charset(ctype)
            try:
                html = raw.decode(charset, errors="replace")
            except LookupError:
                # unknown charset name → fallback
                try:
                    html = raw.decode("utf-8", errors="replace")
                except Exception:
                    html = raw.decode("latin-1", errors="replace")

            title, text = html_to_text(html)
            item = {
                "url": rec.rec_headers.get_header("WARC-Target-URI"),
                "warc_date": rec.rec_headers.get_header("WARC-Date"),
                "content_type": ctype,
                "title": title,
                "text": text,
                "len_text": len(text),
            }
            yield item

            seen += 1
            if seen >= max_items:
                break

In [27]:
# Use the resilient `session` you built earlier.
rows = list(extract_html_sample(paths[0], max_items=5000, session=session))
print(f"{len(rows)} HTML pages.")
for r in rows[:5]:
    print(r["len_text"], r["title"][:60], "→", r["url"])


Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.




  soup = BeautifulSoup(html, "html.parser")


5000 HTML pages.
0  → http://0014housingrental.shop/
2635 ETF选择困难？易方达基金划分四大类助您轻松投资！_ → http://010ganji.com/html/yingjianchanpin/chanpinfenleisi/150.html
15663 Скидка до 23% на керамические блоки Porotherm → http://01dom.ru/sale/prodlenie_aktsii_na_keramicheskie_bloki_porotherm/
4827 青青草原综合久久,精品人成视频免费国产,色综合久久综合香蕉色老大 → http://0594jy.com/live/sepak/f4219021.html
52  → http://0cpm.org/


In [28]:
df = pd.DataFrame(rows)  
print("Columns:", df.columns.tolist())
df[["len_text", "title", "url"]].head(100)


Columns: ['url', 'warc_date', 'content_type', 'title', 'text', 'len_text']


Unnamed: 0,len_text,title,url
0,0,,http://0014housingrental.shop/
1,2635,ETF选择困难？易方达基金划分四大类助您轻松投资！_,http://010ganji.com/html/yingjianchanpin/chanp...
2,15663,Скидка до 23% на керамические блоки Porotherm,http://01dom.ru/sale/prodlenie_aktsii_na_keram...
3,4827,"青青草原综合久久,精品人成视频免费国产,色综合久久综合香蕉色老大",http://0594jy.com/live/sepak/f4219021.html
4,52,,http://0cpm.org/
...,...,...,...
95,9,CONTENTdm,http://archives.csuchico.edu/digital/collectio...
96,9,CONTENTdm,http://archives.csuchico.edu/digital/collectio...
97,18256,Collection browser,http://archives.rcpe.ac.uk/CalmView/TreeBrowse...
98,3075,ACCADEMIA NAZIONALE DEI LINCEI,http://archivi.lincei.it/index.php/information...


In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   url           5000 non-null   object
 1   warc_date     5000 non-null   object
 2   content_type  5000 non-null   object
 3   title         5000 non-null   object
 4   text          5000 non-null   object
 5   len_text      5000 non-null   int64 
dtypes: int64(1), object(5)
memory usage: 234.5+ KB


In [30]:
print("Total pages:", len(df))
print("Zero-length pages:", (df["len_text"] == 0).sum())
print("Short (<200 chars):", (df["len_text"] < 200).sum())
print("Median length:", int(df["len_text"].median()))


Total pages: 5000
Zero-length pages: 119
Short (<200 chars): 378
Median length: 2908


In [132]:
df['content_type'].value_counts()

content_type
text/html; charset=utf-8                           3896
text/html                                           678
text/html;charset=utf-8                             256
text/html; charset=iso-8859-1                        41
text/html; charset=windows-1251                      30
text/html; charset=big5                              14
text/html; charset=shift_jis                         13
text/html;charset=iso-8859-1                         12
text/html; charset=gbk                                6
text/html; charset=euc-jp                             6
text/html; charset=utf8                               5
text/html; charset=euc-kr                             5
text/html;charset=gbk                                 4
text/html; charset=windows-1250                       3
text/html; charset=gb2312                             3
text/html; charset=cp1251                             3
text/html; charset=utf-8;                             3
text/html; charset=windows-1252    

In [31]:
MIN_LEN = 200  # start with 200 minimum text length
df_filt = df[df["len_text"] >= MIN_LEN].copy()

print("length after filter:", len(df_filt), "/", len(df))
df_filt[["len_text", "title", "url"]].head(5)


length after filter: 4622 / 5000


Unnamed: 0,len_text,title,url
1,2635,ETF选择困难？易方达基金划分四大类助您轻松投资！_,http://010ganji.com/html/yingjianchanpin/chanp...
2,15663,Скидка до 23% на керамические блоки Porotherm,http://01dom.ru/sale/prodlenie_aktsii_na_keram...
3,4827,"青青草原综合久久,精品人成视频免费国产,色综合久久综合香蕉色老大",http://0594jy.com/live/sepak/f4219021.html
5,2000,Nicole Eredics - UBC Centennial,http://100.ubc.ca/ubc-impact/nicole-eredics/
6,3197,27 Μαΐου 2024 – 12o Δημοτικό Σχολείο Τρικάλων,http://12dim-trikal.tri.sch.gr/2024/05/27/


In [32]:
df.tail()

Unnamed: 0,url,warc_date,content_type,title,text,len_text
4995,https://dbgweb.com/catalog/product/kontron-269...,2025-08-02T22:53:42Z,text/html; charset=utf-8,KONTRON 26971 » Digital Brothers Group | CATALOG,KONTRON 26971 » Digital Brothers Group | CATAL...,1491
4996,https://dbhs.k12k.com/apps/events/2024/5/22/16...,2025-08-02T23:18:27Z,text/html;charset=utf-8,Final Exams (see schedule attached) | Dobyns-B...,Final Exams (see schedule attached) | Dobyns-B...,3401
4997,https://dblp.dagstuhl.de/pid/48/2402.html,2025-08-02T22:25:12Z,text/html; charset=utf-8,dblp: Domine Leenaerts,dblp: Domine Leenaerts\ndblp\nBlog\nStatistics...,60056
4998,https://dbnaked.com/models/shemale/B/Bon,2025-08-02T23:48:19Z,text/html; charset=utf-8,Bon - shemale porn star bio and photos @ dbNaked,Bon - shemale porn star bio and photos @ dbNak...,1727
4999,https://dbpedia.org/page/Category:Streets_in_P...,2025-08-03T00:02:37Z,text/html; charset=utf-8,"About: Streets in Plymouth, Devon","About: Streets in Plymouth, Devon\nBrowse usin...",1177


In [33]:
df.isna().sum()

url             0
warc_date       0
content_type    0
title           0
text            0
len_text        0
dtype: int64

In [34]:
df_test = df.copy()

In [35]:
import os
# Tell Transformers to ignore TensorFlow entirely
os.environ["TRANSFORMERS_NO_TF"] = "1"
# Optional: silence the Windows symlink warning
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"


In [36]:
import torch
from transformers import pipeline

device = 0 if torch.cuda.is_available() else -1
pipe = pipeline(
    "text-classification",
    model="papluca/xlm-roberta-base-language-detection",
    device=device,
    framework="pt",        # force PyTorch
)


  from .autonotebook import tqdm as notebook_tqdm
Device set to use cpu


In [37]:
text_series = df_test["text"].astype(str).str.strip()
title_series = df_test["title"].astype(str).str.strip()
inputs_series = text_series.where(text_series.ne(""), title_series)  # element-wise fallback

In [38]:
text_series

0                                                        
1       ETF选择困难？易方达基金划分四大类助您轻松投资！_\n首页\nw88win优德\n产品分类...
2       Скидка до 23% на керамические блоки Porotherm\...
3       青青草原综合久久,精品人成视频免费国产,色综合久久综合香蕉色老大\n91精品国产自产在线观看...
4       Redirection...\nYou should be\ntaken here\nimm...
                              ...                        
4995    KONTRON 26971 » Digital Brothers Group | CATAL...
4996    Final Exams (see schedule attached) | Dobyns-B...
4997    dblp: Domine Leenaerts\ndblp\nBlog\nStatistics...
4998    Bon - shemale porn star bio and photos @ dbNak...
4999    About: Streets in Plymouth, Devon\nBrowse usin...
Name: text, Length: 5000, dtype: object

In [39]:
title_series

0                                                        
1                              ETF选择困难？易方达基金划分四大类助您轻松投资！_
2           Скидка до 23% на керамические блоки Porotherm
3                        青青草原综合久久,精品人成视频免费国产,色综合久久综合香蕉色老大
4                                                        
                              ...                        
4995     KONTRON 26971 » Digital Brothers Group | CATALOG
4996    Final Exams (see schedule attached) | Dobyns-B...
4997                               dblp: Domine Leenaerts
4998     Bon - shemale porn star bio and photos @ dbNaked
4999                    About: Streets in Plymouth, Devon
Name: title, Length: 5000, dtype: object

In [40]:
inputs_series

0                                                        
1       ETF选择困难？易方达基金划分四大类助您轻松投资！_\n首页\nw88win优德\n产品分类...
2       Скидка до 23% на керамические блоки Porotherm\...
3       青青草原综合久久,精品人成视频免费国产,色综合久久综合香蕉色老大\n91精品国产自产在线观看...
4       Redirection...\nYou should be\ntaken here\nimm...
                              ...                        
4995    KONTRON 26971 » Digital Brothers Group | CATAL...
4996    Final Exams (see schedule attached) | Dobyns-B...
4997    dblp: Domine Leenaerts\ndblp\nBlog\nStatistics...
4998    Bon - shemale porn star bio and photos @ dbNak...
4999    About: Streets in Plymouth, Devon\nBrowse usin...
Name: text, Length: 5000, dtype: object

In [41]:
mask = inputs_series.ne("")


In [42]:
mask.nunique()

2

In [43]:
to_classify = inputs_series[mask].str.slice(0, 1000).tolist()  # truncate text, not rows for faster preprocessing


In [44]:
len(to_classify)

4881

In [45]:
df_test["languges"] = "unk"  # default for truly empty rows

if len(to_classify) > 0:
    preds = pipe(to_classify, top_k=1, truncation=True, batch_size=32)
    labels = [p[0]["label"] for p in preds]
    df_test.loc[mask, "languges"] = labels

In [47]:
print(df_test["languges"].value_counts())


languges
en     1821
zh      719
ru      392
ja      268
es      227
de      220
fr      189
pl      154
pt      132
ur      120
unk     119
hi      115
tr      101
nl       92
bg       88
it       63
vi       55
ar       41
sw       35
th       27
el       22
Name: count, dtype: int64


In [48]:
print(df_test[["url","title","languges"]].head(10))

                                                 url  \
0                     http://0014housingrental.shop/   
1  http://010ganji.com/html/yingjianchanpin/chanp...   
2  http://01dom.ru/sale/prodlenie_aktsii_na_keram...   
3         http://0594jy.com/live/sepak/f4219021.html   
4                                   http://0cpm.org/   
5       http://100.ubc.ca/ubc-impact/nicole-eredics/   
6         http://12dim-trikal.tri.sch.gr/2024/05/27/   
7  http://1599888.gg33t.com/index.phtml?PUT=a_sho...   
8                 http://170248.hwe2.com/?FID=170248   
9  http://170248.hwe2.com/?PUT=a_show&AID=280092&...   

                                           title languges  
0                                                     unk  
1                     ETF选择困难？易方达基金划分四大类助您轻松投资！_       zh  
2  Скидка до 23% на керамические блоки Porotherm       ru  
3               青青草原综合久久,精品人成视频免费国产,色综合久久综合香蕉色老大       zh  
4                                                      en  
5                Nicole

In [49]:
print(df_test)

                                                    url             warc_date  \
0                        http://0014housingrental.shop/  2025-08-02T23:15:49Z   
1     http://010ganji.com/html/yingjianchanpin/chanp...  2025-08-02T23:06:24Z   
2     http://01dom.ru/sale/prodlenie_aktsii_na_keram...  2025-08-02T22:29:13Z   
3            http://0594jy.com/live/sepak/f4219021.html  2025-08-02T23:18:39Z   
4                                      http://0cpm.org/  2025-08-02T23:57:45Z   
...                                                 ...                   ...   
4995  https://dbgweb.com/catalog/product/kontron-269...  2025-08-02T22:53:42Z   
4996  https://dbhs.k12k.com/apps/events/2024/5/22/16...  2025-08-02T23:18:27Z   
4997          https://dblp.dagstuhl.de/pid/48/2402.html  2025-08-02T22:25:12Z   
4998           https://dbnaked.com/models/shemale/B/Bon  2025-08-02T23:48:19Z   
4999  https://dbpedia.org/page/Category:Streets_in_P...  2025-08-03T00:02:37Z   

                  content_t

In [50]:
df_en = df_test[df_test["languges"].astype(str).str.lower() == "en"].copy()
print("Total:", len(df_test), "| English only:", len(df_en))


Total: 5000 | English only: 1821


In [51]:
text_series  = df_en["text"].astype(str).str.strip()
title_series = df_en["title"].astype(str).str.strip()
df_en["text_src"] = text_series.where(text_series.ne(""), title_series)

In [52]:
df_en['text_src']

4       Redirection...\nYou should be\ntaken here\nimm...
5       Nicole Eredics - UBC Centennial\nSkip to main ...
20      The Portal: The Alliance Fleet\nThe Portal\nTh...
24      Ashop Marketplace\nSearch\nAdvanced Search\nAE...
33      Sticky Notes Set in PP Box Archives - High Qua...
                              ...                        
4995    KONTRON 26971 » Digital Brothers Group | CATAL...
4996    Final Exams (see schedule attached) | Dobyns-B...
4997    dblp: Domine Leenaerts\ndblp\nBlog\nStatistics...
4998    Bon - shemale porn star bio and photos @ dbNak...
4999    About: Streets in Plymouth, Devon\nBrowse usin...
Name: text_src, Length: 1821, dtype: object

In [55]:
 !pip install -q beautifulsoup4 unidecode


### Text Cleaning Function

The `clean_text` function preprocesses raw text by:
- Removing HTML tags, URLs, and email addresses  
- Normalizing Unicode and converting accented characters to ASCII  
- Lowercasing all text  
- Replacing numbers with `<num>`  
- Removing unwanted symbols and extra spaces  

This produces clean and standardized text for NLP or machine learning tasks.


In [56]:
import re, html, unicodedata
from bs4 import BeautifulSoup
from unidecode import unidecode

_url_pat   = re.compile(r'https?://\S+|www\.\S+')
_email_pat = re.compile(r'\b\S+@\S+\.\S+\b')

def clean_text(s: str) -> str:
    if not isinstance(s, str) or not s.strip():
        return ""
    s = BeautifulSoup(s, "html.parser").get_text(" ", strip=True)
    s = html.unescape(s)
    s = unicodedata.normalize("NFKC", s)
    s = unidecode(s)
    s = _url_pat.sub(" ", s)
    s = _email_pat.sub(" ", s)
    s = s.lower()
    s = re.sub(r'\d+', ' <num> ', s)
    s = re.sub(r"[^a-z0-9'_/.,!?;:() -]+", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s


In [57]:
df_en["text_clean"]  = df_en["text_src"].apply(clean_text)


In [60]:
df_en["text_clean"].head(10)

4     redirection... you should be taken here immedi...
5     nicole eredics - ubc centennial skip to main c...
20    the portal: the alliance fleet the portal the ...
24    ashop marketplace search advanced search aed a...
33    sticky notes set in pp box archives - high qua...
42    agenda san diego wireless summit skip to main ...
53    july num a num race management skip to content...
55    about scotland bed and breakfast, self caterin...
56    one moment, please... loader please wait while...
59    a komorbid szorongasos es depresszios zavarok ...
Name: text_clean, dtype: object

In [61]:
df_en["title_clean"] = df_en["title"].astype(str).apply(clean_text)


In [62]:
df_en["title_clean"].head(10)

4                                                      
5                       nicole eredics - ubc centennial
20                       the portal: the alliance fleet
24                                    ashop marketplace
33    sticky notes set in pp box archives - high qua...
42                     agenda san diego wireless summit
53                       july num a num race management
55    about scotland bed and breakfast, self caterin...
56                                one moment, please...
59    a komorbid szorongasos es depresszios zavarok ...
Name: title_clean, dtype: object

In [63]:
df_en["len_clean"]   = df_en["text_clean"].str.len()


In [64]:
df_en["len_clean"]

4          52
5        1961
20      73806
24       2167
33       1604
        ...  
4995     1580
4996     3489
4997    56702
4998     1814
4999     1149
Name: len_clean, Length: 1821, dtype: int64

In [None]:
# df_en.reset_index()

In [66]:
df_en["len_clean"].median()

np.float64(3016.0)

In [67]:
print("Empty cleaned texts:", (df_en["len_clean"]==0).sum())


Empty cleaned texts: 0


In [73]:
df_en.head()

Unnamed: 0,url,warc_date,content_type,title,text,len_text,languges,text_src,text_clean,title_clean,len_clean
4,http://0cpm.org/,2025-08-02T23:57:45Z,text/html,,Redirection...\nYou should be\ntaken here\nimm...,52,en,Redirection...\nYou should be\ntaken here\nimm...,redirection... you should be taken here immedi...,,52
5,http://100.ubc.ca/ubc-impact/nicole-eredics/,2025-08-02T23:25:21Z,text/html; charset=utf-8,Nicole Eredics - UBC Centennial,Nicole Eredics - UBC Centennial\nSkip to main ...,2000,en,Nicole Eredics - UBC Centennial\nSkip to main ...,nicole eredics - ubc centennial skip to main c...,nicole eredics - ubc centennial,1961
20,http://2012portal.blogspot.com/2015/03/the-all...,2025-08-02T23:31:29Z,text/html; charset=utf-8,The Portal: The Alliance Fleet,The Portal: The Alliance Fleet\nThe Portal\nTh...,76132,en,The Portal: The Alliance Fleet\nThe Portal\nTh...,the portal: the alliance fleet the portal the ...,the portal: the alliance fleet,73806
24,http://336-166316.shop033.com/MarketPlace/Merc...,2025-08-02T23:17:25Z,text/html; charset=utf-8,Ashop Marketplace,Ashop Marketplace\nSearch\nAdvanced Search\nAE...,2201,en,Ashop Marketplace\nSearch\nAdvanced Search\nAE...,ashop marketplace search advanced search aed a...,ashop marketplace,2167
33,http://4ausa.com/product-category/sticky-notes...,2025-08-03T00:05:19Z,text/html; charset=utf-8,Sticky Notes Set in PP Box Archives - High Qua...,Sticky Notes Set in PP Box Archives - High Qua...,1569,en,Sticky Notes Set in PP Box Archives - High Qua...,sticky notes set in pp box archives - high qua...,sticky notes set in pp box archives - high qua...,1604


In [72]:
df_en['languges'].value_counts()

languges
en    1821
Name: count, dtype: int64

In [75]:
df_en.isna().sum()

url             0
warc_date       0
content_type    0
title           0
text            0
len_text        0
languges        0
text_src        0
text_clean      0
title_clean     0
len_clean       0
dtype: int64

In [79]:
try:
    import spacy
except ImportError:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "spacy"])
    import spacy

MODEL = "en_core_web_sm"
try:
    nlp = spacy.load(MODEL, disable=["parser","ner","textcat"])
except OSError:
    subprocess.check_call([sys.executable, "-m", "spacy", "download", MODEL])
    nlp = spacy.load(MODEL, disable=["parser","ner","textcat"])

stop = nlp.Defaults.stop_words

### Lemmatization with spaCy (NLP preprocessing)
- **Truncating** very long documents to 300k chars (avoids spaCy memory errors)
- **Batch-processing** with `nlp.pipe` for speed
- **Tokenizing & lemmatizing** (e.g., “running” → “run”), lowercasing, and removing punctuation, stopwords, and 1-char tokens
- **Saving outputs** to `tokens`, `lemmas`, and `len_tokens`, then printing quick sanity stats and a small preview

In [81]:
# truncate very long docs for spaCy 
MAX_CHARS_FOR_SPACY = 300_000       # can be tuned but it is already big enough for my RAM
texts = df_en["text_clean"].astype(str).tolist()
texts_trimmed = [t if len(t) <= MAX_CHARS_FOR_SPACY else t[:MAX_CHARS_FOR_SPACY] for t in texts]

# batched lemmatization ( normalizing word running --> run )
tokens_col, lemmas_col = [], []
for doc in nlp.pipe(texts_trimmed, batch_size=256):
    toks = [t.text.lower()   for t in doc if not (t.is_space or t.is_punct)]
    lems = [t.lemma_.lower() for t in doc if not (t.is_space or t.is_punct)]
    toks = [t for t in toks if len(t) > 1 and t not in stop]
    lems = [l for l in lems if len(l) > 1 and l not in stop]
    tokens_col.append(toks)
    lemmas_col.append(lems)

df_en["tokens"] = tokens_col
df_en["lemmas"] = lemmas_col
df_en["len_tokens"] = df_en["tokens"].apply(len)

print(
    "spaCy lemmatization done ✓",
    "\nDocs:", len(df_en),
    "\nTrimmed docs (>MAX_CHARS_FOR_SPACY):", sum(int(len(t) > MAX_CHARS_FOR_SPACY) for t in texts),
    "\nEmpty token lists:", int((df_en['len_tokens'] == 0).sum()),
    "\nMedian tokens/doc:", int(df_en['len_tokens'].median()),
)
display(df_en[["title_clean","tokens","lemmas"]].head(5))

spaCy lemmatization done ✓ 
Docs: 1821 
Trimmed docs (>MAX_CHARS_FOR_SPACY): 4 
Empty token lists: 0 
Median tokens/doc: 340


Unnamed: 0,title_clean,tokens,lemmas
4,,"[redirection, taken, immediately]","[redirection, immediately]"
5,nicole eredics - ubc centennial,"[nicole, eredics, ubc, centennial, skip, main,...","[nicole, eredic, ubc, centennial, skip, main, ..."
20,the portal: the alliance fleet,"[portal, alliance, fleet, portal, intelligence...","[portal, alliance, fleet, portal, intelligence..."
24,ashop marketplace,"[ashop, marketplace, search, advanced, search,...","[ashop, marketplace, search, advanced, search,..."
33,sticky notes set in pp box archives - high qua...,"[sticky, notes, set, pp, box, archives, high, ...","[sticky, note, set, pp, box, archive, high, qu..."


### **Result:** clean, normalized word features that improve similarity, deduplication, and clustering.


In [83]:
df_en['tokens'].isna().sum()

np.int64(0)

### TF-IDF Feature Extraction

converting lemmatized text into numeric vectors for similarity and clustering:

- **Joins lemmas** back into strings to form the `corpus`
- Builds a **TF-IDF** vectorizer with:
  - `max_features=50,000` (cap vocabulary size)
  - `min_df=3` (ignore ultra-rare terms)
  - `ngram_range=(1, 2)` (unigrams + bigrams)
- **Fits & transforms** the corpus to produce:
  - `X_en` — sparse TF-IDF matrix `[n_docs × vocab]`
  - `terms` — the ordered vocabulary list (feature names)



In [92]:
from sklearn.feature_extraction.text import TfidfVectorizer

def join_tokens(xs): 
    return " ".join(xs) if isinstance(xs, list) else str(xs)

corpus = df_en["lemmas"].apply(join_tokens).tolist()

TFIDF_MAX_FEATURES = 50_000
TFIDF_MIN_DF = 3
TFIDF_NGRAM_RANGE = (1, 2)  # unigrams + bigrams

tfidf = TfidfVectorizer(
    max_features=TFIDF_MAX_FEATURES,
    min_df=TFIDF_MIN_DF,
    ngram_range=TFIDF_NGRAM_RANGE,
)
X_en = tfidf.fit_transform(corpus)  # sparse matrix [n_docs x vocab]
terms = tfidf.get_feature_names_out()


In [93]:
X_en

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 525248 stored elements and shape (1821, 43697)>

### **Result: compact content fingerprints used later for near-duplicate reranking (cosine) and topic clustering.**


In [128]:
terms

array(['aa', 'aa num', 'aac', ..., 'zx', 'zydeco', 'zydeco caribbean'],
      shape=(43697,), dtype=object)

### Near-Duplicate Candidates (MinHash-LSH)

finds **high-recall candidate pairs** before precise scoring:

- **Shingle tokens** into 5-grams (`K=5`) per document.
- Build **MinHash signatures** (`NUM_PERM=128`) over shingles.
- Insert into an **LSH index** with Jaccard threshold `0.85` (`JACCARD_T`) to retrieve likely matches quickly.
- For each doc, **query LSH** and keep pairs with estimated **Jaccard ≥ 0.85**.
- Output **`cand_pairs_df`** sorted by similarity with columns:
  - `i`, `j` — document indices
  - `minhash_jaccard_est` — MinHash Jaccard estimate

**Purpose:** fast, scalable candidate generation to be **reranked with TF-IDF cosine** in the next step.


In [94]:
K = 5                 # token shingle length
NUM_PERM = 128       # number of MinHash permutations
JACCARD_T = 0.85     # LSH Jaccard threshold for candidate recall


In [96]:
try:
    from datasketch import MinHash, MinHashLSH
except ImportError:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "datasketch"])
    from datasketch import MinHash, MinHashLSH

In [97]:
def token_shingles(tokens, k=K):
    tokens = tokens if isinstance(tokens, list) else []
    if len(tokens) < k:
        return [" ".join(tokens)] if tokens else []
    return [" ".join(tokens[i:i+k]) for i in range(len(tokens) - k + 1)]

# Build MinHash signatures
sigs = []
for toks in df_en["lemmas"]:
    m = MinHash(num_perm=NUM_PERM)
    for sh in token_shingles(toks):
        m.update(sh.encode("utf-8"))
    sigs.append(m)

In [98]:
lsh = MinHashLSH(threshold=JACCARD_T, num_perm=NUM_PERM)
for i, m in enumerate(sigs):
    lsh.insert(str(i), m)

# Query candidates and keep pairs with estimated Jaccard >= threshold
cand_pairs = []
for i, m in enumerate(sigs):
    for j_str in lsh.query(m):
        j = int(j_str)
        if j > i:
            jac = float(m.jaccard(sigs[j]))
            if jac >= JACCARD_T:
                cand_pairs.append((i, j, jac))

# Sort & stash for the next step (TF-IDF cosine rerank)
cand_pairs = sorted(cand_pairs, key=lambda t: (-t[2], t[0], t[1]))
cand_pairs_df = pd.DataFrame(cand_pairs, columns=["i", "j", "minhash_jaccard_est"])

print(f"Stage-1 candidates (k={K}, perms={NUM_PERM}, thr={JACCARD_T}): {len(cand_pairs_df)} pairs")
display(cand_pairs_df.head(10))

Stage-1 candidates (k=5, perms=128, thr=0.85): 1846 pairs


Unnamed: 0,i,j,minhash_jaccard_est
0,8,75,1.0
1,8,111,1.0
2,8,166,1.0
3,8,167,1.0
4,8,190,1.0
5,8,241,1.0
6,8,244,1.0
7,8,292,1.0
8,8,303,1.0
9,8,316,1.0


### Rerank with TF-IDF Cosine

filtering the MinHash-LSH candidates to **high-precision near-duplicates**:

- **Precomputes vector norms** of the TF-IDF matrix (`norms = ||X_en[i]||`) for fast cosine.
- Defines a **cosine threshold** `NEAR_TFIDF_SIM = 0.92`.
- Implements `cosine_ij(i, j)` that:
  - Gets sparse rows `vi`, `vj`
  - Computes dot product `vi.multiply(vj).sum()`
  - Divides by `||vi||·||vj||` to get cosine similarity
- **Loops over candidate pairs** from `cand_pairs_df`, computes cosine, and **keeps** only those with `cosine ≥ 0.92`.
- Adds `tfidf_cosine` to `cand_pairs_df` (for inspection) and builds **`near_pairs_df`** with:
  - `i`, `j` — doc indices
  - `tfidf_cosine` — precise similarity
  - `minhash_jaccard_est` — candidate-stage estimate
- Prints quick stats (pairs kept, unique docs involved) and shows a preview.



In [100]:
# Precompute TF-IDF vector norms for fast cosine
# norms[i] = ||X_en[i]||
norms = np.sqrt(X_en.power(2).sum(axis=1)).A1  # .A1 -> 1D array
norms

array([1., 1., 1., ..., 1., 1., 1.], shape=(1821,))

In [101]:
NEAR_TFIDF_SIM = 0.92  # keep pairs with cosine >= 0.92

def cosine_ij(i: int, j: int) -> float:
    vi = X_en.getrow(i)
    vj = X_en.getrow(j)
    num = vi.multiply(vj).sum()
    den = norms[i] * norms[j]
    return float(num / den) if den > 0 else 0.0

# Compute cosine for each candidate and filter
cos_vals = []
keep_i, keep_j, keep_cos, keep_jac = [], [], [], []

for (i, j, jac) in cand_pairs_df[["i", "j", "minhash_jaccard_est"]].itertuples(index=False):
    c = cosine_ij(i, j)
    cos_vals.append(c)
    if c >= NEAR_TFIDF_SIM:
        keep_i.append(int(i))
        keep_j.append(int(j))
        keep_cos.append(float(c))
        keep_jac.append(float(jac))

# Add cosine column to the candidate frame (for inspection)
cand_pairs_df = cand_pairs_df.copy()
cand_pairs_df["tfidf_cosine"] = cos_vals

# Strong near-duplicate pairs after rerank
near_pairs_df = pd.DataFrame({
    "i": keep_i,
    "j": keep_j,
    "tfidf_cosine": keep_cos,
    "minhash_jaccard_est": keep_jac,
}).sort_values(["tfidf_cosine","minhash_jaccard_est"], ascending=False, ignore_index=True)

# Quick stats
uniq_docs = pd.unique(near_pairs_df[["i","j"]].values.ravel("K")).size if len(near_pairs_df) else 0
print(f"Stage-2 rerank done. Kept {len(near_pairs_df)} pairs (cosine >= {NEAR_TFIDF_SIM}).")
print(f"Unique docs involved: {uniq_docs} / {X_en.shape[0]}")

# Preview
display(near_pairs_df.head(10))

Stage-2 rerank done. Kept 1845 pairs (cosine >= 0.92).
Unique docs involved: 144 / 1821


Unnamed: 0,i,j,tfidf_cosine,minhash_jaccard_est
0,1568,1569,1.0,1.0
1,1568,1570,1.0,1.0
2,1568,1571,1.0,1.0
3,1568,1572,1.0,1.0
4,1568,1573,1.0,1.0
5,1568,1574,1.0,1.0
6,1569,1570,1.0,1.0
7,1569,1571,1.0,1.0
8,1569,1572,1.0,1.0
9,1569,1573,1.0,1.0


### **Result:** a precise set of near-duplicate pairs ready for **graph grouping** (connected components).


In [106]:
n = len(df_en)
df_en = df_en.reset_index(drop=True)
df_en.head()

Unnamed: 0,url,warc_date,content_type,title,text,len_text,languges,text_src,text_clean,title_clean,len_clean,tokens,lemmas,len_tokens
0,http://0cpm.org/,2025-08-02T23:57:45Z,text/html,,Redirection...\nYou should be\ntaken here\nimm...,52,en,Redirection...\nYou should be\ntaken here\nimm...,redirection... you should be taken here immedi...,,52,"[redirection, taken, immediately]","[redirection, immediately]",3
1,http://100.ubc.ca/ubc-impact/nicole-eredics/,2025-08-02T23:25:21Z,text/html; charset=utf-8,Nicole Eredics - UBC Centennial,Nicole Eredics - UBC Centennial\nSkip to main ...,2000,en,Nicole Eredics - UBC Centennial\nSkip to main ...,nicole eredics - ubc centennial skip to main c...,nicole eredics - ubc centennial,1961,"[nicole, eredics, ubc, centennial, skip, main,...","[nicole, eredic, ubc, centennial, skip, main, ...",244
2,http://2012portal.blogspot.com/2015/03/the-all...,2025-08-02T23:31:29Z,text/html; charset=utf-8,The Portal: The Alliance Fleet,The Portal: The Alliance Fleet\nThe Portal\nTh...,76132,en,The Portal: The Alliance Fleet\nThe Portal\nTh...,the portal: the alliance fleet the portal the ...,the portal: the alliance fleet,73806,"[portal, alliance, fleet, portal, intelligence...","[portal, alliance, fleet, portal, intelligence...",6763
3,http://336-166316.shop033.com/MarketPlace/Merc...,2025-08-02T23:17:25Z,text/html; charset=utf-8,Ashop Marketplace,Ashop Marketplace\nSearch\nAdvanced Search\nAE...,2201,en,Ashop Marketplace\nSearch\nAdvanced Search\nAE...,ashop marketplace search advanced search aed a...,ashop marketplace,2167,"[ashop, marketplace, search, advanced, search,...","[ashop, marketplace, search, advanced, search,...",285
4,http://4ausa.com/product-category/sticky-notes...,2025-08-03T00:05:19Z,text/html; charset=utf-8,Sticky Notes Set in PP Box Archives - High Qua...,Sticky Notes Set in PP Box Archives - High Qua...,1569,en,Sticky Notes Set in PP Box Archives - High Qua...,sticky notes set in pp box archives - high qua...,sticky notes set in pp box archives - high qua...,1604,"[sticky, notes, set, pp, box, archives, high, ...","[sticky, note, set, pp, box, archive, high, qu...",228


### Graph Grouping & Canonical Selection (Union-Find)

 turning strong near-duplicate pairs into **groups** and marks a **canonical** doc per group.

- **Build similarity graph:** each pair `(i, j)` from `near_pairs_df` is an edge.
- **Union-Find:** `find/union` merges connected docs into the same component.
- **Groups:** collect members per root; keep only groups with **size ≥ 2** as near-duplicates.
- **Canonical policy:** pick the member with the **longest `len_clean`**; tie-break by **HTTPS** URL.
- **Annotate `df_en`:**
  - `dup_group` — connected-component id
  - `dup_group_size` — size of that component
  - `canonical_ix` — chosen canonical index (self if not in a group)
  - `is_canonical` — boolean flag


In [108]:
# Union-Find over edges that passed TF-IDF cosine threshold 
parent = np.arange(n)

def find(x):
    while parent[x] != x:
        parent[x] = parent[parent[x]]
        x = parent[x]
    return x

def union(a, b):
    ra, rb = find(a), find(b)
    if ra != rb:
        parent[rb] = ra

# add edges from near_pairs_df
for i, j, _cos, _jac in near_pairs_df[["i","j","tfidf_cosine","minhash_jaccard_est"]].itertuples(index=False):
    union(int(i), int(j))

# build groups
groups = defaultdict(list)
for idx in range(n):
    groups[find(idx)].append(idx)

# keep only groups with size >= 2 as near-duplicate groups
near_groups = {g: members for g, members in groups.items() if len(members) >= 2}

# --- Canonical selection: longest cleaned text, tie-break https ---
def content_len(k):
    v = df_en.iloc[k].get("len_clean")
    return int(v) if pd.notna(v) else len(str(df_en.iloc[k].get("text_clean","")))

def canon_score(k):
    url = str(df_en.iloc[k].get("url", ""))
    return (content_len(k), 1 if url.startswith("https") else 0)

canonical_map = {}
for g, members in near_groups.items():
    best = max(members, key=canon_score)
    for m in members:
        canonical_map[m] = best




In [109]:
# annotate dataframe
df_en["dup_group"]      = [find(i) for i in range(n)]
df_en["dup_group_size"] = df_en["dup_group"].map(lambda g: len(groups[g]))
df_en["canonical_ix"]   = [canonical_map.get(i, i) for i in range(n)]  # if not in a group, canonical=self
df_en["is_canonical"]   = (df_en.index == df_en["canonical_ix"])


In [110]:
# quick stats + preview a few largest groups
num_groups = len(near_groups)
docs_in_groups = sum(len(m) for m in near_groups.values())
print(f"Near-duplicate groups (size ≥ 2): {num_groups}")
print(f"Docs that appear in such groups: {docs_in_groups} / {n}")



Near-duplicate groups (size ≥ 2): 24
Docs that appear in such groups: 144 / 1821


In [111]:
# peek top 3 groups by size
for g, members in sorted(near_groups.items(), key=lambda kv: len(kv[1]), reverse=True)[:3]:
    best = max(members, key=canon_score)
    print(f"\nGroup {g} | size={len(members)} | canonical={best}")
    for m in members[:5]:
        title = str(df_en.loc[m, 'title'])[:80].replace("\n"," ")
        print(f"  idx={m:4d}  canon={'*' if m==best else ' '}  {title}")


Group 8 | size=56 | canonical=642
  idx=   8  canon=   One moment, please...
  idx=  75  canon=   One moment, please...
  idx= 111  canon=   One moment, please...
  idx= 166  canon=   One moment, please...
  idx= 167  canon=   One moment, please...

Group 100 | size=20 | canonical=1533
  idx= 100  canon=   Redirect Notice
  idx= 101  canon=   Redirect Notice
  idx= 198  canon=   Redirect Notice
  idx= 199  canon=   Redirect Notice
  idx= 240  canon=   Redirect Notice

Group 20 | size=9 | canonical=805
  idx=  20  canon=   CONTENTdm
  idx=  21  canon=   CONTENTdm
  idx= 372  canon=   CONTENTdm
  idx= 805  canon=*  CONTENTdm
  idx= 893  canon=   CONTENTdm


### **Outputs:** summary counts and a preview of the largest groups with their canonical members.
### **Result:** each document is labeled with its duplicate group and canonical representative for reporting/QA.


In [112]:
df_en.head()

Unnamed: 0,url,warc_date,content_type,title,text,len_text,languges,text_src,text_clean,title_clean,len_clean,tokens,lemmas,len_tokens,dup_group,dup_group_size,canonical_ix,is_canonical
0,http://0cpm.org/,2025-08-02T23:57:45Z,text/html,,Redirection...\nYou should be\ntaken here\nimm...,52,en,Redirection...\nYou should be\ntaken here\nimm...,redirection... you should be taken here immedi...,,52,"[redirection, taken, immediately]","[redirection, immediately]",3,0,1,0,True
1,http://100.ubc.ca/ubc-impact/nicole-eredics/,2025-08-02T23:25:21Z,text/html; charset=utf-8,Nicole Eredics - UBC Centennial,Nicole Eredics - UBC Centennial\nSkip to main ...,2000,en,Nicole Eredics - UBC Centennial\nSkip to main ...,nicole eredics - ubc centennial skip to main c...,nicole eredics - ubc centennial,1961,"[nicole, eredics, ubc, centennial, skip, main,...","[nicole, eredic, ubc, centennial, skip, main, ...",244,1,1,1,True
2,http://2012portal.blogspot.com/2015/03/the-all...,2025-08-02T23:31:29Z,text/html; charset=utf-8,The Portal: The Alliance Fleet,The Portal: The Alliance Fleet\nThe Portal\nTh...,76132,en,The Portal: The Alliance Fleet\nThe Portal\nTh...,the portal: the alliance fleet the portal the ...,the portal: the alliance fleet,73806,"[portal, alliance, fleet, portal, intelligence...","[portal, alliance, fleet, portal, intelligence...",6763,2,1,2,True
3,http://336-166316.shop033.com/MarketPlace/Merc...,2025-08-02T23:17:25Z,text/html; charset=utf-8,Ashop Marketplace,Ashop Marketplace\nSearch\nAdvanced Search\nAE...,2201,en,Ashop Marketplace\nSearch\nAdvanced Search\nAE...,ashop marketplace search advanced search aed a...,ashop marketplace,2167,"[ashop, marketplace, search, advanced, search,...","[ashop, marketplace, search, advanced, search,...",285,3,1,3,True
4,http://4ausa.com/product-category/sticky-notes...,2025-08-03T00:05:19Z,text/html; charset=utf-8,Sticky Notes Set in PP Box Archives - High Qua...,Sticky Notes Set in PP Box Archives - High Qua...,1569,en,Sticky Notes Set in PP Box Archives - High Qua...,sticky notes set in pp box archives - high qua...,sticky notes set in pp box archives - high qua...,1604,"[sticky, notes, set, pp, box, archives, high, ...","[sticky, note, set, pp, box, archive, high, qu...",228,4,1,4,True


### Exact Duplicate Detection (SHA-256)

This step identifies **byte-equivalent** pages after cleaning:

- **Stabilize row ids:** `df_en = df_en.reset_index(drop=True)` for consistent indexing.
- **Hash normalized text:** compute `exact_hash = sha256(text_clean)` for each document.
- **Group by hash:** same `exact_hash` ⇒ same **exact-dup group** (`exact_groups`).
- **Annotate rows:** `exact_group_size` = size of the document’s exact-dup group (1 = unique).
- **Report & preview:** print total docs, number of groups (size ≥ 2), total docs in such groups, and list a few largest groups (showing indices/URLs).



In [113]:
import hashlib

In [114]:
# stablizing integer index
df_en = df_en.reset_index(drop=True)

df_en["exact_hash"] = df_en["text_clean"].apply(lambda s: hashlib.sha256(s.encode("utf-8")).hexdigest())  # applying hashlib.sha256 to hash every text 

grp = df_en.reset_index().groupby("exact_hash")["index"].apply(list) # grouping same hash => same exact-dup group
exact_groups = {h: idxs for h, idxs in grp.items() if len(idxs) > 1}   

df_en["exact_group_size"] = df_en["exact_hash"].map(lambda h: len(exact_groups.get(h, [])) or 1) # annotating each group (cluster) and mapping it

# view biggest few groups
num_exact_groups = sum(1 for v in exact_groups.values() if len(v) >= 2)
docs_in_exact = sum(len(v) for v in exact_groups.values())

print("Docs:", len(df_en))
print("Exact-dup groups (size ≥ 2):", num_exact_groups)
print("Docs in exact-dup groups:", docs_in_exact)


Docs: 1821
Exact-dup groups (size ≥ 2): 18
Docs in exact-dup groups: 122


In [122]:
if num_exact_groups:
    top = sorted(exact_groups.items(), key=lambda kv: len(kv[1]), reverse=True)[:3]
    for h, idxs in top:
        print(f"\nHash {h[:12]}… | size={len(idxs)}")
        for i in idxs[:10]:
            print("  idx:", i, "|", (df_en.loc[i, "url"] or "")[:100])


Hash 9dbe9b076219… | size=56
  idx: 8 | http://acasalaromani.ro/category/retete-culinare-traditionale/retete-culinare-traditionale-muntenia/
  idx: 75 | http://carobniprstki.com/diy-slincki
  idx: 111 | http://doctorexpres.ro/index.php/tag/hamburgeri/
  idx: 166 | http://gebeligim.com/
  idx: 167 | http://geinomatome.com/ea4djvm-jpg/
  idx: 190 | http://hashtagbylily.com/
  idx: 241 | http://mardelrefrigeration.ca/
  idx: 244 | http://marlowhistory.uk/
  idx: 292 | http://psyssa.com/
  idx: 303 | http://resepmasakan.9wiki.net/tag/hotel-di-jogja

Hash 259d7d2a13e2… | size=20
  idx: 100 | http://cse.google.com.bn/url?sa=i&url=https://pensiuneacoral.ro/fr.php?cid=30%26kys=maternit%C3%A9+k
  idx: 101 | http://cse.google.gr/url?sa=i&url=http://rank-your.site/i/top-fiverr-seo-service-review//
  idx: 198 | http://images.google.com.cy/url?q=https%3A%2F%2Friverstonenetworks.com%2F
  idx: 199 | http://images.google.de/url?sa=t&url=https%3A%2F%2Fanasdream-realestate.es
  idx: 240 | http://maps.g

### **Result:** a precise map of **exact duplicates** independent of order, spacing, or markup differences already removed by cleaning.

In [116]:
# FINAL STEP: build & save the JSON report (pairs, groups, canonicals, clusters, params)

import os, json, math
from datetime import datetime
import numpy as np
import pandas as pd
from collections import defaultdict
from sklearn.cluster import MiniBatchKMeans


# ---- Reuse knobs (set defaults if not already defined)
K                  = globals().get("K", 5)
NUM_PERM           = globals().get("NUM_PERM", 128)
JACCARD_T          = globals().get("JACCARD_T", 0.85)
NEAR_TFIDF_SIM     = globals().get("NEAR_TFIDF_SIM", 0.92)
TFIDF_MAX_FEATURES = globals().get("TFIDF_MAX_FEATURES", 50_000)
TFIDF_MIN_DF       = globals().get("TFIDF_MIN_DF", 3)
TFIDF_NGRAM_RANGE  = globals().get("TFIDF_NGRAM_RANGE", (1,2))
TOP_DOCS_PER_CLUSTER = 5

# ---- Build/confirm groups from the dataframe annotations
# near-dup groups (from previous union-find step). If not present, derive from dup_group.
if "near_groups" in globals():
    ngroups = {int(g): [int(x) for x in v] for g, v in near_groups.items()}
else:
    ngroups = defaultdict(list)
    for g, idxs in df_en.reset_index().groupby("dup_group")["index"]:
        idxs = idxs.tolist()
        if len(idxs) >= 2:
            ngroups[int(g)] = [int(x) for x in idxs]

# canonical map (if missing, self-canonical)
if "canonical_map" in globals():
    canon_map = {int(k): int(v) for k, v in canonical_map.items()}
else:
    canon_map = {int(i): int(i) for i in range(len(df_en))}

# exact-dup groups (from previous cell). If not present, derive now.
if "exact_groups" not in globals():
    grp = df_en.reset_index().groupby("exact_hash")["index"].apply(list)
    exact_groups = {h: idxs for h, idxs in grp.items() if len(idxs) > 1}

# ---- Topic clustering (MiniBatchKMeans) for reviewer context
n_docs = X_en.shape[0]
n_clusters = max(5, min(100, int(math.sqrt(max(2, n_docs)))))
kmeans = MiniBatchKMeans(n_clusters=n_clusters, random_state=42, batch_size=1024, n_init=10)
cluster_labels = kmeans.fit_predict(X_en)

terms = np.array(tfidf.get_feature_names_out())
cluster_terms = []
for c in kmeans.cluster_centers_:
    idx = np.argsort(c)[::-1][:10]
    cluster_terms.append(terms[idx].tolist())

# representative docs per cluster by TF-IDF norm
norms = np.asarray((X_en.power(2)).sum(axis=1)).ravel()
cluster_samples = {}
for cid in range(n_clusters):
    members = np.where(cluster_labels == cid)[0]
    if members.size == 0:
        cluster_samples[cid] = []
    else:
        order = members[np.argsort(-norms[members])][:TOP_DOCS_PER_CLUSTER]
        cluster_samples[cid] = order.tolist()

# ---- Build JSON sections

# documents
docs_json = []
for i, row in df_en.reset_index(drop=True).iterrows():
    docs_json.append({
        "id": int(i),
        "url": row.get("url", None),
        "title": row.get("title", None),
        "len_clean": int(row.get("len_clean", len(str(row.get("text_clean","")))) or 0),
        "exact_hash": row.get("exact_hash", None),
        "exact_group_size": int(row.get("exact_group_size", 1)),
        "dup_group": int(row.get("dup_group", i)) if pd.notna(row.get("dup_group", i)) else int(i),
        "dup_group_size": int(row.get("dup_group_size", 1)),
        "canonical_ix": int(row.get("canonical_ix", i)),
        "is_canonical": bool(row.get("is_canonical", True)),
        "cluster": int(cluster_labels[i]),
    })

# near-duplicate pairs (if you computed rerank)
pairs_json = []
if "near_pairs_df" in globals() and len(near_pairs_df):
    for _, r in near_pairs_df.iterrows():
        i, j = int(r["i"]), int(r["j"])
        pairs_json.append({
            "i": i,
            "j": j,
            "tfidf_cosine": float(r["tfidf_cosine"]),
            "minhash_jaccard_est": float(r["minhash_jaccard_est"]),
            "url_i": df_en.iloc[i].get("url", None),
            "title_i": df_en.iloc[i].get("title", None),
            "url_j": df_en.iloc[j].get("url", None),
            "title_j": df_en.iloc[j].get("title", None),
        })

# near-duplicate groups with canonicals
near_groups_json = []
for gid, members in ngroups.items():
    canon = canon_map.get(members[0], members[0])
    # if canon not in this group because of previous defaulting, choose longest text
    if canon not in members:
        canon = max(members, key=lambda k: int(df_en.iloc[k].get("len_clean", len(str(df_en.iloc[k].get("text_clean",""))))))
    near_groups_json.append({
        "group_id": int(gid),
        "size": int(len(members)),
        "canonical": int(canon),
        "canonical_reason": "longest_clean_text_then_https",
        "members": [int(m) for m in sorted(members)],
        "sample_titles": [df_en.iloc[m].get("title", None) for m in members[:5]],
    })

# exact-duplicate groups
exact_groups_json = []
for h, members in exact_groups.items():
    exact_groups_json.append({
        "hash": h,
        "size": int(len(members)),
        "members": [int(m) for m in members[:]],
        "sample_titles": [df_en.iloc[m].get("title", None) for m in members[:5]],
    })

# clusters (topics)
clusters_json = []
for cid in range(n_clusters):
    samples = cluster_samples.get(cid, [])
    clusters_json.append({
        "cluster_id": int(cid),
        "size": int((cluster_labels == cid).sum()),
        "top_terms": cluster_terms[cid],
        "sample_docs": [
            {"id": int(i),
             "url": df_en.iloc[i].get("url", None),
             "title": df_en.iloc[i].get("title", None)}
            for i in samples
        ],
    })

# ---- Assemble report
report = {
    "meta": {
        "generated_at": datetime.utcnow().isoformat() + "Z",
        "num_docs": int(len(df_en)),
        "tfidf_vocab_size": int(len(terms)),
        "params": {
            "stage0_exact_hash_algo": "sha256",
            "stage1_minhash": {"k_token": int(K), "num_perm": int(NUM_PERM), "jaccard_threshold": float(JACCARD_T)},
            "stage2_rerank": {"tfidf_cosine_threshold": float(NEAR_TFIDF_SIM)},
            "tfidf": {
                "max_features": int(TFIDF_MAX_FEATURES),
                "min_df": int(TFIDF_MIN_DF),
                "ngram_range": list(TFIDF_NGRAM_RANGE),
            },
            "topic_clustering": {"n_clusters": int(n_clusters), "top_docs_per_cluster": int(TOP_DOCS_PER_CLUSTER)},
        },
        "counts": {
            "near_dup_pairs_after_rerank": int(len(pairs_json)),
            "near_dup_groups_ge2": int(sum(1 for v in ngroups.values() if len(v) >= 2)),
            "exact_dup_groups_ge2": int(sum(1 for v in exact_groups.values() if len(v) >= 2)),
        },
    },
    "documents": docs_json,
    "near_duplicate_pairs": pairs_json,
    "near_duplicate_groups": near_groups_json,
    "exact_duplicate_groups": exact_groups_json,
    "topic_clusters": clusters_json,
}

# ---- Save
out_path = "web_archive_dedup_report.json"
with open(out_path, "w", encoding="utf-8") as f:
    json.dump(report, f, ensure_ascii=False, indent=2)

print("Saved report to:", os.path.abspath(out_path))
print("Summary:",
      "| docs:", len(df_en),
      "| pairs:", len(pairs_json),
      "| near-groups:", sum(1 for v in ngroups.values() if len(v) >= 2),
      "| exact-groups:", sum(1 for v in exact_groups.values() if len(v) >= 2),
      "| clusters:", n_clusters)


Saved report to: C:\Users\huyas\web_archive_dedup_report.json
Summary: | docs: 1821 | pairs: 1845 | near-groups: 24 | exact-groups: 18 | clusters: 42


In [125]:
with open(r"C:\Users\huyas\web_archive_dedup_report.json", "r", encoding="utf-8") as f:
    rep = json.load(f)

# largest near-dup groups
sorted(rep["near_duplicate_groups"], key=lambda g: g["size"], reverse=True)[:3]

# sample of strong pairs
rep["near_duplicate_pairs"][:10]

# Top cluster overview
[(c["cluster_id"], c["size"], c["top_terms"][:5]) for c in rep["topic_clusters"][:5]]


[(0, 1, ['industry', 'num united', 'intern', 'united states', 'states']),
 (1,
  1,
  ['request verify',
   'wait request',
   'moment loader',
   'loader wait',
   'loader']),
 (2, 1, ['research', 'questionnaire', 'equipment', 'datum', 'num']),
 (3,
  56,
  ['moment loader',
   'loader wait',
   'request verify',
   'wait request',
   'loader']),
 (4, 2, ['num', 'num num', 'earthquake', 'manual se', 'earthquake manual'])]

In [133]:
out_csv = "df_en_final.csv"
df_en.to_csv(out_csv, index=False, encoding="utf-8")
print(f"Saved CSV to: {out_csv}")

Saved CSV to: df_en_final.csv
