# Tutorial: Deduplicate Web Text with MinHash + LSH (Guided Lesson)

By the end of this lesson you will run a small, reliable pipeline that:
- extracts text blocks from HTML,
- normalizes text,
- computes MinHash signatures and LSH buckets,
- finds connected components to group near-duplicates,
- exports a deduplicated dataset and a duplicates sample.

We will take one safe path with clear actions, expected results, and checkpoints. No AWS required by default.


## Step 0: Prerequisites and Preflight

Action:
- Ensure Python 3.11+ and install pinned packages.
- Run the preflight cell to verify imports and versions.

Expected result:
- A green check and versions printed.

If it fails:
- Re-run install; ensure your virtual environment is active.


In [None]:
python - << 'PY'
import sys
print(sys.version)
PY

python -m pip install --quiet 'daft[aws,pandas]==0.3.9' selectolax==0.3.23 python-dotenv==1.0.1 scipy==1.12.0 matplotlib==3.8.4 igraph==0.11.6 numpy==1.26.4 pandas==2.2.2


In [None]:
# Preflight: versions and imports
from __future__ import annotations
import importlib, sys
mods = {
    'daft': '0.3.9',
    'selectolax': '0.3.23',
    'scipy': '1.12.0',
    'matplotlib': '3.8.4',
    'igraph': '0.11.6',
    'numpy': '1.26.4',
    'pandas': '2.2.2',
}

ok = True
for name, expected in mods.items():
    m = importlib.import_module(name)
    v = getattr(m, '__version__', None)
    print(f"{name}=={v}")
    if v and expected and v != expected:
        print(f"[warn] expected {expected} but got {v} for {name}")

if ok:
    print("✅ Preflight OK")


## Step 1: Load sample HTML data (no AWS)

Action:
- Load a tiny, in-memory HTML sample into a table with `WARC-Record-ID`, `warc_content`, and `WARC-Identified-Payload-Type`.

Expected result:
- Table has > 0 rows and the payload type is `text/html`.

Checkpoint:
- A small preview prints with 3–5 rows.


In [None]:
import daft
from daft import col

HTML_DOCS = [
    ("warc:1", b"HTTP/1.1 200 OK\r\n\r\n<html><head><title>Example 1</title></head><body><h1>Alpha</h1><p>Hello world world!</p></body></html>", "text/html"),
    ("warc:2", b"HTTP/1.1 200 OK\r\n\r\n<html><head><title>Example 2</title></head><body><h1>Alpha</h1><p>Hello world!</p></body></html>", "text/html"),
    ("warc:3", b"HTTP/1.1 200 OK\r\n\r\n<html><head><title>Other</title></head><body><h2>Bravo</h2><p>Different content altogether.</p></body></html>", "text/html"),
]

df_warc = daft.from_pydict({
    "WARC-Record-ID": [x[0] for x in HTML_DOCS],
    "warc_content":   [x[1] for x in HTML_DOCS],
    "WARC-Identified-Payload-Type": [x[2] for x in HTML_DOCS],
}).collect()

assert df_warc.count_rows() > 0, "No rows loaded"
df_warc.show()


## Step 2: Extract text blocks from HTML

Action:
- Strip HTTP headers and extract visible text blocks.

Expected result:
- A table of text blocks with `block_id` and `block` columns, > 0 rows.

Checkpoint:
- Show 3 sample blocks.


In [None]:
from selectolax.parser import HTMLParser

index_col = "block_id"
content_col = "block"

@daft.func()
def remove_http_headers(x: bytes) -> str:
    if x is None:
        return ""
    s = x.decode("utf-8", errors="ignore")
    parts = s.split("\r\n\r\n")
    if len(parts) > 1:
        return parts[1]
    parts = s.split("\n\n")
    return parts[1] if len(parts) > 1 else s

@daft.func()
def extract_blocks(html: str) -> list[str]:
    tree = HTMLParser(html)
    for n in tree.css("script,style,noscript"):
        n.decompose()
    blocks = []
    for node in tree.css("title, article, main, p, h1, h2, h3, h4, h5, h6, li, div, section"):
        txt = node.text(separator=" ", strip=True)
        if txt and len(txt) >= 20:
            blocks.append(txt)
    return blocks

@daft.func()
def get_block_idx(blocks: list[str]) -> list[int]:
    return list(range(len(blocks)))

df_html = (
    df_warc
    .where(col("WARC-Identified-Payload-Type") == "text/html")
    .with_column("content_raw", remove_http_headers(col("warc_content")))
    .where(col("content_raw") != "")
)

df_text = (
    df_html
    .with_column("blocks", extract_blocks(col("content_raw")))
    .with_column("block_idx", get_block_idx(col("blocks")))
    .explode("blocks", "block_idx")
    .where(col("blocks") != "")
    .with_column(index_col, col("WARC-Record-ID") + "-" + col("block_idx"))
    .with_column(content_col, col("blocks"))
    .select("WARC-Record-ID", index_col, content_col)
).collect()

assert df_text.count_rows() > 0, "No blocks extracted"
df_text.show(3)


## Step 3: Normalize text

Action:
- Normalize punctuation, case, Unicode, and whitespace into `content_normalized`.

Expected result:
- New column present with lowercased, de‑noised text.

Checkpoint:
- Show 3 rows comparing `block` vs `content_normalized`.


In [None]:
df_norm = df_text.with_column(
    "content_normalized",
    col(content_col).str.normalize(
        remove_punct=True,
        lowercase=True,
        nfd_unicode=True,
        white_space=True,
    ),
).collect()

assert "content_normalized" in df_norm.column_names(), "Normalization failed"
df_norm.select(index_col, content_col, "content_normalized").show(3)


## Step 4: MinHash signatures

Action:
- Compute MinHash vectors with fixed parameters.

Expected result:
- A `min_hashes` column of length `K` per row.

Checkpoint:
- Show 3 rows with `min_hashes[:8]` preview.


In [None]:
K = 64
SEED = 42
NGRAM_SIZE = 5

df_minhash = (
    df_norm
    .with_column(
        "min_hashes",
        col("content_normalized").minhash(
            num_hashes=K,
            ngram_size=NGRAM_SIZE,
            seed=SEED,
            hash_function="xxhash",
        ),
    )
).collect()

# Quick preview of the first 8 hashes
import json
rows = df_minhash.select(index_col, "min_hashes").limit(3).to_pydict()
for i in range(len(rows[index_col])):
    print(rows[index_col][i], rows["min_hashes"][i][:8])

assert df_minhash.count_rows() > 0, "MinHash failed"
