# 01 - Exploring Common Crawl WARC Files

## Why Data Collection Matters for Legal AI

Every large language model starts with data. GPT, LLaMA, and other foundation
models are trained on massive text corpora, and **Common Crawl** is one of the
largest publicly available sources. It contains petabytes of web pages collected
over more than a decade.

For legal AI systems like CoCounsel, the composition of training data directly
affects the model's ability to reason about statutes, case law, and legal
concepts. Understanding *where* the data comes from -- and how little of a
general web crawl is actually legal content -- is the first step toward building
better legal AI.

### What is the WARC Format?

Common Crawl stores its data in **WARC (Web ARChive)** format, an ISO standard
(ISO 28500:2017) for archiving web content. A single WARC file bundles together:

- **`warcinfo`** records -- metadata about the crawl itself
- **`request`** records -- the HTTP request sent to the server
- **`response`** records -- the full HTTP response including headers and HTML body
- **`metadata`** records -- additional information about the crawl

A typical WARC segment from Common Crawl is about 1 GB compressed and contains
roughly 30,000-40,000 web pages. The full monthly crawl consists of tens of
thousands of such segments.

> **Note:** This notebook requires network access to download WARC data from
> Common Crawl. Cells that need network access are marked with a
> `[REQUIRES NETWORK]` label.

## Setup

Install and import the libraries we need:
- **warcio** -- reading and writing WARC files
- **beautifulsoup4** -- parsing HTML content
- **requests** -- downloading files from the web

In [None]:
# Install dependencies (uncomment if needed)
# %pip install warcio beautifulsoup4 requests

In [1]:
import io
import re
from collections import Counter
from pathlib import Path

import requests
from bs4 import BeautifulSoup
from warcio.archiveiterator import ArchiveIterator

In [2]:
# ---------------------------------------------------------------------------
# Helper: download a WARC segment (streaming, with a byte limit so we don't
# pull down the entire 1 GB file during exploration).
# ---------------------------------------------------------------------------

SAMPLE_WARC_URL = (
    "https://data.commoncrawl.org/crawl-data/CC-MAIN-2026-04/"
    "segments/1768220467618.22/warc/"
    "CC-MAIN-20260112161239-20260112191239-00000.warc.gz"
)


def download_warc_sample(
    url: str = SAMPLE_WARC_URL,
    max_bytes: int = 5 * 1024 * 1024,  # 5 MB default -- enough for ~100-200 pages
) -> io.BytesIO:
    """Stream the first `max_bytes` of a WARC file into an in-memory buffer."""
    print(f"Downloading first {max_bytes / 1024 / 1024:.1f} MB from:")
    print(f"  {url}")
    headers = {"Range": f"bytes=0-{max_bytes - 1}"}
    resp = requests.get(url, headers=headers, stream=True, timeout=30)
    resp.raise_for_status()
    buf = io.BytesIO(resp.content)
    print(f"Downloaded {len(resp.content):,} bytes.")
    return buf

## Exploring WARC Files

Let's download a small slice of a real Common Crawl WARC segment and see what's
inside. We limit the download to 5 MB so it finishes quickly.

> **[REQUIRES NETWORK]** -- The cell below downloads data from
> `data.commoncrawl.org`.

In [3]:
# [REQUIRES NETWORK] -- Download a 5 MB sample of a WARC segment.
warc_buffer = download_warc_sample()

Downloading first 5.0 MB from:
  https://data.commoncrawl.org/crawl-data/CC-MAIN-2026-04/segments/1768220467618.22/warc/CC-MAIN-20260112161239-20260112191239-00000.warc.gz
Downloaded 5,242,880 bytes.


In [4]:
# Parse every record in our sample and collect basic stats.
records = []

warc_buffer.seek(0)
for record in ArchiveIterator(warc_buffer):
    rec_type = record.rec_type
    url = record.rec_headers.get_header("WARC-Target-URI") or ""
    content_length = int(
        record.rec_headers.get_header("Content-Length") or 0
    )
    # Read the payload so the iterator advances properly
    payload = record.content_stream().read()
    records.append(
        {
            "type": rec_type,
            "url": url,
            "content_length": content_length,
            "payload": payload,
        }
    )

print(f"Total records parsed: {len(records)}")

Total records parsed: 932


In [5]:
# How many records of each type?
type_counts = Counter(r["type"] for r in records)
print("Record types:")
for rtype, count in type_counts.most_common():
    print(f"  {rtype:12s}  {count:>5,}")

Record types:
  request         311
  response        310
  metadata        310
  warcinfo          1


In [6]:
# Extract HTML content from 'response' records and show a sample.
response_records = [r for r in records if r["type"] == "response"]
print(f"Response records: {len(response_records)}")

if response_records:
    sample = response_records[0]
    print(f"\nSample URL: {sample['url']}")
    print(f"Payload size: {len(sample['payload']):,} bytes")
    # Show the first 500 characters of the raw payload
    print("\n--- Raw payload (first 500 chars) ---")
    print(sample["payload"][:500].decode("utf-8", errors="replace"))

Response records: 310

Sample URL: http://028700.com/video/310005.html
Payload size: 109,599 bytes

--- Raw payload (first 500 chars) ---
<!DOCTYPE html>
<html>
	<head>
		<title>《全民诡异：开局掌握零元购·动态漫画-西施程美嘉无码》第1集在线播放-全集选集-高清视频在线播放-蘑菇视频</title>
		<meta name="keywords" content="全民诡异：开局掌握零元购·动态漫画-西施程美嘉无码正在播放,热播高清电影,免费电视剧,高清综艺,精彩短剧,日韩精品,国产经典,欧美精品,蘑菇视频" />
		<meta name="description" content="全民诡异：开局掌握零�


### Observations

Even from this small sample, you can see that:
1. Each web page generates multiple WARC records (request + response at minimum).
2. The `response` payload includes full HTTP headers followed by the HTML body.
3. The raw content is noisy -- navigation menus, ads, JavaScript, and boilerplate
   surround the actual page text.

## Filtering for Legal Content

Common Crawl is a *general* web crawl. Let's find out what fraction of the
pages we downloaded come from legal domains.

We'll define a simple heuristic: a URL is "legal-related" if it contains any of
the substrings `.gov`, `court`, `law`, or `legal`.

In [7]:
LEGAL_KEYWORDS = ["gov", "court", "law", "legal"]


def is_legal_url(url: str) -> bool:
    """Return True if the URL likely points to a legal-domain page."""
    url_lower = url.lower()
    return any(kw in url_lower for kw in LEGAL_KEYWORDS)


legal_records = [r for r in response_records if is_legal_url(r["url"])]
total = len(response_records)
legal_count = len(legal_records)

print(f"Total response records : {total}")
print(f"Legal-domain records   : {legal_count}")
if total > 0:
    print(f"Percentage             : {legal_count / total * 100:.2f}%")
else:
    print("No response records found (is the WARC download empty?)")

Total response records : 310
Legal-domain records   : 4
Percentage             : 1.29%


In [8]:
def extract_text_from_html(raw_payload: bytes) -> str:
    """Extract visible text from an HTTP response payload.

    The payload starts with HTTP headers, followed by a blank line, then the
    HTML body. We split on the first double-newline to skip the headers.
    """
    try:
        text = raw_payload.decode("utf-8", errors="replace")
    except Exception:
        return ""
    # Skip HTTP headers
    parts = text.split("\r\n\r\n", 1)
    html = parts[1] if len(parts) > 1 else text
    soup = BeautifulSoup(html, "html.parser")
    # Remove script and style elements
    for tag in soup(["script", "style", "nav", "header", "footer"]):
        tag.decompose()
    return soup.get_text(separator=" ", strip=True)

In [9]:
# Show extracted text from legal-domain pages (if any were found).
if legal_records:
    for rec in legal_records[:3]:  # show up to 3
        text = extract_text_from_html(rec["payload"])
        print(f"URL: {rec['url']}")
        print(f"Extracted text (first 300 chars):")
        print(f"  {text[:300]}")
        print("-" * 72)
else:
    print(
        "No legal-domain pages found in this sample. This is expected --\n"
        "legal content is a tiny fraction of the general web.\n\n"
        "Try increasing max_bytes in download_warc_sample() or using a\n"
        "different WARC segment."
    )

URL: http://bdlaws.minlaw.gov.bd/act-print-1246/section-print-47268.html
Extracted text (first 300 chars):
  প্রিন্ট 12/01/2026 Laws of Bangladesh ক্যান্টনমেন্ট আইন, ২০১৮ (
                                
                                    

                                        ২০১৮ সনের ২৭ নং
                                        
                                            
                
------------------------------------------------------------------------
URL: http://bittencourtadv.com/2023/05/03/betting-company-mostbet-app-online-sports-betting-41/
Extracted text (first 300 chars):
  Betting Company Mostbet App Online Sports Betting 41 - Bittencourt Advocacia Skip to content Betting Company Mostbet App Online Sports Betting 41 Bittencourt Advocacia > Sem categoria > Betting Company Mostbet App Online Sports Betting 41 Betting Company Mostbet App Online Sports Betting 418 Mostbet
------------------------------------------------------------------------
URL: http://cample

### What Did We Learn?

Even with a generous URL-based filter, legal content is an extremely small
fraction of Common Crawl. Estimates from researchers working with Common Crawl
suggest that well under **0.1%** of all crawled pages contain substantive legal
text (court opinions, statutes, regulations).

This has real consequences:
- A foundation model trained on Common Crawl has seen very little legal text
  relative to its total training data.
- The legal text it *has* seen may come from blogs, law firm marketing pages,
  or news articles -- not primary sources like judicial opinions.
- For a product like CoCounsel, supplementing with curated legal datasets
  is essential.

## Exercises

### Exercise (a): Filter by Legal Citations

Write a function that scans extracted text for legal citations. Look for
patterns matching common case reporters:

- **Federal reporters:** `F.2d`, `F.3d`, `F. Supp.`, `F. Supp. 2d`, `F. Supp. 3d`
- **Supreme Court:** `U.S.`, `S. Ct.`, `S.Ct.`, `L. Ed.`, `L. Ed. 2d`

For example, a regex like:
```python
CITATION_PATTERN = re.compile(
    r'\d+\s+(?:F\.\s*(?:2d|3d|Supp\.(?:\s*[23]d)?)|'
    r'U\.S\.|S\.\s*Ct\.|L\.\s*Ed\.(?:\s*2d)?)\s+\d+'
)
```

Apply this filter to all response records in your WARC sample. How many pages
contain at least one legal citation?

### Exercise (b): Estimate Corpus Size

Based on your sample:
1. What is the average text size (in bytes) of a legal-domain page after
   HTML extraction?
2. What percentage of pages in a WARC segment are legal?
3. A full WARC segment is ~1 GB compressed (~3 GB uncompressed) and contains
   ~35,000 pages. Using your percentages, estimate how many segments you would
   need to download to collect **1 GB of clean legal text**.

Think about what this implies for the cost (bandwidth, storage, compute) of
building a legal corpus from Common Crawl alone.

In [11]:
CITATION_PATTERN = re.compile(
    r'\d+\s+(?:F\.\s*(?:2d|3d|Supp\.(?:\s*[23]d)?)|'
    r'U\.S\.|S\.\s*Ct\.|L\.\s*Ed\.(?:\s*2d)?)\s+\d+'
)

citation_pages = []
for rec in response_records:
    text = extract_text_from_html(rec["payload"])
    matches = CITATION_PATTERN.findall(text)
    if matches:
        citation_pages.append({"url": rec["url"], "matches": matches, "text": text})

print(f"Pages with legal citations: {len(citation_pages)} out of {len(response_records)}")
for page in citation_pages[:5]:
    print(f"\nURL: {page['url']}")
    print(f"Citations found: {page['matches'][:5]}")


Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.




  soup = BeautifulSoup(html, "html.parser")


Pages with legal citations: 0 out of 310


In [14]:
# Exercise (b): Estimate corpus size

# Step 1: Average text size of legal-domain pages
if legal_records:
    legal_texts = [extract_text_from_html(r["payload"]) for r in legal_records]
    legal_sizes = [len(t.encode("utf-8")) for t in legal_texts]
    avg_legal_size = sum(legal_sizes) / len(legal_sizes)
    print(f"Legal-domain pages found: {len(legal_records)}")
    print(f"Average text size after extraction: {avg_legal_size:,.0f} bytes ({avg_legal_size/1024:.1f} KB)")
else:
    avg_legal_size = 0
    print("No legal-domain pages found — using estimates below")

# Step 2: What percentage of pages are legal?
legal_pct = len(legal_records) / len(response_records) * 100
print(f"\nLegal pages in our sample: {len(legal_records)}/{len(response_records)} = {legal_pct:.2f}%")

# Step 3: Estimate segments needed for 1 GB of clean legal text
pages_per_segment = 35_000  # approximate for a full 1 GB WARC segment
legal_pages_per_segment = pages_per_segment * (legal_pct / 100)

# Use our measured average, or fall back to a reasonable estimate
avg_size = avg_legal_size if avg_legal_size > 0 else 5_000  # 5 KB estimate
legal_bytes_per_segment = legal_pages_per_segment * avg_size

target = 1 * 1024 ** 3  # 1 GB
segments_needed = target / legal_bytes_per_segment if legal_bytes_per_segment > 0 else float("inf")

print(f"\n--- Estimates ---")
print(f"Pages per full segment:        ~{pages_per_segment:,}")
print(f"Legal pages per segment:       ~{legal_pages_per_segment:,.0f}")
print(f"Avg clean text per legal page: ~{avg_size:,.0f} bytes")
print(f"Legal text per segment:        ~{legal_bytes_per_segment/1024/1024:.2f} MB")
print(f"Segments needed for 1 GB:      ~{segments_needed:,.0f}")
print(f"Raw WARC data to download:     ~{segments_needed:,.0f} GB (1 GB per segment)")

Legal-domain pages found: 4
Average text size after extraction: 9,542 bytes (9.3 KB)

Legal pages in our sample: 4/310 = 1.29%

--- Estimates ---
Pages per full segment:        ~35,000
Legal pages per segment:       ~452
Avg clean text per legal page: ~9,542 bytes
Legal text per segment:        ~4.11 MB
Segments needed for 1 GB:      ~249
Raw WARC data to download:     ~249 GB (1 GB per segment)
