# 02 - CourtListener: Curated Legal Data

## Purpose-Built Legal Data vs. General Web Crawls

In the previous notebook we saw that legal content is a vanishingly small
fraction of Common Crawl's general web archive. Building a legal AI system on
raw web data alone would be like panning for gold in the ocean.

**CourtListener** (https://www.courtlistener.com) takes the opposite approach.
Operated by Free Law Project, it is an open platform that collects, curates,
and serves U.S. court opinions, oral arguments, and judicial records. The data
is:

- **Structured** -- each opinion comes with metadata (court, date, case name,
  citations)
- **Clean** -- text is extracted and normalized from court filing systems
- **Authoritative** -- sourced directly from PACER, court websites, and
  official repositories

### CourtListener REST API

CourtListener provides a free REST API at `https://www.courtlistener.com/api/rest/v4/`
that lets you search and retrieve opinions programmatically. Key endpoints:

| Endpoint | Description |
|----------|-------------|
| `/opinions/` | Full text of court opinions |
| `/clusters/` | Groups of opinions for a single case |
| `/courts/` | Court metadata |
| `/search/` | Full-text search across opinions |

In this notebook we will primarily work with **sample data bundled in this
repository** so everything runs offline. We also show how you would query the
live API for reference.

> **Network access:** Only the API example cells require network access. All
> analysis cells use the local sample data and run fully offline.

## Setup

In [None]:
import json
import re
from collections import Counter
from pathlib import Path

## Fetching Court Opinions

### Option A: Local Sample Data (offline)

This repository includes a small sample of court opinions in JSONL format.
Each line is a JSON object with fields: `id`, `case_name`, `court`,
`date_filed`, `text`, and `citations`.

In [None]:
SAMPLE_PATH = Path("../../datasets/sample/court_opinions.jsonl")


def load_opinions(path: Path = SAMPLE_PATH) -> list[dict]:
    """Load court opinions from a JSONL file."""
    opinions = []
    with open(path, encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if line:
                opinions.append(json.loads(line))
    return opinions


opinions = load_opinions()
print(f"Loaded {len(opinions)} opinions from {SAMPLE_PATH}")

In [None]:
# Display the first opinion as a formatted example.
sample = opinions[0]
print(f"Case Name  : {sample['case_name']}")
print(f"Court      : {sample['court']}")
print(f"Date Filed : {sample['date_filed']}")
print(f"Citations  : {sample['citations']}")
print(f"\nOpinion text (first 500 chars):")
print(f"  {sample['text'][:500]}")

In [None]:
# Display all opinions in a compact summary table.
print(f"{'ID':>6}  {'Date':10}  {'Court (short)':30}  {'Case Name (truncated)'}")
print("-" * 90)
for op in opinions:
    court_short = op["court"][:30]
    name_short = op["case_name"][:40]
    print(f"{op['id']:>6}  {op['date_filed']:10}  {court_short:30}  {name_short}")

### Option B: CourtListener REST API (requires network)

The cell below demonstrates how to query the live CourtListener API.
You can run it when you have network access, or simply read through the
code to understand the pattern.

> **[REQUIRES NETWORK]** -- The next cell makes HTTP requests to
> `courtlistener.com`.

In [None]:
# [REQUIRES NETWORK] -- Uncomment and run to query the live API.
#
# import requests
#
# API_BASE = "https://www.courtlistener.com/api/rest/v4"
#
# def fetch_opinions(
#     court: str = "scotus",
#     date_after: str = "2023-01-01",
#     page_size: int = 5,
# ) -> list[dict]:
#     """Fetch opinions from the CourtListener API."""
#     params = {
#         "court": court,
#         "date_created__gte": date_after,
#         "page_size": page_size,
#         "format": "json",
#     }
#     resp = requests.get(f"{API_BASE}/opinions/", params=params, timeout=15)
#     resp.raise_for_status()
#     data = resp.json()
#     print(f"Total results available: {data['count']}")
#     return data["results"]
#
# api_opinions = fetch_opinions()
# for op in api_opinions:
#     print(f"  [{op['id']}] {op.get('download_url', 'N/A')[:80]}")

## Comparing Data Quality

How does curated legal data compare to what you might extract from Common
Crawl? Let's look at the differences side by side.

### Common Crawl Extract (simulated)

Below we simulate a typical raw web extract of a legal page. This represents
what you would get after downloading a WARC response and running it through
BeautifulSoup -- navigation chrome, ads, and formatting artifacts mixed in
with the substantive text.

In [None]:
# Simulated Common Crawl extract -- this is what raw web scraping looks like
# after HTML-to-text conversion.
SIMULATED_WEB_EXTRACT = """Home | About | Contact | Login
Search our database of over 10 million cases...
ADVERTISEMENT - Legal Services - Click Here
================================================
Henderson v. Meridian Health Systems, Inc.
No. 24-1847
United States Court of Appeals for the Seventh Circuit
Filed: March 15, 2024

Before the Court is the appeal of plaintiff James Henderson from the district
court's grant of summary judgment in favor of defendant Meridian Health Systems.
Henderson alleges that Meridian violated the Americans with Disabilities Act...
[content continues]
================================================
Related Cases | Share | Print | Download PDF
Footer: Copyright 2024 LegalCasesOnline.com | Privacy Policy | Terms of Use
Subscribe to our newsletter for weekly case updates!"""

print("=" * 72)
print("COMMON CRAWL EXTRACT (simulated)")
print("=" * 72)
print(SIMULATED_WEB_EXTRACT)

In [None]:
# CourtListener data -- clean, structured, authoritative.
cl_opinion = opinions[0]  # Henderson v. Meridian -- same case

print("=" * 72)
print("COURTLISTENER DATA")
print("=" * 72)
print(f"Case Name  : {cl_opinion['case_name']}")
print(f"Court      : {cl_opinion['court']}")
print(f"Date Filed : {cl_opinion['date_filed']}")
print(f"Citations  : {cl_opinion['citations']}")
print(f"\nOpinion text (first 400 chars):")
print(cl_opinion["text"][:400])

### Side-by-Side Comparison

| Dimension | Common Crawl | CourtListener |
|-----------|-------------|---------------|
| **Metadata** | Must be inferred from HTML/URL | Structured fields (court, date, case name) |
| **Text quality** | Noisy -- nav bars, ads, footers mixed in | Clean opinion text only |
| **Citations** | Embedded in raw text, must be parsed | Provided as a structured list |
| **Provenance** | Unknown third-party website | Official court records |
| **Coverage** | Unpredictable; depends on what was crawled | Systematic collection from court systems |
| **Freshness** | Crawl lag (weeks to months) | Near real-time updates |

For legal AI, the CourtListener approach is clearly superior for building a
training corpus. But Common Crawl still has value: it captures secondary
sources (law review articles, legal blogs, news coverage) that can provide
broader context.

## Corpus Statistics

Let's compute basic statistics over our sample corpus to understand its
characteristics. These metrics are useful baselines when evaluating any
text dataset.

In [None]:
# --- Document count ---
doc_count = len(opinions)
print(f"Document count: {doc_count}")

In [None]:
# --- Average length (characters and words) ---
char_lengths = [len(op["text"]) for op in opinions]
word_lengths = [len(op["text"].split()) for op in opinions]

avg_chars = sum(char_lengths) / doc_count
avg_words = sum(word_lengths) / doc_count

print(f"Average length (characters): {avg_chars:,.0f}")
print(f"Average length (words)     : {avg_words:,.0f}")
print(f"Min words: {min(word_lengths):,}  |  Max words: {max(word_lengths):,}")
print()
print("Per-document breakdown:")
for op, wc, cc in zip(opinions, word_lengths, char_lengths):
    print(f"  [{op['id']}] {op['case_name'][:45]:45s}  {wc:>5} words  {cc:>6} chars")

In [None]:
# --- Vocabulary size (unique words) ---
# We do minimal normalization: lowercase and strip punctuation.
all_words = []
for op in opinions:
    tokens = re.findall(r"[a-zA-Z]+(?:'[a-zA-Z]+)?", op["text"].lower())
    all_words.extend(tokens)

vocab = set(all_words)
total_tokens = len(all_words)

print(f"Total tokens (words)  : {total_tokens:,}")
print(f"Vocabulary size       : {len(vocab):,} unique words")
print(f"Type-token ratio      : {len(vocab) / total_tokens:.4f}")

In [None]:
# --- Most common legal terms ---
# We filter out common English stop words to surface domain-specific vocabulary.
STOP_WORDS = {
    "the", "of", "and", "to", "a", "in", "that", "is", "for", "it", "was",
    "on", "with", "as", "at", "by", "an", "be", "this", "from", "or", "are",
    "not", "but", "its", "his", "her", "he", "she", "has", "have", "had",
    "been", "were", "which", "their", "we", "they", "will", "would", "can",
    "could", "may", "should", "shall", "do", "does", "did", "no", "if",
    "all", "more", "than", "other", "into", "also", "any", "such", "when",
    "who", "what", "there", "each", "about", "up", "out", "so", "said",
    "under", "after", "before", "between", "over", "because", "our", "you",
}

legal_word_counts = Counter(
    w for w in all_words if w not in STOP_WORDS and len(w) > 2
)

print("Most common terms (excluding stop words):")
print()
for word, count in legal_word_counts.most_common(25):
    bar = "|" * min(count, 50)
    print(f"  {word:20s}  {count:>4}  {bar}")

In [None]:
# --- Citation analysis ---
all_citations = []
for op in opinions:
    all_citations.extend(op.get("citations", []))

print(f"Total citations across corpus: {len(all_citations)}")
print(f"Unique citations             : {len(set(all_citations))}")
print()
print("All citations:")
for cite in sorted(set(all_citations)):
    freq = all_citations.count(cite)
    print(f"  [{freq}x] {cite}")

### Summary Statistics

Even with just 5 sample opinions, we can observe several characteristics
of legal text:

1. **Long documents** -- Court opinions are substantially longer than typical
   web pages, often running to several thousand words.
2. **Specialized vocabulary** -- Terms like "court", "defendant", "plaintiff",
   "motion", and "commission" appear frequently.
3. **Dense citation networks** -- Opinions reference other opinions, creating
   a graph structure that is valuable for understanding legal reasoning.
4. **Formal register** -- The language is precise and formulaic, quite
   different from informal web text.

## Exercises

### Exercise (a): Fetch Opinions from a Specific Court and Date Range

Using the CourtListener API (requires network access), write code to:
1. Fetch the 10 most recent opinions from the Second Circuit (`ca2`).
2. Filter for opinions filed after `2024-01-01`.
3. Display the case name, date, and first 200 characters of each opinion.

Starter code:
```python
import requests

API_BASE = "https://www.courtlistener.com/api/rest/v4"

params = {
    "court": "ca2",
    "date_created__gte": "2024-01-01",
    "page_size": 10,
    "ordering": "-date_created",
    "format": "json",
}

resp = requests.get(f"{API_BASE}/clusters/", params=params, timeout=15)
resp.raise_for_status()
results = resp.json()["results"]

for r in results:
    print(r["case_name"], r["date_filed"])
```

### Exercise (b): Compute Corpus Statistics on a Larger Dataset

Using either the local sample data or opinions fetched from the API, compute:

1. **Document count** -- How many opinions are in the corpus?
2. **Average length** -- Mean number of characters and words per opinion.
3. **Vocabulary size** -- Number of unique words after lowercasing.
4. **Top 20 legal terms** -- Most frequent words after removing stop words.

Compare your results to the statistics we computed above. If you used API data,
how do the numbers differ from our 5-document sample? What does this tell you
about the representativeness of small samples?