# Docket-Level Analysis

**Problem**: How can we build clear, digestible summaries and insights from tens of thousands of comments within a single docket?

This notebook demonstrates:
- Comment volume and timeline analysis
- Top themes and keywords extraction
- Commenter type breakdown
- Sentiment distribution estimation

In [3]:
import duckdb
import pandas as pd
from collections import Counter
import re

R2_BASE_URL = "https://pub-5fc11ad134984edf8d9af452dd1849d6.r2.dev"

conn = duckdb.connect()
conn.execute("INSTALL httpfs; LOAD httpfs;")
print("✓ Ready")

✓ Ready


In [4]:
# Select a docket to analyze
docket_id = "EPA-HQ-OAR-2021-0317"  # Change to your target docket

# Get docket info
docket_info = conn.execute(f"""
    SELECT docket_id, agency_code, title, docket_type, modify_date
    FROM read_parquet('{R2_BASE_URL}/dockets.parquet')
    WHERE docket_id = '{docket_id}'
""").fetchdf()
docket_info

Unnamed: 0,docket_id,agency_code,title,docket_type,modify_date
0,EPA-HQ-OAR-2021-0317,EPA,"Standards of Performance for New, Reconstructe...",Rulemaking,2024-05-20T13:28:23Z
1,EPA-HQ-OAR-2021-0317,EPA,"Standards of Performance for New, Reconstructe...",Rulemaking,2023-08-02T14:59:24Z
2,EPA-HQ-OAR-2021-0317,EPA,"Standards of Performance for New, Reconstructe...",Rulemaking,2023-08-15T14:08:58Z
3,EPA-HQ-OAR-2021-0317,EPA,"Standards of Performance for New, Reconstructe...",Rulemaking,2023-04-13T16:53:27Z
4,EPA-HQ-OAR-2021-0317,EPA,"Standards of Performance for New, Reconstructe...",Rulemaking,2023-12-07T08:19:59Z
5,EPA-HQ-OAR-2021-0317,EPA,"Standards of Performance for New, Reconstructe...",Rulemaking,2024-05-07T15:36:23Z
6,EPA-HQ-OAR-2021-0317,EPA,"Standards of Performance for New, Reconstructe...",Rulemaking,2023-05-09T11:05:01Z
7,EPA-HQ-OAR-2021-0317,EPA,"Standards of Performance for New, Reconstructe...",Rulemaking,2023-04-10T14:41:11Z
8,EPA-HQ-OAR-2021-0317,EPA,"Standards of Performance for New, Reconstructe...",Rulemaking,2024-03-29T16:27:25Z
9,EPA-HQ-OAR-2021-0317,EPA,"Standards of Performance for New, Reconstructe...",Rulemaking,2023-03-28T07:42:11Z


## 1. Comment Volume Overview

In [5]:
# Basic statistics
stats = conn.execute(f"""
    SELECT
        COUNT(*) as total_comments,
        COUNT(DISTINCT comment) as unique_texts,
        MIN(posted_date) as first_comment,
        MAX(posted_date) as last_comment,
        AVG(LENGTH(comment)) as avg_length
    FROM read_parquet('{R2_BASE_URL}/comments.parquet')
    WHERE docket_id = '{docket_id}'
""").fetchdf()

print(f"Docket Analysis: {docket_id}")
print("=" * 50)
print(f"Total comments: {stats['total_comments'].iloc[0]:,}")
print(f"Unique texts: {stats['unique_texts'].iloc[0]:,}")
print(f"Comment period: {stats['first_comment'].iloc[0]} to {stats['last_comment'].iloc[0]}")
print(f"Average comment length: {stats['avg_length'].iloc[0]:.0f} chars")

Docket Analysis: EPA-HQ-OAR-2021-0317
Total comments: 3,639
Unique texts: 1,120
Comment period: 2021-11-17T05:00:00Z to 2025-07-23T04:00:00Z
Average comment length: 335 chars


## 2. Timeline Analysis

In [6]:
# Daily comment volume
timeline = conn.execute(f"""
    SELECT 
        CAST(posted_date AS DATE) as date,
        COUNT(*) as comments
    FROM read_parquet('{R2_BASE_URL}/comments.parquet')
    WHERE docket_id = '{docket_id}'
      AND posted_date IS NOT NULL
    GROUP BY CAST(posted_date AS DATE)
    ORDER BY date
""").fetchdf()

# Simple ASCII chart
max_comments = timeline['comments'].max()
print("Daily Comment Volume:")
print("-" * 60)
for _, row in timeline.tail(30).iterrows():
    bar_length = int(40 * row['comments'] / max_comments)
    print(f"{row['date']} | {'█' * bar_length} {row['comments']}")

Daily Comment Volume:
------------------------------------------------------------
2023-03-06 00:00:00 | ██ 32
2023-03-08 00:00:00 | █ 20
2023-03-28 00:00:00 | █████ 70
2023-03-30 00:00:00 | ██ 25
2023-04-03 00:00:00 |  1
2023-04-07 00:00:00 | █████████ 114
2023-04-10 00:00:00 | █████████████████████ 264
2023-04-12 00:00:00 | ████████████████████████████████████████ 489
2023-04-13 00:00:00 | ████ 55
2023-05-09 00:00:00 |  1
2023-06-08 00:00:00 |  2
2023-07-11 00:00:00 |  2
2023-07-13 00:00:00 |  1
2023-07-20 00:00:00 |  4
2023-07-27 00:00:00 | ███ 44
2023-08-02 00:00:00 |  1
2023-08-15 00:00:00 |  5
2023-08-24 00:00:00 |  1
2023-12-04 00:00:00 |  4
2023-12-07 00:00:00 |  2
2023-12-19 00:00:00 |  2
2024-05-07 00:00:00 |  6
2024-05-09 00:00:00 | █ 18
2024-05-15 00:00:00 | ███ 40
2024-05-20 00:00:00 |  2
2024-08-28 00:00:00 |  4
2024-09-04 00:00:00 |  1
2024-09-09 00:00:00 |  1
2024-10-23 00:00:00 |  1
2025-07-23 00:00:00 |  1


## 3. Comment Length Distribution

In [7]:
# Categorize by length (proxy for effort/detail)
length_dist = conn.execute(f"""
    SELECT
        CASE 
            WHEN LENGTH(comment) < 100 THEN 'Very short (<100 chars)'
            WHEN LENGTH(comment) < 500 THEN 'Short (100-500 chars)'
            WHEN LENGTH(comment) < 2000 THEN 'Medium (500-2000 chars)'
            WHEN LENGTH(comment) < 10000 THEN 'Long (2000-10000 chars)'
            ELSE 'Very long (10000+ chars)'
        END as length_category,
        COUNT(*) as count
    FROM read_parquet('{R2_BASE_URL}/comments.parquet')
    WHERE docket_id = '{docket_id}'
      AND comment IS NOT NULL
    GROUP BY length_category
    ORDER BY count DESC
""").fetchdf()

length_dist

Unnamed: 0,length_category,count
0,Very short (<100 chars),2676
1,Medium (500-2000 chars),448
2,Short (100-500 chars),332
3,Long (2000-10000 chars),181


## 4. Sample Unique vs Duplicate Comments

In [8]:
# Sample of unique, substantive comments (likely individual submissions)
unique_comments = conn.execute(f"""
    WITH comment_counts AS (
        SELECT comment, COUNT(*) as cnt
        FROM read_parquet('{R2_BASE_URL}/comments.parquet')
        WHERE docket_id = '{docket_id}'
          AND comment IS NOT NULL
          AND LENGTH(comment) > 200
        GROUP BY comment
    )
    SELECT c.comment_id, c.title, LEFT(c.comment, 300) as comment_preview
    FROM read_parquet('{R2_BASE_URL}/comments.parquet') c
    JOIN comment_counts cc ON c.comment = cc.comment
    WHERE c.docket_id = '{docket_id}'
      AND cc.cnt = 1
    LIMIT 5
""").fetchdf()

print("Sample Unique Comments (likely individual):")
for _, row in unique_comments.iterrows():
    print(f"\n--- {row['comment_id']} ---")
    print(row['comment_preview'][:200] + "...")

Sample Unique Comments (likely individual):

--- EPA-HQ-OAR-2021-0317-0206 ---
With energy costs rising almost daily this new regulation needs to be looked at thoroughly. We can control harmful emissions better with innovation instead of more regulation. At this point in time I ...

--- EPA-HQ-OAR-2021-0317-0202 ---
Hello,<br/><br/>I am new to this process but as I see the world changing and the risk of our own human extinction increasing everyday, I feel the need to start getting more involved. I&rsquo;m not a s...

--- EPA-HQ-OAR-2021-0317-0208 ---
The EPA has long been an enforcement agency of the radical left. Time and time again their &quot;recommendations&quot; have been nothing more than the furthering of a radical left-wing agenda on clima...

--- EPA-HQ-OAR-2021-0317-0207 ---
I believe methane gas is produced from decomposition of organic material and is thus naturally occurring. Is it considered &quot;green&quot; energy?  It is produced in our bodies and in the bodies of ...


## 5. Key Phrases (Simple Extraction)

In [9]:
# Get sample comments for keyword extraction
sample = conn.execute(f"""
    SELECT comment
    FROM read_parquet('{R2_BASE_URL}/comments.parquet')
    WHERE docket_id = '{docket_id}'
      AND comment IS NOT NULL
    LIMIT 500
""").fetchdf()

# Simple word frequency (excluding common words)
stopwords = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by', 'from', 'is', 'are', 'was', 'were', 'be', 'been', 'this', 'that', 'these', 'those', 'i', 'we', 'you', 'it', 'they', 'my', 'our', 'your', 'as', 'not', 'have', 'has', 'will', 'would', 'should', 'could', 'can', 'all', 'any', 'if', 'so', 'do', 'does'}

all_words = []
for text in sample['comment'].dropna():
    words = re.findall(r'\b[a-z]{4,}\b', text.lower())
    all_words.extend([w for w in words if w not in stopwords])

print("Top 20 Keywords:")
for word, count in Counter(all_words).most_common(20):
    print(f"  {word}: {count}")

Top 20 Keywords:
  methane: 574
  attached: 279
  pollution: 245
  emissions: 221
  climate: 214
  wells: 177
  health: 170
  rules: 154
  more: 150
  rsquo: 148
  proposed: 146
  change: 125
  industry: 122
  flaring: 115
  their: 110
  also: 110
  protect: 103
  span: 101
  monitoring: 99
  energy: 99
