# Entity Resolution

**Problem**: How can we identify and unify organization names across dockets, accounting for aliases and inconsistent naming conventions?

**Stakeholder Quotes:**
- "Make it easier to determine which organizations submitted comments."
- "We need entity disambiguation."

This notebook demonstrates:
- Extracting organization names from comment titles
- Identifying variations of the same entity
- Building a canonical entity mapping
- Tracking organizations across dockets

In [5]:
import duckdb
import pandas as pd
from collections import Counter
import re

R2_BASE_URL = "https://pub-5fc11ad134984edf8d9af452dd1849d6.r2.dev"

conn = duckdb.connect()
conn.execute("INSTALL httpfs; LOAD httpfs;")
print("✓ Ready")

✓ Ready


## 1. Extract Commenter Names

Comment titles often contain the submitter's name/organization.

In [6]:
# Sample comment titles to understand naming patterns
titles = conn.execute(f"""
    SELECT title, docket_id, posted_date
    FROM read_parquet('{R2_BASE_URL}/comments.parquet')
    WHERE title IS NOT NULL
      AND LENGTH(title) > 10
    LIMIT 100
""").fetchdf()

print("Sample comment titles:")
for t in titles['title'].head(20):
    print(f"  - {t[:80]}")

Sample comment titles:
  - SACWIS Public Comments
  - Comment on FR Doc # E7-24860
  - Comment on FR Doc # E7-24860
  - Comment on FR Doc # E7-24860
  - AFCARS PUBLIC COMMENTS
  - Comment on FR Doc # 2010-23583
  - Comment on AFCARS NPRM
  - Comment on FR Doc # 2010-23583
  - SACWIS Public Comments
  - AFCARS NPRM
  - Comment on FR Doc # 2010-23583
  - AFCARS PUBLIC COMMENT
  - Comment on FR Doc # 2010-23583
  - Comment on FR Doc # E8-28812
  - Comment on AFCARS NPRM
  - Comment on FR Doc # 2010-23583
  - SACWIS Public Comments
  - Comment on AFCARS NPRM
  - Comments Regarding Regulations For Tribal Title IV-E
  - Comment on FR Doc # 2010-23583


## 2. Find Frequent Commenters

Organizations that comment frequently across dockets.

In [7]:
# Most frequent comment titles (often org names)
frequent = conn.execute(f"""
    SELECT 
        title,
        COUNT(*) as comment_count,
        COUNT(DISTINCT docket_id) as docket_count
    FROM read_parquet('{R2_BASE_URL}/comments.parquet')
    WHERE title IS NOT NULL
      AND title NOT LIKE 'Comment%'
      AND title NOT LIKE 'Anonymous%'
      AND LENGTH(title) > 5
    GROUP BY title
    HAVING COUNT(DISTINCT docket_id) > 5
    ORDER BY docket_count DESC
    LIMIT 30
""").fetchdf()

print("Frequent commenters across multiple dockets:")
frequent

Frequent commenters across multiple dockets:


Unnamed: 0,title,comment_count,docket_count
0,Submitted Electronically via eRulemaking Portal,684529,694
1,Boeing Commercial Airplane,424,387
2,Advocates for Highway and Auto Safety - Comments,869,364
3,The Boeing Company,385,352
4,United Airlines,279,261
5,Public Comment,11544,258
6,Boeing Commercial Airplanes,259,245
7,"Air Line Pilots Association, International",258,243
8,"Air Line Pilots Association, Int'l",244,236
9,American Airlines,259,229


## 3. Identify Organization Patterns

Look for common organizational suffixes and patterns.

In [8]:
# Find titles containing org indicators
org_patterns = ['Association', 'Institute', 'Foundation', 'Corporation', 'Inc.', 
                'LLC', 'Council', 'Coalition', 'Alliance', 'Federation', 'Union',
                'Chamber', 'Society', 'Board', 'Commission', 'Agency']

pattern_clause = " OR ".join([f"title LIKE '%{p}%'" for p in org_patterns])

orgs = conn.execute(f"""
    SELECT 
        title,
        COUNT(*) as comments,
        COUNT(DISTINCT docket_id) as dockets,
        COUNT(DISTINCT agency_code) as agencies
    FROM read_parquet('{R2_BASE_URL}/comments.parquet')
    WHERE ({pattern_clause})
    GROUP BY title
    HAVING COUNT(*) > 3
    ORDER BY dockets DESC
    LIMIT 50
""").fetchdf()

print(f"Found {len(orgs)} organizations with multiple comments:")
orgs

Found 50 organizations with multiple comments:


Unnamed: 0,title,comments,dockets,agencies
0,"Comment from Air Line Pilots Association, Int'l",551,528,4
1,Comment from Advanced Medical Technology Assoc...,323,256,6
2,"Air Line Pilots Association, International",258,243,4
3,"Air Line Pilots Association, Int'l",244,236,2
4,American Trucking Associations - Comments,276,207,8
5,Comment submitted by American Chemistry Counci...,413,178,1
6,Aircraft Owners and Pilots Association,193,167,3
7,Air Line Pilots Association,152,145,3
8,Comment from U.S. Chamber of Commerce,161,141,31
9,National Transportation Safety Board,160,141,5


## 4. Fuzzy Name Matching

Find variations of the same entity name.

In [9]:
# Simple approach: find titles that share the first N characters
# This catches "American Petroleum Institute" vs "American Petroleum Institute (API)"

prefix_matches = conn.execute(f"""
    SELECT 
        LEFT(title, 30) as name_prefix,
        COUNT(DISTINCT title) as variations,
        COUNT(*) as total_comments,
        ARRAY_AGG(DISTINCT title) as all_variations
    FROM read_parquet('{R2_BASE_URL}/comments.parquet')
    WHERE title IS NOT NULL
      AND LENGTH(title) > 30
    GROUP BY LEFT(title, 30)
    HAVING COUNT(DISTINCT title) > 1
    ORDER BY variations DESC
    LIMIT 20
""").fetchdf()

print("Entity name variations (potential duplicates to merge):")
for _, row in prefix_matches.iterrows():
    print(f"\n{row['name_prefix']}... ({row['variations']} variations, {row['total_comments']} comments)")
    for v in row['all_variations'][:5]:
        print(f"  - {v[:60]}")

Entity name variations (potential duplicates to merge):

BIS Decision Memo - BIS-2018-0... (81186 variations, 88647 comments)
  - BIS Decision Memo - BIS-2018-0006-166085
  - BIS Decision Memo - BIS-2018-0006-119814
  - BIS Decision Memo - BIS-2018-0006-87820
  - BIS Decision Memo - BIS-2018-0006-117474
  - BIS Decision Memo - BIS-2018-0006-114758

Planned Parenthood Mass Mailin... (36488 variations, 37638 comments)
  - Planned Parenthood Mass Mailing, Donna, Prinzmetal - OR
  - Planned Parenthood Mass Mailing, Michael, Smar - IL
  - Planned Parenthood Mass Mailing, Emily, Schneider - OH
  - Planned Parenthood Mass Mailing, Philip, Steier - NE
  - Planned Parenthood Mass Mailing, Cynthia, Wick - MD

[TF] Template Form Comment - (... (9789 variations, 43333 comments)
  - [TF] Template Form Comment - (no last name), superNEScube - 
  - [TF] Template Form Comment - (no last name), Peter - First R
  - [TF] Template Form Comment - (no last name), Gareth - First 
  - [TF] Template Form Comme

## 5. Track Entity Across Dockets

In [11]:
# Track a specific organization's commenting activity
org_name = "Sierra Club"  # Change this to track different orgs

org_activity = conn.execute(f"""
    SELECT 
        c.docket_id,
        d.title as docket_title,
        d.agency_code,
        c.posted_date,
        c.title as comment_title
    FROM read_parquet('{R2_BASE_URL}/comments.parquet') c
    LEFT JOIN read_parquet('{R2_BASE_URL}/dockets.parquet') d
        ON c.docket_id = d.docket_id
    WHERE LOWER(c.title) LIKE '%{org_name.lower()}%'
    ORDER BY c.posted_date DESC
    LIMIT 25
""").fetchdf()

print(f"{org_name} commenting activity:")
org_activity

Sierra Club commenting activity:


Unnamed: 0,docket_id,docket_title,agency_code,posted_date,comment_title
0,EPA-R09-OAR-2025-1938,Air Plan Approval; California; San Joaquin Val...,EPA,2026-01-14T05:00:00Z,Comment submitted by Committee for a Better Ar...
1,EPA-HQ-OW-2025-0322,Updated Definition of Waters of the United States,EPA,2026-01-14T05:00:00Z,Comment submitted by Lehigh Valley Group of th...
2,EPA-HQ-OW-2025-0322,Updated Definition of Waters of the United States,EPA,2026-01-12T05:00:00Z,Comment submitted by Sierra Club et al. (Part ...
3,EPA-HQ-OW-2025-0322,Updated Definition of Waters of the United States,EPA,2026-01-12T05:00:00Z,Comment submitted by Sierra Club et al. (Part ...
4,EPA-HQ-OW-2025-0322,Updated Definition of Waters of the United States,EPA,2026-01-06T05:00:00Z,Comment submitted by Illinois Chapter of the S...
5,EPA-HQ-OW-2025-0322,Updated Definition of Waters of the United States,EPA,2026-01-06T05:00:00Z,"Comment submitted by Piasa Palisades Group, Si..."
6,EPA-HQ-OW-2025-0322,Updated Definition of Waters of the United States,EPA,2026-01-06T05:00:00Z,Mass Comment Campaign sponsored by Sierra Club
7,EPA-HQ-OW-2002-0072,Process for Designing a Watershed Initiative,EPA,2026-01-02T05:00:00Z,Comment submitted by Cumberland (Kentucky) Cha...
8,EPA-HQ-OW-2002-0072,Process for Designing a Watershed Initiative,EPA,2026-01-02T05:00:00Z,Comment submitted by Cumberland (Kentucky) Cha...
9,EPA-HQ-OPPT-2020-0549,Reporting and Recordkeeping for Perfluoroalkyl...,EPA,2025-12-31T05:00:00Z,Comment submitted by Sierra Club


## 6. Build Entity Classification

Classify entities by type (industry, advocacy, government, individual).

In [12]:
# Simple keyword-based classification
classifications = {
    'industry': ['Corporation', 'Inc.', 'LLC', 'Corp.', 'Company', 'Industries'],
    'trade_association': ['Association', 'Federation', 'Chamber', 'Council'],
    'advocacy': ['Sierra Club', 'EDF', 'NRDC', 'Citizens', 'Action', 'Watch'],
    'academic': ['University', 'College', 'Institute', 'Professor', 'PhD'],
    'government': ['Department', 'Agency', 'Commission', 'State of', 'City of']
}

# Build classification query
case_clauses = []
for entity_type, keywords in classifications.items():
    conditions = " OR ".join([f"title LIKE '%{kw}%'" for kw in keywords])
    case_clauses.append(f"WHEN {conditions} THEN '{entity_type}'")

case_stmt = "CASE " + " ".join(case_clauses) + " ELSE 'individual' END"

entity_breakdown = conn.execute(f"""
    SELECT 
        {case_stmt} as entity_type,
        COUNT(*) as comment_count,
        COUNT(DISTINCT docket_id) as dockets
    FROM read_parquet('{R2_BASE_URL}/comments.parquet')
    WHERE title IS NOT NULL
    GROUP BY entity_type
    ORDER BY comment_count DESC
""").fetchdf()

print("Comment breakdown by entity type:")
entity_breakdown

Comment breakdown by entity type:


Unnamed: 0,entity_type,comment_count,dockets
0,individual,24379915,53985
1,industry,172278,14152
2,trade_association,123473,15875
3,government,56682,8042
4,academic,29246,6860
5,advocacy,13079,3390
