# Campaign Detection

**Problem**: How can we detect duplicate or template-driven comment submissions, including coordinated campaigns and potential bot activity?

This notebook demonstrates techniques for identifying:
- Duplicate comments (exact matches)
- Template-driven comments (high similarity)
- Coordinated submission patterns (time-based clustering)
- Potential bot activity (unusual submission patterns)

In [1]:
import duckdb
import pandas as pd

R2_BASE_URL = "https://pub-5fc11ad134984edf8d9af452dd1849d6.r2.dev"

conn = duckdb.connect()
conn.execute("INSTALL httpfs; LOAD httpfs;")
print("✓ Ready")

✓ Ready


## 1. Find Exact Duplicate Comments

Identify comments with identical text across a docket.

In [2]:
# Find dockets with many comments for analysis
conn.execute(f"""
    SELECT docket_id, agency_code, COUNT(*) as comment_count
    FROM read_parquet('{R2_BASE_URL}/comments.parquet')
    GROUP BY docket_id, agency_code
    ORDER BY comment_count DESC
    LIMIT 20
""").fetchdf()

Unnamed: 0,docket_id,agency_code,comment_count
0,DOI-2017-0002,DOI,743653
1,CEQ-2019-0003,CEQ,720590
2,ATF-2023-0002,ATF,369840
3,ETA-2019-0005,ETA,325696
4,BOEM-2022-0031,BOEM,324103
5,ATF-2021-0001,ATF,249202
6,FWS-HQ-ES-2025-0034,FWS,242893
7,ED-2021-OCR-0166,ED,238945
8,FS-2025-0001,FS,223671
9,BLM-2018-0001,BLM,222494


In [3]:
# Analyze a specific docket for duplicates
docket_id = "EPA-HQ-OAR-2021-0317"  # Change to a docket with many comments

duplicates = conn.execute(f"""
    SELECT 
        comment,
        COUNT(*) as duplicate_count,
        MIN(posted_date) as first_posted,
        MAX(posted_date) as last_posted
    FROM read_parquet('{R2_BASE_URL}/comments.parquet')
    WHERE docket_id = '{docket_id}'
      AND comment IS NOT NULL
      AND LENGTH(comment) > 50
    GROUP BY comment
    HAVING COUNT(*) > 1
    ORDER BY duplicate_count DESC
    LIMIT 20
""").fetchdf()

print(f"Found {len(duplicates)} unique comment texts that appear multiple times")
duplicates

Found 20 unique comment texts that appear multiple times


Unnamed: 0,comment,duplicate_count,first_posted,last_posted
0,"Dear Administrator Regan, <br/><br/>Thank you ...",10,2023-03-28T04:00:00Z,2023-04-07T04:00:00Z
1,"As pro-life Christians, we want the air we bre...",8,2021-12-10T05:00:00Z,2022-02-04T05:00:00Z
2,Docket No. EPA-HQ-OAR-2021-0317<br/><br/>I urg...,7,2021-11-23T05:00:00Z,2021-12-21T05:00:00Z
3,"Dear Administrator Regan, <br/><br/>Thank you ...",6,2023-03-28T04:00:00Z,2023-03-30T04:00:00Z
4,Oil and gas producers should not be allowed to...,4,2022-02-03T05:00:00Z,2022-02-04T05:00:00Z
5,Communities across the U.S. are being impacted...,3,2022-02-04T05:00:00Z,2022-02-04T05:00:00Z
6,Thank for proposing a rule to reduce methane ...,3,2023-03-28T04:00:00Z,2023-03-28T04:00:00Z
7,Thank you for the opportunity to testify. My n...,2,2021-12-10T05:00:00Z,2022-02-02T05:00:00Z
8,"As an advocate for our national parks, communi...",2,2023-03-28T04:00:00Z,2023-03-30T04:00:00Z
9,I applaud the EPA for proposing an updated met...,2,2023-03-28T04:00:00Z,2023-03-28T04:00:00Z


## 2. Detect Template-Based Comments

Find comments that share common phrases but have slight variations.

In [4]:
# Extract first N characters to find template patterns
templates = conn.execute(f"""
    SELECT 
        LEFT(comment, 100) as comment_start,
        COUNT(*) as count,
        COUNT(DISTINCT comment) as variations
    FROM read_parquet('{R2_BASE_URL}/comments.parquet')
    WHERE docket_id = '{docket_id}'
      AND comment IS NOT NULL
      AND LENGTH(comment) > 100
    GROUP BY LEFT(comment, 100)
    HAVING COUNT(*) > 5
    ORDER BY count DESC
    LIMIT 15
""").fetchdf()

print("Comments sharing the same opening 100 characters (potential templates):")
templates

Comments sharing the same opening 100 characters (potential templates):


Unnamed: 0,comment_start,count,variations
0,Docket No. EPA-HQ-OAR-2021-0317<br/><br/>I urg...,26,20
1,"EPA Administrator Michael Regan,<br/><br/>I am...",12,12
2,"Dear Administrator Regan, <br/><br/>Thank you ...",10,1
3,"As pro-life Christians, we want the air we bre...",9,2
4,"As a person of faith and conscience, I am writ...",7,7
5,Oil and gas producers should not be allowed to...,6,3
6,"Dear EPA EPA,<br/><br/>I am deeply concerned w...",6,6
7,"Dear Administrator Regan, <br/><br/>Thank you ...",6,1
8,Im writing to urge EPA to strengthen the propo...,6,6


## 3. Time-Based Submission Analysis

Identify coordinated campaigns by analyzing submission timing.

In [5]:
# Comments per day for a docket
daily_submissions = conn.execute(f"""
    SELECT 
        CAST(posted_date AS DATE) as date,
        COUNT(*) as comments
    FROM read_parquet('{R2_BASE_URL}/comments.parquet')
    WHERE docket_id = '{docket_id}'
      AND posted_date IS NOT NULL
    GROUP BY CAST(posted_date AS DATE)
    ORDER BY date
""").fetchdf()

# Find spike days (potential campaign activity)
mean_comments = daily_submissions['comments'].mean()
std_comments = daily_submissions['comments'].std()
spike_threshold = mean_comments + 2 * std_comments

spikes = daily_submissions[daily_submissions['comments'] > spike_threshold]
print(f"Average daily comments: {mean_comments:.1f}")
print(f"Spike threshold (mean + 2σ): {spike_threshold:.1f}")
print(f"\nSpike days (potential coordinated campaigns):")
spikes

Average daily comments: 36.0
Spike threshold (mean + 2σ): 188.3

Spike days (potential coordinated campaigns):


Unnamed: 0,date,comments
18,2022-02-02,236
20,2022-02-04,286
53,2023-01-26,266
69,2023-02-27,198
77,2023-04-10,264
78,2023-04-12,489


## 4. Campaign Summary

Aggregate findings into a campaign detection report.

In [6]:
# Summary statistics for a docket
stats = conn.execute(f"""
    SELECT
        COUNT(*) as total_comments,
        COUNT(DISTINCT comment) as unique_texts,
        COUNT(*) - COUNT(DISTINCT comment) as likely_duplicates,
        ROUND(100.0 * (COUNT(*) - COUNT(DISTINCT comment)) / COUNT(*), 1) as duplicate_pct
    FROM read_parquet('{R2_BASE_URL}/comments.parquet')
    WHERE docket_id = '{docket_id}'
      AND comment IS NOT NULL
""").fetchdf()

print(f"Campaign Detection Summary for {docket_id}")
print("=" * 50)
print(f"Total comments: {stats['total_comments'].iloc[0]:,}")
print(f"Unique comment texts: {stats['unique_texts'].iloc[0]:,}")
print(f"Likely duplicates: {stats['likely_duplicates'].iloc[0]:,}")
print(f"Duplicate percentage: {stats['duplicate_pct'].iloc[0]}%")

Campaign Detection Summary for EPA-HQ-OAR-2021-0317
Total comments: 3,637
Unique comment texts: 1,120
Likely duplicates: 2,517
Duplicate percentage: 69.2%
