Skip to content

cnlangzi/knownbots

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

95 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

knownbots

Go Reference Go Report Card

KnownBots is a high-performance Go library for verifying search engine crawlers and identifying legitimate bots. It protects your web services from bot impersonation by validating User-Agent strings and IP addresses through RDNS lookups and IP range verification.

Why KnownBots?

The Problem: Malicious actors can easily spoof User-Agent strings to impersonate legitimate search engine bots (Googlebot, Bingbot, etc.) to bypass rate limits, scrape content, or exploit bot-specific logic.

The Solution: KnownBots performs cryptographic-strength verification by:

  1. Matching User-Agent markers (case-sensitive word boundaries)
  2. Verifying IP ownership through reverse DNS lookups or official IP ranges
  3. Caching results to avoid expensive DNS queries on subsequent requests

Key Features

πŸš€ High Performance

  • Lock-free reads via atomic.Pointer[T] for bot configuration and RDNS cache
  • Zero-allocation hot paths using netip.Prefix for IP matching
  • Byte-level indexing for O(1) bot lookup (150-300ns for 40 bots vs 640ns linear scan)
  • Copy-on-Write caching optimized for read-heavy workloads (1-20 writes/day)
  • Embedded bots - 57 built-in configs compiled into binary (no file I/O at startup)
  • Optional UA classification - Disabled by default for maximum performance
  • Logging control - Disable log output via knownbots.EnableLog = false

πŸ”’ Security First

  • Case-sensitive matching prevents forgery attempts (official bots use fixed casing)
  • Word boundary validation prevents partial matches (e.g., "MyGooglebot" won't match)
  • LRU fail cache for fast rejection of known-bad IPs (1000 entry limit)
  • Browser detection distinguishes legitimate users from suspicious bot-like patterns (opt-in)

πŸ“¦ Production Ready

  • Persistent RDNS cache survives restarts (file-based storage)
  • Background scheduler automatically refreshes IP ranges from official URLs
  • Graceful degradation (cache persistence failures don't affect runtime)
  • Comprehensive tests with benchmarks for 3-40 bot scenarios

🌍 Extensible

  • YAML-based configuration for easy bot additions (no code changes)
  • Pluggable verification supports both IP ranges and RDNS verification
  • Official source integration automatically downloads and updates IP lists

Installation

go get github.com/cnlangzi/knownbots

Requirements: Go 1.21+

Quick Start

Basic Usage

package main

import (
    "fmt"
    "log"

    "github.com/cnlangzi/knownbots"
)

func main() {
    // Initialize validator (starts background scheduler)
    v, err := knownbots.New()
    if err != nil {
        log.Fatal(err)
    }
    defer v.Close()

    // Verify a bot claim
    result := v.Validate(
        "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
        "66.249.66.1",
    )

    fmt.Printf("Status: %s\n", result.Status)      // "verified"
    fmt.Printf("IsBot: %t\n", result.IsBot)        // true
    fmt.Printf("IsVerified: %t\n", result.IsVerified) // true
    fmt.Printf("Bot Name: %s\n", result.Name)      // "googlebot"
}

HTTP Middleware Example

func BotVerificationMiddleware(v *knownbots.Validator) func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            ua := r.Header.Get("User-Agent")
            ip := r.RemoteAddr // In production, extract from X-Forwarded-For

            result := v.Validate(ua, ip)

            // Block fake bots (claims to be bot but IP not verified)
            if result.IsBot && !result.IsVerified {
                http.Error(w, "Forbidden: Bot verification failed", http.StatusForbidden)
                return
            }

            // Add verification metadata to request context
            ctx := context.WithValue(r.Context(), "botVerified", result)
            next.ServeHTTP(w, r.WithContext(ctx))
        })
    }
}

Configuration Options

v, err := knownbots.New(
    knownbots.WithRoot("./custom-bots"),    // Custom bot config directory
    knownbots.WithFailLimit(5000),          // Failed lookup cache size
    knownbots.WithClassifyUA(),             // Enable UA classification (disabled by default)
)

// Disable logging to reduce console pollution (e.g., in benchmarks)
knownbots.EnableLog = false

Configuration

Directory Structure

bots/
β”œβ”€β”€ conf.d/              # Bot configurations (YAML)
β”‚   β”œβ”€β”€ googlebot.yaml
β”‚   β”œβ”€β”€ bingbot.yaml
β”‚   └── ...
β”œβ”€β”€ googlebot/           # Bot-specific data (auto-created)
β”‚   β”œβ”€β”€ rdns.txt        # Persistent RDNS cache
β”‚   └── ips.txt         # Downloaded IP ranges
└── ...

Bot Configuration (YAML)

name: googlebot
ua: "Googlebot"                           # EXACT casing required (case-sensitive)
urls:                                      # Official IP list URLs (auto-downloaded)
  - "https://www.gstatic.com/ipranges/google.json"
custom:                                    # Static CIDR ranges (always checked)
  - "66.249.64.0/19"
asn:                                       # ASN numbers for verification (optional)
  - 15169
domains:                                   # Verified RDNS domains
  - "googlebot.com"
  - "google.com"
rdns: true                             # Enable RDNS verification (false = IP-only)

Important:

  • User-Agent markers (ua) are case-sensitive. Official bots use fixed casing (e.g., "Googlebot", never "googlebot"). This prevents forgery attempts where attackers alter casing to bypass detection.
  • Set rdns: false for bots that only need IP range verification (faster, no DNS queries)
  • ASN verification is optional and provides faster IP ownership verification (~35ns) compared to RDNS (~450ns) for bots with official ASN registrations

Parser Selection

Choose the correct parser based on the IP list format:

Format JSON Example Parser
Google-style {"prefixes": [{"ipv4Prefix": "1.2.3.4/24"}]} google
OpenAI-style {"prefixes": [{"prefix": "1.2.3.4/24"}]} openai
Plain text 1.2.3.4/24 or 172.16.0.5 txt
GitHub-style {"hooks": ["1.2.3.4/24"], "web": [...]} github
Stripe-style {"WEBHOOKS": ["3.18.12.63"]} stripe

User-Agent Matching Rules

  1. Case-sensitive: Use exact casing from official documentation

    • βœ… Correct: ua: "Googlebot" or ua: "bingbot"
    • ❌ Wrong: ua: "googlebot" or ua: "BINGBOT"
  2. Match type: Word boundary matching (not substring)

    • ua: "Googlebot" matches: Googlebot/2.1, Mozilla/5.0 (compatible; Googlebot/2.1; ...)
    • ua: "Googlebot" does NOT match: MyGooglebot, GooglebotPro
  3. Special bots: Some bots don't use Mozilla prefix

    • ua: "GPTBot" (OpenAI)
    • ua: "curl" (CLI tool)

How It Works

Verification Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Incoming Request                         β”‚
β”‚                  (User-Agent + IP Address)                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
                  β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  UA Matches Bot?   │──No──▢ Classify UA Type
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        (Browser/Suspicious/Unknown)
                  β”‚ Yes                         β”‚
                  β–Ό                             β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        Return: IsBot=false
         β”‚  Check IP Ranges   β”‚        (legitimate browser)
         β”‚  (CIDR matching)   β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
                  β”œβ”€ Hit ──▢ Return: verified
                  β”‚
                  β”œβ”€ Miss + asn empty ──▢ Check RDNS
                  β”‚
                  β”œβ”€ Miss + asn defined ──▢ Check ASN
                  β”‚                              β”‚
                  β”‚                              β”œβ”€ Hit ──▢ Return: verified
                  β”‚                              β”‚
                  β”‚                              └─ Miss ──▢ Check RDNS
                  β”‚
                  β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚   Bot.RDNS=true?   │──No──▢ Return: failed
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        (IP-only bot, no DNS check)
                  β”‚ Yes
                  β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  Check Fail Cache  │──Hit──▢ Return: failed
         β”‚  (LRU, 1000 IPs)   β”‚        (known fake bot)
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚ Miss
                  β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚ Check RDNS Cache   │──Hit──▢ Domain match?
         β”‚  (persistent)      β”‚         Yes: verified
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         No: failed
                  β”‚ Miss
                  β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚ Perform RDNS Lookup│──▢ Domain match?
         β”‚  (50-200ms delay)  β”‚     Yes: verified + cache
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     No: failed + fail cache

Background Scheduler (Every 24h)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Background Scheduler                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚                   β”‚          β”‚
        β–Ό                   β–Ό          β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Refresh  β”‚      β”‚ Update ASN   β”‚ β”‚ Prune &  β”‚
  β”‚ IP Lists β”‚      β”‚ Data         β”‚ β”‚ Save     β”‚
  β”‚ (HTTP)   β”‚      β”‚ (RIPE API)   β”‚ β”‚ RDNS     β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ Cache    β”‚
        β”‚                   β”‚        β”‚ (rdns=true) β”‚
        β–Ό                   β–Ό        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  Update memory      Update cache          β”‚
  Persist to file    Persist to file       β–Ό
                     (per-bot dir)    Remove invalid
                                        Persist to file

Performance

Benchmarks (40 bots, Intel i5-1038NG7 @ 2.00GHz)

Operation Time/op Allocs/op Notes
UA matching (hit first) 165ns 0 Byte index + word boundary check
UA matching (hit middle) 300ns 0 Worst case: mid-list match
UA matching (miss) 640ns 0 Full scan + browser classification
Validate (IP range hit) 227ns 0 Radix tree CIDR matching
Validate (ASN hit) 35ns 1 O(1) Patricia tree lookup
Validate (RDNS hit) 450ns 0 Cache lookup + domain match
Validate (cold lookup) 50-200ms 1-2 DNS query (first time only)

Key Insight: Verification priority is IP ranges β†’ ASN β†’ RDNS. ASN verification (~35ns) is faster than RDNS cache lookup (~450ns) and ideal for bots with official ASN registrations.

Scalability

Bot Count Index Benefit Recommended Index
< 20 bots Minimal (2x) Single byte (current)
20-50 bots Significant (4-5x) Single byte (current)
> 50 bots Critical (10x+) Consider 3-char prefix

Current implementation is optimized for 3-50 bots (covers 99% of use cases).

API Reference

Types

type Validator struct { /* ... */ }

type Result struct {
    Name       string       // Bot name (e.g., "googlebot")
    Status     ResultStatus // "verified" | "failed" | "unknown"
    IsBot      bool         // True if UA matches any bot or looks bot-like
    IsVerified bool         // True if IP ownership verified
}

type ResultStatus string
const (
    StatusVerified ResultStatus = "verified" // Bot confirmed (UA + IP match)
    StatusFailed   ResultStatus = "failed"   // Bot suspected but IP invalid
    StatusUnknown  ResultStatus = "unknown"  // Not a known bot
)

Methods

// New creates a validator with background scheduler
func New(opts ...Option) (*Validator, error)

// Validate verifies User-Agent and IP address
func (v *Validator) Validate(ua, ip string) Result

// Close stops background scheduler
func (v *Validator) Close() error

Options

// WithRoot sets custom bot directory (default: "./bots")
func WithRoot(dir string) Option

// WithFailLimit sets failed lookup cache size (default: 1000)
func WithFailLimit(limit int) Option

Real-World Use Cases

1. Rate Limiting

// Apply different rate limits for verified bots vs browsers
result := validator.Validate(ua, ip)
if result.IsVerified {
    limiter = rateLimits.Bot  // Generous: 10/sec
} else if result.IsBot {
    limiter = rateLimits.FakeBot  // Strict: 1/min
} else {
    limiter = rateLimits.Browser  // Normal: 5/sec
}

2. Analytics Exclusion

// Exclude verified bots from user analytics
result := validator.Validate(ua, ip)
if !result.IsBot || !result.IsVerified {
    analytics.Track(userID, event)
}

3. SEO Testing

// Allow verified Googlebot to bypass feature flags
result := validator.Validate(ua, ip)
if result.Name == "googlebot" && result.IsVerified {
    features.EnableAll()  // Show production content for indexing
}

4. Content Protection

// Block fake bots from scraping paywalled content
result := validator.Validate(ua, ip)
if result.IsBot && !result.IsVerified {
    return http.StatusForbidden  // Suspected scraper
}

Supported Bots (Built-in Configs)

Current built-in configurations:

  • Googlebot (Google Search)
  • Bingbot (Microsoft Bing)
  • facebookexternalhit (Facebook/Meta link previews)
  • GPTBot (OpenAI)
  • Applebot (Apple Search and Siri)
  • GitHub (GitHub webhooks)
  • Stripe (Stripe webhooks)
  • UptimeRobot (Uptime monitoring)

Need more bots? Add YAML configs to bots/conf.d/ - no code changes required!

Common bots to add:

  • Yandex (YandexBot)
  • Baidu (Baiduspider)
  • DuckDuckGo (DuckDuckBot)
  • Twitter (Twitterbot)
  • Slack (Slackbot)

See bots/conf.d/googlebot.yaml for configuration examples.

Testing

# Run all tests
go test ./...

# Run only unit tests (skip integration tests)
go test -short ./...

# Run benchmarks
go test -bench=. -benchmem

# Run specific test
go test -v -run ^TestValidator$

# Coverage report
go test -cover ./...

Integration Tests: The project includes integration tests that verify parsing of real API responses from:

  • GoogleBot: 307 prefixes
  • Bingbot: 28 prefixes
  • GPTBot: 21 prefixes
  • GitHub: 50 prefixes
  • Stripe: 12 IPs
  • UptimeRobot: 116 prefixes
  • Applebot: 12 prefixes

Architecture Decisions

Why atomic.Pointer[T] instead of RWMutex?

Bot configurations change rarely (on reload/schedule, 1-20x/day) but are read on every request (1000s/sec). atomic.Pointer[T] provides:

  • Lock-free reads - single atomic load, no lock acquisition overhead
  • Readers never block - writes don't wait for readers, readers don't wait for writes (Copy-on-Write)
  • Consistent performance - no priority inversion or cache line contention from lock operations

Consistent sub-microsecond performance for read-heavy workloads.

Why case-sensitive UA matching?

Official bots use fixed casing ("Googlebot", never "googlebot"). Case variations indicate forgery. Case-sensitive matching:

  1. Rejects fakes at first stage (no expensive DNS queries)
  2. 4x faster than case-insensitive (16ns vs 67ns)
  3. Improves both security and performance

Why Copy-on-Write cache?

RDNS cache sees 1-20 new IPs per day but 1000s of reads per second (99.99% read ratio). Copy-on-Write with atomic swap provides:

  • Zero-allocation reads (no locking)
  • Safe concurrent access
  • Simple implementation (vs lock-free data structures)

Why byte-level index?

Linear bot list scan is fast for 3 bots (52ns) but degrades to 640ns at 40 bots. Single-character index provides 4-5x speedup for 20-50 bots at minimal memory cost (<1KB).

Why Bot method encapsulation?

IP and ASN lifecycle operations (load, refresh, persist) are shared between initialization and the background scheduler. Encapsulating these as Bot methods:

  • Eliminates duplicate code - initBot and runScheduler both call the same loadCachedIPs, refreshIPs, initializeASN, and refreshASN methods
  • Centralizes state - IPTree and ASN cache pointers live on the Bot struct, making ownership clear
  • Improves testability - Each lifecycle method can be unit tested in isolation
  • Enables future extensions - New verification methods (e.g., BGP feeds) can follow the same pattern

Example Bot methods:

func (b *Bot) loadCachedIPs(path string)   // Load cached prefixes from file
func (b *Bot) refreshIPs(http *http.Client, root string) // Download and persist new prefixes
func (b *Bot) initializeASN(store *asn.Store)  // Load ASN cache with fallback to API
func (b *Bot) refreshASN(store *asn.Store)     // Refresh ASN prefixes from API

Adding New Bots

Adding a new bot requires no code changes - just create a YAML configuration file.

Step 1: Choose Verification Method

Method When to Use Example
URL + Parser Bot has official JSON/TXT IP list Googlebot, Bingbot, GPTBot
ASN Bot has official ASN registration Cloudflare (AS13335), Google (AS15169)
RDNS Only No official IP list, verify via DNS Baidu, Yandex

Step 2: Create Configuration File

Create bots/conf.d/newbot.yaml:

# Case 1: Bot with official JSON IP list (RECOMMENDED)
kind: SearchEngine        # Category: SearchEngine, SocialMedia, Tool, etc.
name: newbot              # Unique identifier (used in results)
parser: google            # Parser: google, openai, txt, github, stripe
ua: "NewBot"              # User-Agent fragment (case-sensitive!)
urls:
  - "https://example.com/bot-ips.json"

# Case 2: Bot with ASN verification (fastest option)
kind: SearchEngine
name: newbot
ua: "NewBot"
asn:
  - 12345                # ASN number (fetched from RIPE API)

# Case 3: Bot with RDNS verification only (no official IP list)
kind: SearchEngine
name: newbot
ua: "NewBot"
domains:
  - "newbot.example.com"
rdns: true

Step 3: Configure Parser

Choose the correct parser based on the IP list format:

Google-style (ipv4Prefix/ipv6Prefix fields):

{"prefixes": [{"ipv4Prefix": "1.2.3.4/24"}, {"ipv6Prefix": "2001:db8::/32"}]}

Parser: google

OpenAI-style (prefix field):

{"prefixes": [{"prefix": "1.2.3.4/24"}]}

Parser: openai

Plain text (one CIDR or individual IP per line):

1.2.3.4/24
5.6.7.8/24
172.16.0.5

Parser: txt (converts individual IPs to /32 or /128 CIDR notation)

GitHub-style (hooks, web, api string arrays):

{"hooks": ["192.30.252.0/22"], "web": ["192.30.252.0/22"], "api": ["192.30.252.0/22"]}

Parser: github

Stripe-style (WEBHOOKS array with individual IPs):

{"WEBHOOKS": ["3.18.12.63", "3.130.192.231", "13.235.14.237"]}

Parser: stripe (converts individual IPs to /32 or /128 CIDR notation)

Step 4: Restart Validator

To apply new bot configurations, restart your application or recreate the Validator:

// Create a new validator with updated bots
v, err := knownbots.New(knownbots.WithRoot("./bots"))
if err != nil {
    log.Fatal(err)
}
defer v.Close()

Step 5: Verify

result := v.Validate(
    "Mozilla/5.0 (compatible; NewBot/1.0; +https://example.com/bot)",
    "1.2.3.4",
)

fmt.Printf("Status: %s\n", result.Status)      // "verified"
fmt.Printf("IsBot: %t\n", result.IsBot)        // true
fmt.Printf("IsVerified: %t\n", result.IsVerified) // true

Example Configurations

Googlebot (official JSON, fast verification):

kind: SearchEngine
name: googlebot
parser: google
ua: "Googlebot"
urls:
  - "https://www.gstatic.com/ipranges/google.json"

Bingbot (official JSON):

kind: SearchEngine
name: bingbot
parser: google
ua: "bingbot"
urls:
  - "https://www.bing.com/toolbox/bingbot.json"

GPTBot (OpenAI uses Google-style JSON):

kind: AiTraining
name: gptbot
parser: google
ua: "GPTBot"
urls:
  - "https://openai.com/gptbot.json"

Applebot (official JSON from developer.apple.com):

kind: SearchEngine
name: applebot
parser: google
ua: "Applebot"
urls:
  - "https://search.developer.apple.com/applebot.json"

GitHub Webhooks:

kind: Tool
name: github
parser: github
ua: "GitHub-Hookshot"
urls:
  - "https://api.github.com/meta"

Stripe Webhooks:

kind: Tool
name: stripe
parser: stripe
ua: "Stripe"
urls:
  - "https://stripe.com/files/ips/ips_webhooks.json"

UptimeRobot (plain text with individual IPs):

kind: Monitoring
name:uptimerobot
parser: txt
ua: "UptimeRobot"
urls:
  - "https://uptimerobot.com/inc/files/ips/IPv4.txt"

Baidu (RDNS only, no official IP list):

kind: SearchEngine
name: baiduspider
ua: "Baiduspider"
domains:
  - "baidu.com"
  - "baidu.jp"
rdns: true

Yandex (RDNS only):

kind: SearchEngine
name: yandexbot
ua: "YandexBot"
domains:
  - "yandex.com"
  - "yandex.ru"
rdns: true

Common Mistakes

Mistake Problem Solution
Wrong casing "googlebot" won't match "Googlebot/2.1" Use exact casing: "Googlebot"
Wrong parser JSON not parsed correctly Match parser to JSON structure
Missing rdns: true RDNS verification not performed Add rdns: true for DNS-based bots
Empty custom: [] Unnecessary configuration Omit empty fields

Testing New Bot Config

# Run tests to verify bot parsing
go test -v ./...

# Run specific parser test
go test -v -run TestGoogleParser ./parser/

# Validate IP list format
curl -s https://example.com/bot-ips.json | jq '.prefixes[0]'

Contributing

Contributions are welcome! Whether you want to add new bots, fix bugs, or improve documentation.

Ways to Contribute

  1. Add new bot configurations - Most contributions are just YAML files in bots/conf.d/
  2. Fix parser issues - Handle new or different IP list formats
  3. Improve documentation - Fix typos, clarify instructions, add examples
  4. Report bugs - Open issues with minimal reproduction steps
  5. Suggest features - Open discussions about new functionality

Submitting Pull Requests

  1. Fork the repository on GitHub
  2. Create a feature branch: git checkout -b add-newbot
  3. Add your bot configuration to bots/conf.d/newbot.yaml
  4. Test your changes:
    go test -short ./...
    go test -v -run TestNewBot ./parser/
  5. Commit using Google Git convention:
    git commit -m "feat: add NewBot configuration
    
    - Add NewBot YAML configuration
    - Verify User-Agent matching
    - Test IP parsing with official API
    
    PiperOrigin-RevId: XXXXXXXX
    Change-Id: IXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  6. Push and create a Pull Request

Bot Configuration Guidelines

When adding a new bot configuration:

  1. Verify the User-Agent from official documentation

    • Use exact casing (e.g., "Googlebot", not "googlebot")
    • Check for word boundary matching requirements
  2. Find the official IP list URL

    • Most major bots publish JSON/TXT IP lists
    • Prefer official sources over third-party aggregators
  3. Choose the correct parser

    • Match the parser to the actual JSON structure
    • Test with real API response before submitting
  4. Test thoroughly

    • Run go test -short ./... to verify no regressions
    • Check integration tests pass for new bot if applicable

Code Style

  • Follow standard Go conventions
  • Run go fmt ./... before committing
  • Run go vet ./... to catch potential issues
  • Add tests for new functionality

License

MIT License

Author

Dayi Chen - GitHub

Acknowledgments

  • Inspired by Google's official bot verification documentation
  • Performance patterns influenced by Go stdlib's sync/atomic and net/netip designs
  • Special thanks to all contributors and users providing feedback

⭐ Star this project if you find it useful!

πŸ“ Questions? Open an issue or start a discussion!

πŸ› Found a bug? Please report it with minimal reproduction steps!

About

KnownBots is a Go library for verifying search engine crawlers and identifying legitimate bots.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •