KnownBots is a high-performance Go library for verifying search engine crawlers and identifying legitimate bots. It protects your web services from bot impersonation by validating User-Agent strings and IP addresses through RDNS lookups and IP range verification.
The Problem: Malicious actors can easily spoof User-Agent strings to impersonate legitimate search engine bots (Googlebot, Bingbot, etc.) to bypass rate limits, scrape content, or exploit bot-specific logic.
The Solution: KnownBots performs cryptographic-strength verification by:
- Matching User-Agent markers (case-sensitive word boundaries)
- Verifying IP ownership through reverse DNS lookups or official IP ranges
- Caching results to avoid expensive DNS queries on subsequent requests
- Lock-free reads via
atomic.Pointer[T]for bot configuration and RDNS cache - Zero-allocation hot paths using
netip.Prefixfor IP matching - Byte-level indexing for O(1) bot lookup (150-300ns for 40 bots vs 640ns linear scan)
- Copy-on-Write caching optimized for read-heavy workloads (1-20 writes/day)
- Embedded bots - 57 built-in configs compiled into binary (no file I/O at startup)
- Optional UA classification - Disabled by default for maximum performance
- Logging control - Disable log output via
knownbots.EnableLog = false
- Case-sensitive matching prevents forgery attempts (official bots use fixed casing)
- Word boundary validation prevents partial matches (e.g., "MyGooglebot" won't match)
- LRU fail cache for fast rejection of known-bad IPs (1000 entry limit)
- Browser detection distinguishes legitimate users from suspicious bot-like patterns (opt-in)
- Persistent RDNS cache survives restarts (file-based storage)
- Background scheduler automatically refreshes IP ranges from official URLs
- Graceful degradation (cache persistence failures don't affect runtime)
- Comprehensive tests with benchmarks for 3-40 bot scenarios
- YAML-based configuration for easy bot additions (no code changes)
- Pluggable verification supports both IP ranges and RDNS verification
- Official source integration automatically downloads and updates IP lists
go get github.com/cnlangzi/knownbotsRequirements: Go 1.21+
package main
import (
"fmt"
"log"
"github.com/cnlangzi/knownbots"
)
func main() {
// Initialize validator (starts background scheduler)
v, err := knownbots.New()
if err != nil {
log.Fatal(err)
}
defer v.Close()
// Verify a bot claim
result := v.Validate(
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
"66.249.66.1",
)
fmt.Printf("Status: %s\n", result.Status) // "verified"
fmt.Printf("IsBot: %t\n", result.IsBot) // true
fmt.Printf("IsVerified: %t\n", result.IsVerified) // true
fmt.Printf("Bot Name: %s\n", result.Name) // "googlebot"
}func BotVerificationMiddleware(v *knownbots.Validator) func(http.Handler) http.Handler {
return func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
ua := r.Header.Get("User-Agent")
ip := r.RemoteAddr // In production, extract from X-Forwarded-For
result := v.Validate(ua, ip)
// Block fake bots (claims to be bot but IP not verified)
if result.IsBot && !result.IsVerified {
http.Error(w, "Forbidden: Bot verification failed", http.StatusForbidden)
return
}
// Add verification metadata to request context
ctx := context.WithValue(r.Context(), "botVerified", result)
next.ServeHTTP(w, r.WithContext(ctx))
})
}
}v, err := knownbots.New(
knownbots.WithRoot("./custom-bots"), // Custom bot config directory
knownbots.WithFailLimit(5000), // Failed lookup cache size
knownbots.WithClassifyUA(), // Enable UA classification (disabled by default)
)
// Disable logging to reduce console pollution (e.g., in benchmarks)
knownbots.EnableLog = falsebots/
βββ conf.d/ # Bot configurations (YAML)
β βββ googlebot.yaml
β βββ bingbot.yaml
β βββ ...
βββ googlebot/ # Bot-specific data (auto-created)
β βββ rdns.txt # Persistent RDNS cache
β βββ ips.txt # Downloaded IP ranges
βββ ...
name: googlebot
ua: "Googlebot" # EXACT casing required (case-sensitive)
urls: # Official IP list URLs (auto-downloaded)
- "https://www.gstatic.com/ipranges/google.json"
custom: # Static CIDR ranges (always checked)
- "66.249.64.0/19"
asn: # ASN numbers for verification (optional)
- 15169
domains: # Verified RDNS domains
- "googlebot.com"
- "google.com"
rdns: true # Enable RDNS verification (false = IP-only)Important:
- User-Agent markers (
ua) are case-sensitive. Official bots use fixed casing (e.g., "Googlebot", never "googlebot"). This prevents forgery attempts where attackers alter casing to bypass detection. - Set
rdns: falsefor bots that only need IP range verification (faster, no DNS queries) - ASN verification is optional and provides faster IP ownership verification (~35ns) compared to RDNS (~450ns) for bots with official ASN registrations
Choose the correct parser based on the IP list format:
| Format | JSON Example | Parser |
|---|---|---|
| Google-style | {"prefixes": [{"ipv4Prefix": "1.2.3.4/24"}]} |
google |
| OpenAI-style | {"prefixes": [{"prefix": "1.2.3.4/24"}]} |
openai |
| Plain text | 1.2.3.4/24 or 172.16.0.5 |
txt |
| GitHub-style | {"hooks": ["1.2.3.4/24"], "web": [...]} |
github |
| Stripe-style | {"WEBHOOKS": ["3.18.12.63"]} |
stripe |
-
Case-sensitive: Use exact casing from official documentation
- β
Correct:
ua: "Googlebot"orua: "bingbot" - β Wrong:
ua: "googlebot"orua: "BINGBOT"
- β
Correct:
-
Match type: Word boundary matching (not substring)
ua: "Googlebot"matches:Googlebot/2.1,Mozilla/5.0 (compatible; Googlebot/2.1; ...)ua: "Googlebot"does NOT match:MyGooglebot,GooglebotPro
-
Special bots: Some bots don't use Mozilla prefix
ua: "GPTBot"(OpenAI)ua: "curl"(CLI tool)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Incoming Request β
β (User-Agent + IP Address) β
βββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββ
β UA Matches Bot? βββNoβββΆ Classify UA Type
ββββββββββ¬ββββββββββββ (Browser/Suspicious/Unknown)
β Yes β
βΌ βΌ
ββββββββββββββββββββββ Return: IsBot=false
β Check IP Ranges β (legitimate browser)
β (CIDR matching) β
ββββββββββ¬ββββββββββββ
β
ββ Hit βββΆ Return: verified
β
ββ Miss + asn empty βββΆ Check RDNS
β
ββ Miss + asn defined βββΆ Check ASN
β β
β ββ Hit βββΆ Return: verified
β β
β ββ Miss βββΆ Check RDNS
β
βΌ
ββββββββββββββββββββββ
β Bot.RDNS=true? βββNoβββΆ Return: failed
ββββββββββ¬ββββββββββββ (IP-only bot, no DNS check)
β Yes
βΌ
ββββββββββββββββββββββ
β Check Fail Cache βββHitβββΆ Return: failed
β (LRU, 1000 IPs) β (known fake bot)
ββββββββββ¬ββββββββββββ
β Miss
βΌ
ββββββββββββββββββββββ
β Check RDNS Cache βββHitβββΆ Domain match?
β (persistent) β Yes: verified
ββββββββββ¬ββββββββββββ No: failed
β Miss
βΌ
ββββββββββββββββββββββ
β Perform RDNS LookupββββΆ Domain match?
β (50-200ms delay) β Yes: verified + cache
ββββββββββββββββββββββ No: failed + fail cache
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Background Scheduler β
βββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββ΄ββββββββββ¬βββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββ ββββββββββββββββ ββββββββββββ
β Refresh β β Update ASN β β Prune & β
β IP Lists β β Data β β Save β
β (HTTP) β β (RIPE API) β β RDNS β
ββββββββββββ ββββββββββββββββ β Cache β
β β β (rdns=true) β
βΌ βΌ ββββββββββββ
Update memory Update cache β
Persist to file Persist to file βΌ
(per-bot dir) Remove invalid
Persist to file
| Operation | Time/op | Allocs/op | Notes |
|---|---|---|---|
| UA matching (hit first) | 165ns | 0 | Byte index + word boundary check |
| UA matching (hit middle) | 300ns | 0 | Worst case: mid-list match |
| UA matching (miss) | 640ns | 0 | Full scan + browser classification |
| Validate (IP range hit) | 227ns | 0 | Radix tree CIDR matching |
| Validate (ASN hit) | 35ns | 1 | O(1) Patricia tree lookup |
| Validate (RDNS hit) | 450ns | 0 | Cache lookup + domain match |
| Validate (cold lookup) | 50-200ms | 1-2 | DNS query (first time only) |
Key Insight: Verification priority is IP ranges β ASN β RDNS. ASN verification (~35ns) is faster than RDNS cache lookup (~450ns) and ideal for bots with official ASN registrations.
| Bot Count | Index Benefit | Recommended Index |
|---|---|---|
| < 20 bots | Minimal (2x) | Single byte (current) |
| 20-50 bots | Significant (4-5x) | Single byte (current) |
| > 50 bots | Critical (10x+) | Consider 3-char prefix |
Current implementation is optimized for 3-50 bots (covers 99% of use cases).
type Validator struct { /* ... */ }
type Result struct {
Name string // Bot name (e.g., "googlebot")
Status ResultStatus // "verified" | "failed" | "unknown"
IsBot bool // True if UA matches any bot or looks bot-like
IsVerified bool // True if IP ownership verified
}
type ResultStatus string
const (
StatusVerified ResultStatus = "verified" // Bot confirmed (UA + IP match)
StatusFailed ResultStatus = "failed" // Bot suspected but IP invalid
StatusUnknown ResultStatus = "unknown" // Not a known bot
)// New creates a validator with background scheduler
func New(opts ...Option) (*Validator, error)
// Validate verifies User-Agent and IP address
func (v *Validator) Validate(ua, ip string) Result
// Close stops background scheduler
func (v *Validator) Close() error// WithRoot sets custom bot directory (default: "./bots")
func WithRoot(dir string) Option
// WithFailLimit sets failed lookup cache size (default: 1000)
func WithFailLimit(limit int) Option// Apply different rate limits for verified bots vs browsers
result := validator.Validate(ua, ip)
if result.IsVerified {
limiter = rateLimits.Bot // Generous: 10/sec
} else if result.IsBot {
limiter = rateLimits.FakeBot // Strict: 1/min
} else {
limiter = rateLimits.Browser // Normal: 5/sec
}// Exclude verified bots from user analytics
result := validator.Validate(ua, ip)
if !result.IsBot || !result.IsVerified {
analytics.Track(userID, event)
}// Allow verified Googlebot to bypass feature flags
result := validator.Validate(ua, ip)
if result.Name == "googlebot" && result.IsVerified {
features.EnableAll() // Show production content for indexing
}// Block fake bots from scraping paywalled content
result := validator.Validate(ua, ip)
if result.IsBot && !result.IsVerified {
return http.StatusForbidden // Suspected scraper
}Current built-in configurations:
- Googlebot (Google Search)
- Bingbot (Microsoft Bing)
- facebookexternalhit (Facebook/Meta link previews)
- GPTBot (OpenAI)
- Applebot (Apple Search and Siri)
- GitHub (GitHub webhooks)
- Stripe (Stripe webhooks)
- UptimeRobot (Uptime monitoring)
Need more bots? Add YAML configs to bots/conf.d/ - no code changes required!
Common bots to add:
- Yandex (YandexBot)
- Baidu (Baiduspider)
- DuckDuckGo (DuckDuckBot)
- Twitter (Twitterbot)
- Slack (Slackbot)
See bots/conf.d/googlebot.yaml for configuration examples.
# Run all tests
go test ./...
# Run only unit tests (skip integration tests)
go test -short ./...
# Run benchmarks
go test -bench=. -benchmem
# Run specific test
go test -v -run ^TestValidator$
# Coverage report
go test -cover ./...Integration Tests: The project includes integration tests that verify parsing of real API responses from:
- GoogleBot: 307 prefixes
- Bingbot: 28 prefixes
- GPTBot: 21 prefixes
- GitHub: 50 prefixes
- Stripe: 12 IPs
- UptimeRobot: 116 prefixes
- Applebot: 12 prefixes
Bot configurations change rarely (on reload/schedule, 1-20x/day) but are read on every request (1000s/sec). atomic.Pointer[T] provides:
- Lock-free reads - single atomic load, no lock acquisition overhead
- Readers never block - writes don't wait for readers, readers don't wait for writes (Copy-on-Write)
- Consistent performance - no priority inversion or cache line contention from lock operations
Consistent sub-microsecond performance for read-heavy workloads.
Official bots use fixed casing ("Googlebot", never "googlebot"). Case variations indicate forgery. Case-sensitive matching:
- Rejects fakes at first stage (no expensive DNS queries)
- 4x faster than case-insensitive (16ns vs 67ns)
- Improves both security and performance
RDNS cache sees 1-20 new IPs per day but 1000s of reads per second (99.99% read ratio). Copy-on-Write with atomic swap provides:
- Zero-allocation reads (no locking)
- Safe concurrent access
- Simple implementation (vs lock-free data structures)
Linear bot list scan is fast for 3 bots (52ns) but degrades to 640ns at 40 bots. Single-character index provides 4-5x speedup for 20-50 bots at minimal memory cost (<1KB).
IP and ASN lifecycle operations (load, refresh, persist) are shared between initialization and the background scheduler. Encapsulating these as Bot methods:
- Eliminates duplicate code -
initBotandrunSchedulerboth call the sameloadCachedIPs,refreshIPs,initializeASN, andrefreshASNmethods - Centralizes state - IPTree and ASN cache pointers live on the
Botstruct, making ownership clear - Improves testability - Each lifecycle method can be unit tested in isolation
- Enables future extensions - New verification methods (e.g., BGP feeds) can follow the same pattern
Example Bot methods:
func (b *Bot) loadCachedIPs(path string) // Load cached prefixes from file
func (b *Bot) refreshIPs(http *http.Client, root string) // Download and persist new prefixes
func (b *Bot) initializeASN(store *asn.Store) // Load ASN cache with fallback to API
func (b *Bot) refreshASN(store *asn.Store) // Refresh ASN prefixes from APIAdding a new bot requires no code changes - just create a YAML configuration file.
| Method | When to Use | Example |
|---|---|---|
| URL + Parser | Bot has official JSON/TXT IP list | Googlebot, Bingbot, GPTBot |
| ASN | Bot has official ASN registration | Cloudflare (AS13335), Google (AS15169) |
| RDNS Only | No official IP list, verify via DNS | Baidu, Yandex |
Create bots/conf.d/newbot.yaml:
# Case 1: Bot with official JSON IP list (RECOMMENDED)
kind: SearchEngine # Category: SearchEngine, SocialMedia, Tool, etc.
name: newbot # Unique identifier (used in results)
parser: google # Parser: google, openai, txt, github, stripe
ua: "NewBot" # User-Agent fragment (case-sensitive!)
urls:
- "https://example.com/bot-ips.json"
# Case 2: Bot with ASN verification (fastest option)
kind: SearchEngine
name: newbot
ua: "NewBot"
asn:
- 12345 # ASN number (fetched from RIPE API)
# Case 3: Bot with RDNS verification only (no official IP list)
kind: SearchEngine
name: newbot
ua: "NewBot"
domains:
- "newbot.example.com"
rdns: trueChoose the correct parser based on the IP list format:
Google-style (ipv4Prefix/ipv6Prefix fields):
{"prefixes": [{"ipv4Prefix": "1.2.3.4/24"}, {"ipv6Prefix": "2001:db8::/32"}]}Parser: google
OpenAI-style (prefix field):
{"prefixes": [{"prefix": "1.2.3.4/24"}]}Parser: openai
Plain text (one CIDR or individual IP per line):
1.2.3.4/24
5.6.7.8/24
172.16.0.5
Parser: txt (converts individual IPs to /32 or /128 CIDR notation)
GitHub-style (hooks, web, api string arrays):
{"hooks": ["192.30.252.0/22"], "web": ["192.30.252.0/22"], "api": ["192.30.252.0/22"]}Parser: github
Stripe-style (WEBHOOKS array with individual IPs):
{"WEBHOOKS": ["3.18.12.63", "3.130.192.231", "13.235.14.237"]}Parser: stripe (converts individual IPs to /32 or /128 CIDR notation)
To apply new bot configurations, restart your application or recreate the Validator:
// Create a new validator with updated bots
v, err := knownbots.New(knownbots.WithRoot("./bots"))
if err != nil {
log.Fatal(err)
}
defer v.Close()result := v.Validate(
"Mozilla/5.0 (compatible; NewBot/1.0; +https://example.com/bot)",
"1.2.3.4",
)
fmt.Printf("Status: %s\n", result.Status) // "verified"
fmt.Printf("IsBot: %t\n", result.IsBot) // true
fmt.Printf("IsVerified: %t\n", result.IsVerified) // trueGooglebot (official JSON, fast verification):
kind: SearchEngine
name: googlebot
parser: google
ua: "Googlebot"
urls:
- "https://www.gstatic.com/ipranges/google.json"Bingbot (official JSON):
kind: SearchEngine
name: bingbot
parser: google
ua: "bingbot"
urls:
- "https://www.bing.com/toolbox/bingbot.json"GPTBot (OpenAI uses Google-style JSON):
kind: AiTraining
name: gptbot
parser: google
ua: "GPTBot"
urls:
- "https://openai.com/gptbot.json"Applebot (official JSON from developer.apple.com):
kind: SearchEngine
name: applebot
parser: google
ua: "Applebot"
urls:
- "https://search.developer.apple.com/applebot.json"GitHub Webhooks:
kind: Tool
name: github
parser: github
ua: "GitHub-Hookshot"
urls:
- "https://api.github.com/meta"Stripe Webhooks:
kind: Tool
name: stripe
parser: stripe
ua: "Stripe"
urls:
- "https://stripe.com/files/ips/ips_webhooks.json"UptimeRobot (plain text with individual IPs):
kind: Monitoring
name:uptimerobot
parser: txt
ua: "UptimeRobot"
urls:
- "https://uptimerobot.com/inc/files/ips/IPv4.txt"Baidu (RDNS only, no official IP list):
kind: SearchEngine
name: baiduspider
ua: "Baiduspider"
domains:
- "baidu.com"
- "baidu.jp"
rdns: trueYandex (RDNS only):
kind: SearchEngine
name: yandexbot
ua: "YandexBot"
domains:
- "yandex.com"
- "yandex.ru"
rdns: true| Mistake | Problem | Solution |
|---|---|---|
| Wrong casing | "googlebot" won't match "Googlebot/2.1" | Use exact casing: "Googlebot" |
| Wrong parser | JSON not parsed correctly | Match parser to JSON structure |
Missing rdns: true |
RDNS verification not performed | Add rdns: true for DNS-based bots |
Empty custom: [] |
Unnecessary configuration | Omit empty fields |
# Run tests to verify bot parsing
go test -v ./...
# Run specific parser test
go test -v -run TestGoogleParser ./parser/
# Validate IP list format
curl -s https://example.com/bot-ips.json | jq '.prefixes[0]'Contributions are welcome! Whether you want to add new bots, fix bugs, or improve documentation.
- Add new bot configurations - Most contributions are just YAML files in
bots/conf.d/ - Fix parser issues - Handle new or different IP list formats
- Improve documentation - Fix typos, clarify instructions, add examples
- Report bugs - Open issues with minimal reproduction steps
- Suggest features - Open discussions about new functionality
- Fork the repository on GitHub
- Create a feature branch:
git checkout -b add-newbot - Add your bot configuration to
bots/conf.d/newbot.yaml - Test your changes:
go test -short ./... go test -v -run TestNewBot ./parser/
- Commit using Google Git convention:
git commit -m "feat: add NewBot configuration - Add NewBot YAML configuration - Verify User-Agent matching - Test IP parsing with official API PiperOrigin-RevId: XXXXXXXX Change-Id: IXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
- Push and create a Pull Request
When adding a new bot configuration:
-
Verify the User-Agent from official documentation
- Use exact casing (e.g., "Googlebot", not "googlebot")
- Check for word boundary matching requirements
-
Find the official IP list URL
- Most major bots publish JSON/TXT IP lists
- Prefer official sources over third-party aggregators
-
Choose the correct parser
- Match the parser to the actual JSON structure
- Test with real API response before submitting
-
Test thoroughly
- Run
go test -short ./...to verify no regressions - Check integration tests pass for new bot if applicable
- Run
- Follow standard Go conventions
- Run
go fmt ./...before committing - Run
go vet ./...to catch potential issues - Add tests for new functionality
Dayi Chen - GitHub
- Inspired by Google's official bot verification documentation
- Performance patterns influenced by Go stdlib's
sync/atomicandnet/netipdesigns - Special thanks to all contributors and users providing feedback
β Star this project if you find it useful!
π Questions? Open an issue or start a discussion!
π Found a bug? Please report it with minimal reproduction steps!