A Go library to fetch data from Archive of Our Own (AO3).
- Current backend: Chrome DevTools automation via
go-rod
+ parsing withgoquery
. - Roadmap backend: pure HTTP scraper using
net/http
+go-colly
+goquery
(no headless browser).
Project was renamed from ao3api-rod
to ao3api
to support multiple backends.
Early WIP. APIs can change. Use responsibly and respect AO3's Terms of Service and rate limits.
- Go 1.21+
- For the current
go-rod
backend:- A local Chrome/Chromium or a remote browser reachable via DevTools protocol
- Optionally, an exported cookies file (parsed via
CookieMonster
)
go get github.com/capoverflow/ao3api
Initialize the browser, log in (via cookies or credentials), navigate, and scrape.
package main
import (
"log"
"github.com/capoverflow/ao3api/internals/base"
"github.com/capoverflow/ao3api/internals/author"
"github.com/capoverflow/ao3api/internals/fanfic"
"github.com/capoverflow/ao3api/internals/models"
)
func main() {
cfg := models.RodConfig{
Headless: true,
// If you have a remote Chrome endpoint:
// RemoteURL: "ws://127.0.0.1:9222/devtools/browser/<id>",
Login: models.Login{
// Choose ONE login method:
// 1) Use exported cookies (JSON/SQLite supported by CookieMonster)
CookiesPath: "/path/to/cookies",
// 2) Or username/password
// Username: "user",
// Password: "pass",
},
}
base.Init(cfg)
page := base.Page
// Scrape a work page
page.MustNavigate("https://archiveofourown.org/works/<WORK_ID>").MustWaitLoad()
work := fanfic.GetFanfic(page)
work, _ = fanfic.GetFanficChapters(work, page)
log.Printf("Work: %+v\n", work)
// Scrape an author's dashboard
a := models.Author{AuthorParams: models.AuthorParams{Author: "<AO3_USERNAME>"}}
a = author.GetAuthorDashboard(a, page)
log.Printf("Author: %+v\n", a)
}
- Cookies file: set
Login.CookiesPath
(parsed to DevTools cookies viautils.ConvertHTTPCookieToRodCookie
). - Username/password: set
Login.Username
,Login.Password
.
- Fanfic metadata from a work page: title, authors, dates, language, words, chapters count, stats, tags, download links.
- Chapters list from a work: chapter IDs, names, dates.
- Author dashboard: pseuds, fandoms (with counts), works list with metadata and tags.
- Cookie helpers to convert between
net/http
androd
cookie types.
APIs are exposed under:
internals/base
:Init(models.RodConfig)
initializes the browser and session.internals/auth
: login via cookies or credentials; utilities to save cookies.internals/fanfic
:GetFanfic
,GetFanficChapters
(comments WIP).internals/author
:GetAuthorDashboard
.internals/utils
: helpers and cookie conversion.internals/models
: data structures (Work
,Chapter
,Author
, etc.).
- Define a unified
Client
interface (e.g.,WorkByID
,AuthorDashboard
,Chapters
,Comments
,Search
/Browse
). - Implement selectable backends:
- RodBackend: current
go-rod
+goquery
implementation for JS-required flows. - HTTPBackend:
go-colly
or purenet/http
+goquery
for faster, headless-free scraping.
- RodBackend: current
- Seamless backend selection via config flag (
Backend=rod|http
) with optional auto-fallback (rod→http
) when JS isn’t needed. - Shared session layer: cookie jar + login abstraction reused across backends.
- Pluggable middlewares: rate limiting, retries, proxy rotation, custom User-Agent.
- Provide a top-level constructor and options API:
ao3.NewClient(ctx, opts ...Option) (Client, error)
- Functional options:
WithBackend
,WithLogin
,WithCookiesPath
,WithRemoteURL
,WithHeadless
,WithHTTPTransport
,WithProxy
,WithRateLimit
,WithRetry
,WithUserAgent
.
- Config model (illustrative):
Backend
selection (rod/http),Rod
settings:RemoteURL
,Headless
.HTTP
settings: base URL, transport, cookie jar.Login
:Username
,Password
,CookiesPath
, or preloaded cookies.RateLimit
: requests-per-second, burst, jitter.Retry
: max attempts, backoff strategy.Proxy
: static or rotating proxy list.UserAgent
: override UA string.
- Lifecycle methods:
Client.Close(ctx)
to cleanly shutdown the rod browser or flush HTTP resources. - Session persistence: optional cookie save/load between runs.
- Queue-based crawling with dedupe and politeness controls:
- Seeds: works, authors, tags, series, bookmarks.
- Controls: max depth, max pages, allowed paths, per-host RPS, jitter, concurrency.
- Robust retry/backoff for 429/5xx; respect AO3 rate limits.
- URL and WorkID-level deduplication; checkpointing and resume.
- Fetcher/Parser separation so both backends can feed the crawler.
- Outputs via pluggable sinks: channel callbacks, JSONL writer, or user-provided interface.
- Incremental parsing for large author libraries; batched writes.
- M1: Define
Client
interface and shared models; stabilize selectors. - M2: Implement HTTP backend for read-only endpoints (works, chapters, author dashboards).
- M3: Introduce
ao3.NewClient
+ functional options; unify login/session handling. - M4: Crawler MVP with seeds, dedupe, politeness, and pluggable sinks.
- M5: Comments pagination + parsing improvements across backends.
- M6: Docs, examples, and CI tests for both backends.
- Be mindful of scraping etiquette: add randomized delays (
utils.RandSleep
), avoid heavy concurrent requests, and cache locally. - AO3 content and DOM can change; selectors may need updates over time.
TBD