IRI extraction, normalization, and clustering.
Iriq pulls IRIs out of free text, parses them, normalizes them into canonical shape-aware forms, classifies their path and query components, and clusters similar identifiers — surfacing what's stable vs. unique.
Ships as both a command-line tool (iriq) and a library (Ruby and
Go — same behavior, enforced by parity tests).
The CLI is available three ways. Pick whichever fits your workflow:
# Homebrew (recommended)
brew install dpep/tools/iriq
# RubyGems — installs the CLI shim and the library
gem install iriq
# Go — installs the CLI binary into $GOBIN
go install github.com/dpep/iriq/cmd/iriq@latestFor library use, depend on whichever runtime you're working in:
# Gemfile
gem "iriq"import "github.com/dpep/iriq"$ iriq https://foo.com/users/123
# parse
original: https://foo.com/users/123
kind: url
scheme: https
host: foo.com
path_segments: ["users", "123"]
canonical: https://foo.com/users/123
# normalize
https://foo.com/users/{user_id}
$ iriq -n https://foo.com/users/123
https://foo.com/users/{user_id}
$ cat access.log | iriq # extract → URL list (or clusters at scale)
$ cat access.log | iriq --stats # rolling aggregates
$ iriq ./access.log -n # file auto-detected → normalize each found URL
Full CLI reference is below under CLI.
# Ruby
iri = Iriq.parse("https://foo.com/users/123")
iri.scheme # => "https"
iri.host # => "foo.com"
iri.path_segments # => ["users", "123"]
iri.canonical # => "https://foo.com/users/123"
Iriq.normalize("https://foo.com/users/123")
# => "https://foo.com/users/{user_id}"
Iriq.explain("https://foo.com/users/123/orders/456")
# => [
# { value: "users", type: :literal, variable: false, hint: nil },
# { value: "123", type: :integer_id, variable: true, hint: "user_id" },
# { value: "orders", type: :literal, variable: false, hint: nil },
# { value: "456", type: :integer_id, variable: true, hint: "order_id" },
# ]// Go (same surface)
iri, _ := iriq.Parse("https://foo.com/users/123")
iri.Scheme // "https"
iri.Host // "foo.com"
iri.PathSegments // []string{"users", "123"}
iri.Canonical() // "https://foo.com/users/123"
norm, _ := iriq.Normalize("https://foo.com/users/123")
// "https://foo.com/users/{user_id}"The Ruby gem is the reference implementation; Go mirrors its API and is kept in sync via JSON fixtures plus a CLI parity harness. See CLAUDE.md for the dev process.
Pass hints: false to Iriq.normalize (or PathShape) for mechanical
placeholders ({integer_id} instead of {user_id}).
When a variable segment follows a literal one, Iriq derives a hint by
singularizing the literal and suffixing _id (or _uuid for UUIDs). This is
what produces {user_id} from /users/123 and {order_id} from
/orders/456. Singularization uses Iriq::Inflector, which delegates to a
swappable adapter:
# Default: ActiveSupport::Inflector if `active_support/inflector` is loadable,
# otherwise a built-in adapter with rules adapted from ActiveSupport.
Iriq::Inflector.singularize("categories") # => "category"
Iriq::Inflector.singularize("people") # => "person"
# Override:
Iriq::Inflector.adapter = MyAdapter # must respond to .singularize(String)
Iriq::Inflector.reset_adapter!| Input | Notes |
|---|---|
https://foo.com/users/123 |
Standard URL |
foo.com/users/456 |
Scheme-less; https:// is assumed |
urn:isbn:0451450523 |
URN — scheme and nss are populated |
https://例え.テスト/こんにちは |
Unicode IRI — display form preserved |
HTTPS://Foo.com:443/A |
Scheme + host lowercased; default port dropped |
https://foo.com/a/./b/../c |
Dot segments normalized |
Iriq::SegmentClassifier returns one of:
:literal— plain word (users,orders,Profile,こんにちは):integer_id— pure digits below the timestamp range (1,123,42):uuid—f47ac10b-58cc-4372-a567-0e02b2c3d479:date—2024-05-23:timestamp— ISO 8601, or 10/13-digit UNIX epoch:hash— 32+ hex chars (md5 / sha):slug—my-cool-post,my_cool_post:opaque_id— short alphanumeric mix that doesn't fit elsewhere
Heuristics are deterministic and ordered — the first matching rule wins.
clusterer = Iriq::Clusterer.new
clusterer.add("https://foo.com/users/123")
clusterer.add("https://foo.com/users/456")
clusterer.add("https://foo.com/users/789/orders/1")
clusterer.clusters.map(&:shape)
# => ["/users/{user_id}", "/users/{user_id}/orders/{order_id}"]
clusterer.clusters.first.segment_stats
# => [
# { position: 0, stable: true, values: { "users" => 2 } },
# { position: 1, stable: false, values: { "123" => 1, "456" => 1 } },
# ]
clusterer.explain("https://foo.com/users/999")
# => [
# { value: "users", type: :literal, variable: false, hint: nil, stable: true },
# { value: "999", type: :integer_id, variable: true, hint: "user_id", stable: false },
# ]The clusterer combines classifier output with what it has actually observed:
a position the classifier would call variable but that is empirically
constant across all members of the cluster will be reported with
stable: true, variable: false.
For processing many identifiers — possibly an unbounded stream — use
Iriq::Corpus. It maintains rolling aggregates and per-(host, prefix)
frequency stats so classification improves as more data comes in.
corpus = Iriq::Corpus.new
iris.each do |iri|
obs = corpus.observe(iri)
obs.fingerprint # deterministic shape: "https://foo.com/users/{user_id}"
obs.cluster # the Iriq::Cluster this fell into
obs.explanation # per-segment annotations with corpus-informed classification
end
corpus.host_counts # { "foo.com" => 1234, "bar.com" => 7 }
corpus.path_length_counts # { 2 => 800, 3 => 434 }
corpus.fingerprint_counts # shape → count
corpus.raw_shape_counts # hint-free shape → count
corpus.clusters # Iriq::Cluster instancesIriq.normalize("https://foo.com/users/me")
# => "https://foo.com/users/me" # mechanical: "me" is a literal
corpus.normalize("https://foo.com/users/me")
# => depends on what the corpus has seenIf many /users/{integer_id} paths flow in alongside a handful of
/users/me, the cluster /users/me is preserved (mechanical clustering
keeps literal routes distinct). If many distinct literal handles
(/users/alice, /users/bob, /users/carol, ...) flow in, the corpus
promotes that position to a {user} placeholder:
%w[alice bob carol dave erin frank gina hank ivan jane].each do |name|
corpus.observe("https://foo.com/users/#{name}/profile")
end
corpus.normalize("https://foo.com/users/alice/profile")
# => "https://foo.com/users/{user}/profile"Each row of corpus.explain(...) (and observation.explanation) carries a
classification: symbol on top of the deterministic fields:
| Classification | Meaning |
|---|---|
:stable_literal |
Literal value dominates this position |
:variable_identifier |
Classifier said variable (uuid, integer, etc.) |
:rare_literal |
Literal seen here, but not dominant |
:corpus_inferred_variable |
Classifier said literal, but position has high entropy |
:ambiguous |
Insufficient signal — never seen, or mixed |
Iriq::Extractor is what powers pipe-mode in the CLI. Picks up explicit-
scheme URLs (http, https, ftp, ws, wss, urn) and foo.com/path-
style scheme-less URLs (small TLD allow-list, required path). Trims trailing
sentence punctuation iteratively and preserves balanced parens
(https://en.wikipedia.org/wiki/Ruby_(programming_language) stays intact;
(see https://foo.com) drops the outer paren).
Iriq.extract("Visit https://foo.com today, also hit foo.com/users.")
# => [#<Iriq::Identifier https://foo.com>,
# #<Iriq::Identifier https://foo.com/users>]
# Disable scheme-less:
Iriq::Extractor.new(scheme_less: false).extract("hit foo.com/users today")
# => []Known limitations (intentional):
- Comma is a URL boundary, so query strings like
?q=37.7,-122.4truncate. Trade-off picked to keep CSV-shaped text working. - No HTML entity decoding (
&stays as-is). - Scheme-less mode skips bare hostnames without a path (too noisy in prose).
- Per-position
value_countsis capped (max_values_per_position, default 1000) — once full,totalkeeps growing but only existing keys count up. - Cluster examples are capped at
Iriq::Cluster::MAX_EXAMPLES. - No raw IRI strings are retained outside the bounded cluster examples.
Iriq::Corpus.new(max_values_per_position: 200)| Class | Responsibility |
|---|---|
Iriq::Parser |
String → Identifier |
Iriq::Identifier |
Structured fields + canonical reconstruction |
Iriq::SegmentClassifier |
Single segment → type symbol |
Iriq::PathShape |
Segments → /users/{user_id} route shape |
Iriq::SegmentHints |
Derives user_id-style hints from neighbors |
Iriq::Inflector |
Singularization with swappable adapter (AS or built-in) |
Iriq::Normalizer |
Identifier → canonical, shape-aware string |
Iriq::Explanation |
Per-segment {value, type, variable, hint} rows |
Iriq::Cluster |
One host + shape group, with examples & stats |
Iriq::Clusterer |
Many identifiers → Cluster set + explain |
Iriq::PositionStats |
Capped value/type frequencies for one position |
Iriq::Observation |
What Corpus#observe returns |
Iriq::Corpus |
Streaming observer with rolling aggregates + learning |
Iriq::Extractor |
Pulls IRIs out of free text (scheme-anchored) |
Installing the gem installs an iriq executable. Two main modes:
Single input — combined parse + normalize summary; trim with section
flags (-p, -n).
$ iriq foo.com/users/456
# parse
original: foo.com/users/456
kind: url
scheme: https
host: foo.com
path_segments: ["users", "456"]
canonical: https://foo.com/users/456
# normalize
https://foo.com/users/{user_id}
$ iriq -n https://foo.com/users/123
https://foo.com/users/{user_id}
Piped stdin — extraction runs by default. Output auto-switches: small inputs get a deduplicated URL list, larger inputs (≥ 10 IRIs) get the cluster view via an ephemeral corpus. Section flags work too — emit one normalized URL / parsed record per extracted IRI.
$ cat short.txt | iriq
[2] https://github.com/dpep/iriq
[1] https://foo.com/users
$ cat short.txt | iriq -n # normalized URL per line
https://github.com/dpep/iriq
https://foo.com/users
$ cat access.log | iriq # ≥ 10 IRIs → cluster view
[190] docs.example.com /users/{user_id}
[186] app.example.com /users/{user_id}
...
$ cat README.md | iriq --stats # rolling aggregates
$ cat README.md | iriq cluster # force cluster view
$ cat README.md | iriq --corpus c.json # persist into a corpus
--corpus PATH makes the corpus survive across invocations (atomic JSON
file). Once it has data, -n becomes corpus-informed:
$ for n in alice bob carol dave erin frank gina hank ivan jane; do
iriq --corpus c.json https://foo.com/users/$n/profile >/dev/null
done
$ iriq -n --corpus c.json https://foo.com/users/zoe/profile
https://foo.com/users/{user}/profile # mechanical would keep "zoe"
Flags:
| Flag | Effect |
|---|---|
-p, --parse |
Show parsed fields |
-n, --normalize |
Show the shape-normalized form |
-j, --json |
Emit JSON |
-N, --no-hints |
Use {integer_id} etc. instead of {user_id} |
--no-scheme-less |
Skip foo.com/path-style extraction (explicit-scheme only) |
--corpus PATH |
Load/create a JSON corpus at PATH; observe and save |
--stats |
Print rolling aggregates |
-V, --version |
Print version |
A positional argument that doesn't parse as an IRI but IS an existing
file is read and extracted from automatically — iriq ./access.log and
iriq /var/log/foo.log Just Work. (Bare filenames like README.md
may still parse as a URL; pipe with cat to disambiguate.)
Exit codes: 0 success, 1 usage error, 2 parse error.
Measured on the deterministic IriGenerator fixture (Ruby 3.4.9, single
thread):
| Operation | Throughput |
|---|---|
Iriq.parse |
~260k URLs/s |
Iriq.normalize |
~148k URLs/s |
Iriq.explain |
~205k URLs/s |
Iriq.extract (prose) |
~9.6 MB/s |
Corpus#observe |
~80k URLs/s |
| Corpus save/load (10k) | ~135 ms |
Linear scaling holds through 100k observations; per-observation retained
memory amortizes to ~100 bytes at that scale. Memoization caches are
bounded by CACHE_MAX = 10_000 (cleared when full) — overhead is a few
hundred KB regardless of corpus size.
Re-run anytime with:
bundle exec script/benchmark.rb # throughput
bundle exec script/memory.rb # retained memory + cache footprints
This is an MVP. Iriq does not:
- Implement RFC 3986, RFC 3987, or the WHATWG URL standard fully.
- Convert between Unicode (IRI) and punycode (URI) — the display form is preserved as-is.
- Percent-encode or decode path/query bytes. Bytes are kept as written.
- Validate scheme-specific structure beyond URL vs. URN.
- Resolve relative references against a base URL.
- Round-trip
canonicalback to the exact original byte-for-byte (whitespace is stripped, default ports are dropped, dot segments are collapsed).
For richer IRI handling, see addressable. Iriq's focus is the analysis
side: classification, normalization, and clustering — not a complete URL
implementation.
A Go implementation lives under go/ — same public surface, same
behavior, ~10× faster CLI on extraction-heavy workloads. The Ruby gem is
the reference; the Go port stays in sync via golden JSON fixtures
(spec/fixtures/) and a CLI parity harness (script/cli_parity.sh), both
checked in CI.
import "github.com/dpep/iriq/go/iriq"
iri, _ := iriq.Parse("https://foo.com/users/123")
norm, _ := iriq.Normalize("https://foo.com/users/123")
// "https://foo.com/users/{user_id}"See go/README.md for the full API table and porting workflow.
Yes please :)
- Fork it
- Create your feature branch (
git checkout -b my-feature) - Ensure the tests pass (
bundle exec rspec) - If you changed library behavior, port the change to Go (or open an
issue) and regenerate fixtures:
bundle exec ruby script/generate_fixtures.rb - Commit your changes (
git commit -am 'awesome new feature') - Push your branch (
git push origin my-feature) - Create a Pull Request