Skip to content

dpep/iriq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Iriq

Gem codecov

IRI extraction, normalization, and clustering.

Iriq pulls IRIs out of free text, parses them, normalizes them into canonical shape-aware forms, classifies their path and query components, and clusters similar identifiers — surfacing what's stable vs. unique.

Ships as both a command-line tool (iriq) and a library (Ruby and Go — same behavior, enforced by parity tests).

Install

The CLI is available three ways. Pick whichever fits your workflow:

# Homebrew (recommended)
brew install dpep/tools/iriq

# RubyGems — installs the CLI shim and the library
gem install iriq

# Go — installs the CLI binary into $GOBIN
go install github.com/dpep/iriq/cmd/iriq@latest

For library use, depend on whichever runtime you're working in:

# Gemfile
gem "iriq"
import "github.com/dpep/iriq"

CLI quick start

$ iriq https://foo.com/users/123
# parse
original:      https://foo.com/users/123
kind:          url
scheme:        https
host:          foo.com
path_segments: ["users", "123"]
canonical:     https://foo.com/users/123

# normalize
https://foo.com/users/{user_id}

$ iriq -n https://foo.com/users/123
https://foo.com/users/{user_id}

$ cat access.log | iriq             # extract → URL list (or clusters at scale)
$ cat access.log | iriq --stats     # rolling aggregates
$ iriq ./access.log -n              # file auto-detected → normalize each found URL

Full CLI reference is below under CLI.

Library quick start

# Ruby
iri = Iriq.parse("https://foo.com/users/123")
iri.scheme         # => "https"
iri.host           # => "foo.com"
iri.path_segments  # => ["users", "123"]
iri.canonical      # => "https://foo.com/users/123"

Iriq.normalize("https://foo.com/users/123")
# => "https://foo.com/users/{user_id}"

Iriq.explain("https://foo.com/users/123/orders/456")
# => [
#      { value: "users",  type: :literal,    variable: false, hint: nil        },
#      { value: "123",    type: :integer_id, variable: true,  hint: "user_id"  },
#      { value: "orders", type: :literal,    variable: false, hint: nil        },
#      { value: "456",    type: :integer_id, variable: true,  hint: "order_id" },
#    ]
// Go (same surface)
iri, _ := iriq.Parse("https://foo.com/users/123")
iri.Scheme         // "https"
iri.Host           // "foo.com"
iri.PathSegments   // []string{"users", "123"}
iri.Canonical()    // "https://foo.com/users/123"

norm, _ := iriq.Normalize("https://foo.com/users/123")
// "https://foo.com/users/{user_id}"

The Ruby gem is the reference implementation; Go mirrors its API and is kept in sync via JSON fixtures plus a CLI parity harness. See CLAUDE.md for the dev process.

Pass hints: false to Iriq.normalize (or PathShape) for mechanical placeholders ({integer_id} instead of {user_id}).

RESTful hints

When a variable segment follows a literal one, Iriq derives a hint by singularizing the literal and suffixing _id (or _uuid for UUIDs). This is what produces {user_id} from /users/123 and {order_id} from /orders/456. Singularization uses Iriq::Inflector, which delegates to a swappable adapter:

# Default: ActiveSupport::Inflector if `active_support/inflector` is loadable,
# otherwise a built-in adapter with rules adapted from ActiveSupport.

Iriq::Inflector.singularize("categories")  # => "category"
Iriq::Inflector.singularize("people")      # => "person"

# Override:
Iriq::Inflector.adapter = MyAdapter        # must respond to .singularize(String)
Iriq::Inflector.reset_adapter!

Supported inputs

Input Notes
https://foo.com/users/123 Standard URL
foo.com/users/456 Scheme-less; https:// is assumed
urn:isbn:0451450523 URN — scheme and nss are populated
https://例え.テスト/こんにちは Unicode IRI — display form preserved
HTTPS://Foo.com:443/A Scheme + host lowercased; default port dropped
https://foo.com/a/./b/../c Dot segments normalized

Segment classification

Iriq::SegmentClassifier returns one of:

  • :literal — plain word (users, orders, Profile, こんにちは)
  • :integer_id — pure digits below the timestamp range (1, 123, 42)
  • :uuidf47ac10b-58cc-4372-a567-0e02b2c3d479
  • :date2024-05-23
  • :timestamp — ISO 8601, or 10/13-digit UNIX epoch
  • :hash — 32+ hex chars (md5 / sha)
  • :slugmy-cool-post, my_cool_post
  • :opaque_id — short alphanumeric mix that doesn't fit elsewhere

Heuristics are deterministic and ordered — the first matching rule wins.

Clustering

clusterer = Iriq::Clusterer.new
clusterer.add("https://foo.com/users/123")
clusterer.add("https://foo.com/users/456")
clusterer.add("https://foo.com/users/789/orders/1")

clusterer.clusters.map(&:shape)
# => ["/users/{user_id}", "/users/{user_id}/orders/{order_id}"]

clusterer.clusters.first.segment_stats
# => [
#      { position: 0, stable: true,  values: { "users" => 2 } },
#      { position: 1, stable: false, values: { "123" => 1, "456" => 1 } },
#    ]

clusterer.explain("https://foo.com/users/999")
# => [
#      { value: "users", type: :literal,    variable: false, hint: nil,       stable: true  },
#      { value: "999",   type: :integer_id, variable: true,  hint: "user_id", stable: false },
#    ]

The clusterer combines classifier output with what it has actually observed: a position the classifier would call variable but that is empirically constant across all members of the cluster will be reported with stable: true, variable: false.

Corpus (streaming + learning)

For processing many identifiers — possibly an unbounded stream — use Iriq::Corpus. It maintains rolling aggregates and per-(host, prefix) frequency stats so classification improves as more data comes in.

corpus = Iriq::Corpus.new

iris.each do |iri|
  obs = corpus.observe(iri)
  obs.fingerprint   # deterministic shape: "https://foo.com/users/{user_id}"
  obs.cluster       # the Iriq::Cluster this fell into
  obs.explanation   # per-segment annotations with corpus-informed classification
end

corpus.host_counts          # { "foo.com" => 1234, "bar.com" => 7 }
corpus.path_length_counts   # { 2 => 800, 3 => 434 }
corpus.fingerprint_counts   # shape → count
corpus.raw_shape_counts     # hint-free shape → count
corpus.clusters             # Iriq::Cluster instances

Deterministic vs. corpus-informed normalization

Iriq.normalize("https://foo.com/users/me")
# => "https://foo.com/users/me"   # mechanical: "me" is a literal

corpus.normalize("https://foo.com/users/me")
# => depends on what the corpus has seen

If many /users/{integer_id} paths flow in alongside a handful of /users/me, the cluster /users/me is preserved (mechanical clustering keeps literal routes distinct). If many distinct literal handles (/users/alice, /users/bob, /users/carol, ...) flow in, the corpus promotes that position to a {user} placeholder:

%w[alice bob carol dave erin frank gina hank ivan jane].each do |name|
  corpus.observe("https://foo.com/users/#{name}/profile")
end

corpus.normalize("https://foo.com/users/alice/profile")
# => "https://foo.com/users/{user}/profile"

Explainability

Each row of corpus.explain(...) (and observation.explanation) carries a classification: symbol on top of the deterministic fields:

Classification Meaning
:stable_literal Literal value dominates this position
:variable_identifier Classifier said variable (uuid, integer, etc.)
:rare_literal Literal seen here, but not dominant
:corpus_inferred_variable Classifier said literal, but position has high entropy
:ambiguous Insufficient signal — never seen, or mixed

Extracting IRIs from text

Iriq::Extractor is what powers pipe-mode in the CLI. Picks up explicit- scheme URLs (http, https, ftp, ws, wss, urn) and foo.com/path- style scheme-less URLs (small TLD allow-list, required path). Trims trailing sentence punctuation iteratively and preserves balanced parens (https://en.wikipedia.org/wiki/Ruby_(programming_language) stays intact; (see https://foo.com) drops the outer paren).

Iriq.extract("Visit https://foo.com today, also hit foo.com/users.")
# => [#<Iriq::Identifier https://foo.com>,
#     #<Iriq::Identifier https://foo.com/users>]

# Disable scheme-less:
Iriq::Extractor.new(scheme_less: false).extract("hit foo.com/users today")
# => []

Known limitations (intentional):

  • Comma is a URL boundary, so query strings like ?q=37.7,-122.4 truncate. Trade-off picked to keep CSV-shaped text working.
  • No HTML entity decoding (&amp; stays as-is).
  • Scheme-less mode skips bare hostnames without a path (too noisy in prose).

Memory bounds

  • Per-position value_counts is capped (max_values_per_position, default 1000) — once full, total keeps growing but only existing keys count up.
  • Cluster examples are capped at Iriq::Cluster::MAX_EXAMPLES.
  • No raw IRI strings are retained outside the bounded cluster examples.
Iriq::Corpus.new(max_values_per_position: 200)

Object model

Class Responsibility
Iriq::Parser String → Identifier
Iriq::Identifier Structured fields + canonical reconstruction
Iriq::SegmentClassifier Single segment → type symbol
Iriq::PathShape Segments → /users/{user_id} route shape
Iriq::SegmentHints Derives user_id-style hints from neighbors
Iriq::Inflector Singularization with swappable adapter (AS or built-in)
Iriq::Normalizer Identifier → canonical, shape-aware string
Iriq::Explanation Per-segment {value, type, variable, hint} rows
Iriq::Cluster One host + shape group, with examples & stats
Iriq::Clusterer Many identifiers → Cluster set + explain
Iriq::PositionStats Capped value/type frequencies for one position
Iriq::Observation What Corpus#observe returns
Iriq::Corpus Streaming observer with rolling aggregates + learning
Iriq::Extractor Pulls IRIs out of free text (scheme-anchored)

CLI

Installing the gem installs an iriq executable. Two main modes:

Single input — combined parse + normalize summary; trim with section flags (-p, -n).

$ iriq foo.com/users/456
# parse
original:      foo.com/users/456
kind:          url
scheme:        https
host:          foo.com
path_segments: ["users", "456"]
canonical:     https://foo.com/users/456

# normalize
https://foo.com/users/{user_id}

$ iriq -n https://foo.com/users/123
https://foo.com/users/{user_id}

Piped stdin — extraction runs by default. Output auto-switches: small inputs get a deduplicated URL list, larger inputs (≥ 10 IRIs) get the cluster view via an ephemeral corpus. Section flags work too — emit one normalized URL / parsed record per extracted IRI.

$ cat short.txt | iriq
[2] https://github.com/dpep/iriq
[1] https://foo.com/users

$ cat short.txt | iriq -n                     # normalized URL per line
https://github.com/dpep/iriq
https://foo.com/users

$ cat access.log | iriq                       # ≥ 10 IRIs → cluster view
[190] docs.example.com  /users/{user_id}
[186] app.example.com   /users/{user_id}
...

$ cat README.md | iriq --stats                # rolling aggregates
$ cat README.md | iriq cluster                # force cluster view
$ cat README.md | iriq --corpus c.json        # persist into a corpus

--corpus PATH makes the corpus survive across invocations (atomic JSON file). Once it has data, -n becomes corpus-informed:

$ for n in alice bob carol dave erin frank gina hank ivan jane; do
    iriq --corpus c.json https://foo.com/users/$n/profile >/dev/null
  done

$ iriq -n --corpus c.json https://foo.com/users/zoe/profile
https://foo.com/users/{user}/profile         # mechanical would keep "zoe"

Flags:

Flag Effect
-p, --parse Show parsed fields
-n, --normalize Show the shape-normalized form
-j, --json Emit JSON
-N, --no-hints Use {integer_id} etc. instead of {user_id}
--no-scheme-less Skip foo.com/path-style extraction (explicit-scheme only)
--corpus PATH Load/create a JSON corpus at PATH; observe and save
--stats Print rolling aggregates
-V, --version Print version

A positional argument that doesn't parse as an IRI but IS an existing file is read and extracted from automatically — iriq ./access.log and iriq /var/log/foo.log Just Work. (Bare filenames like README.md may still parse as a URL; pipe with cat to disambiguate.)

Exit codes: 0 success, 1 usage error, 2 parse error.

Performance

Measured on the deterministic IriGenerator fixture (Ruby 3.4.9, single thread):

Operation Throughput
Iriq.parse ~260k URLs/s
Iriq.normalize ~148k URLs/s
Iriq.explain ~205k URLs/s
Iriq.extract (prose) ~9.6 MB/s
Corpus#observe ~80k URLs/s
Corpus save/load (10k) ~135 ms

Linear scaling holds through 100k observations; per-observation retained memory amortizes to ~100 bytes at that scale. Memoization caches are bounded by CACHE_MAX = 10_000 (cleared when full) — overhead is a few hundred KB regardless of corpus size.

Re-run anytime with:

bundle exec script/benchmark.rb       # throughput
bundle exec script/memory.rb          # retained memory + cache footprints

Limitations (intentional)

This is an MVP. Iriq does not:

  • Implement RFC 3986, RFC 3987, or the WHATWG URL standard fully.
  • Convert between Unicode (IRI) and punycode (URI) — the display form is preserved as-is.
  • Percent-encode or decode path/query bytes. Bytes are kept as written.
  • Validate scheme-specific structure beyond URL vs. URN.
  • Resolve relative references against a base URL.
  • Round-trip canonical back to the exact original byte-for-byte (whitespace is stripped, default ports are dropped, dot segments are collapsed).

For richer IRI handling, see addressable. Iriq's focus is the analysis side: classification, normalization, and clustering — not a complete URL implementation.


Go port

A Go implementation lives under go/ — same public surface, same behavior, ~10× faster CLI on extraction-heavy workloads. The Ruby gem is the reference; the Go port stays in sync via golden JSON fixtures (spec/fixtures/) and a CLI parity harness (script/cli_parity.sh), both checked in CI.

import "github.com/dpep/iriq/go/iriq"

iri, _ := iriq.Parse("https://foo.com/users/123")
norm, _ := iriq.Normalize("https://foo.com/users/123")
// "https://foo.com/users/{user_id}"

See go/README.md for the full API table and porting workflow.


Contributing

Yes please :)

  1. Fork it
  2. Create your feature branch (git checkout -b my-feature)
  3. Ensure the tests pass (bundle exec rspec)
  4. If you changed library behavior, port the change to Go (or open an issue) and regenerate fixtures: bundle exec ruby script/generate_fixtures.rb
  5. Commit your changes (git commit -am 'awesome new feature')
  6. Push your branch (git push origin my-feature)
  7. Create a Pull Request

About

IRI extraction, normalization, and clustering

Resources

License

Stars

Watchers

Forks

Contributors