Skip to content

EvanLeleux/simhash

Repository files navigation

simhash

Small, production-quality Go module for fast 64-bit SimHash on HTML documents.

It includes:

  • Core streaming SimHash hasher (Hasher64)
  • Visible-text HTML tokenization (TokenizeHTMLText)
  • DOM-structure tokenization (TokenizeHTMLDOM)
  • Comparison helpers (Hamming64, Similarity64)

Install

go get github.com/evanleleux/simhash

Quickstart

package main

import (
	"fmt"

	"github.com/evanleleux/simhash"
)

func main() {
	a := []byte(`<html><body><h1>Checkout</h1><p>Pay securely</p></body></html>`)
	b := []byte(`<html><body><h1>Checkout</h1><p>Pay securely today</p></body></html>`)

	h1, _ := simhash.FingerprintHTMLText64(a)
	h2, _ := simhash.FingerprintHTMLText64(b)

	fmt.Printf("h1=0x%016x h2=0x%016x\n", h1, h2)
	fmt.Printf("hamming=%d similarity=%.4f\n", simhash.Hamming64(h1, h2), simhash.Similarity64(h1, h2))
}

API

  • FingerprintTokens64(tokens TokenStream, opts ...Option) (uint64, error)
  • FingerprintHTMLText64(html []byte, opts ...Option) (uint64, error)
  • FingerprintHTMLDOM64(html []byte, opts ...Option) (uint64, error)
  • Hamming64(a, b uint64) int
  • Similarity64(a, b uint64) float64

Hasher64 supports streaming accumulation without building token slices:

h := simhash.NewHasher64()
h.AddStringToken("checkout", 1)
h.AddStringToken("payment", 1)
fp := h.Sum64()
_ = fp

HTML Tokenization Behavior

Visible text

  • Parses with golang.org/x/net/html (no regex parsing)
  • Ignores text under <script> and <style>
  • Best-effort hidden filtering (hidden, inline display:none, visibility:hidden) by default
  • Collapses whitespace by tokenizing into word tokens

DOM structure

  • Emits path tokens like html/body/div/form/input
  • Ignores attributes by default (stable across changing classes/IDs)
  • Uses configurable max depth (default 8)
  • Can focus on form tags with WithDOMFormOnly(true)

Options

  • WithHashFunc(HashFunc64) to override hashing function
  • WithWeightFunc(WeightFunc) to override token weighting
  • WithMaxTextBytes(n int) to cap visible text bytes processed
  • WithDOMMaxDepth(depth int) to cap emitted DOM depth
  • WithIgnoreHidden(enabled bool) to toggle hidden-node filtering
  • WithLowercaseTags(enabled bool) to toggle lowercasing tag names
  • WithDOMFormOnly(enabled bool) to emit only form-related DOM paths

Default token hash is github.com/cespare/xxhash/v2.

Threshold Guidance

For 64-bit SimHash, near-duplicate detection often starts around Hamming distance <= 35, but this is dataset-dependent. Tune thresholds on your corpus and objective (precision vs recall).

Important Note

Do not shingle before SimHash in this workflow. Shingling is mainly useful for MinHash/Jaccard style similarity; this package is intended for direct token streams into SimHash.

Example Program

Run:

go run ./cmd/example [fileA.html fileB.html]

Without args, it compares:

  • https://evanleleux.dev/simhash/page-01 through https://evanleleux.dev/simhash/page-10

Default output includes per-page hashes plus adjacent-page similarity comparisons.

You can also pass two local file paths or URLs for direct pair comparison.

About

A small golang library for computing simhashes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages