Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
README.txt
main.go

README.txt




Usage: steno-similar [OPTIONS] DBFILE1 DBFILE2
Options:
  -m float
    	match threshold (0=no matching, 1=all ngrams matched) (default 0.4)
  -n int
    	ngram size (default 3)
  -s int
    	ignore articles shorter than this this number of words (default 100)
  -v	verbose output
  -l string
	language (en/ru/es)
eg:

  $ steno-similar -m 0.5 -v poop1.db poop2.db


Indexing algorithm:

1) initialise the ngram table (to hold a list of doc IDs for each possible ngram)

2) For each document in the source:

    a) lowercase text, remove punctuation, remove stopwords, apply stemming
      eg:
      doc 1: "National Socialism is a bit silly."
      doc 2: "Nihilists! Fuck me. I mean, say what you want about the tenets of National Socialism, Dude, at least it's an ethos."

      becomes:
      doc 1: "nation social silly"
      doc 2: "nihili fuck mean tenet nation social dud ethos"

    b) split into ngrams
      eg (with ngramSize=2)
      doc 1: " nation social" "social silly"
      doc 2: "nihili fuck", "fuck mean", "mean tenet", "tenet nation", "nation social", "social dud" "dud least" "least ethos"

    c) go through each ngram and append the doc id to its slot in the ngram table.
       So, after indexing the two example docs above, the ngram table looks like this:

        "nation social"  => [1,2]
        "social silly" => [1]
        "nihili fuck"  => [2]
        "fuck mean"  => [2]
        "mean tenet"  => [2]
        "tenet nation"  => [2]
        "social dud"  => [2]
        "dud least"  => [2]
        "least ethos"  => [2]


Matching algorithm:

to match a document ("the query doc")against the index:

eg doc 3: "National socialism is fucking mean."

1) split up the document text into ngrams, as above.
   => doc 3: "nation social" "social fuck" "fuck mean"

2) get a list of _all_ the potential match docs which contain any of those ngrams
   =>
        "nation social"  => [1,2]
        "social fuck" => []  (empty - wasn't in original corpus)
        "fuck mean"  => [2]
        so the potential match docs are: [1,2]

   For each one we calculate a match factor:

   a) count how many of the query doc ngrams appear in this potential match doc
   => docid 1:  1 match
   => docid 2:  2 matches

   b) divide by the number of ngrams in the query doc to get match factor
   => docid 1: 1/3 = 0.33333...
   => docid 2: 2/3 = 0.66666...

   c) discard any documents below the match threshold

   d) perform a diff to compare the original text of the two documents
      (NOT the ngrams!) and display this in the HTML report







You can’t perform that action at this time.