Skip to content

dainiusjocas/lucene-text-analysis

Repository files navigation

Clojars Project cljdoc badge Tests

lucene-text-analysis

Library to inspect the output of the Lucene text analysis pipeline.

Supports 3 ways of analyzing text:

  • string to list of strings;
  • String to list of tokens (similar to the Elasticsearch/Opensearch _analyze API);
  • string to GraphViz program to draw a Lucene TokenStream as a graph.

Quickstart

Dependencies:

{:deps {lt.jocas/lucene-text-analysis {:mvn/version "1.0.21"}}}

Code:

(require '[lucene.custom.text-analysis :as analysis])

(analysis/text->token-strings "Test TEXT")
;; => ["test" "text"]

(analysis/text->tokens "Test TEXT")
;; => 
[#lucene.custom.text_analysis.TokenRecord{:token "test",
                                          :type "<ALPHANUM>",
                                          :start_offset 0,
                                          :end_offset 4,
                                          :position 0,
                                          :positionLength 1}
 #lucene.custom.text_analysis.TokenRecord{:token "text",
                                          :type "<ALPHANUM>",
                                          :start_offset 5,
                                          :end_offset 9,
                                          :position 1,
                                          :positionLength 1}]

(analysis/text->graph "Test TEXT")
;; =>
"digraph tokens {
   graph [ fontsize=30 labelloc=\"t\" label=\"\" splines=true overlap=false rankdir = \"LR\" ];
   // A2 paper size
   size = \"34.4,16.5\";
   edge [ fontname=\"Helvetica\" fontcolor=\"red\" color=\"#606060\" ]
   node [ style=\"filled\" fillcolor=\"#e8e8f0\" shape=\"Mrecord\" fontname=\"Helvetica\" ]
 
   0 [label=\"0\"]
   -1 [shape=point color=white]
   -1 -> 0 []
   0 -> 1 [ label=\"test / Test\"]
   1 [label=\"1\"]
   1 -> 2 [ label=\"text / TEXT\"]
   -2 [shape=point color=white]
   2 -> -2 []
 }
 "

Every function accepts a Lucene Analyzer as the second argument.

Use cases

  • Do ASCII folding person names:

With helper library:

lt.jocas/lucene-custom-analyzer {:mvn/version "1.0.14"}
(require '[lucene.custom.analyzer :as custom-analyzer])

(lucene.custom.text-analysis/text->token-strings 
  "Thomas Müller" 
  (custom-analyzer/create {:token-filters [{:asciiFolding {}}]}))
;; => ["Thomas" "Muller"]

How to draw a graph image?

The example assumes that the GraphViz dot program is installed:

clojure -M --eval '(require `lucene.custom.text-analysis)(println (lucene.custom.text-analysis/text->graph "one two three"))' | dot -Tpng -o docs/assets/images/token-graph.png

Results in an image

Token Graph

Development

Compile Java classes:

clojure -T:build compile-java

Start your REPL.

License

Copyright © 2023 Dainius Jocas.

Distributed under The Apache License, Version 2.0.

About

(Micro)Library to inspect the output of the Lucene analyzers.

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

 

Packages

No packages published