A Lucene custom analyzer that allow for indexing multiple tokens for a single term it supports storing terms with Mixed-case letters and terms with punctuation. This in theory should produce more accurate results as it causes Lucene to perform a more exact search.
Released under the MIT License.
$ git clone https://github.com/eXist-db/exist.git
$ cd exist
$ git checkout master
$ mvn -DskipTests packagewe will refer to the exist-db directory as $EXIST_HOME
you can set it using
Linux/macOS:
$ export EXIST_HOME=/your/path/to/eXist-dbWindows:
$ set EXIST_HOME=C:\your\path\to\eXist-dbLinux/macOS:
$ cp PLMultiAnalyzer-1.0.0-SNAPSHOT.jar $EXIST_HOME/exist-distribution/target/exist-distribution-[version]-dir/libWindows:
$ copy PLMultiAnalyzer-1.0.0-SNAPSHOT.jar %EXIST_HOME%\exist-distribution\target\exist-distribution-[version]-dir\libin your $EXIST_HOME/exist-distribution/target/exist-distribution-[version]-dir/etc/startup.xml
add to the dependencies
<dependencies>
... <!-- other dependencies -->
<dependency>
<groupId>com.evolvedbinary.lucene.analyzer</groupId>
<artifactId>ohAnalyzer</artifactId>
<version>1.0.0-SNAPSHOT</version>
<relativePath>PLMultiAnalyzer-1.0.0-SNAPSHOT.jar</relativePath> <!-- must be exact match to the jar in lib folder -->
</dependency>
... <!-- other dependencies -->
<dependencies>run the start up script Linux/macOS:
$ $EXIST_HOME/exist-distribution/target/exist-distribution-[version]-dir/bin/startup.shWindows:
$ %EXIST_HOME%\exist-distribution\target\exist-distribution-[version]-dir\bin\startup.batwhen creating the index config specify the Analyzer as com.evolvedbinary.lucene.analyzer.OhAnalyzer
the Analyzer needs two parameters
minimumTermLength: the minimum length of any decomposed term, any smaller decomposed terms will be discarded. Set to 0 to indicate no minimum.punctuationDictionary: the dictionary of punctuation to use for decomposition.
<collection xmlns="http://exist-db.org/collection-config/1.0">
<index xmlns:wiki="http://exist-db.org/xquery/wiki" xmlns:html="http://www.w3.org/1999/xhtml" xmlns:atom="http://www.w3.org/2005/Atom">
<!-- Lucene index is configured below -->
<lucene>
<analyzer class="com.evolvedbinary.lucene.analyzer.OhAnalyzer">
<param name="punctuationDictionary" type="char[]">
<value>'</value>
<value>-</value>
<value>’</value>
</param>
<param name="minimumTermLength" type="int" value="2" />
<analyzer>
<text qname="doc"/>
</lucene>
</index>
</collection>