Souped versions for various rule-based Lucene/Elasticsearch token filters. Uber filters!

Part of a much bigger closed-source project, this version only provides database access for the various filter rules. If there is interest, I can add refreshable rules (which will only affect newly indexed content), S3 access, and more.

There is only currently support for Elasticsearch 5.5+. The main source is easily ported to other versions, but the integration testing support improved with Elasticsearch 5.5.

Token filters

Token type name	Description
uber_keyword_marker	Keyword Marker Token Filter (lacks supports for patterns)
uber_stemmer_override	Stemmer Override Token Filter
uber_stop	Stop Token Filter
uber_synonym	Synonym Token Filter

All filters are identical to their standard counterpart, but simply add the a 'query' parameter. If the 'query' parameter is not provided, the token filter will simply use the standard parameters for its standard counterpart. Any SQL select supported by your database can be used.

Building

There is no downloadable version of the plugin for two reasons:

It is difficult to release a plugin for each minor version of Elasticsearch. You can only run plugins built for the exact version of Elasticsearch.
Each database requires a different Java driver, so it would be impossible to include them all.

Edit the gradle.properties file for the correct Elaticsearch version

For Elasticsearch versions between 5.5.0 - 5.5.2, please use the 5.5 branch.
For Elasticsearch versions between 5.6+ or 5.5.3, please use the master/5.6 branch.

You can either create a plugin jar with standard Java driver jars or without if you wish to bundle it yourself. Certain drivers, such as the one for MS SQL Server, are not available at the standard maven jar repositories.

Create a jar with no driver and no tests: gradle assemble
Create a jar with no driver and (non-database) tests: gradle build
Create a jar with the derby driver and full tests: gradle build -DdbType=test

You can add a driver jar during build time by use the dbType parameter.

gradle build -DdbType=mysql

Currently supported types are postgres, mysql and (networked) derby. The version is specified in the gradle.properties file. Pull requests welcome for other databases.

You can also add the database dependency directly via the dbDependency parameter

gradle build -DdbDependency=mysql:mysql-connector-java:5.1.44

If you wish to add the Java driver jars manually, you can use any archiving tool with the resulting zip file. Elasticsearch does not use uberjars, so each jar can be added separately. Beware of jarhell issues. You have been warned!

Running gradle clean is highly recommended before doing integration tests.

Installation

After the jar has been built and any additional jars manually added, you can run the elasticsearch plugin installer

$ES_HOME/bin/elasticsearch-plugin install <path to zip file> (usually $PLUGIN_HOME/build/distributions/uber-filters-1.0.zip)

You will be prompted to accept addtional security updates required for database access.

Due to limitations stemming from the Elasticsearch security model, Java drivers are loaded via Class.forName() method and cannot be loaded via JDBC4/DriverManager/ServiceLoader. Therefore the name of the driver class must be specified in the Elasticsearch config. Only one database is allowed, so the settings are defined in the Elasticsearch config (elaticsearch.yml), and not the token filter setup.

Required settings

uber_filters.jdbc.driver
uber_filters.jdbc.url

Example

uber_filters.jdbc.driver: "com.mysql.jdbc.Driver"
uber_filters.jdbc.url: "jdbc:mysql://localhost/test"

Examples

PUT /mytest
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "mystop": {
            "type": "uber_stop",
            "query": "select distinct stopword from stopwords"
          },
          "mykeyword": {
            "type": "uber_keyword_marker",
            "query": "select distinct keyword_marker from keyword_markers"
          },
          "mystemmeroverride": {
            "type": "uber_stemmer_override",
            "query": "select distinct stemmer_override from stemmer_overrides"
          },
          "mysynonym": {
            "type": "uber_synonym",
            "query": "select distinct synonym from synonyms"
          }          
        }        
      }
    }
  }
}

The standard parameters will be used if the query is not specified or causes an exception

PUT /mytest
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "mystop": {
            "type": "uber_stop"
          },        
          "mysynonym": {
            "type": "uber_synonym",
            "query": "select distinct synonym from synonyms",
            "synonyms" : [
                            "i-pod, i pod => ipod",
                            "universe, cosmos"
            ]
          }          
        }        
      }
    }
  }
}

NOTICE

The database needs to be up and running with the correct content whenever the token filter is created. Creation can occur when:

creating a new index referencing the filter
a closed index is re-opened
the Elasticsearch node is restarted

Obviously, the database needs to be accessible from each Elasticsearch node. The flexibility having the rules in a database does not comes cheap! :)

TODO

Elasticsearch 6 support
jdbc.url should not be per node, but per token filter
support for synonym graph
keyword marker pattern support. Easy enough, might be easier to create a new type altogher (ala synonym graph)
configurable strict mode that will allow the filter not to use standard rules if the SQL query should fail.
s3 support

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
src		src
test/fixtures/db-fixture		test/fixtures/db-fixture
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
build.gradle		build.gradle
gradle.properties		gradle.properties
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Souped versions for various rule-based Lucene/Elasticsearch token filters. Uber filters!

Token filters

Building

Installation

Examples

NOTICE

TODO

Pull requests welcome.

About

Releases

Packages

Languages

License

brusic/uber-filters

Folders and files

Latest commit

History

Repository files navigation

Souped versions for various rule-based Lucene/Elasticsearch token filters. Uber filters!

Token filters

Building

Installation

Examples

NOTICE

TODO

Pull requests welcome.

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages