This little plugin uses the FSTSuggeser from lucene to create suggestions from a certain field for a specified term instead of returning index data.
THIS IS NOT PRODUCTION READY! DO NOT USE IT.
This is my first attempt with elasticsearch. I am not too deep into elasticsearch internals, nor I have deep knowledge about lucene. So please forgive this code.
Feel free to comment, improve and help – I am thankful for any insights, no matter whether you want to help with elasticsearch, lucene or my other flaws I will have done for sure.
Oh and in case you have not read it above:
THIS IS NOT PRODUCTION READY! DO NOT USE IT.
In case you want to contact me, drop me a mail at alexander@reelsen.net
If you do not want to work on the repository, just use the standard elasticsearch plugin command (inside your elasticsearch/bin directory)
bin/plugin -url https://github.com/downloads/spinscale/elasticsearch-suggest-plugin/elasticsearch-suggest-0.0.4-0.19.0.zip -install suggest
If you want to work on the repository
- Clone this repo with git clone git://github.com/spinscale/elasticsearch-suggest-plugin.git
- Run:
gradle clean assemble zip
– this does not run any unit tests, as they take some time. If you want to run them, better rungradle clean build zip
- Install the plugin:
/path/to/elasticsearch/bin/plugin -install elasticsearch-suggest -url file:///$PWD/build/distributions/elasticsearch-suggest-$version.zip
Fire up curl like this, in case you have a products index and the right fields – if not, read below how to setup a clean elasticsearch in order to support suggestions.
# curl -X POST 'localhost:9200/products1/product/_suggest?pretty=1' -d '{ "field": "ProductName.suggest", "term": "tischwäsche", "size": "10" }'
{
"suggest" : [ "tischwäsche", "tischwäsche 100",
"tischwäsche aberdeen", "tischwäsche acryl", "tischwäsche ambiente",
"tischwäsche aquarius", "tischwäsche atlanta", "tischwäsche atlas",
"tischwäsche augsburg", "tischwäsche aus", "tischwäsche austria" ]
}
As you can see, this queries the products index for the field ProductName.suggest with the specified term and size
You might want to check out the included unit test as well. I use a shingle filter in my examples, take a look at the files in src/test/resources directory.
Furthermore the suggest data is not updated, whenever you index a new product but every few minutes. The default is to update the index every 10 minutes, but you can change that in your elasticsearch.yml configuration:
suggest:
refresh_interval: 600s
In this case the suggest indexes are refreshed every 10 minutes. This is also the default. You can use values like “10s”, “10ms” or “10m” as with most other time based configuration settings in elasticsearch.
If you want to refresh your FST suggesters manually instead of waiting for 10 minutes just issue a POST request to the “/_suggestRefresh” URL.
# curl -X POST 'localhost:9200/_suggestRefresh'
Inject the NodeClientWithSuggest via Guice and use it
private NodeClientWithSuggest client;
@Inject public ConstructorOfYourClass(NodeClientWithSuggest client) {
this.client = client;
}
public List<String> getMySuggestions(String term, String field, String index, Integer size, Float similarity) {
SuggestRequest request = new SuggestRequest(index);
request.term(term);
request.field(field);
request.size(size);
request.similarity(similarity);
SuggestResponse response = client.suggest(request).actionGet()
return response.sugggestions();
}
Refresh works like this – there is no information in the response:
NodesSuggestRefreshRequest refreshRequest = new NodesSuggestRefreshRequest();
NodesSuggestRefreshResponse response = client.suggestRefresh(request).actionGet()
You can also use the included builders
List<String> suggestions = new SuggestRequestBuilder(client)
.field(field)
.term(term)
.size(size)
.similarity(similarity)
.execute().actionGet().suggestions();
SuggestRefreshRequestBuilder builder = new SuggestRefreshRequestBuilder(client);
builder.execute().actionGet();
- Shay for giving feedback
- Ensure that the _all field can be suggested on as well, see spinscale#1
- Check Shays response and incorporate comments
- Build the service as shard level service like the ShardGetService
- https://groups.google.com/group/elasticsearch/browse_thread/thread/7c8693c6640cbc49/d448e8880552e999?lnk=gst#d448e8880552e999
- Allow clearing of fst suggesters at least per index or per field
- Make it generelly less hacky
- 2012-03-07: Updated to work with elasticsearch 0.19.0
- 2012-02-10: Created SuggestRequestBuilder and SuggestRefreshRequestBuilder classes – results in easy to use request classes (check the examples and tests)
- 2011-12-29: The refresh interval can now be chosen as time based value like any other elasticsearch configuration
- 2011-12-29: Instead of having all nodes sleeping the same time and updating the suggester asynchronously, the master node now triggers the update for all slaves
- 2011-12-20: Added transport action (and REST action) to trigger reloading of all FST suggesters
- 2011-12-11: Fixed the biggest issues: Searchers are released now and do not leak
- 2011-12-11: Indexing is now done periodically
- 2011-12-11: Found a way to get the injector from the node, so I can build my tests without using HTTP requests
This HOWTO will help you to setup a clean elasticsearch installation with the correct index settings and mappings, so you can use the plugin as easy as possible.
We will setup elasticsearch, index some products and query those for suggestions.
- Get elasticsearch, install it
- Get this plugin, install it
- Add a suggest and a lowercase analyzer to your elasticsearch/config/elasticsearch.yml config file
index: analysis: analyzer: lowercase_analyzer: type: custom tokenizer: standard filter: [standard, lowercase] suggest_analyzer: type: custom tokenizer: standard filter: [standard, lowercase, shingle]
- Start elasticsearch
- Now a mapping has to be created. You can either create it via configuration in a file or during index creation. We will create an index with a mapping now
curl -X PUT localhost:9200/products -d '{ "mappings" : { "product" : { "properties" : { "ProductId": { "type": "string", "index": "not_analyzed" }, "ProductName" : { "type" : "multi_field", "fields" : { "ProductName": { "type": "string", "index": "not_analyzed" }, "lowercase": { "type": "string", "analyzer": "lowercase_analyzer" }, "suggest" : { "type": "string", "analyzer": "suggest_analyzer" } } } } } } }'
- Now lets add some products
for i in 1 2 3 4 5 6 7 8 9 10 100 101 1000; do json=$(printf '{"ProductId": "%s", "ProductName": "%s" }', $i, "My Product $i") curl -X PUT localhost:9200/products/product/$i -d "$json" done
- Queries the not analyzed field, returns 10 matches (default), always the full product name:
curl -X POST localhost:9200/products/product/_suggest -d '{ "field": "ProductName", "term": "My" }'
- Queries the not analyzed field, returns nothing (because lowercase):
curl -X POST localhost:9200/products/product/_suggest -d '{ "field": "ProductName", "term": "my" }'
- Queries the lowercase field, returns only the occuring word (which is pretty bad for suggests):
curl -X POST localhost:9200/products/product/_suggest -d '{ "field": "ProductName.lowercase", "term": "m" }'
- Queries the suggest field, returns two words (this is the default length of the shingle filter), in this case “my” and “my product”
curl -X POST localhost:9200/products/product/_suggest -d '{ "field": "ProductName.suggest", "term": "my" }'
- Queries the suggest field, returns ten product names as we started with the second word + another one due to the shingle
curl -X POST localhost:9200/products/product/_suggest -d '{ "field": "ProductName.suggest", "term": "product" }'
- Queries the suggest field, returns all products with “product 1” in the shingle
curl -X POST localhost:9200/products/product/_suggest -d '{ "field": "ProductName.suggest", "term": "product 1" }'
- The same query as above, but limits the result set to two
curl -X POST localhost:9200/products/product/_suggest -d '{ "field": "ProductName.suggest", "term": "product 1", "size": 2 }'
- And last but not least, typo finding, the query without similarity parameter set returns nothing:
curl -X POST localhost:9200/products/product/_suggest -d '{ "field": "ProductName.suggest", "term": "proudct", similarity: 0.7 }'
The similarity is a float between 0.0 and 1.0 – if it is not specified 1.0 is used, which means it must equal. I’ve found 0.7 ok for cases, when two letters were exchanged, but mileage may very as I tested merely on german product names.
With the tests I did, a shingle filter held the best results. Please check http://www.elasticsearch.org/guide/reference/index-modules/analysis/shingle-tokenfilter.html for more information about setup, like the default tokenization of two terms.
Now test with your data, come up and improve this configuration. I am happy to hear about your specific configuration for successful suggestion queries.