delete the existing index

In [12]:
curl -XDELETE 'localhost:9200/my_index?pretty'

{
  "acknowledged" : true
}


create the token filter and analyzer

In [2]:
curl -XPUT 'localhost:9200/my_index?pretty' -H 'Content-Type: application/json' -d'
{
    "settings": {
        "number_of_shards": 1, 
        "analysis": {
            "filter": {
                "autocomplete_filter": { 
                    "type":     "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 20
                }
            },
            "analyzer": {
                "autocomplete": {
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "autocomplete_filter" 
                    ]
                }
            }
        }
    }
}
'

{
  "acknowledged" : true,
  "shards_acknowledged" : true
}


test the new analyzer

In [3]:
curl -XGET 'localhost:9200/my_index/_analyze?pretty' -H 'Content-Type: application/json' -d'
{
  "analyzer": "autocomplete",
  "text": "quick brown"
}
'

{
  "tokens" : [
    {
      "token" : "q",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "qu",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "qui",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "quic",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "quick",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "b",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "br",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "bro",
      "start_offset" : 6,
      "end_offset" 

apply the analyzer to a field

In [4]:
curl -XPUT 'localhost:9200/my_index/_mapping/my_type?pretty' -H 'Content-Type: application/json' -d'
{
    "my_type": {
        "properties": {
            "name": {
                "type":     "string",
                "analyzer": "autocomplete"
            }
        }
    }
}
'

{
  "acknowledged" : true
}


add some test data

In [5]:
curl -XPOST 'localhost:9200/my_index/my_type/_bulk?pretty' -H 'Content-Type: application/json' -d'
{ "index": { "_id": 1            }}
{ "name": "Brown foxes"    }
{ "index": { "_id": 2            }}
{ "name": "Yellow furballs" }
'

{
  "took" : 122,
  "errors" : false,
  "items" : [
    {
      "index" : {
        "_index" : "my_index",
        "_type" : "my_type",
        "_id" : "1",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 2,
          "failed" : 0
        },
        "created" : true,
        "status" : 201
      }
    },
    {
      "index" : {
        "_index" : "my_index",
        "_type" : "my_type",
        "_id" : "2",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 2,
          "failed" : 0
        },
        "created" : true,
        "status" : 201
      }
    }
  ]
}


use the match query to test it out

In [6]:
curl -XGET 'localhost:9200/my_index/my_type/_search?pretty' -H 'Content-Type: application/json' -d'
{
    "query": {
        "match": {
            "name": "brown fo"
        }
    }
}
'

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.7102298,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "my_type",
        "_id" : "1",
        "_score" : 1.7102298,
        "_source" : {
          "name" : "Brown foxes"
        }
      },
      {
        "_index" : "my_index",
        "_type" : "my_type",
        "_id" : "2",
        "_score" : 0.2688388,
        "_source" : {
          "name" : "Yellow furballs"
        }
      }
    ]
  }
}


why did yellow furballs return? Use validate-query to shed some light

In [7]:
curl -XGET 'localhost:9200/my_index/my_type/_validate/query?explain&pretty' -H 'Content-Type: application/json' -d'
{
    "query": {
        "match": {
            "name": "brown fo"
        }
    }
}
'

{
  "valid" : true,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "explanations" : [
    {
      "index" : "my_index",
      "valid" : true,
      "explanation" : "+(Synonym(name:b name:br name:bro name:brow name:brown) Synonym(name:f name:fo)) #(#_type:my_type)"
    }
  ]
}


its because of the letter f! But what we really want is only match full words a user has entered

to do so, well want to change the analyzer at search time (the indexer will still behave the same)

In [8]:
curl -XGET 'localhost:9200/my_index/my_type/_search?pretty' -H 'Content-Type: application/json' -d'
{
    "query": {
        "match": {
            "name": {
                "query":    "brown fo",
                "analyzer": "standard" 
            }
        }
    }
}
'

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 2.044134,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "my_type",
        "_id" : "1",
        "_score" : 2.044134,
        "_source" : {
          "name" : "Brown foxes"
        }
      }
    ]
  }
}


we can also update the search analyzer in the mapping

In [9]:
curl -XPUT 'localhost:9200/my_index/my_type/_mapping?pretty' -H 'Content-Type: application/json' -d'
{
    "my_type": {
        "properties": {
            "name": {
                "type":            "string",
                "analyzer":  "autocomplete", 
                "search_analyzer": "standard" 
            }
        }
    }
}
'

{
  "acknowledged" : true
}


now lets check our query

In [10]:
curl -XGET 'localhost:9200/my_index/my_type/_validate/query?explain&pretty' -H 'Content-Type: application/json' -d'
{
    "query": {
        "match": {
            "name": "brown fo"
        }
    }
}
'

{
  "valid" : true,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "explanations" : [
    {
      "index" : "my_index",
      "valid" : true,
      "explanation" : "+(name:brown name:fo) #(#_type:my_type)"
    }
  ]
}


to make the search completion even more efficient, one can devise a finite state transducer (big graph)

for the postcode example, one could change the string into a token stream (treat as non-analyzed), and use the n-gram token filter (delete the index first)

In [13]:
curl -XPUT 'localhost:9200/my_index?pretty' -H 'Content-Type: application/json' -d'
{
    "settings": {
        "analysis": {
            "filter": {
                "postcode_filter": {
                    "type":     "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 8
                }
            },
            "analyzer": {
                "postcode_index": { 
                    "tokenizer": "keyword",
                    "filter":    [ "postcode_filter" ]
                },
                "postcode_search": { 
                    "tokenizer": "keyword"
                }
            }
        }
    }
}
'

{
  "acknowledged" : true,
  "shards_acknowledged" : true
}
