<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Tuning-Analyzer-to-Improve-Recall" data-toc-modified-id="Tuning-Analyzer-to-Improve-Recall-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Tuning Analyzer to Improve Recall</a></span></li><li><span><a href="#Example-of-Creating-Index-that-uses-English-Analyzer" data-toc-modified-id="Example-of-Creating-Index-that-uses-English-Analyzer-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Example of Creating Index that uses English Analyzer</a></span></li><li><span><a href="#Dealing-with-Delimiters-(Acronyms-&amp;-Phone-Numbers)" data-toc-modified-id="Dealing-with-Delimiters-(Acronyms-&amp;-Phone-Numbers)-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Dealing with Delimiters (Acronyms &amp; Phone Numbers)</a></span></li><li><span><a href="#Capturing-Meaning-and-Modeling-Specificity-with-Synonyms" data-toc-modified-id="Capturing-Meaning-and-Modeling-Specificity-with-Synonyms-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Capturing Meaning and Modeling Specificity with Synonyms</a></span></li><li><span><a href="#Modeling-Specificity-with-Paths" data-toc-modified-id="Modeling-Specificity-with-Paths-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Modeling Specificity with Paths</a></span></li><li><span><a href="#Tokenize-the-World" data-toc-modified-id="Tokenize-the-World-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Tokenize the World</a></span></li></ul></div>

Feature creation happens on both the query and document. When an analysis is properly performed, we can greatly improve the relevancy of our search results. Note that meaningful features doesn't have to be just text, it can be geographic locations, images, etc.


There are three major stages to an analyzer:

- Character filtering: This gives us the change to modify the entire piece of text, e.g. HTML tag filtering
- Tokenization: This step chops the original text into a stream of tokens, e.g. using whitespace
- Token filtering: Modifies the token stream by modifying, removing or inserting tokens, e.g. perform stemming, removing stop words.

## Tuning Analyzer to Improve Recall

In [1]:
import json
import requests


settings = {
    'settings': {
        'analysis': {
            'analyzer': {
                # create an analyzer called "standard_clone"
                'standard_clone': {
                    'tokenizer': 'standard',
                    'filter': ['lowercase', 'stop']
                }
            }
        }
    }
}
headers = {'Content-Type': 'application/json'}


requests.delete("http://localhost:9200/my_library")
requests.put("http://localhost:9200/my_library", data=json.dumps(settings), headers=headers)

<Response [200]>

In [2]:
data = {
    'analyzer': 'standard_clone',
    'text': 'Dr. Strangelove: Or How I Learned to Stop Worrying and Love the Bomb'
}
response = requests.get('http://localhost:9200/my_library/_analyze', 
                        data=json.dumps(data), headers=headers)

# apart from the token, it also returns information such as offset and position
# but here we are only interested in the token
result = json.loads(response.text)
tokens1 = set(token['token'] for token in result['tokens'])
tokens1

{'bomb',
 'dr',
 'how',
 'i',
 'learned',
 'love',
 'stop',
 'strangelove',
 'worrying'}

In [3]:
# no match between the two queries
data = {
    'analyzer': 'standard_clone',
    'text': "mr. weirdlove: don't worry, I'm learning to start loving bombs"
}
response = requests.get('http://localhost:9200/my_library/_analyze', 
                        data=json.dumps(data), headers=headers)

result = json.loads(response.text)
tokens2 = set(token['token'] for token in result['tokens'])
tokens2 & tokens1

set()

Using the standard analyzer can lead to a high precision but poor recall problem as the user's query must use the exact same words that occurred in the document. Let's see how we can avoid that.

In [18]:
settings = {
    'settings': {
        'analysis': {

            # syntax for accessing the available stemmer
            # https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stemmer-tokenfilter.html
            'filter': {
                # normalize the words, e.g. walking, walked -> walk
                # the stemmer uses heuristic for mapping words to its root form.
                # Stemming is often times desirable as it makes the word represent
                # their meaning by collapsing multiple representation of the same word
                # into a single form. By performing stemming, our search engine's recall
                # usually improves by quite a bit
                # but there can be times when the heuristic leads to undesirable
                # results, e.g. the word Main (a state in U.S.) will get normalized
                # to the "main", we can avoid this by specifying a list of protected keyword
                'english_stemmer': {
                    'type': 'stemmer',
                    'name': 'english'
                },
                # removes trailing s from words
                'english_possessive_stemmer': {
                    'type': 'stemmer',
                    'name': 'possessive_english'
                },
                # remove a list of built-in english stop words,
                # '_english_' is a keyword to specify use the built-in version
                'english_stop': {
                    'type': 'stop',
                    'stopwords': '_english_'
                },
                # protect words from being modified by downstream stemmer
                # https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-keyword-marker-tokenfilter.html
                'english_keywords': {
                    'type': 'keyword_marker',
                    'keywords': ['maine']
                }
            },

            'analyzer': {
                'english_clone': {
                    'type': 'custom',
                    'tokenizer': 'standard',
                    'filter': [
                        'lowercase',
                        'english_possessive_stemmer',
                        'english_stop',
                        'english_keywords',  # note the ordering matters, define the keyword before stemming
                        'english_stemmer'
                    ]
                }
            }
        }
    }
}
requests.delete("http://localhost:9200/my_library")
requests.put("http://localhost:9200/my_library", data=json.dumps(settings), headers=headers)

<Response [200]>

In [19]:
# check if the stemming worked
data = {
    'analyzer': 'english_clone',
    'text': 'flowers flower flowered flower'
}

response = requests.get('http://localhost:9200/my_library/_analyze', 
                        data=json.dumps(data), headers=headers)

result = json.loads(response.text)
','.join('[' + token['token'] + ']' for token in result['tokens'])

'[flower],[flower],[flower],[flower]'

In [20]:
# check to see the keywords are protected
data = {
    'analyzer': 'english_clone',
    'text': 'maine main'
}

response = requests.get('http://localhost:9200/my_library/_analyze', 
                        data=json.dumps(data), headers=headers)

result = json.loads(response.text)
set(token['token'] for token in result['tokens'])

{'main', 'maine'}

In [21]:
# now, check to see if there's a match between the two queries
data = {
    'analyzer': 'english_clone',
    'text': 'Dr. Strangelove: Or How I Learned to Stop Worrying and Love the Bomb'
}
response = requests.get('http://localhost:9200/my_library/_analyze', 
                        data=json.dumps(data), headers=headers)

# apart from the token, it also returns information such as offset and position
# but here we are only interested in the token
result = json.loads(response.text)
tokens1 = set(token['token'] for token in result['tokens'])

data = {
    'analyzer': 'english_clone',
    'text': "mr. weirdlove: don't worry, I'm learning to start loving bombs"
}
response = requests.get('http://localhost:9200/my_library/_analyze', 
                        data=json.dumps(data), headers=headers)

result = json.loads(response.text)
tokens2 = set(token['token'] for token in result['tokens'])
tokens2 & tokens1

{'bomb', 'learn', 'love', 'worri'}

## Example of Creating Index that uses English Analyzer

In [22]:
requests.delete('http://localhost:9200/my_library')
settings = {
    'settings': {
        'number_of_shards': 1,
        'number_of_replicas': 1,
        'index': {
            'analysis': {
                'analyzer': {
                    'default': {
                        'type': 'english'
                    }
                }
            }
        }
    }
}

headers = {'Content-Type': 'application/json'}
response = requests.put('http://localhost:9200/my_library', data=json.dumps(settings), headers=headers)
response

<Response [200]>

In [23]:
# index some sample documents,
# some documents are very much about apples,
# some are apple-ish, while some rarely mentions it
documents = [
    {'title': 'apples apple'},
    {'title': 'apple apple apple apple apple'},
    {'title': 'apple apple apple banana banana'},
    {'title': 'apple banana blueberry coconut'}
]

for idx, document in enumerate(documents):
    url = 'http://localhost:9200/my_library/_doc/%s' % (idx + 1)
    response = requests.put(url, data=json.dumps(document), headers=headers)

In [24]:
def search(query):
    url = 'http://localhost:9200/my_library/_doc/_search'
    response = requests.get(url, data=json.dumps(query), headers=headers)
    search_hits = json.loads(response.text)['hits']

    print('Num\tRelevance Score\tTitle')
    for idx, hit in enumerate(search_hits['hits']):
        print('%s\t%s\t%s' % (idx + 1, hit['_score'], hit['_source']['title']))


user_search = 'apple banana'
query = {
    'query': {
        'match': {
            'title': user_search
        }
    }
}
search(query)

Num	Relevance Score	Title
1	1.0476142	apple apple apple banana banana
2	0.7985077	apple banana blueberry coconut
3	0.18038376	apple apple apple apple apple
4	0.16857682	apples apple


## Dealing with Delimiters (Acronyms & Phone Numbers)

In [25]:
# acronyms are cases where dealing with delimiters inappropriately can
# lead to poor results
# e.g. I.B.M versus IBM
# our analyzer should normalize various acrynoms so that
# the resulting tokens remains the same

# word_delimiter splits words into subwords and performs
# https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html
settings = {
    'settings': {
        'analysis': {
            'filter': {
                'acronyms': {
                    'type': 'word_delimiter',
                    'generate_word_parts': False,
                    'generate_number_parts': False,
                    'catenate_all': True,
                    # the preserve_original problem look-outs for cases
                    # where acrynoms are mixed with actual english words
                    # e.g. N.E.W -> new
                    'preserve_original': True
                }
            },
            'analyzer': {
                'standard_with_acronyms': {
                    'type': 'custom',
                    'tokenizer': 'standard',
                    'filter': ['lowercase', 'acronyms']
                }
            }
        }
    }
}
requests.delete('http://localhost:9200/my_library')
requests.put('http://localhost:9200/my_library', data=json.dumps(settings), headers=headers)

<Response [200]>

In [26]:
data = {
    'analyzer': 'standard_with_acronyms',
    'text': 'I.B.M. IBM ibm'
}
response = requests.get('http://localhost:9200/my_library/_analyze', 
                        data=json.dumps(data), headers=headers)

# the resulting token has both i.b.m and ibm,
# i.b.m exists since we specify the preserve_original argument to True
result = json.loads(response.text)
','.join('[' + token['token'] + ']' for token in result['tokens'])

'[i.b.m],[ibm],[ibm],[ibm]'

In [27]:
# for phone numbers 1-800-86705309,
# we wish to preserve the last 7 digits (the local number),
# and the last 10 digits (long-distance number), so a user
# can search for the phone number using any of the patterns

# to a regex to emit a token for every captured group in the specified regex,
# the link provides an example using emails
# https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-capture-tokenfilter.html
settings = {
    'settings': {
        'analysis': {
            'filter': {
                'phone_num_filter': {
                    'type': 'word_delimiter',
                    'catenate_all': True,
                    'generate_number_parts': False
                },
                'phone_num_parts': {
                    'type': 'pattern_capture',
                    'patterns': ['(\\d{7}$)', '(\\d{10}$)'],
                    'preserve_original': True
                }
            },
            'analyzer': {
                'phone_num': {
                    # here, we use the keyword tokenizer, the use-case
                    # is where we're dealing with a phone number field,
                    # instead of a text field that happens to contain phone numbers
                    # https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-keyword-tokenizer.html
                    'tokenizer': 'keyword',
                    'filter': ['phone_num_filter', 'phone_num_parts']
                }
            }
        }
    }
}
requests.delete('http://localhost:9200/my_library')
requests.put('http://localhost:9200/my_library', data=json.dumps(settings), headers=headers)

<Response [200]>

In [28]:
data = {
    'analyzer': 'phone_num',
    'text': '1(800)867-5309'
}
response = requests.get('http://localhost:9200/my_library/_analyze', 
                        data=json.dumps(data), headers=headers)

# with phone numbers, capturing the meaningful subset of numbers is
# one way of trying to capturing the user's intent. Doing this is
# essentially ackowledging the fact that user will search for numbers
# by entering local numbers or national numbers
result = json.loads(response.text)
','.join('[' + token['token'] + ']' for token in result['tokens'])

'[18008675309],[8008675309],[8675309]'

## Capturing Meaning and Modeling Specificity with Synonyms

In [29]:
# e.g. dress shoes, whenever the term dress immediately precedes shoes,
# it has a specific concept
# https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html
# https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html
settings = {
    'settings': {
        'analysis': {
            'filter': {
                'english_stop': {
                    'type': 'stop',
                    'stopwords': '_english_'
                },
                'english_stemmer': {
                    'type': 'stemmer',
                    'name': 'english'
                },
                'english_possessive_stemmer': {
                    'type': 'stemmer',
                    'name': 'possessive_english'
                },
                # whenever we see either dress shoe or dress shoes, map that to
                # the tokens specified at the right hand side of the =>
                # And here, we are defining the index time and query time
                # analyzer separately, since when we search for shoe, we want
                # both shoe and dress shoe to show up, however, when we search
                # for dress shoe, we only want dress shoe to show up. In other words,
                # this analysis is asymmetric
                'retail_syn_filter_index': {
                    'type': 'synonym',
                    'synonyms': ['dress shoe, dress shoes => dress_shoe, shoe']
                },
                'retail_syn_filter_search': {
                    'type': 'synonym',
                    'synonyms': ['dress shoe, dress shoes => dress_shoe']
                }
            },
            'analyzer': {
                'retail_analyzer_index': {
                    'tokenizer': 'standard',
                    # important to place the synonym before stemming, which is a more
                    # drastic form of normalization and after lowercasing and possessive stemmer
                    # so that words such as Dress shoes will still get matched
                    'filter': [
                        'lowercase',
                        'english_possessive_stemmer',
                        'english_stop',
                        'retail_syn_filter_index',
                        'english_stemmer'
                    ]
                },
                'retail_analyzer_search': {
                    'tokenizer': 'standard',
                    'filter': [
                        'lowercase',
                        'english_possessive_stemmer',
                        'english_stop',
                        'retail_syn_filter_search',
                        'english_stemmer'
                    ]
                }
            }
        }
    },
    'mappings': {
        '_doc': {
            'properties': {
                # for the 'desc', description field
                'desc': {
                    'type': 'text',
                    'analyzer': 'retail_analyzer_index',
                    'search_analyzer': 'retail_analyzer_search'
                }
            }
        }
    }
}
requests.delete('http://localhost:9200/my_library')
requests.put('http://localhost:9200/my_library', data=json.dumps(settings), headers=headers)

<Response [200]>

In [30]:
documents = [
    {'desc': 'bob brand dress shoes are the bomb'},  # dress shoe
    {'desc': 'this little black dress is sure to impress'},  # dress
    {'desc': 'tennis shoes... you know, for tennis'}  # tennis shoe
]

for idx, document in enumerate(documents):
    url = 'http://localhost:9200/my_library/_doc/%s' % (idx + 1)
    response = requests.put(url, data=json.dumps(document), headers=headers)

In [31]:
# the search gives us the pertinent result, where it returns
# only dress shoes when searching for dress shoes while returning
# both dress shoes and tennis shoes when searching for shoes.
user_searches = ['dress shoes', 'shoes']

url = 'http://localhost:9200/my_library/_search'
for user_search in user_searches:
    print('\nsearched for:', user_search)
    query = {
        'query': {
            'match': {
                'desc': user_search
            }
        }
    }
    response = requests.get(url, data=json.dumps(query), headers=headers)
    search_hits = json.loads(response.text)['hits']['hits']
    for search_hit in search_hits:
        print(search_hit['_source']['desc'])


searched for: dress shoes
bob brand dress shoes are the bomb

searched for: shoes
bob brand dress shoes are the bomb
tennis shoes... you know, for tennis


We can use the knowledge above and use it in the case of specificity e.g. we can index `fuji => fuji, apple, fruit`, in this case we are linking a topic to its "parent" topic, i.e. fuji is one kind of apple, so when a user searches for apple, it will match not only apple documents, but also fuji documents. This pattern is a tradeoff that we are making to improve recall at the expensive of decreasing precision

## Modeling Specificity with Paths

In [36]:
# for a filesystem search engine, when a user is searching for a document
# in the path 'fruit/apples', the search result should return documents
# from the children directory such as 'fruit/apples/fugi', 'fruit/apples/gala'
# https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pathhierarchy-tokenizer.html
settings = {
    'settings': {
        'analysis': {
            'analyzer': {
                'path_analyzer': {
                    'tokenizer': 'path_hierarchy'
                }
            }
        }
    },
    'mappings': {
        '_doc': {
            'properties': {
                'inventory_dir' : {
                    'type': 'text',
                    'analyzer': 'path_analyzer'
                }
            }
        }
    }
}
requests.delete('http://localhost:9200/my_library')
requests.put('http://localhost:9200/my_library', data=json.dumps(settings), headers=headers)

<Response [200]>

In [37]:
documents = [
    # because of the path hierarchy tokenizer, the
    # '/fruit/apples/fuji' will emit the following three
    # terms '/fruit', '/fruit/apples' and '/fruit/apples/fuji'
    {'desc': 'crisp, sweet-flavored, long shelf-life',
     'inventory_dir': '/fruit/apples/fuji'},
    {'desc': 'sweat, pleasant apple',
     'inventory_dir': '/fruit/apples/gala'},
    {'desc': 'edible, seed-bearing portion of plants',
     'inventory_dir': '/fruit'}
]

for idx, document in enumerate(documents):
    url = 'http://localhost:9200/my_library/_doc/%s' % (idx + 1)
    response = requests.put(url, data=json.dumps(document), headers=headers)

In [48]:
# query filter answers a yes or no question, thus no scores are
# computed for this type of query
# https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html
query = {
    'query': {
        'bool': {
            'filter': [
                {'term': {'inventory_dir': '/fruit/apples'}}
            ]
        }
    }
}
url = 'http://localhost:9200/my_library/_search'
response = requests.get(url, data=json.dumps(query), headers=headers)

search_hits = json.loads(response.text)['hits']['hits']
for search_hit in search_hits:
    source = search_hit['_source']
    print(source['inventory_dir'])
    print(source['desc'])

/fruit/apples/gala
sweat, pleasant apple
/fruit/apples/fuji
crisp, sweet-flavored, long shelf-life


## Tokenize the World

We can apply search on anything where we can extract meaningful and discrete features from the data that flow through the analysis/analyzer process. Apart from text, we can also consider tokenizing geographic information or images, etc. to turn our search engine into a more general purpose similarity system.