<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Relevance-Feedback" data-toc-modified-id="Relevance-Feedback-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Relevance Feedback</a></span><ul class="toc-item"><li><span><a href="#Match-Phrase-Prefix-Query" data-toc-modified-id="Match-Phrase-Prefix-Query-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Match Phrase Prefix Query</a></span></li><li><span><a href="#Completion" data-toc-modified-id="Completion-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Completion</a></span></li><li><span><a href="#Correcting-Typos" data-toc-modified-id="Correcting-Typos-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Correcting Typos</a></span></li><li><span><a href="#Faceting" data-toc-modified-id="Faceting-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Faceting</a></span></li><li><span><a href="#Alternate-Sorting" data-toc-modified-id="Alternate-Sorting-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Alternate Sorting</a></span></li><li><span><a href="#What-Information-to-Present" data-toc-modified-id="What-Information-to-Present-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>What Information to Present</a></span></li><li><span><a href="#Grouping-Similar-Documents" data-toc-modified-id="Grouping-Similar-Documents-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Grouping Similar Documents</a></span></li><li><span><a href="#No-Results" data-toc-modified-id="No-Results-1.8"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>No Results</a></span></li></ul></li></ul></div>

# Relevance Feedback

Steering search conversation:

- Explain to user how their query is being interpreted or help them understand why a particular document is a match.
- Correct mistakes such as typos and misspellings.
- Suggest other searches that will provide better result.

Common relevance feedback:

- Search as you type.
- Search completion.
- Post-search suggestion.


Search completion is usually represented as drop-down menu to aid the user as they type.

- With too few queries, we may have insufficient data to build a satisfactory completion experience.
- With too many queries, we'll have to prioritize what's important from a large, diverse set of completion candidates.
- We should also consider whether old search traffic becomes obsolete for our application.
- We should ensure the search completion doesn't lead to search query that has 0 results.

In [1]:
import json
import requests

In [2]:
def extract():
    with open('../tmdb.json') as f:
        return json.loads(f.read())
    
    
movies = extract()

# we can check some sample movie id, to check a sense of what
# the data looks like
# movie_ids = ['93837', '8193', '8195', '5', '8202', '11']
movies['93837']

{'poster_path': '/mfMndRWFbzXbTx0g3rHUXFAxyOh.jpg',
 'production_countries': [{'iso_3166_1': 'US',
   'name': 'United States of America'}],
 'revenue': 0,
 'overview': 'When the FBI hires her to go undercover at a college sorority, Molly Morris (Miley Cyrus) must transform herself from a tough, streetwise private investigator to a refined, sophisticated university girl to help protect the daughter of a one-time Mobster. With several suspects on her list, Molly unexpectedly discovers that not everyone is who they appear to be, including herself.',
 'video': False,
 'id': 93837,
 'genres': [{'id': 28, 'name': 'Action'}, {'id': 35, 'name': 'Comedy'}],
 'title': 'So Undercover',
 'tagline': "Meet the FBI's new secret weapon",
 'vote_count': 55,
 'homepage': '',
 'belongs_to_collection': None,
 'original_language': 'en',
 'status': 'Released',
 'spoken_languages': [{'iso_639_1': 'en', 'name': 'English'}],
 'imdb_id': 'tt1766094',
 'adult': False,
 'backdrop_path': '/o4Tt60z94Hbgk8adeZG9WE4S

In [3]:
class ElasticSearchUtils:

    def __init__(self, index_name='tmdb', base_url='http://localhost:9200'):
        self.base_url = base_url
        self.index_name = index_name
        self.index_url = self.base_url + '/' + self.index_name
        self.index_type_name = '_doc'
        self.index_type_url = self.index_url + '/' + self.index_type_name
        self.headers = {'Content-Type': 'application/json'}

    def reindex(self, movies, analysis_settings, mapping_settings=None):
        """
        Reindex takes analyzer and field mappings, recreates the index, and then reindexes
        TMDB movies using the _bulk index API. There are other ways for modifying the configuration
        of the index besides dropping and restarting, however for convenience and because our data
        isn't truly that large, we'll just delete and start from scratch when we need to.
        """
        response = requests.delete(self.index_url)
        print('deleted {} index: '.format(self.index_name), response.status_code)

        # create the index with explicit settings
        # We need to explicitly set number of shards to 1 to eliminate the impact of 
        # distributed IDF on our small collection
        # See also 'Relavance is Broken!'
        # http://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-is-broken.html
        settings = {
            'settings': {
                'index': {
                    'number_of_replicas': 1,
                    'number_of_shards': 1
                } 
            }
        }
        if analysis_settings is not None:
            settings['settings']['analysis'] = analysis_settings
        
        if mapping_settings is not None:
            settings['mappings'] = mapping_settings

        response = requests.put(self.index_url, data=json.dumps(settings), headers=self.headers)
        print('Created {} index: '.format(self.index_name), response.status_code)

        self._bulk_index(movies)

    def _bulk_index(self, movies):
        bulk_index_cmd = ''
        for movie_id, movie in movies.items():
            index_cmd = {
                'index': {
                    '_index': self.index_name,
                    '_type': self.index_type_name,
                    '_id': movie_id
                }
            }
            bulk_index_cmd += (json.dumps(index_cmd) + '\n' + json.dumps(movie) + '\n')

        response = requests.post(self.base_url + '/_bulk',
                                 data=bulk_index_cmd,
                                 headers=self.headers)
 
        print('Bulk index into {} index:'.format(self.index_name), response.status_code)

    def search(self, query, verbose=False):
        search_url = self.index_type_url + '/_search'
        response = requests.get(search_url, data=json.dumps(query), headers=self.headers)

        search_hits = json.loads(response.text)['hits']['hits']
        for idx, hit in enumerate(search_hits):
            source = hit['_source']
            print("%s\t%s\t%s" % (idx + 1, hit['_score'], source['title']))
            
            if verbose:
                cast_names = []
                cast_characters = []
                for cast in source['cast']:
                    cast_names.append(cast['name'])
                    cast_characters.append(cast['character'])

                director_names = [director['name'] for director in source['directors']]

                print('director: ', director_names)
                print('cast: ', cast_names)
                print('character: ', cast_characters)
                print('overview:', source['overview'])
                if '_explanation' in hit:
                    result = ElasticSearchUtils.flatten_explain(hit['_explanation'])
                    print(result)

                print('=============================================')
   
    @staticmethod          
    def flatten_explain(explain_json, depth=0):
        
        # getting rid of potential next line character to make things prettier
        description = explain_json['description'].replace('\n', '')
        result = ' ' * (depth * 2) + '%s, %s\n' % (explain_json['value'], description)
        if 'details' in explain_json:
            for detail in explain_json['details']:
                result += ElasticSearchUtils.flatten_explain(detail, depth=depth + 1)

        return result

    def validate(self, query):
        url = self.index_type_url + '/_validate/query?explain'
        response = requests.get(url, data=json.dumps(query), headers=self.headers)
        return json.loads(response.text)

In [4]:
# explanation of analyzer's role
# https://qbox.io/blog/elasticsearch-english-analyzer-customize
analysis_settings = {
    'filter': {
        'bigram_filter': {
            'type': 'shingle',
            'max_shingle_size': 2,
            'min_shingle_size': 2,
            'output_unigrams': False
        },
        'english_stemmer': {
            'type': 'stemmer',
            'name': 'english'
        }
    },
    'analyzer': {
        'english_bigram': {
            'type': 'custom',
            'tokenizer': 'standard',
            'filter': ['lowercase', 'english_stemmer', 'bigram_filter']
        },
        # for search completion, we should preserve readability during analysis
        # so no stemming is performed
        'completion_analyzer': {
            'tokenizer': 'standard',
            'filter': [
                'standard',
                'lowercase',
                'bigram_filter'
            ]
        }
    }
}

mapping_settings = {
    '_doc': {
        'properties': {
            'title': {
                'type': 'text',
                'analyzer': 'english',
                'copy_to': 'completion'
            },
            'completion': {
                'type': 'text',
                'analyzer': 'completion_analyzer'
            }
        }
    }
}

es_utils = ElasticSearchUtils()
es_utils.reindex(movies, analysis_settings, mapping_settings)

deleted tmdb index:  200
Created tmdb index:  200
Bulk index into tmdb index: 200


## Match Phrase Prefix Query

In [5]:
# https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase-prefix.html
user_search = 'star tr'
query = {
    'query': {
        'match_phrase_prefix': {
            'title': user_search
        }
    }
}
es_utils.search(query)

1	29.129747	Star Trek: Generations
2	25.174505	Star Trek: The Motion Picture
3	25.174505	Star Trek: First Contact
4	22.16494	Star Trek II: The Wrath of Khan
5	22.16494	Star Trek III: The Search for Spock
6	22.16494	Star Trek IV: The Voyage Home
7	22.16494	Star Trek V: The Final Frontier
8	22.16494	Star Trek VI: The Undiscovered Country


## Completion

The completion suggester provides the search as we type functionality. It is specialized search index that is stored in parallel with the normal search index.

In [6]:
# https://www.elastic.co/blog/you-complete-me
# https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up
# https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-completion.html


# name_suggest is a field that will be indexed as type 'suggestion'
# we index the suggest field by providing the mandatory 'input' field,
# which is the text that will match the query and an optional 'weight' field
# that allows us to control the ordering of the suggestion returned
hotel1 = {
    'name': 'Mercure Hotel Munich',
    'city': 'Munich',
    'name_suggest': [
        {'input': 'Mercure Hotel Munich'},
        {'input': 'Mercure Munich'}
    ]
}
hotel2 = {
    'name': 'Hotel Monaco',
    'city': 'Munich',
    'name_suggest': [
        {'input': 'Monaco Munich'},
        {'input': 'Hotel Monaco'}
    ]
}
hotel3 = {
    'name': 'Courtyard by Marriot Munich City',
    'city': 'Munich',
    'name_suggest': [
        {'input': 'Courtyard by Marriot Munich City'},
        {'input': 'Marriot Munich City'}
    ]
}

hotels = {
    '1': hotel1,
    '2': hotel2,
    '3': hotel3
}

In [7]:
mapping_settings = {
    '_doc': {
        'properties': {
            'name': {'type': 'text'},
            'city': {'type': 'text'},
            'name_suggest': {'type': 'completion'}
        }
    }
}

es_utils = ElasticSearchUtils(index_name='hotels')
es_utils.reindex(hotels, analysis_settings=None, mapping_settings=mapping_settings)

deleted hotels index:  200
Created hotels index:  200
Bulk index into hotels index: 200


In [9]:
url = es_utils.index_url + '/_search'
query = {
    'suggest' : {
        'hotel_suggest': {
            'text' : 'm',
            'completion' : {
                'field' : 'name_suggest'
            }
        }
    }
}

# here the only word that starts with m is Mercure Hotel Munich,
# that's why when we use the suggestion syntax to ask for suggestions, it is
# the only result that was returned
response = requests.post(url, data=json.dumps(query), headers=es_utils.headers)
json.loads(response.text)['suggest']['hotel_suggest'][0]['options']

[{'text': 'Mercure Hotel Munich',
  '_index': 'hotels',
  '_type': '_doc',
  '_id': '1',
  '_score': 1.0,
  '_source': {'name': 'Mercure Hotel Munich',
   'city': 'Munich',
   'name_suggest': [{'input': 'Mercure Hotel Munich'},
    {'input': 'Mercure Munich'}]}},
 {'text': 'Monaco Munich',
  '_index': 'hotels',
  '_type': '_doc',
  '_id': '2',
  '_score': 1.0,
  '_source': {'name': 'Hotel Monaco',
   'city': 'Munich',
   'name_suggest': [{'input': 'Monaco Munich'}, {'input': 'Hotel Monaco'}]}},
 {'text': 'Marriot Munich City',
  '_index': 'hotels',
  '_type': '_doc',
  '_id': '3',
  '_score': 1.0,
  '_source': {'name': 'Courtyard by Marriot Munich City',
   'city': 'Munich',
   'name_suggest': [{'input': 'Courtyard by Marriot Munich City'},
    {'input': 'Marriot Munich City'}]}}]

## Correcting Typos

If the search engine receives a query that contains an obvious typo or if the original search query returns 0 result, it replaces the user's query with the correction that the user likely intended. It also notifies the user that the "corrected" query was searched instead. If the user's the query seems like it contains a typo, but is a bit ambiguous of whether it's actually a typo, then our search engine should retrieve the result using the user's original search query, but also suggest a different query so the user can click on it to execute the suggested query.

Regardless of which approach we take, we should clearly convey what we're doing in the UI so the user won't become disoriented.

In [10]:
# https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-phrase.html
# https://qbox.io/blog/how-to-build-did-you-mean-feature-with-elasticsearch-phrase-suggester
mapping_setting = {
    '_doc': {
        'properties': {
            'title': {
                'type': 'text',
                'analyzer': 'english',
                # or we can use fields to index the same field in different ways for different purposes
                'copy_to': 'suggestion'
            },
            # the genres field will be used to perform aggregation later,
            # hence it will not be indexed
            # https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html
            'genres': {
                'properties': {
                    'name': {
                        'type': 'keyword'
                    }
                }
            }
        }
    }
}
es_utils = ElasticSearchUtils()
es_utils.reindex(movies, analysis_settings=None, mapping_settings=mapping_setting)

deleted tmdb index:  200
Created tmdb index:  200
Bulk index into tmdb index: 200


In [11]:
url = es_utils.index_url + '/_search'
query = {
    'suggest': {
        'text': 'star trec',
        'simple_phrase': {
            'phrase': {'field': 'suggestion'}
        }
    }
}

# our suggestions are listed under the options list
response = requests.post(url, data=json.dumps(query), headers=es_utils.headers)
json.loads(response.text)

{'took': 37,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': 0, 'max_score': 0.0, 'hits': []},
 'suggest': {'simple_phrase': [{'text': 'star trec',
    'offset': 0,
    'length': 9,
    'options': [{'text': 'star trek', 'score': 0.005635708},
     {'text': 'star they', 'score': 0.0026878999},
     {'text': 'star true', 'score': 0.0026878999},
     {'text': 'star three', 'score': 0.0022539154},
     {'text': 'star tracy', 'score': 0.0022539154}]}]}}

In [12]:
url = es_utils.index_url + '/_search'
query = {
    # we can specify the query along with the suggestion, so
    # we don't have to make two separate request
    'query': {
        'match': {
            'title': 'star trec'
        }
    },
    'suggest': {
        'text': 'star trec',
        'simple_phrase': {
            'phrase': {
                'field': 'suggestion',
                
                # collate will issue the suggestion to ensure there're
                # documents that match the suggestion, and by specifying
                # prune = True, the payload returned will have an additional
                # 'collate_match' that tells us whether there's a match or not.
                'collate': {
                    'query': {
                        'source': {
                            # notice that we are using match_phrase here, e.g.
                            # for the suggestion 'star three', it might remain in the
                            # suggestion because some document contains the word
                            # star or three, even though none included the phrase 'star three'
                            'match_phrase': {
                                # the special suggestion parameter will be replaced
                                # with the text of each suggestion
                                '{{field_name}}': '{{suggestion}}'
                            }
                        }
                    },
                    'params': {'field_name': 'title'},
                    'prune': True
                }
            }
        }
    }
}

# our suggestions are listed under the options list
response = requests.post(url, data=json.dumps(query), headers=es_utils.headers)
result = json.loads(response.text)['suggest']
result

{'simple_phrase': [{'text': 'star trec',
   'offset': 0,
   'length': 9,
   'options': [{'text': 'star trek',
     'score': 0.005635708,
     'collate_match': True},
    {'text': 'star they', 'score': 0.0026878999, 'collate_match': True},
    {'text': 'star true', 'score': 0.0026878999, 'collate_match': False},
    {'text': 'star three', 'score': 0.0022539154, 'collate_match': False},
    {'text': 'star tracy', 'score': 0.0022539154, 'collate_match': False}]}]}

## Faceting

Most e-commerce search engine has a facet on the left that allows the user to filter by various facet, e.g. brand, gender. By giving the users more options to guide themselves, not only does it increases the likelihood that they'll find what they are actually looking for and make the purchase, but we can also fret less about complex ranking.

In [13]:
url = es_utils.index_url + '/_search'

query = {
    'aggs': {
        'genres': {
            'terms': {
                'field': 'genres.name'
            }
        }
    },
    'size': 0
}

response = requests.post(url, data=json.dumps(query), headers=es_utils.headers)
result = json.loads(response.text)['aggregations']
result

{'genres': {'doc_count_error_upper_bound': 0,
  'sum_other_doc_count': 341,
  'buckets': [{'key': 'Drama', 'doc_count': 374},
   {'key': 'Comedy', 'doc_count': 277},
   {'key': 'Thriller', 'doc_count': 274},
   {'key': 'Action', 'doc_count': 264},
   {'key': 'Adventure', 'doc_count': 216},
   {'key': 'Crime', 'doc_count': 166},
   {'key': 'Science Fiction', 'doc_count': 146},
   {'key': 'Romance', 'doc_count': 143},
   {'key': 'Fantasy', 'doc_count': 108},
   {'key': 'Family', 'doc_count': 97}]}}

In [14]:
# the user can then use the facet to perform various filtering,
# note that in the UI, we also want to let the user know that
# they are performing filtering on the search result
# https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-post-filter.html
url = es_utils.index_url + '/_search'

query = {
    'query': {
        'bool': {
            'filter': [
                {
                    'term': {
                        'genres.name': 'Science Fiction'
                    }
                }
            ]
        }
    },
    'aggs': {
        'genres': {
            'terms': {
                'field': 'genres.name'
            }
        }
    },
    'size': 0
}

response = requests.post(url, data=json.dumps(query), headers=es_utils.headers)
result = json.loads(response.text)['aggregations']
result

{'genres': {'doc_count_error_upper_bound': 0,
  'sum_other_doc_count': 33,
  'buckets': [{'key': 'Science Fiction', 'doc_count': 146},
   {'key': 'Action', 'doc_count': 88},
   {'key': 'Thriller', 'doc_count': 69},
   {'key': 'Adventure', 'doc_count': 65},
   {'key': 'Drama', 'doc_count': 29},
   {'key': 'Comedy', 'doc_count': 26},
   {'key': 'Fantasy', 'doc_count': 23},
   {'key': 'Mystery', 'doc_count': 16},
   {'key': 'Horror', 'doc_count': 15},
   {'key': 'Crime', 'doc_count': 13}]}}

## Alternate Sorting

Usually a search engine will sort the results based on relevancy, or so called best match. But we can provide additional ranking options. e.g. one common one is to sort by price. Be careful when sorting by price from lowest to highest to make sure the results are still relevant, e.g. user search for fitbit, but the cheapest result that shows up when doing a sort by price from low to high returns fitbit wristband as the first result.

## What Information to Present

Depending on the application, things may vary.

e.g. title, image, short description, price (for e-commerce), location & datetime (for event based search)

For text-heavy documents, highlighting text that resulted in the document to match the user's query is an important form of relevance feedback. Users will appreciate having the opportunity to read through the matches in the context that they appeared in to get a sense of whether the document is a good fit for them before clicking on the document to dig deeper.

In [15]:
# TODO didn't dig that deep into this, could come back to this
# https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html
url = es_utils.index_url + '/_search'
query = {
    'query': {
        'match': {
            'title': 'star trek'
        }
    },
    'highlight': {
        'fields': {
            # we specify the field we would like to highlight
            # and the type of highlight we would like to use
            'title': {
                'type': 'plain'
            }
        }
    },
    'size': 5
}

# our suggestions are listed under the options list
response = requests.post(url, data=json.dumps(query), headers=es_utils.headers)

# for the search hits, the high light is stored under
search_hits = json.loads(response.text)['hits']['hits']
search_hits[0]['highlight']

{'title': ['<em>Star</em> <em>Trek</em>: Generations']}

## Grouping Similar Documents

- Documents that are similar can be presented together to reduce the user's cognitive burden.
- Documents that are near-duplicates should not be shown to the user.
- https://github.com/o19s/relevant-search-book/blob/master/ipython/Chapter%208%20(Providing%20Relevance%20Feedback).ipynb/Search%20Result%20Listing.ipynb

## No Results

Sometimes our user may issue a request that end-up having no results. In this scenario, it's better to fall back to an alternate result instead of showing no results. e.g. showing popular documents, re-vise the user's query to a closest suggestion. And communicate to the user what happened.