<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Term-Centric-Search" data-toc-modified-id="Term-Centric-Search-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Term Centric Search</a></span><ul class="toc-item"><li><span><a href="#Copy-Fields" data-toc-modified-id="Copy-Fields-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Copy Fields</a></span></li></ul></li></ul></div>

# Term Centric Search

Field centric search is based on search criteria such as whether the search matches an exact field or whether the search matches some of the fields. Term centric, on the other hands, places the search terms front and center.

Instead of searching every field with the full search string, term centric search acts on the search string like a term-by-term matchmaker, trying to find each term's ideal match, i.e. for each term go through each field and find the term's best matching field, only then do we combine the score for each term.

The reason that we should be considering term-centric search is to solve for two potential problems:

**A failure to give a higher rank to documents that match more search terms.**

Imagine the following scenario:

```python
# If we are to index two documents:
doc1 = {'title': 'albino', 'body': 'elephant'}
doc2 = {'title': 'elephant', 'body': 'elephant'}

# then we issue a 'most_fields' type query 'albino elephant' over the title and body field.
```

In the field-centric search scenario, both documents would be equally ranked since we are shipping the entire search string to each field for scoring before combining the result. There is no difference between a match in which only elephant matches both fields, and a match in which albino matches one field and elephant matches another

**Signal discordance.** Unintuitive relevance scoring based on constituent parts instead of scoring based on larger parts. e.g. instead of searching for the whole document's text as a whole, our source data model might have split up the document into various fields such as title, intro, conclusion, appendix. And if we send the search query to these fields separately, it will create a signal discordance where the signal that we are using doesn't reflect the user's intent.

Term-centric search aims to solve the albino elephant problem and signal discordance by taking a top-down view of search: breaking up search terms, and querying each term one by one against a set of fields. Note that we'll soon realize that term-centric field is not without its problems and a hybrid approach may be preferred.

Links that describes potential issues with field-centric search: https://www.elastic.co/guide/en/elasticsearch/guide/master/field-centric.html

In [1]:
import json
import requests

In [2]:
def extract():
    with open('tmdb.json') as f:
        return json.loads(f.read())
    
    
movies = extract()

# we can check some sample movie id, to check a sense of what
# the data looks like
# movie_ids = ['93837', '8193', '8195', '5', '8202', '11']
movies['93837']

{'poster_path': '/mfMndRWFbzXbTx0g3rHUXFAxyOh.jpg',
 'production_countries': [{'iso_3166_1': 'US',
   'name': 'United States of America'}],
 'revenue': 0,
 'overview': 'When the FBI hires her to go undercover at a college sorority, Molly Morris (Miley Cyrus) must transform herself from a tough, streetwise private investigator to a refined, sophisticated university girl to help protect the daughter of a one-time Mobster. With several suspects on her list, Molly unexpectedly discovers that not everyone is who they appear to be, including herself.',
 'video': False,
 'id': 93837,
 'genres': [{'id': 28, 'name': 'Action'}, {'id': 35, 'name': 'Comedy'}],
 'title': 'So Undercover',
 'tagline': "Meet the FBI's new secret weapon",
 'vote_count': 55,
 'homepage': '',
 'belongs_to_collection': None,
 'original_language': 'en',
 'status': 'Released',
 'spoken_languages': [{'iso_639_1': 'en', 'name': 'English'}],
 'imdb_id': 'tt1766094',
 'adult': False,
 'backdrop_path': '/o4Tt60z94Hbgk8adeZG9WE4S

In [3]:
class ElasticSearchUtils:

    def __init__(self, index_name='tmdb', base_url='http://localhost:9200'):
        self.base_url = base_url
        self.index_name = index_name
        self.index_url = self.base_url + '/' + self.index_name
        self.index_type_name = '_doc'
        self.index_type_url = self.index_url + '/' + self.index_type_name
        self.headers = {'Content-Type': 'application/json'}

    def reindex(self, movies, analysis_settings, mapping_settings=None):
        """
        Reindex takes analyzer and field mappings, recreates the index, and then reindexes
        TMDB movies using the _bulk index API. There are other ways for modifying the configuration
        of the index besides dropping and restarting, however for convenience and because our data
        isn't truly that large, we'll just delete and start from scratch when we need to.
        """
        response = requests.delete(self.index_url)
        print('deleted TMDB index: ', response.status_code)

        # create the index with explicit settings
        # We need to explicitly set number of shards to 1 to eliminate the impact of 
        # distributed IDF on our small collection
        # See also 'Relavance is Broken!'
        # http://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-is-broken.html
        settings = {
            'settings': {
                'index': {
                    'number_of_replicas': 1,
                    'number_of_shards': 1
                },
                'analysis': analysis_settings
            }
        }
        if mapping_settings is not None:
            settings['mappings'] = mapping_settings

        response = requests.put(self.index_url, data=json.dumps(settings), headers=self.headers)
        print('Created TMDB index: ', response.status_code)

        self._bulk_index(movies)

    def _bulk_index(self, movies):
        bulk_index_cmd = ''
        for movie_id, movie in movies.items():
            index_cmd = {
                'index': {
                    '_index': self.index_name,
                    '_type': self.index_type_name,
                    '_id': movie_id
                }
            }
            bulk_index_cmd += (json.dumps(index_cmd) + '\n' + json.dumps(movie) + '\n')

        response = requests.post(self.base_url + '/_bulk',
                                 data=bulk_index_cmd,
                                 headers=self.headers)
 
        print('Bulk index into TMDB index:', response.status_code)

    def search(self, query, verbose=False):
        search_url = self.index_type_url + '/_search'
        response = requests.get(search_url, data=json.dumps(query), headers=self.headers)

        search_hits = json.loads(response.text)['hits']['hits']
        for idx, hit in enumerate(search_hits):
            source = hit['_source']
            print("%s\t%s\t%s" % (idx + 1, hit['_score'], source['title']))
            
            if verbose:
                cast_names = []
                cast_characters = []
                for cast in source['cast']:
                    cast_names.append(cast['name'])
                    cast_characters.append(cast['character'])

                director_names = [director['name'] for director in source['directors']]

                print('director: ', director_names)
                print('cast: ', cast_names)
                print('character: ', cast_characters)
                print('overview:', source['overview'])
                if '_explanation' in hit:
                    result = ElasticSearchUtils.flatten_explain(hit['_explanation'])
                    print(result)

                print('=============================================')
   
    @staticmethod          
    def flatten_explain(explain_json, depth=0):
        
        # getting rid of potential next line character to make things prettier
        description = explain_json['description'].replace('\n', '')
        result = ' ' * (depth * 2) + '%s, %s\n' % (explain_json['value'], description)
        if 'details' in explain_json:
            for detail in explain_json['details']:
                result += ElasticSearchUtils.flatten_explain(detail, depth=depth + 1)

        return result

    def validate(self, query):
        url = self.index_type_url + '/_validate/query?explain'
        response = requests.get(url, data=json.dumps(query), headers=self.headers)
        return json.loads(response.text)

In [4]:
# re-creating the index from chapter 5
analysis_settings = {
    'filter': {
        'bigram_filter': {
            'type': 'shingle',
            'max_shingle_size': 2,
            'min_shingle_size': 2,
            'output_unigrams': False
        },
        'english_stemmer': {
            'type': 'stemmer',
            'name': 'english'
        }
    },
    'analyzer': {
        'english_bigram': {
            'type': 'custom',
            'tokenizer': 'standard',
            'filter': ['lowercase', 'english_stemmer', 'bigram_filter']
        }
    }
}

mapping_settings = {
    '_doc': {
        'properties': {
            'cast': {
                'properties': {
                    'name': {
                        'type': 'text',
                        'analyzer': 'english',
                        'fields': {
                            'bigrammed': {
                                'type': 'text',
                                'analyzer': 'english_bigram'
                            }
                        }
                    }
                }
            },
            'directors': {
                'properties': {
                    'name': {
                        'type': 'text',
                        'analyzer': 'english',
                        'fields': {
                            'bigrammed': {
                                'type': 'text',
                                'analyzer': 'english_bigram'
                            }
                        }
                    }
                }
            }
        }
    }
}

es_utils = ElasticSearchUtils()
es_utils.reindex(movies, analysis_settings, mapping_settings)

deleted TMDB index:  200
Created TMDB index:  200
Bulk index into TMDB index: 200


In [13]:
# field centric search
# the only link documents search in depth
# https://www.elastic.co/guide/en/elasticsearch/guide/master/term-vs-full-text.html
user_search = 'star trek patrick stewart william shatner'
query = {
    'query': {
        'multi_match': { 
            'query': user_search,
            'type': 'most_fields',
            'fields': [
                'title',
                'overview',
                'cast.name.bigrammed^5',
                'directors.name.bigrammed'
            ]
         }
    }
}
# checking the translated lucene query
es_utils.validate(query)['explanations']

[{'index': 'tmdb',
  'valid': True,
  'explanation': '+((directors.name.bigrammed:star trek directors.name.bigrammed:trek patrick directors.name.bigrammed:patrick stewart directors.name.bigrammed:stewart william directors.name.bigrammed:william shatner) | (cast.name.bigrammed:star trek cast.name.bigrammed:trek patrick cast.name.bigrammed:patrick stewart cast.name.bigrammed:stewart william cast.name.bigrammed:william shatner)^5.0 | (overview:star overview:trek overview:patrick overview:stewart overview:william overview:shatner) | (title:star title:trek title:patrick title:stewart title:william title:shatner))~1.0 #*:*'}]

In [6]:
query.update({'size': 5, 'explain': True})
es_utils.search(query, verbose=False)

1	64.98588	Star Trek: Generations
2	41.540653	Star Trek IV: The Voyage Home
3	40.866093	Star Trek V: The Final Frontier
4	38.89132	Star Trek: Nemesis
5	38.43622	Star Trek: Insurrection


The potential problem with field centric search is that term frequencies are different in each field and could interfere with each other to produce badly ordered results.


In [7]:
# term centric search using 'query_string'
# https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
user_search = 'star trek patrick stewart william shatner'
query = {
    'query': {
        'query_string': { 
            'query': user_search,
            'fields': [
                'title',
                'overview',
                'cast.name.bigrammed',
                'directors.name.bigrammed'
            ]      
         }
    }
}

# checking the translated lucene query
es_utils.validate(query)['explanations']

{'valid': True,
 '_shards': {'total': 1, 'successful': 1, 'failed': 0},
 'explanations': [{'index': 'tmdb',
   'valid': True,
   'explanation': '+((directors.name.bigrammed:star trek directors.name.bigrammed:trek patrick directors.name.bigrammed:patrick stewart directors.name.bigrammed:stewart william directors.name.bigrammed:william shatner) | (cast.name.bigrammed:star trek cast.name.bigrammed:trek patrick cast.name.bigrammed:patrick stewart cast.name.bigrammed:stewart william cast.name.bigrammed:william shatner) | (overview:star overview:trek overview:patrick overview:stewart overview:william overview:shatner) | (title:star title:trek title:patrick title:stewart title:william title:shatner)) #*:*'}]}

In [8]:
query.update({'size': 5, 'explain': True})
es_utils.search(query, verbose=False)

1	10.409781	Star Trek: Generations
2	9.918423	Hannah Montana: The Movie
3	8.972326	Star Trek: Insurrection
4	8.972326	Star Trek: Nemesis
5	8.380948	Star Trek IV: The Voyage Home


## Copy Fields

Copy fields comes from the idea that we can combine/group multiple similar fields into one single field.

In [9]:
# here we combine the various field about people (cast and director) into a single
# field that provides a specific signal, this solves for the signal discordance
# problem where fields are scored independently, resulting in different search term's
# idf score being different in different field
# https://www.elastic.co/guide/en/elasticsearch/reference/current/copy-to.html
mapping_settings = {
    '_doc': {
        'properties': {
            'title': {
                'type': 'text',
                'analyzer': 'english'
            },
            'overview': {
                'type': 'text',
                'analyzer': 'english'
            },
            'cast': {
                'properties': {
                    'name': {
                        'type': 'text',
                        'analyzer': 'english',
                        'copy_to': 'people.name',
                        'fields': {
                            'bigrammed': {
                                'type': 'text',
                                'analyzer': 'english_bigram'
                            }
                        }
                    }
                }
            },
            'directors': {
                'properties': {
                    'name': {
                        'type': 'text',
                        'analyzer': 'english',
                        'copy_to': 'people.name',
                        'fields': {
                            'bigrammed': {
                                'type': 'text',
                                'analyzer': 'english_bigram'
                            }
                        }
                    }
                }
            },
            'people': {  # define the combined field people like any other field
                'properties': {
                    'name': {
                        'type': 'text',
                        'analyzer': 'english',
                        'fields': {
                            'bigrammed': {
                                'type': 'text',
                                'analyzer': 'english_bigram'
                            }
                        }
                    }
                }
            }
        }
    }
}
es_utils.reindex(movies, analysis_settings, mapping_settings)

deleted TMDB index:  200
Created TMDB index:  200
Bulk index into TMDB index: 200


In [10]:
# performing a search on the combined field people
user_search = 'patrick stewart william shatner'
query = {
    'query': {
        'match': { 
            'people.name': user_search
         }
    },
    'size': 5,
    'explain': True
}
es_utils.search(query)

1	13.463219	Star Trek: Generations
2	10.10991	Star Trek V: The Final Frontier
3	8.345118	Conspiracy Theory
4	8.305401	Bill & Ted's Bogus Journey
5	8.271045	Miss Congeniality 2: Armed and Fabulous


In [14]:
# using cross field query to do term-centric search,
# cross field treats all the field as one big field,
# and looks for each term in any of the fields
# https://www.elastic.co/guide/en/elasticsearch/guide/master/_cross_fields_queries.html
user_search = 'patrick stewart william shatner'
query = {
    'query': {
        'multi_match': {
            'query': user_search,
            'type': 'cross_fields',
            'fields': [
                'title',
                'overview',
                'cast.name.bigrammed',
                'directors.name.bigrammed'
            ]   
         }
    }
}

# checking the translated lucene query
es_utils.validate(query)['explanations']

[{'index': 'tmdb',
  'valid': True,
  'explanation': '+((blended(terms:[directors.name.bigrammed:patrick stewart, cast.name.bigrammed:patrick stewart]) blended(terms:[directors.name.bigrammed:stewart william, cast.name.bigrammed:stewart william]) blended(terms:[directors.name.bigrammed:william shatner, cast.name.bigrammed:william shatner])) | (blended(terms:[overview:patrick, title:patrick]) blended(terms:[overview:stewart, title:stewart]) blended(terms:[overview:william, title:william]) blended(terms:[overview:shatner, title:shatner]))) #*:*'}]

When validating our `cross_fields` query, we get a "blended" query. This aims to solve the term-frequency problem by blending inverse document frequencies across fields.

In [18]:
# combining two searches
# a first level search that is more greedy, used to increase to recall size
# and a second level is more stringent, used to increase precision
user_search = 'star trek patrick stewart william shatner'
query = {
    'query': {
        'bool': {
            'should': [
                {
                    # first level: cast a wide net by searching over all the fields
                    'multi_match': {
                        'query': user_search,
                        'type': 'cross_fields',
                        'fields': [
                            'overview',
                            'title',
                            'directors.name',
                            'cast.name'
                        ]
                    }
                },
                {
                    # second level: being more stringent and search only on
                    # a subset of the fields
                    'multi_match': {
                        'query': user_search,
                        'type': 'cross_fields',
                        'fields': [
                            'directors.name.bigrammed',
                            'cast.name.bigrammed'
                        ]
                    }
                }
            ]
        }
    }
}
es_utils.search(query)

1	32.55439	Star Trek: Generations
2	20.593908	Star Trek: Insurrection
3	20.580242	Star Trek: Nemesis
4	20.264835	Star Trek II: The Wrath of Khan
5	19.561811	Star Trek V: The Final Frontier
6	19.235165	Star Trek IV: The Voyage Home
7	19.125366	Star Trek: First Contact
8	19.10784	Star Trek: The Motion Picture
9	18.711422	Star Trek III: The Search for Spock
10	17.318808	Star Trek VI: The Undiscovered Country
