<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Indexing-TMDB-Movies" data-toc-modified-id="Indexing-TMDB-Movies-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Indexing TMDB Movies</a></span></li><li><span><a href="#Basic-Searching" data-toc-modified-id="Basic-Searching-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Basic Searching</a></span></li><li><span><a href="#Query-Validation-API" data-toc-modified-id="Query-Validation-API-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Query Validation API</a></span></li><li><span><a href="#Debugging-Analysis" data-toc-modified-id="Debugging-Analysis-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Debugging Analysis</a></span></li><li><span><a href="#Solving-The-Matching-Problem" data-toc-modified-id="Solving-The-Matching-Problem-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Solving The Matching Problem</a></span></li><li><span><a href="#Repeat-the-search" data-toc-modified-id="Repeat-the-search-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Repeat the search</a></span></li><li><span><a href="#Decomposing-Relevance-Score-With-Lucene’s-Explain" data-toc-modified-id="Decomposing-Relevance-Score-With-Lucene’s-Explain-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Decomposing Relevance Score With Lucene’s Explain</a></span></li><li><span><a href="#Fixing-Space-Jam-vs-Alien-Ranking" data-toc-modified-id="Fixing-Space-Jam-vs-Alien-Ranking-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Fixing Space Jam vs Alien Ranking</a></span></li></ul></div>

## Indexing TMDB Movies

In [1]:
import json
import requests


def extract():
    with open('tmdb.json') as f:
        return json.loads(f.read())
    

movies = extract()

# we can check some sample movie id, to check a sense of what
# the data looks like
# movie_ids = ['93837', '8193', '8195', '5', '8202', '11']
movies['93837']

{'poster_path': '/mfMndRWFbzXbTx0g3rHUXFAxyOh.jpg',
 'production_countries': [{'iso_3166_1': 'US',
   'name': 'United States of America'}],
 'revenue': 0,
 'overview': 'When the FBI hires her to go undercover at a college sorority, Molly Morris (Miley Cyrus) must transform herself from a tough, streetwise private investigator to a refined, sophisticated university girl to help protect the daughter of a one-time Mobster. With several suspects on her list, Molly unexpectedly discovers that not everyone is who they appear to be, including herself.',
 'video': False,
 'id': 93837,
 'genres': [{'id': 28, 'name': 'Action'}, {'id': 35, 'name': 'Comedy'}],
 'title': 'So Undercover',
 'tagline': "Meet the FBI's new secret weapon",
 'vote_count': 55,
 'homepage': '',
 'belongs_to_collection': None,
 'original_language': 'en',
 'status': 'Released',
 'spoken_languages': [{'iso_639_1': 'en', 'name': 'English'}],
 'imdb_id': 'tt1766094',
 'adult': False,
 'backdrop_path': '/o4Tt60z94Hbgk8adeZG9WE4S

In [2]:
response = requests.delete('http://localhost:9200/tmdb')
response

<Response [200]>

In [3]:
# creating an index
# https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html
settings = {
    'settings': {
        'index': {
            'number_of_shards': 1,
            'number_of_replicas': 1
        }
    }
}
headers = {'Content-Type': 'application/json'}
response = requests.put('http://localhost:9200/tmdb', data=json.dumps(settings), headers=headers)
response

<Response [200]>

```bash
curl -X PUT "localhost:9200/tmdb" -H 'Content-Type: application/json' -d'
{
    "settings" : {
        "index" : {
            "number_of_shards" : 1, 
            "number_of_replicas" : 1 
        }
    }
}
'
```

In [4]:
# indexing API
# https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html
# https://www.elastic.co/guide/en/elasticsearch/guide/master/index-doc.html

bulk_index_cmd = ''
for movie_id, movie in movies.items():
    # a document is uniquely identified by the index, the type and id
    # it's worth noting that there's a note on removing the capabilities of
    # having multiple types under one index, and going forward the type will
    # just to set to '_doc'
    # https://www.elastic.co/guide/en/elasticsearch/reference/current/removal-of-types.html
    index_cmd = {
        'index': {
            '_index': 'tmdb',
            '_type': '_doc',
            '_id': movie_id
        }
    }
    bulk_index_cmd += (json.dumps(index_cmd) + '\n' + json.dumps(movie) + '\n')

    
response = requests.post('http://localhost:9200/_bulk', data=bulk_index_cmd, headers=headers)
response

<Response [200]>

In [5]:
def reindex(movies, mapping_settings=None):
    """Create the tmdb index, and movie type. We can interact with it using tmdb/movie"""
    response = requests.delete('http://localhost:9200/tmdb')

    settings = {
        'settings': {
            'index': {
                'number_of_shards': 1,  # for reproducibility, document frequency is stored per shard
                'number_of_replicas': 1
            }
        }
    }
    if mapping_settings is not None:
        settings.update(mapping_settings)
    
    headers = {'Content-Type': 'application/json'}
    response = requests.put('http://localhost:9200/tmdb',
                            data=json.dumps(settings), headers=headers)

    bulk_index_cmd = ''
    for movie_id, movie in movies.items():
        index_cmd = {
            'index': {
                '_index': 'tmdb',
                '_type': '_doc',
                '_id': movie_id
            }
        }
        bulk_index_cmd += (json.dumps(index_cmd) + '\n' + json.dumps(movie) + '\n')

    response = requests.post('http://localhost:9200/_bulk', data=bulk_index_cmd, headers=headers)

In [6]:
reindex(movies)

## Basic Searching

In [7]:
def search(query):
    url = 'http://localhost:9200/tmdb/_doc/_search'
    response = requests.get(url, data=json.dumps(query), headers=headers)
    search_hits = json.loads(response.text)['hits']

    print('Num\tRelevance Score\tMovie Title')
    for idx, hit in enumerate(search_hits['hits']):
        print('%s\t%s\t%s' % (idx + 1, hit['_score'], hit['_source']['title']))

In [8]:
# user is most likely trying to look for the movie name space jam
user_search = 'basketball with cartoon aliens'
query = {
    'query': {
        'multi_match': {
            'query': user_search,
            # boost the title's score by 10, this can
            # be thought of as the query weight
            'fields': ['title^10', 'overview'],
        }
    },
    'size': '40'
}
search(query)

Num	Relevance Score	Movie Title
1	81.2218	Aliens
2	67.8921	Cowboys & Aliens
3	66.59527	The Basketball Diaries
4	51.114723	Aliens vs Predator: Requiem
5	45.88936	Dances with Wolves
6	45.88936	Friends with Benefits
7	45.88936	Fire with Fire
8	40.21931	Interview with the Vampire
9	40.21931	From Russia With Love
10	40.21931	Gone with the Wind
11	40.21931	Just Go With It
12	40.21931	My Week with Marilyn
13	35.796352	Die Hard: With a Vengeance
14	32.2498	The Girl with the Dragon Tattoo
15	32.2498	The Life Aquatic With Steve Zissou
16	32.2498	Twin Peaks: Fire Walk with Me
17	7.733661	Space Jam
18	7.0542254	The Flintstones
19	6.5545254	Galaxy Quest
20	6.496262	White Men Can't Jump
21	6.320264	Bedazzled
22	5.806069	They Live
23	5.0928774	Battlefield Earth
24	5.0928774	Cocoon
25	4.9547644	High School Musical
26	1.6261511	The Switch
27	1.6142253	Nim's Island
28	1.6039157	White Noise
29	1.6009648	Frida
30	1.5836867	Silver Linings Playbook
31	1.5588576	The Man in the Iron Mask
32	1.554282	Strangers

## Query Validation API

In [9]:
query = {
   'query': {
        'multi_match': { 
            'query': user_search,
            'fields': ['title^10', 'overview']
        }
    }
}

# examine the underlying query strategy to understand why
# it's generating spurious matches
response = requests.get(
    'http://localhost:9200/tmdb/_doc/_validate/query?explain',
    data=json.dumps(query), headers=headers)

# <fieldName>:<query>
# title:basketball is a term query
# title:"space jam" would be a phrase query, these are specified by using quotes
# to indicate the terms should be adjacent to one another
response.text

'{"valid":true,"_shards":{"total":1,"successful":1,"failed":0},"explanations":[{"index":"tmdb","valid":true,"explanation":"+((title:basketball title:with title:cartoon title:aliens)^10.0 | (overview:basketball overview:with overview:cartoon overview:aliens)) #*:*"}]}'

There's 4 SHOULD clause grouped together.

In Lucene, MUST clause is preceded by a +, MUST_NOT clause is preceded by a -, and the SHOULD clause isn't prefixed.

## Debugging Analysis

In [10]:
# https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html
data = {
    'field': 'title',
    'analyzer': 'standard',
    'text': 'Fire with Fire'
}

response = requests.get('http://localhost:9200/tmdb/_analyze?format=yaml', 
                        data=json.dumps(data), headers=headers)

# this shows information such as the resulting token, their position in the
# original text, notice that the text has been lowercased. Allowing us to
# inspect what tokens were indexed by the search engine

# also noticed that with will be included in the index as well, that explains
# why Fire with Fire will be a match to our query basketball with cartoon aliens
print(response.text)

---
tokens:
- token: "fire"
  start_offset: 0
  end_offset: 4
  type: "<ALPHANUM>"
  position: 0
- token: "with"
  start_offset: 5
  end_offset: 9
  type: "<ALPHANUM>"
  position: 1
- token: "fire"
  start_offset: 10
  end_offset: 14
  type: "<ALPHANUM>"
  position: 2



## Solving The Matching Problem

In [11]:
# fix the match by changing analyzers (we can define it field by field),
# so that our search engine can discriminate between meaningful terms and
# those that are of low importance.
# here we are changing the title and overview field with the English analyzer,
# which removes stop words for us, in this case, "with" being one of them
# https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html
mappings_setting = {
    'mappings': {
        # this key is the "type", and should match
        # index_cmd's _type key
        '_doc': {
            'properties': {
                'title': {
                    'type': 'text',
                    'analyzer': 'english'
                },
                'overview': {
                    'type': 'text',
                    'analyzer': 'english'
                }
            }
        }
    }
}
reindex(movies, mappings_setting)

In [12]:
data = {
    'field': 'title',
    'text': 'Fire with Fire'
}

response = requests.get('http://localhost:9200/tmdb/_analyze?format=yaml', 
                        data=json.dumps(data), headers=headers)

# the "with" token is no longer is the resulting token stream
# the positio field for the second fire term is still kept as 2
print(response.text)

---
tokens:
- token: "fire"
  start_offset: 0
  end_offset: 4
  type: "<ALPHANUM>"
  position: 0
- token: "fire"
  start_offset: 10
  end_offset: 14
  type: "<ALPHANUM>"
  position: 2



In [13]:
# we can double-check the query validation and
# see and the "with" token is no longer in the query strategy,
# notice that we are also now matching the term basketbal, which
# is the stemmed form of basketball
user_search = 'basketball with cartoon aliens'
query = {
   'query': {
        'multi_match': { 
            'query': user_search,
            'fields': ['title^10', 'overview']
        }
    }
}

response = requests.get(
    'http://localhost:9200/tmdb/_doc/_validate/query?explain',
    data=json.dumps(query), headers=headers)

response.text

'{"valid":true,"_shards":{"total":1,"successful":1,"failed":0},"explanations":[{"index":"tmdb","valid":true,"explanation":"+((title:basketbal title:cartoon title:alien)^10.0 | (overview:basketbal overview:cartoon overview:alien)) #*:*"}]}'

## Repeat the search

In [14]:
# now we can see that Space Jam is higher up the rank
user_search = 'basketball with cartoon aliens'
query = {
    'query': {
        'multi_match': {
            'query': user_search,
            'fields': ['title^10', 'overview'],
        }
    },
    'size': '15'
}
search(query)

Num	Relevance Score	Movie Title
1	71.57863	Alien
2	71.57863	Aliens
3	71.28492	The Basketball Diaries
4	57.776596	Cowboys & Aliens
5	41.69649	Aliens vs Predator: Requiem
6	41.69649	AVP: Alien vs. Predator
7	12.921349	Space Jam
8	6.8866544	The Flintstones
9	6.603108	White Men Can't Jump
10	5.561389	The Thing
11	5.304045	Bedazzled
12	5.196432	High School Musical
13	5.079732	Independence Day
14	4.98434	The X Files
15	4.6665144	The Day the Earth Stood Still


## Decomposing Relevance Score With Lucene’s Explain

In [15]:
# we can pass an additional explain argument to the
# request when issuing the search query, then for
# each search result returned, it will have an additional
# "_explanation" entry that allows us to take a peek of
# why we're getting the score for each search result
query = {
    'query': {
        'multi_match': {
            'query': user_search,
            'fields': ['title^10', 'overview'],
        }
    },
    'explain': True
}

response = requests.get('http://localhost:9200/tmdb/_doc/_search', data=json.dumps(query), headers=headers)
result = json.loads(response.text)

# we're giving title basketbal a score boost of 10
# (tf=7.6180873) * (idf=1.0338583) * (boost=10)
title = result['hits']['hits'][0]['_source']['title']
print('explanation for: ', title)
explanation = result['hits']['hits'][0]['_explanation']
explanation

explanation for:  Alien


{'value': 71.57863,
 'description': 'max of:',
 'details': [{'value': 71.57863,
   'description': 'sum of:',
   'details': [{'value': 71.57863,
     'description': 'weight(title:alien in 229) [PerFieldSimilarity], result of:',
     'details': [{'value': 71.57863,
       'description': 'score(doc=229,freq=1.0 = termFreq=1.0\n), product of:',
       'details': [{'value': 10.0, 'description': 'boost', 'details': []},
        {'value': 5.557179,
         'description': 'idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:',
         'details': [{'value': 5.0, 'description': 'docFreq', 'details': []},
          {'value': 1424.0, 'description': 'docCount', 'details': []}]},
        {'value': 1.288039,
         'description': 'tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:',
         'details': [{'value': 1.0,
           'description': 'termFreq=1.0',
           'details': []},
          {'value': 1.2, 'descriptio

In [21]:
def flatten_explain(explain_json, depth=0):
    # getting rid of potential next line character to make things prettier
    description = explain_json['description'].replace('\n', '')
    result = " " * (depth * 2) + "%s, %s\n" % (explain_json['value'], description)
    if 'details' in explain_json:
        for detail in explain_json['details']:
            result += flatten_explain(detail, depth=depth + 1)

    return result


print(flatten_explain(explanation))

71.57863, max of:
  71.57863, sum of:
    71.57863, weight(title:alien in 229) [PerFieldSimilarity], result of:
      71.57863, score(doc=229,freq=1.0 = termFreq=1.0), product of:
        10.0, boost
        5.557179, idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
          5.0, docFreq
          1424.0, docCount
        1.288039, tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
          1.0, termFreq=1.0
          1.2, parameter k1
          0.75, parameter b
          2.2057583, avgFieldLength
          1.0, fieldLength
  3.460176, sum of:
    3.460176, weight(overview:alien in 229) [PerFieldSimilarity], result of:
      3.460176, score(doc=229,freq=1.0 = termFreq=1.0), product of:
        3.911321, idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
          28.0, docFreq
          1423.0, docCount
        0.8846566, tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 -

## Fixing Space Jam vs Alien Ranking

In [16]:
# even though we get the desired result by making
# some changes to the search query:
# Here are some questions to ask ourselves:
# the "|" is taking the maximum of the two compound queries,
# will this streategy work for other searches
query = {
    'query': {
        'multi_match': { 
            'query': user_search,
            'fields': ['title^0.1', 'overview'],
        }
    }
}
search(query)

Num	Relevance Score	Movie Title
1	12.921349	Space Jam
2	6.8866544	The Flintstones
3	6.603108	White Men Can't Jump
4	6.0548387	Aliens vs Predator: Requiem
5	5.561389	The Thing
6	5.304045	Bedazzled
7	5.196432	High School Musical
8	5.079732	Independence Day
9	4.98434	The X Files
10	4.6665144	The Day the Earth Stood Still
