# Configuring analyzers for the MSMARCO Document dataset

Before we start tuning queries and other index parameters, we wanted to first show a very simple iteration on the standard analyzers. In the MS MARCO Document dataset we have three fields: `url`, `title` and `body`. We tried just couple very small improvements, mostly to stopword lists, to see what would happen to our baseline queries. We now have two indices to play with:

- `msmarco-doument.defaults` with some default analyzers
 - `url`: standard
 - `title`: english
 - `body`: english
- `msmarco-document` with customized analyzers
 - `url`: english with URL-specific stopword list
 - `title`: english with question-specfic stopword list
 - `body`: english with question-specfic stopword list

The stopword lists have been changed:
 1. Since the MS MARCO query dataset is all questions, it makes sense to add a few extra stop words like: who, what, when where, why, how
 1. URLs in addition have some other words that don't really need to be searched on: http, https, www, com, edu
 
More details can be found in the index settings in `conf`.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import importlib
import os
import sys

from elasticsearch import Elasticsearch

In [3]:
# project library
sys.path.insert(0, os.path.abspath('..'))

import qopt
importlib.reload(qopt)

from qopt.notebooks import evaluate_mrr100_dev

In [4]:
# use a local Elasticsearch or Cloud instance (https://cloud.elastic.co/)
es = Elasticsearch('http://localhost:9200')

# set the parallelization parameter `max_concurrent_searches` for the Rank Evaluation API calls
max_concurrent_searches = 10

## Comparisons

The following runs a series of comparisons between the baseline default index `msmarco-document.default` and the custom index `msmarco-document`. We use multiple query types just to confirm that we make improvements across all of them.

### Query: combined per-field `match`es

In [5]:
def combined_matches(index):
    evaluate_mrr100_dev(es, max_concurrent_searches, index, 'combined_matches', params={})

In [6]:
%%time

combined_matches('msmarco-document.defaults')

Evaluation with: MRR@100
Score: 0.2403
CPU times: user 2.26 s, sys: 615 ms, total: 2.87 s
Wall time: 4min 44s


In [7]:
%%time

combined_matches('msmarco-document')

Evaluation with: MRR@100
Score: 0.2504
CPU times: user 2.12 s, sys: 639 ms, total: 2.76 s
Wall time: 3min 34s


### Query: `multi_match` `cross_fields`

In [8]:
def multi_match_cross_fields(index):
    evaluate_mrr100_dev(es, max_concurrent_searches, index,
        template_id='cross_fields',
        params={
            'operator': 'OR',
            'minimum_should_match': 50,  # in percent/%
            'tie_breaker': 0.0,
            'url|boost': 1.0,
            'title|boost': 1.0,
            'body|boost': 1.0,
        })

In [9]:
%%time

multi_match_cross_fields('msmarco-document.defaults')

Evaluation with: MRR@100
Score: 0.2475
CPU times: user 2.29 s, sys: 732 ms, total: 3.02 s
Wall time: 4min 10s


In [10]:
%%time

multi_match_cross_fields('msmarco-document')

Evaluation with: MRR@100
Score: 0.2683
CPU times: user 2.13 s, sys: 709 ms, total: 2.84 s
Wall time: 4min 26s


### Query: `multi_match` `best_fields`

In [11]:
def multi_match_best_fields(index):
    evaluate_mrr100_dev(es, max_concurrent_searches, index,
        template_id='best_fields',
        params={
            'tie_breaker': 0.0,
            'url|boost': 1.0,
            'title|boost': 1.0,
            'body|boost': 1.0,
        })

In [12]:
%%time

multi_match_best_fields('msmarco-document.defaults')

Evaluation with: MRR@100
Score: 0.2714
CPU times: user 2.16 s, sys: 731 ms, total: 2.89 s
Wall time: 4min 9s


In [13]:
%%time

multi_match_best_fields('msmarco-document')

Evaluation with: MRR@100
Score: 0.2873
CPU times: user 2.14 s, sys: 641 ms, total: 2.78 s
Wall time: 4min 27s


## Conclusion

As you can see, there's a measurable and consistent improvement with just some minor changes to the default analyzers. All other notebooks that follow will use the custom analyzers including for their baseline measurements.