<a href="https://colab.research.google.com/github/dmcguire81/metapy/blob/task%2Fgoogle_colab/tutorials/sigir18-topic-models/sigir18-retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
%%capture
# NOTE: this assumes you've uploaded a Python 3.7 build from our fork to Drive
# TODO: replace this with a stock install when it's published somewhere
%pip install /content/drive/MyDrive/metapy-0.2.13-cp37-cp37m-manylinux_2_24_x86_64.whl

# Exercise 1: Pseudo-feedback with Two-component Mixture Model

First, let's import the Python bindings for MeTA:

In [3]:
import metapy

If you don't have `metapy` installed, you can install it with a

```bash
pip install metapy
```

on the command line on Linux, macOS, or Windows for either Python 2.7 or Python 3.x. (I will be using Python 3.6 in this tutorial.)

Double-check that you are running the latest version. Right now, that should be `0.2.10`.

In [4]:
metapy.__version__

'0.2.13'

Now, let's set MeTA to log to standard error so we can see progress output for long-running commands. (Only do this once, or you'll get double the output.)

In [5]:
metapy.log_to_stderr()

Now, let's download all of the files we need for the tutorial.

In [6]:
import urllib.request
import os
import tarfile

if not os.path.exists('sigir18-tutorial.tar.gz'):
    urllib.request.urlretrieve('https://meta-toolkit.org/data/2018-06-25/sigir18-tutorial.tar.gz',
                               'sigir18-tutorial.tar.gz')
    
if not os.path.exists('data'):
    with tarfile.open('sigir18-tutorial.tar.gz', 'r:gz') as files:
        files.extractall()

Let's index our data using the `InvertedIndex` format. In a search engine, we want to quickly determine what documents mention a specific query term, so the `InvertedIndex` stores a mapping from term to a list of documents that contain that term (along with how many times they do).

In [7]:
inv_idx = metapy.index.make_inverted_index('cranfield.toml')

1669328692: [info]     Loading index from disk: cranfield-idx/inv (/metapy/deps/meta/src/index/inverted_index.cpp:171)


This may take a minute at first, since the index needs to be built. Subsequent calls to `make_inverted_index` with this config file will simply load the index, which will not take any time.

Here's how we can interact with the index object:

In [8]:
inv_idx.num_docs()

1400

In [9]:
inv_idx.unique_terms()

4137

In [10]:
inv_idx.avg_doc_length()

87.17857360839844

In [11]:
inv_idx.total_corpus_terms()

122050

Let's search our index. We'll start by creating a ranker:

In [12]:
ranker = metapy.index.DirichletPrior()

Now we need a query. Let's create an example query.

In [13]:
query = metapy.index.Document()
query.content("flow equilibrium")

Now we can use this to search our index like so:

In [14]:
top_docs = ranker.score(inv_idx, query, num_results=5)
top_docs

[(235, 1.2931444644927979),
 (1251, 1.256299614906311),
 (316, 1.1081531047821045),
 (655, 1.0878994464874268),
 (574, 1.076568841934204)]

We are returned a ranked list of *(doc_id, score)* pairs. The scores are from the ranker, which in this case was Okapi BM25. Since the `tutorial.toml` file we created for the cranfield dataset has `store-full-text = true`, we can verify the content of our top documents by inspecting the document metadata field "content".

In [15]:
for num, (d_id, _) in enumerate(top_docs):
    content = inv_idx.metadata(d_id).get('content')
    print("{}. {}...\n".format(num + 1, content[0:250]))

1. criteria for thermodynamic equilibrium in gas flow . when gases flow at high velocity, the rates of internal processes may not be fast enough to maintain thermodynamic equilibrium .  by defining quasi-equilibrium in flow as the condition in which the...

2. on the approach to chemical and vibrational equilibrium behind a strong normal shock wave . the concurrent approach to chemical and vibrational equilibrium of a pure diatomic gas passing through a strong normal shock wave is investigated .  it is dem...

3. non-equilibrium flow of an ideal dissociating gas . the theory of an'ideal dissociating'gas developed by lighthill/1957/for conditions of thermodynamic equilibrium is extended to non-equilibrium conditions by postulating a simple rate equation for th...

4. departure from dissociation equilibrium in a hypersonic nozzle . the equations of motion for the flow of an ideal dissociating gas through a nearly conical nozzle have been solved numerically, assuming a simple equation for

Since we have the queries file and relevance judgements, we can do an IR evaluation.

In [16]:
ev = metapy.index.IREval('cranfield.toml')

We will loop over the queries file and add each result to the `IREval` object `ev`.

In [17]:
def evaluate_ranker(ranker, ev, num_results):
    ev.reset_stats()
    with open('data/cranfield/cranfield-queries.txt') as query_file:
        for query_num, line in enumerate(query_file):
            query.content(line.strip())
            results = ranker.score(inv_idx, query, num_results)                            
            avg_p = ev.avg_p(results, query_num + 1, num_results)
            print("Query {} average precision: {}".format(query_num + 1, avg_p))
            
evaluate_ranker(ranker, ev, 10)

Query 1 average precision: 0.19
Query 2 average precision: 0.5433333333333332
Query 3 average precision: 0.6541666666666666
Query 4 average precision: 0.5
Query 5 average precision: 0.35
Query 6 average precision: 0.0625
Query 7 average precision: 0.10666666666666666
Query 8 average precision: 0.0
Query 9 average precision: 0.6984126984126983
Query 10 average precision: 0.0625
Query 11 average precision: 0.028571428571428574
Query 12 average precision: 0.18
Query 13 average precision: 0.0
Query 14 average precision: 0.5
Query 15 average precision: 0.7
Query 16 average precision: 0.08333333333333333
Query 17 average precision: 0.07142857142857142
Query 18 average precision: 0.3333333333333333
Query 19 average precision: 0.0
Query 20 average precision: 0.2685185185185185
Query 21 average precision: 0.0
Query 22 average precision: 0.0
Query 23 average precision: 0.04722222222222222
Query 24 average precision: 0.3333333333333333
Query 25 average precision: 0.6507936507936507
Query 26 avera

Afterwards, we can get the mean average precision of all the queries.

In [18]:
dp_map = ev.map()
print("MAP: {}".format(dp_map))

MAP: 0.21512203955656342


Now, let's use the two-component mixture model we discussed as an implementation of pseudo-feedback for retrieval and see if it helps improve performance. The actual ranking function used here is KL-divergence, where the query model is adjusted to include pseudo-feedback from the retrieved documents.

In order to work, the ranker needs to be able to quickly determine what words were used in the feedback document set. The `InvertedIndex` does not provide fast access to this (since it is a mapping from term to documents, rather than from documents to terms), so we will want to first create a `ForwardIndex` to get the document -> terms mapping.

In [19]:
fwd_idx = metapy.index.make_forward_index('cranfield.toml')

1669328692: [info]     Loading index from disk: cranfield-idx/fwd (/metapy/deps/meta/src/index/forward_index.cpp:171)


Now we can construct the KL-divergence pseudo-feedback ranker. The main components are:
1. The forward index
2. A base language-model ranker (here we'll use `DirichletPrior`)
3. $\alpha$, the query interpolation parameter (how strongly do we prefer terms from the feedback model? default 0.5)
4. $\lambda$, the language-model interpolation parameter (how strong is the background model in the two-component mixture? default 0.5)
5. $k$, the number of documents to retrieve for the feedback set (default 10)
6. `max_terms`, the number of terms from the feedback model to incorporate into the new query model (default 50) 

In [20]:
feedback = metapy.index.KLDivergencePRF(fwd_idx, metapy.index.DirichletPrior())

In [21]:
evaluate_ranker(feedback, ev, 10)

Query 1 average precision: 0.13999999999999999
Query 2 average precision: 0.524047619047619
Query 3 average precision: 0.6642857142857143
Query 4 average precision: 0.5
Query 5 average precision: 0.6875
Query 6 average precision: 0.0625
Query 7 average precision: 0.11666666666666665
Query 8 average precision: 0.0
Query 9 average precision: 0.49999999999999994
Query 10 average precision: 0.0625
Query 11 average precision: 0.023809523809523808
Query 12 average precision: 0.15714285714285714
Query 13 average precision: 0.0
Query 14 average precision: 0.5
Query 15 average precision: 0.6428571428571428
Query 16 average precision: 0.05555555555555555
Query 17 average precision: 0.05
Query 18 average precision: 0.16666666666666666
Query 19 average precision: 0.0
Query 20 average precision: 0.33201058201058203
Query 21 average precision: 0.0
Query 22 average precision: 0.0
Query 23 average precision: 0.075
Query 24 average precision: 0.38888888888888884
Query 25 average precision: 0.7530864197

In [22]:
fb_map = ev.map()
print("Feedback MAP: {}".format(fb_map))
print("DP MAP: {}".format(dp_map))

Feedback MAP: 0.22816526133086987
DP MAP: 0.21512203955656342
