# Boolean retrieval

Example of boolean retrieval on the <code>country_dataset</code>. The dataset is composed by a set of queries of the form:
<pre>
<code>
  c: [d_0, d_1, ..., d_n]  
</code>
</pre>
where <code>c</code> is the name of a country and <code>[d_0, d_1, ..., d_n]</code> is the list of the document ids that are relevant to <code>c</code>. Document texts are given in the <code>docs</code> list. Document ids correspond to the position of documents in the <code>docs</code> list.

In [1]:
import json

In [2]:
dataset_file = '../data/country_dataset.json'
with open(dataset_file, 'r') as infile:
    dataset = json.load(infile)

In [3]:
docs = dataset['docs']
queries = dataset['queries']

## Get a tokenizer

In [4]:
from nltk.tokenize import TweetTokenizer

In [5]:
tk = TweetTokenizer()

## Boolean indexing

In [6]:
from collections import defaultdict

In [7]:
I = defaultdict(lambda: set())

In [8]:
for i, doc in enumerate(docs):
    for token in tk.tokenize(doc):
        I[token.lower()].add(i)

## Query processing
Using set union, we implement <code>OR</code> boolean queries.

In [9]:
q = list(queries.keys())[8]

In [14]:
def or_query(q, index):
    qs = [t.lower() for t in tk.tokenize(q)]
    answers = set()
    for s in qs:
        answers = answers.union(index[s])
    return answers

### Exercize: implement <code>AND</code> boolean queries

## Precision and recall

In [18]:
import numpy as np
import pandas as pd

In [19]:
outcome = defaultdict(lambda: {'precision': 0, 'recall': 0})
for query, expected in queries.items():
    retrieved = or_query(query, I)
    try:
        p = len(set(expected).intersection(retrieved)) / len(retrieved)
    except ZeroDivisionError:
        p = np.nan
    r = len(set(expected).intersection(retrieved)) / len(expected)
    outcome[query]['precision'] = p
    outcome[query]['recall'] = r

In [22]:
O = pd.DataFrame(outcome).T

In [23]:
O.head()

Unnamed: 0,precision,recall
India,1.0,0.869565
Slovenia,1.0,0.75
Canada,1.0,0.615385
Tanzania,1.0,1.0
Indonesia,1.0,0.75


In [24]:
O.mean()

precision    0.814421
recall       0.720115
dtype: float64