Implementing a positional inverted index for phrase queries
===========================================================

In this week's exercise we will look at enhanced inverted indices.
The second form we look at is the inverted index (cf. section 2.4.2 of the book).

In this notebook we provide you with a mostly complete implementation for
the boolean queries.

Your task is to complete `add_document()` to build a positional index and
to complete `positional_intersect_two()` to be able to process phrase
queries on the positional index.

A phrase query is a query such as "Romans countrymen" which should only return
documents that contain this exact phrase.
We updated the provided parser to support phrases in arbitrary boolean
queries.

Any phrase needs to be enclosed in double quotes for the query parser to be
able to detect it.
The output format for a phrase is a list of the individual words that make up
the phrase, in order.

Below we show a few example queries containing phrases

In [1]:
import sys
sys.path.append("../../")

In [2]:
from queryparser import parse_query, process_ast
process_ast(parse_query('"a wit by folly vanquished"'))
process_ast(parse_query('wit AND NOT "else a wit" AND sonnet'))

AND: args=['wit', NOT: args=[['else', 'a', 'wit']], 'sonnet']

We keep the imports and global variables in a separate cell, so we can rerun
the code cell without loosing the contents of the index.

In [3]:
from queryparser import parse_query, ParseException, process_ast, Operation
import glob
import itertools
from textutils import tokenize_document
# global variables defining the index
documents = dict()
the_index = dict()
docid_counter = 1
# the path to the corpus
corpuspath="../../shared/corpus/*.txt"

You can use the following function to remove stop words from your flattened queries:

In [4]:
from textutils import stop_words
def remove_stop_words(ast):
    new_args = []
    for a in ast.args:
        if isinstance(a, list):
            new_args.append([x for x in a if not x in stop_words])
        elif isinstance(a, Operation):
            new_args.append(remove_stop_words(a))
        elif a[0] == '-' and a[1:] in stop_words:
            pass
        elif a in stop_words:
            pass
        else:
            new_args.append(a)
    ast.args = new_args
    return ast

Complete the implementation of `add_document`: the structure of `the_index` should be a `map` from words to lists of `(document_id, [list of positions])`.

In [5]:
def add_document(doc):
    '''
    Add a document to the inverted index. Returns the document's ID
    '''
    global documents, docid_counter, the_index

    # do not re-add the same document.
    if doc in documents.values():
        return documents[doc]
    docid = docid_counter
    documents[docid] = doc
    docid_counter += 1
    print("Adding document %s to inverted index with document ID %d" % (doc, docid))
    for pos, word in enumerate(tokenize_document(doc)):
        docs = the_index.setdefault(word, [])
        if len(docs) > 0 and docs[-1][0] == docid:
            docs[-1][1].append(pos)
        else:
            docs.append((docid, [pos]))
    return docid

In [6]:
for f in glob.glob(corpuspath):
    add_document(f)

Adding document ../../shared/corpus\alls_well_that_ends_well.txt to inverted index with document ID 1
Adding document ../../shared/corpus\as_you_like_it.txt to inverted index with document ID 2
Adding document ../../shared/corpus\a_lovers_complaint.txt to inverted index with document ID 3
Adding document ../../shared/corpus\a_midsummer_nights_dream.txt to inverted index with document ID 4
Adding document ../../shared/corpus\cymbeline.txt to inverted index with document ID 5
Adding document ../../shared/corpus\king_henry_the_eighth.txt to inverted index with document ID 6
Adding document ../../shared/corpus\king_john.txt to inverted index with document ID 7
Adding document ../../shared/corpus\king_richard_the_second.txt to inverted index with document ID 8
Adding document ../../shared/corpus\king_richard_the_third.txt to inverted index with document ID 9
Adding document ../../shared/corpus\loves_labours_lost.txt to inverted index with document ID 10
Adding document ../../shared/corpus\m

Implement the intersection of two posting lists with positional information (refer to the book). Be careful of the structure of the returned value: some of the code provided expects a list of tuples `(document_id, position_of_p1, position_of_p2)`.

In [7]:
def positional_intersect_two(p1, p2, k):
    '''
    Intersect two posting lists according to pseudo-code in Introduction
    to Information Retrieval, Figure 2.12
    k is the max distance between the two words
    Returns a list of tuples (document_id, position_of_p1, position_of_p2)
    '''
    answer = []
    while p1 != [] and p2 != []:
        if p1[0][0] == p2[0][0]:
            l = []
            pos1 = list(p1[0][1])
            pos2 = list(p2[0][1])
            while pos1 != []:
                while pos2 != []:
                    if abs(pos1[0] - pos2[0]) <= k:
                        l.append(pos2[0])
                    elif pos2[0] > pos1[0]:
                        break
                    pos2 = pos2[1:]
                while l != [] and abs(l[0] - pos1[0]) > k:
                    l = l[1:]
                for ps in l:
                    answer.append((p1[0][0], pos1[0], ps))
                pos1 = pos1[1:]
            p1 = p1[1:]
            p2 = p2[1:]
        elif p1[0][0] < p2[0][0]:
            p1 = p1[1:]
        else:
            p2 = p2[1:]
    return answer

The following code performs the intersection for a phrase query.

In [8]:
def positional_intersect(words):
    '''
    Positionally intersect posting lists for a list of words
    '''
    assert(len(words) >= 0)
    postings = [the_index[t] for t in words]
    result = None
    while len(postings) >= 2:
        result = [(docid, [x[1] for x in g]) for docid, g
                in itertools.groupby(positional_intersect_two(postings[-2], postings[-1], 1), key=lambda x: x[0])]
        postings[-2:] = [result,]
    return [x[0] for x in result]

Test your code with the following test cases:

In [9]:
# changed the order compared to online version as windows sort your file lexicographically.
assert(positional_intersect(['wit', 'folly', 'vanquished']) == [40])
assert(positional_intersect(['send', 'some']) == [9, 14, 15, 26, 40])

The following code has the boolean query features you implemented last week (and some more). It will invoke the `positional_intersect` function for phrase terms in the query.

In [10]:
def intersect_two(p1, p2):
    '''
    Intersect two posting lists according to pseudo-code in Introduction
    to Information Retrieval, Figure 1.6
    '''
    answer = []
    while p1 != [] and p2 != []:
        if p1[0] == p2[0]:
            answer.append(p1[0])
            p1 = p1[1:]
            p2 = p2[1:]
        elif p1[0] < p2[0]:
            p1 = p1[1:]
        else:
            p2 = p2[1:]
    return answer

def strip_positions(postings):
    return (x[0] for x in postings)

def negate(term):
    if term in the_index.keys():
        return sorted(set(documents.keys()) - set(strip_positions(the_index[term])))
    else:
        # What is the correct negation of a term not 
        return list(strip_positions(documents.keys()))

def intersect(terms):
    '''
    Intersect posting lists for a list of terms
    Take into consideration phrase queries
    '''
    def process_term(t):
        if isinstance(t, list):
            return positional_intersect(t)
        elif isinstance(t, Operation) and t.op == 'NOT' and isinstance(t.args[0], list):
            return sorted(set(documents.keys()) - set(positional_intersect(t.args[0])))
        elif isinstance(t, str) and t.startswith('-'):
            return negate(t[1:])
        else:
            assert(isinstance(t, str))
            return list(strip_positions(the_index[t]))

    postings = [process_term(t) for t in terms]
    # calculate word frequencies and sort term,freq pairs in ascending
    # order by frequency
    freqs = sorted([ (p, len(p)) for p in postings ], key=lambda x: x[1] )
    result = freqs[0][0]
    del freqs[0]
    while freqs != [] and freqs != []:
        result = intersect_two(result, freqs[0][0])
        del freqs[0]
    return result

In [11]:
def execute_query(query):
    '''
    Execute a boolean query on the inverted index. We only support single
    operator queries ATM.  This method returns a list of document ids
    which satisfy the query in no particular order (i.e. the order in
    which the documents were added most likely :)).
    '''
    # We use a generated parser to transform the query from a string to an
    # AST.
    try:
        ast = parse_query(query)
    except ParseException as e:
        print("Failed to parse query '%s'\n" % query, e)
        return None

    # We preprocess the AST to flatten commutative operations, such as
    # sequences of ANDs. We also transform 'NOT <term>' arguments into
    # '-<term>' to allow smarter processing of AND NOT and OR NOT.
    flat = remove_stop_words(process_ast(ast))

    args = []
    # go through arguments and fall back on recursive evaluation if we
    # could not completely flatten the query
    for arg in flat.args:
        if isinstance(arg, Operation) and arg.op != 'NOT':
            print("Cannot handle query '%s', aborting..." % query)
            return None
        elif isinstance(arg, list):
            # Assume it's a phrase
            for w in arg:
                assert(isinstance(w, str))
            if any(w not in the_index.keys() for w in arg):
                print("NOTE: Dropping phrase '%s' because no document contains it" % ' '.join(arg))
            else:
                args.append(arg)
        elif isinstance(arg, str) and not arg.startswith('-') and arg not in the_index.keys():
            # Drop terms that don't occur in the vocabulary
            print("NOTE: Dropping term '%s' because no document contains it" % arg)
        else:
            args.append(arg)

    if flat.op == 'OR':
        results = set()
        for arg in args:
            if isinstance(arg, list):
                results = results | set(positional_intersect(arg))
            elif isinstance(arg, Operation) and arg.op == 'NOT' and isinstance(arg.args[0], list):
                print("Query '%s' not supported", query)
                return None
            elif arg.startswith('-'):
                results = results | set(negate(arg[1:]))
            else:
                results = results | set(strip_positions(the_index[arg]))
        return sorted(results)

    elif flat.op == 'AND':
        return intersect(args)

    elif flat.op == 'NOT':
        # handle case where we query for the negation of a term not in the
        # vocabulary.
        if len(args) == 0:
            return list(documents.keys())
        else:
            assert(len(args) == 1)
            if isinstance(args[0], list):
                return sorted(set(documents.keys()) - set(positional_intersect(args[0])))
            else:
                return sorted(set(documents.keys()) - set(strip_positions(the_index[args[0]])))

    elif flat.op == 'LOOKUP':
        if len(args) == 0:
            # we drop terms that are not in the vocabulary in the
            # preprocessing loop, so handle the case where we query for a
            # single term that is not in the vocabulary.
            return []
        else:
            # in this case the query was a single term
            assert(len(args) == 1)
            if isinstance(args[0], list):
                return positional_intersect(args[0])
            else:
                return list(strip_positions(the_index[args[0]]))
    else:
        print("Cannot handle query '%s', aborting..." % query)
        return None

In [12]:
def print_result(docs):
    '''
    Helper function to convert a list of document IDs back to file names
    '''
    if not docs:
        print("No documents found")
        print()
        return
    # If we got some results, print them
    for doc in docs:
        print('%d -> %s' % (doc, documents[doc]))
    print()

Test your code with the following queries:

In [13]:
print_result(execute_query('"a wit by folly vanquished"'))
print_result(execute_query('wit AND NOT "else a wit" AND sonnet'))
print_result(execute_query('"glooming peace" OR Titus'))

40 -> ../../shared/corpus\the_two_gentlemen_of_verona.txt

1 -> ../../shared/corpus\alls_well_that_ends_well.txt
10 -> ../../shared/corpus\loves_labours_lost.txt
12 -> ../../shared/corpus\much_ado_about_nothing.txt
18 -> ../../shared/corpus\the_life_of_king_henry_the_fifth.txt
43 -> ../../shared/corpus\twelfth_night_or_what_you_will.txt

19 -> ../../shared/corpus\the_life_of_timon_of_athens.txt
32 -> ../../shared/corpus\the_tragedy_of_coriolanus.txt
38 -> ../../shared/corpus\the_tragedy_of_romeo_and_juliet.txt
39 -> ../../shared/corpus\the_tragedy_of_titus_andronicus.txt
43 -> ../../shared/corpus\twelfth_night_or_what_you_will.txt

