# Wildcard queries and tolerant retrieval
The aim of wildcard queries is to support search for all the words that match a given pattern in order to be tolerant with respect to the form of words in the query.

### Example:
When searching for <code>p*er</code>, we aim at matching both <code>paper</code> and <code>player</code>

## Permuterm index
Add a special symbol like <code>#</code> to each token with the aim of marking the end of the word.
Then, we create a key in the permuterm index for each rotation of a term moving left the symbol <code>#</code>

### Example:
- paper# $\rightarrow$ paper#, aper#p, per#pa, er#pap, r#pape, #paper

Given a wildcard query including the symbol <code>\*</code>, we now rotate it to have <code>\*</code> in the end of the string and search all the tokens starting with it in the permuterm index (using a b-tree).

### Example:
- <code>p\*er#</code> $\rightarrow$ <code>er#p\*</code>

When working with multiple wildcards, e.g. <code>pl\*ie\*rs</code>, we first work with the first and the last wildcards as shown above, and then we filter out all the terms not containing the other part of the query.

### Example:
Search for <code>pl\*rs</code> and then filter out all terms not containing a <code>ie</code>.

# Permuterm index in the cranfield dataset

In [12]:
import pymongo
from collections import defaultdict

In [13]:
I = pymongo.MongoClient()['inforet']['cran_tokens']

In [14]:
g = {'$group': {'_id': None, 'tokens': {'$addToSet': '$text'}}}
tokens = [x['tokens'] for x in I.aggregate([g])][0]

In [15]:
def permuterm(token):
    t = token + '#'
    p = []
    for i in range(-1, -(len(t)+1), -1):
        p.append(t[i:] + t[:i])
    return p

In [16]:
P = defaultdict(lambda: None)
for token in tokens:
    for p in permuterm(token):
        P[p] = token

In [28]:
query = [x for x in permuterm('po*l') if x.endswith('*')][0][:-1]

In [29]:
for k, v in P.items():
    if k.startswith(query):
        print(k, v)

l#powerfu powerful
l#polyaxia polyaxial
l#potentia potential
l#powel powell
l#polynomia polynomial


## k-gram queries
A further strategy for wildcard queries is to index k-grams in tokens. For example, with 3-grams, we index the term <code>#play#</code> under <code>#pl</code>, <code>pla</code>, <code>lay</code>, and <code>ay#</code>.

According to this approach, we process the wildcard query <code>pl\*er</code> by running the boolean query <code>#pl AND er#</code> and then post-processing the resulting tokens to filter out wrong terms.

In [30]:
from nltk.util import ngrams

In [44]:
def kgrams(token, k=2):
    return ["".join(x) for x in ngrams("#" + token + "#", k)]

In [46]:
K = defaultdict(lambda: set())
for token in tokens:
    for kg in kgrams(token, k=3):
        K[kg].add(token)

In [93]:
query = []
for x in "#pl*ed#".split('*'):
    if len(x) > 3:
        query += ["".join(y) for y in ngrams(x, 3)]
    elif len(x) == 3:
        query.append(x)        
    else:
        pass

In [94]:
r = K[query[0]]
for q in query[1:]:
    r = r.intersection(K[q])

In [95]:
r

{'placed', 'played', 'plied', 'plotted'}