> **NOTE:** this tutorial is deprecated for failure to resolve the "AP News 88" dataset being referenced.

This is the Indexing and Search metapy tutorial. First, you should read the following tutorial:
- [Search Tutorial](https://meta-toolkit.org/search-tutorial.html). Read *Initially setting up the config file* and *Relevance judgements*.

First, let's create an index. We will use the AP News dataset. Your current directory should look like this:
- `apnews`: AP News 88 dataset in MeTA format.
- `queries.txt`: 100 queries, one per line.
- `qrels.txt`: Over 10,000 relevance judgements for the queries.
- `stopwords.txt`: A file containing stopwords that will not be indexed.
- `apnews-config.toml`: A config file with paths set to all the above files, including index and ranker settings.

Here's how we can use metapy to create the index.

In [1]:
import metapy
idx = metapy.index.make_inverted_index('apnews-config.toml')

This may take a minute at first, since the index needs to be built. Subsequent calls to `make_inverted_index` with this config file will simply load the index, which will not take any time.

Here's how we can interact with the index object:

In [2]:
idx.num_docs()

164465

In [3]:
idx.unique_terms()

299769

In [4]:
idx.avg_doc_length()

526.3216552734375

In [5]:
idx.total_corpus_terms()

86561496

All the disk index and inverted index functions from MeTA are implemented in metapy.

Let's create a `Ranker` object so we can search the index;

In [6]:
ranker = metapy.index.OkapiBM25()

Now, we need a query. Create a `Document` and set its content to our query:

In [7]:
query = metapy.index.Document()
query.content('Airbus Subsidies') # query from AP news

Search our index using our ranker and query:

In [8]:
top_docs = ranker.score(idx, query, num_results=5)
top_docs

[(49687, 20.686737060546875),
 (8005, 20.367332458496094),
 (8645, 20.20011329650879),
 (13158, 20.071775436401367),
 (10212, 19.91327667236328)]

We are returned a ranked list of *(doc_id, score)* pairs. The scores are from the ranker, which in this case was Okapi BM25. Since our `line.toml` file in the AP News dataset has `store-full-text = true`, we can verify the content of our top documents by inspecting the document metadata field "content".

In [9]:
for num, (d_id, _) in enumerate(top_docs):
    content = idx.metadata(d_id).get('content')
    print("{}. {}...\n".format(num + 1, content[0:250]))

1. A top West German economic official said Sunday that reduction of government subsidies for Airbus Industrie will be a main topic at a planned September meeting ofthe consortium's member nations in Britain . Erich Riedl , parlimentary state secretary ...

2. The United States , angry over European subsidies for the Airbus aircraft - manufacturing consortium , is increasing pressure on Airbus nations to abolish or at least reduce the payments , say diplomatic sources . ` ` The Americans at a minimum want ...

3. U.S . and European trade official sare holding a new round of talks in the lengthy dispute over government subsidies to the Airbus aircraft manufacturing consortium . But both sides remain far apart on the long - simmering issue of subsidies to Airbu...

4. U.S . Trade Representative Clayton Yeutter tol dthe governments of Britain , France , West Germany and Spain onWednesday they are risking a trade war by their ` ` enormous subsidies ' ' to Airbus passenger planes . The majo

Since we have the queries file and relevance judgements, we can do an IR evaluation.

In [10]:
ev = metapy.index.IREval('apnews-config.toml')

We will loop over the queries file and add each result to the `IREval` object `ev`.

In [11]:
num_results = 10
with open('queries.txt') as query_file:
    for query_num, line in enumerate(query_file):
        query.content(line.strip())
        results = ranker.score(idx, query, num_results)                            
        avg_p = ev.avg_p(results, query_num, num_results)
        print("Query {} average precision: {}".format(query_num + 1, avg_p))

Query 1 average precision: 1.0
Query 2 average precision: 1.0
Query 3 average precision: 0.14694444444444443
Query 4 average precision: 0.5308730158730158
Query 5 average precision: 0.08833333333333333
Query 6 average precision: 0.8154365079365078
Query 7 average precision: 0.4084126984126984
Query 8 average precision: 1.0
Query 9 average precision: 0.042222222222222223
Query 10 average precision: 0.0
Query 11 average precision: 0.9
Query 12 average precision: 0.05
Query 13 average precision: 0.3
Query 14 average precision: 0.0
Query 15 average precision: 0.0
Query 16 average precision: 0.0
Query 17 average precision: 0.0
Query 18 average precision: 0.26666666666666666
Query 19 average precision: 0.016666666666666666
Query 20 average precision: 0.671111111111111
Query 21 average precision: 0.5508730158730158
Query 22 average precision: 0.0
Query 23 average precision: 0.0
Query 24 average precision: 0.0
Query 25 average precision: 0.0325
Query 26 average precision: 0.2
Query 27 average 

Afterwards, we can get the mean average precision of all the queries.

In [12]:
ev.map()

0.26801309523809536

Try experimenting with different rankers, ranker parameters, tokenization, and filters. What combination give you the best MAP?

Lastly, it's possible to define your own ranking function in Python.

In [13]:
class SimpleRanker(metapy.index.RankingFunction):                                            
    """                                                                          
    Create a new ranking function in Python that can be used in MeTA.             
    """                                                                          
    def __init__(self, some_param=1.0):                                             
        self.param = some_param
        # You *must* call the base class constructor here!
        super(SimpleRanker, self).__init__()                                        
                                                                                 
    def score_one(self, sd):
        """
        You need to override this function to return a score for a single term.
        For fields available in the score_data sd object,
        @see https://meta-toolkit.org/doxygen/structmeta_1_1index_1_1score__data.html
        """
        return (self.param + sd.doc_term_count) / (self.param * sd.doc_unique_terms + sd.doc_size)