# Lemur project Indri and Pyndri tests

This notebook is to provide a set of examples for how to setup and run some basic queries using the indri engine and pyndri as an interface. 

In [98]:
# A few imports
import pyndri
import os

These are some files that we have in our current directory

In [99]:
ls

IndriBuildIndex.conf          corpus.trectext  run.params
Indri_and_Pyndri_tests.ipynb  [0m[01;34mindex[0m/


In [100]:
!cat ./corpus.trectext

<DOC>
<DOCNO>recipe</DOCNO>
<TEXT>
First put the potatoes in the pan. Then fill with water and boil till soft.
</TEXT>
<TEXT>
more text
</TEXT>
</DOC>
<DOC>
<DOCNO>history</DOCNO>
<TEXT>
Eventually, agriculturalists in Europe found potatoes easier to grow and cultivate than other staple crops, such as wheat and oats.
</TEXT>
</DOC>
<DOC>
<DOCNO>news</DOCNO>
<TEXT>
It has been found that Jaimie Oliver was smuggling potatoes accross the Mexican border. No one knows yet what the meaning of this potato fiasco is.
</TEXT>
</DOC>


Indri is a set of tools for indexing and querying documents. <br>
Building an index involves creating a config file specifying the corpus to use, the stemmer and stop words. Here is an example:

```
<parameters>
    <index>index/</index>
    <memory>1024M</memory>
    <storeDocs>true</storeDocs>
    <corpus><path>corpus.trectext</path><class>trectext</class></corpus>
    <stemmer><name>krovetz</name></stemmer>
    <stopper>  
      <word>a</word>
      <word>about</word>
      <word>above</word>
      <word>according</word>
      <word>across</word>
      <word>after</word>
    </stopper>
</parameters>
```

In [101]:
!cat ./IndriBuildIndex.conf

<parameters>
<index>index/</index>
<memory>1024M</memory>
<storeDocs>true</storeDocs>
<corpus><path>corpus.trectext</path><class>trectext</class></corpus>
<stemmer><name>krovetz</name></stemmer>
</parameters>


In [102]:
!IndriBuildIndex IndriBuildIndex.conf

kstem_add_table_entry: Duplicate word emeritus will be ignored.
kstem_add_table_entry: Duplicate word emeritus will be ignored.
0:00: Opened repository index/
0:00: Opened corpus.trectext
0:00: Documents parsed: 3 Documents indexed: 0
0:00: Closed corpus.trectext
0:00: Closing index
0:00: Finished


In [103]:
ls

IndriBuildIndex.conf          corpus.trectext  run.params
Indri_and_Pyndri_tests.ipynb  [0m[01;34mindex[0m/


In [104]:
!cat run.params

<parameters>
  <index>index</index>
  <trecFormat>true</trecFormat>
  <count>10</count>
  <query>
    <number>115</number>
    <text>
      #weight( 1.0 Jaimie 1.0 potatoes  1.0 the  )
    </text>
  </query>
  <query>
    <number>116</number>
    <text>
      #weight( 1.0 water 1.0 potatoes  1.0 the  )
    </text>
  </query>
  <stopper>
    <word>small</word>
    <word>the</word>
  </stopper>
</parameters>

In [105]:
!IndriRunQuery run.params

kstem_add_table_entry: Duplicate word emeritus will be ignored.
115 Q0 news 1 -3.90996 indri
115 Q0 recipe 2 -3.91677 indri
115 Q0 history 3 -3.91796 indri
116 Q0 recipe 1 -3.90847 indri
116 Q0 history 2 -3.91796 indri
116 Q0 news 3 -3.91826 indri


In [106]:
index_path = os.path.join("", 'index')
print("Index Exists" if os.path.exists(index_path) else "Index not found")

index = pyndri.Index(index_path)

Index Exists


In [107]:
index.query('Jaimie potatoes the')

((3, -3.212641352160709), (1, -3.219448297087022), (2, -3.2248132426417326))

In [108]:
env = pyndri.TFIDFQueryEnvironment(index)
for i in range(1000):
    env.query('Jaimie potatoes the')

In [109]:
[index.document(int_doc_id[0])[0] for int_doc_id in env.query('Jaimie potatoes the')]

['news', 'recipe', 'history']

## Experiments
- you cannot duplicate the document id's in the corpus. Indri will simply ignore the duplicates and it will look as if there was only one document. The solution to getting the right code given a description is to have a parallel file and use the index given to find the document.
- it seems that indri doesnt do Language Models in the query files. So for BM25 pyndri seems like the best option.
- running 100k queries in a for loop for pyndri using a TFIDF-BM25 model takes around 12s