---
Information Retrieval Exercises
====

You will be improving upon a rather poorly-made information retrieval system. You will build a system to quickly retrieve documents that match queries.

---
Rider or Die
----

![](http://i.telegraph.co.uk/multimedia/archive/02162/ridderhaggard_2162866i.jpg)

Data 
---

>“...one day a sunrise will come when we shall be among those who are lost, and then others will watch those glorious rays, and grow sad in the midst of beauty, and dream of Death in the 
full glow of arising Life!”   
> \- Rider Haggard

Your IR system will find relevant documents among a collection of 60 short stories by the famed [Rider Haggard](http://en.wikipedia.org/wiki/H._Rider_Haggard). 

The training data is located in the `data/` directory under the subdirectory `RiderHaggard/`. Within this directory you will see yet another directory `raw/`. This contains the raw text files of 60 different short stories written by Rider Haggard.

A set of development queries and their expected answers are in the `data/` directory, the files `queries.txt` and `solutions.txt` respectively.

In [1]:
! head -n 1 data/queries.txt 

separation, priestess, demon, zulu, sacrifice


In [2]:
! head -n 1 data/solutions.txt

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 22, 23, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 44, 45, 46, 47, 48, 49, 51, 53, 54, 55, 56, 58, 59], [1, 3, 4, 6, 16, 19, 25, 30, 31, 36, 40, 45, 47, 48, 53, 54, 58, 59], [1, 2, 3, 5, 6, 11, 14, 15, 16, 17, 20, 25, 26, 27, 29, 31, 34, 35, 36, 37, 39, 40, 41, 43, 44, 45, 47, 48, 49, 50, 51, 52, 54, 55, 58], [2, 3, 5, 7, 8, 9, 10, 11, 14, 16, 19, 20, 21, 22, 23, 27, 28, 33, 39, 40, 41, 43, 44, 47, 48, 51, 52, 55, 56, 58], [0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 22, 23, 25, 26, 28, 29, 30, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 45, 46, 47, 48, 51, 53, 54, 55, 56, 57, 58, 59]]


---

Improve upon the IR system provided. This involves implementing:

- **Inverted Index:** a mapping from words to the documents in which they occur.
- **Boolean Retrieval:** in which you return the list of documents that contain all words in a query* 

You will implement and/or improve upon the following functions:

- `index():` This is where you will build the inverted index. The documents will have already been read in for you at this point, so you will want to look at some of the instance variables in the class:
    - `self.titles`
    - `self.docs`
    - `self.vocab`
- `boolean_retrieve():` This function performs Boolean retrieval, returning a list of document IDs corresponding to the documents in which all the words in `query` occur.

\* Yes, we only support conjunctions...

----
Evaluation
----
Your IR system will be evaluated on a development set of queries as well as a held-out set of queries. The queries are encoded in the file **queries.txt** and are:

Running the code
---

In [1]:
reset -fs

That code will run you IR system and test it against the development set of queries. 

The first time you run the code the documents will be stemmed.

Then you will see the evaluation metrics

In [2]:
%run python/ir_system.py

Reading in documents...
Already stemmed!
Indexing...
===== Running tests =====
Inverted Index Test
    Score: 0 Feedback: 0/5 Correct. Accuracy: 0.000000
Boolean Retrieval Test
    Score: 0 Feedback: 0/5 Correct. Accuracy: 0.000000


---


__Note__: That the first time you run this, it will create a directory named `stemmed/` in `../data/RiderHaggard/.` This is meant to be a simple cache for the raw text documents. Later runs will be much faster after the first run. 

*However*, this means that if something happens during this first run and it does not get through processing all the documents, you may be left with an incomplete set of documents in `../data/RiderHaggard/stemmed/.` If this happens, simply remove the `stemmed/` directory and re-run!

---
Hints
---

> Smart data structures and dumb code works a lot better than the other way around.

- Take your time - Read the instructions, skim the code, and __read the instructions again__. 
- `sets`, `Counters`, and `defaultdict` are your friends
- indexes are your best friends
- Build an instance of the system in in Jupyter Notebook or in `ipython`....

In [5]:
irsys = IRSystem()
irsys.read_data('./data/RiderHaggard')
irsys.index()

print('Index built')

Reading in documents...
Already stemmed!
Indexing...
Index built


In [6]:
print(irsys.vocab[:5])

['laugh', 'andromedasi', 'seedand', 'eatersupofenemi', 'ballgeoffrei']


In [7]:
print(*irsys.titles[:5], sep="\n")

A Winter Pilgrimage (1901) 0600121
A Yellow God: an Idol of Africa 2857
Allan Quatermain 711
Allan and the Holy Flower 5174
Allan and the Ice Gods (1927) 0200201


In [10]:
word = "the"
# word = 'withhold'
# word = 'twerk'
print(*irsys.inv_index[word], sep="\n")




<br>
<br>
---