# Demonstration

The following demonstration will use the training set of the OHSUMED corpus. This training set was used in the Filtering Track of the 9th edition of the Text REtrieval Conference (TREC-9). We will use it for the information retrieval exercises of this workshop. Download and unzip [ohsumed.zip](ohsumed.zip) into the same folder as this notebook. 

To help you read the data, we are providing the file ohsumed.py (in the zip file above) that has a simple API to the data. When you import it at the Python prompt, it will provide the following variables:


1. `index`: a dictionary with document IDs as keys, and document text as values.
2. `questions`: a dictionary with query IDs as keys, and query text as values.
3. `answers`: a dictionary with query IDs as keys, and a set with the IDs of known relevant documents for evaluation purposes.

Below are some examples:

In [1]:
import ohsumed

Reading OHSUMED data


In [2]:
len(ohsumed.index)

54710

In [3]:
list(ohsumed.index.keys())[:10]

['87049087',
 '87049088',
 '87049089',
 '87049090',
 '87049092',
 '87049093',
 '87049094',
 '87049096',
 '87049098',
 '87049099']

In [4]:
ohsumed.index['87073365']

'Plasma cholecystokinin (CCK) responses after ingestion of a test meal in patients with mild chronic pancreatitis having abdominal pain were studied with a radioimmunoassay using the CCK specific antiserum (OAL-656) produced by a novel immunization procedure. Mean concentration of the fasting plasma CCK determined using CCK-8 as a standard was 31.5 +/- 5.8 pg/ml in six patients who had mild impaired exocrine function with pain, and was significantly higher than 10 healthy subjects (9.8 +/- 1.8 pg/ml). In those patients, the ingestion of a liquid test meal led to a peak of 75.1 +/- 25.4 pg/ml at 30 min, and the 120-min integrated CCK response (5427 +/- 1217.3 pg X min/ml) was significantly higher than in healthy subjects (1538 +/- 110.1 pg X min/ml).'

In [5]:
len(ohsumed.questions)

63

In [6]:
list(ohsumed.questions.keys())[:10]

['OHSU1',
 'OHSU2',
 'OHSU3',
 'OHSU4',
 'OHSU5',
 'OHSU6',
 'OHSU7',
 'OHSU8',
 'OHSU9',
 'OHSU10']

In [7]:
ohsumed.questions['OHSU7']

'young wf with lactase deficiency lactase deficiency therapy options'

In [8]:
len(ohsumed.answers)

63

In [9]:
ohsumed.answers['OHSU7']

{'87103870', '87104937', '87124599', '87153184', '87253647', '87267477'}

## Inverted index

We are going to build an inverted index of the non-stop words with frequency higher than 5.

The following code reads the files and creates a counter of all words in the corpus (including stop words). We will use NLTK's word tokeniser (read the beginning of [chapter 3 of NLTK's book](http://www.nltk.org/book/ch03.html#processing-raw-text)) to convert each document into a list of tokens.

In [10]:
import nltk, collections
nltk.download('stopwords')
nltk.download('punkt')
stop = nltk.corpus.stopwords.words('english')
wordcounter = collections.Counter([w.lower() for k in ohsumed.index
                                             for s in nltk.sent_tokenize(ohsumed.index[k])
                                             for w in nltk.word_tokenize(s)])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [11]:
wordcounter.most_common(10)

[('the', 305806),
 ('of', 271953),
 ('.', 254858),
 (',', 239656),
 ('and', 179604),
 ('in', 172449),
 ('to', 107431),
 (')', 96259),
 ('(', 95948),
 ('a', 95277)]

The following code creates the inverted index of all non-stop words with frequency higher than 5.

In [12]:
inverted = dict()
for d in ohsumed.index:
    for w in nltk.word_tokenize(ohsumed.index[d]):
        w = w.lower()
        if w in stop or wordcounter[w] <= 5:
            continue
        if w in inverted:
            inverted[w].add(d)
        else:
            inverted[w] = set([d])

In [13]:
sorted(list(inverted.keys()))[3000:3010]

['accept',
 'acceptability',
 'acceptable',
 'acceptably',
 'acceptance',
 'accepted',
 'accepting',
 'acceptor',
 'acceptors',
 'access']

In [14]:
inverted['acceptability']

{'87057543',
 '87067994',
 '87073895',
 '87074134',
 '87114326',
 '87119697',
 '87121859',
 '87129900',
 '87149032',
 '87153185',
 '87193350',
 '87223625',
 '87223856',
 '87224779',
 '87232524',
 '87251875',
 '87273001',
 '87282178',
 '87295871',
 '87297008'}

The following code saves the inverted index into a pickle file. This way we do not need to recompute the inverted index again. Read [Python's documentation on pickle files](https://docs.python.org/3/library/pickle.html) for more detail. Note that the file we created is opened for writing in binary mode, following the advice of a [stackoverflow post about saving pickle files](http://stackoverflow.com/questions/13906623/using-pickle-dump-typeerror-must-be-str-not-bytes).

In [15]:
import pickle
with open('inverted.pickle','wb') as f:
    pickle.dump(inverted,f)

## Boolean retrieval

The following code reads the pickled file and returns the list of documents that maches this Boolean query:
1. (menopausal OR pregnant) AND woman AND NOT healthy

In [16]:
with open('inverted.pickle','rb') as f:
    inverted = pickle.load(f)

In [17]:
(inverted['menopausal'] | inverted['pregnant']) & inverted['woman'] - inverted['healthy']

{'87060673',
 '87066899',
 '87097274',
 '87097518',
 '87099263',
 '87114245',
 '87117852',
 '87128881',
 '87134330',
 '87138205',
 '87153548',
 '87153568',
 '87169457',
 '87185313',
 '87226668',
 '87231479',
 '87235637',
 '87251241',
 '87252385',
 '87261426',
 '87281235',
 '87290433',
 '87296136',
 '87316210',
 '87316220',
 '87316328',
 '87324028',
 '87325497'}

# Your Turn

## Vector Retrieval

### Exercise: Boolean Information Retrieval

Create an inverted index of the NLTK Gutenberg corpus and save it into a file "gutenbergindex.pickle". To create this index there is no need to look for stop words or word frequencies, since the corpus is not that large. Simply use all the words. Use this index to find the IDS of the documents that match the following Boolean queries:

1. Brutus OR Caesar
2. Brutus AND NOT Caesar
3. (Brutus AND Caesar) OR Calpurnia


In [18]:
import pickle
nltk.download("gutenberg")
# Write your code here


[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
austen-emma.txt
austen-persuasion.txt
austen-sense.txt
bible-kjv.txt
blake-poems.txt
bryant-stories.txt
burgess-busterbrown.txt
carroll-alice.txt
chesterton-ball.txt
chesterton-brown.txt
chesterton-thursday.txt
edgeworth-parents.txt
melville-moby_dick.txt
milton-paradise.txt
shakespeare-caesar.txt
shakespeare-hamlet.txt
shakespeare-macbeth.txt
whitman-leaves.txt
Saving index into file gutenbergindex.pickle...
Done


In [19]:
with open('gutenbergindex.pickle','rb') as f:
    gutenbergindex = pickle.load(f)

In [20]:
# Write your code for searching for Brutus OR Caesar


{'bible-kjv.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt'}

In [21]:
# Write your code for searching for Brutus AND NOT Caesar


set()

In [22]:
# Write your code for searching for (Brutus AND Caesar) OR Calpurnia


KeyError: 'Calpurnia'

_It turns out that 'Calpurnia' is not indexed. Can you figure out what sort of error handling you could do to handle words that are not indexed?_

### Exercise: tf.idf

Using scikit-learn, compute the tf.idf of all words in the OHSUMED corpus. Use the English list of stop words, and leave all other settings to their default values. In particular, do not stem the words. Pickle the resulting tf.idf vectoriser into a file tfidf.pickle. *Note that in this exercise you should use the sklearn functions, not nltk. In particular, do not use NLTK's list of stop words or its tokeniser.*

In [23]:
# Write your code to compute the tf.idf


TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [24]:
# Write your code to save the results in a pickle file


### Exercise: Sort by tf.idf

Write a program that prints the words of a document with highest tf.idf score, sorted in descending order.

In [25]:
def best_tfidf(tfidf,docID,numwords=10):
    "Print the words with highest tf.idf, in descending order"
    return []

In [26]:
best_tfidf(tfidf,'87049087')

['rhythms',
 'refibrillation',
 'organized',
 'refibrillated',
 'converted',
 'emt',
 'paramedic',
 'ds',
 'defibrillation',
 'hospital']

### Optional exercise: tf.idf cosine similarity

Write a function that takes as a parameter a string and an optional parameter $n$ the number of results, and returns the IDs of the $n$ documents that are most relevant according to the tf.idf vector of the documents and cosine similarity. The results are sorted in descending order of the cosine similarity score. *If the execution of your program takes too long, in your solution you can process only the first 1000 files from the ohsumed collection.*

In [27]:
def best_documents(querystring,n=10):
    "Return the best documents sorted by relevance"
    return []

In [31]:
best_documents(ohsumed.questions['OHSU1'])

['87052846',
 '87053030',
 '87057603',
 '87057561',
 '87054719',
 '87053640',
 '87053630',
 '87055106',
 '87057550',
 '87053614']

## PageRank

The following graph represents a tiny collection of HTML documents and how they are linked. You will implement PageRank on the documents.

![pagerank](graph.png)

### Exercise: Transition matrix

Write the transition matrix using Numpy Python code.

### Exercise: An Iteration of PageRank

Perform an iteration of the PageRank algorithm and report on the resulting PageRank scores of each page.

array([[ 0.25   ],
       [ 0.14375],
       [ 0.56875],
       [ 0.0375 ]])

### Exercise: Complete PageRank

Perform the PageRank algorithm until convergence and report the number of iterations and the final scores.

PR after 9 iterations:
[[ 0.37044014]
 [ 0.19493706]
 [ 0.3971228 ]
 [ 0.0375    ]]
