Search Engine exercise
====

TODO :

```
pip install spacy
and install enwordsm
pip install ir_measures
```


In [432]:
import json

with open('cisi_all.json') as docstream:
    docset     = json.loads(docstream.read())

with open('cisi_queries.json') as qstream:
    queries     = json.loads(qstream.read())


print('Total number of documents',len(docset))
print('Total number of queries',len(queries))

Total number of documents 1460
Total number of queries 112


View documents
-----

The following code displays the first 10 documents of the collection

In [433]:
for doc in docset[:10]:
    print('doc_id:',doc['doc_id'])
    print('text:',doc['text'])
    print('-'*80)

doc_id: 1
text: The present study is a history of the DEWEY Decimal Classification. The first edition of the DDC was published in 1876, the eighteenth edition in 1971, and future editions will continue to appear as needed. In spite of the DDC's long and healthy life, however, its full story has never been told. There have been biographies of Dewey that briefly describe his system, but this is the first attempt to provide a detailed history of the work that more than any other has spurred the growth of librarianship in this country and abroad.
--------------------------------------------------------------------------------
doc_id: 2
text: This report is an analysis of 6300 acts of use in 104 technical libraries in the United Kingdom. Library use is only one aspect of the wider pattern of information use. Information transfer in libraries is restricted to the use of documents. It takes no account of documents used outside the library, still less of information transferred orally from per

Perform a query
------

Each query has an ID and a query text. We display the 10 first queries:

In [434]:
for query in queries[:10]:
    print('query_id:',query['query_id'])
    print('text:',query['text'])
    

query_id: 1
text: What problems and concerns are there in making up descriptive titles? What difficulties are involved in automatically retrieving articles from approximate titles? What is the usual relevance of the content of articles to their titles?
query_id: 2
text: How can actually pertinent data, as opposed to references or entire articles themselves, be retrieved automatically in response to information requests?
query_id: 3
text: What is information science? Give definitions where possible.
query_id: 4
text: Image recognition and any other methods of automatically transforming printed text into computer-ready form.
query_id: 5
text: What special training will ordinary researchers and businessmen need for proper information management and unobstructed use of information retrieval systems? What problems are they likely to encounter?
query_id: 6
text: What possibilities are there for verbal communication between computers and humans, that is, communication via the spoken word?
que

Preprocess data
------

In [435]:
import spacy
import string

stopwords = spacy.util.get_lang_class('en').Defaults.stop_words
stopwords.update(['the', 'of', 'be', 'and', 'to', 'in', 'a', '..', 'for', 'that', 'have'])
tokenizer = spacy.load('en_core_web_sm')

def preproc_doc(text,lower=True):
    lemma_list = []
    for TOKEN in tokenizer(text):
       lemma = TOKEN.lemma_.lower() if lower else TOKEN.lemma_ 
       if lemma not in stopwords and lemma not in string.punctuation: 
           lemma_list.append(TOKEN.lemma_)
    return ' '.join(lemma_list)



Create vocabulary
----

In [436]:
from collections import Counter

def make_vocab(docset,max_vocab_size=50000):
    
    vocab_counts = Counter()
    for doc in docset:
        text = preproc_doc(doc['text'])
        vocab_counts.update(text.split())
    
    print('  Total vocabulary size', len(vocab_counts))
    print('  Vocabulary size', min(len(vocab_counts),max_vocab_size))
    return [word for word,C in vocab_counts.most_common(max_vocab_size)]

def index_vocab(vocab_list):
    return dict( (word,idx) for idx, word in enumerate(vocab_list))
    

In [437]:
import numpy as np

def bow_TXD(docset,vocab_list,vocabulary):
    ncols  = len(docset)
    mlines = len(vocab_list)
    TXD    = np.zeros((mlines,ncols))

    for jdx, doc in enumerate(docset):
        text = preproc_doc(doc['text'])
        counts = Counter(text.split())
        for word in counts.elements():
            if word in vocabulary:
                idx = vocabulary[word]
                TXD[idx,jdx] = counts[word]
    return TXD

In [438]:
def tf(docvec):
    Z = docvec.sum()
    if Z > 0:      #some rows have zero counts (in case some texts are empty)
        return docvec / Z
    return docvec
    
def idf(termvec):
    ndocs = len(termvec)
    return np.log(ndocs / max(1,np.count_nonzero(termvec))) 

def idfvec(tdmat):
    return np.array([idf(row) for row in tdmat]) 

def tfidf(doc_vec,idf_vec):
    return tf(doc_vec) * idf_vec


In [439]:
from numpy.linalg import norm

def score_doc(qvec,docvec):
    return qvec @ docvec / (norm(qvec)*norm(docvec))


For querying, we have the SVD equation:

$$
\begin{align*}
\mathbf{X} &= \mathbf{T}\Sigma\mathbf{D}^\top\\
\mathbf{T}^\top \mathbf{X} &= \Sigma\mathbf{D}^\top\\
\Sigma^{-1}\mathbf{T}^\top \mathbf{X} &= \mathbf{D}^\top
\end{align*}
$$



In [440]:
from numpy.linalg import svd

class IRModel:

    def __init__(self,docset, use_tfidf=True, use_svd=False, max_vocab_size=20000, k = None):

        self.use_tfidf = use_tfidf
        self.use_svd   = use_svd
        
        print('Building vocabulary...')
        vocab_list      = make_vocab(docset,max_vocab_size)
        self.vocabulary = index_vocab(vocab_list)
        print('Building term document matrix...')
        tdmat           = bow_TXD(docset,vocab_list,self.vocabulary)

        if self.use_tfidf:
            self.idfvec     = idfvec(tdmat)
            tdmat           = np.array([tfidf(docvec,self.idfvec) for docvec in tdmat.T ]).T 
        
        if self.use_svd:
            
            print('Computing SVD...')
            U,sigma,Vt = svd(tdmat)
            if k is None:
                k = len(sigma)
            self.sigma = np.zeros( (k,k) )
            np.fill_diagonal(self.sigma,sigma[:k])
            self.U      = U[:,:k]
            tdmat       = Vt[:k,:]
            print('U shape',self.U.shape)
            print('Vt shape',tdmat.shape)
            
        else:
            print('shape',tdmat.shape)            
        
        self.docvecs = {str(idx+1):docvec for idx,docvec in enumerate(tdmat.T)}
        
        print('done')

    
    def query2vec(self,query):
        
        text = preproc_doc(query)
        
        counts = Counter(text.split())
        docvec = np.zeros(len(self.vocabulary)) 
        for word in counts.elements():
            if word in self.vocabulary:
                idx = self.vocabulary[word]
                docvec[idx] = counts[word]
        
        if self.use_tfidf:
            docvec = tfidf(docvec,self.idfvec)

        if self.use_svd:
            docvec =  (np.linalg.inv(self.sigma) @ self.U.T @ docvec)
        
        return docvec

    
    def make_run(self,queries, kresults=50,verbose=False):
        run = [ ]
        for query in queries:
            qvec       = self.query2vec(query['text'])
            raw_scores = [ (score_doc(qvec,docvec),doc_id) for doc_id,docvec in self.docvecs.items() ]
            scores     = sorted(raw_scores,reverse=True)[:kresults]
            if verbose:
                print('QUERY /',query['text'])
                print('Max SCORE',scores[0][0])
                print(docset[int(scores[0][1])-1])          
                print('Min SCORE',scores[-1][0])
                print(len(docset),scores[-1][1])
                print(docset[int(scores[-1][1])-1])    
                
            for score, doc_id in scores:
                run.append(ScoredDoc(query['query_id'],doc_id,score))
        return run

    
    def supervised_query(self,qidx, queries,qrels):
        
        queries = {query['query_id']:query['text'] for query in queries}
        docs    = {doc['doc_id']:doc['text'] for doc in docset}
        for idx,qrel in enumerate(qrels):
            if qrel.query_id == qidx:
                doc_id = qrel.doc_id
                print('Query',qrel.query_id,queries[qrel.query_id])
                qvec       = self.query2vec(queries[qrel.query_id])
                print(len(qvec),len(self.docvecs[doc_id]))
                raw_score  = score_doc(qvec,self.docvecs[doc_id])
                print('  score',raw_score)
                print(' ', docs[doc_id])
                print()

In [441]:
search = IRModel(docset,use_svd=False,max_vocab_size=5500,k=50)


Building vocabulary...
  Total vocabulary size 8354
  Vocabulary size 5500
Building term document matrix...
shape (5500, 1460)
done


In [442]:
from ir_measures import Qrel

with open('cisi_qrels.json') as qrels:
    all_qrels = [Qrel(QREL['query_id'],QREL['doc_id'],QREL['relevance']) for QREL in json.loads(qrels.read())]

search.supervised_query('5',queries,all_qrels)

Query 5 What special training will ordinary researchers and businessmen need for proper information management and unobstructed use of information retrieval systems? What problems are they likely to encounter?
5500 5500
  score 0.01316970975300814
  A comparison of creative and "noncreative" research chemists with respect to the ways in which they use their professional and technical literature.. The creative chemists differ from the "noncreative" in that the former read more technical literature on the job, are less reluctant to use literature of greater reading difficulty, are less influenced in their independence of thought, read more extensively and consult more frequently the older material, are more inquisitive and have broader cultural interests.. The findings of the study are believed to be helpful in planning library and information services, in refining future inquiries into the ways in which scientists use recorded information, and in improving tests for the identification o

In [445]:
import ir_measures
from ir_measures import ScoredDoc, P, R, nDCG


#expect ~40 with P@5 and nDCG

run = search.make_run(queries,kresults=20,verbose=False)

evaluation = ir_measures.calc_aggregate([nDCG@20, P@20, R@20 ], all_qrels, run)

print('Global evaluation')
print(' | '.join(  f'{key}:{val}' for key, val in evaluation.items()))
print('='*80)
print()

for metric in ir_measures.iter_calc( [nDCG@5, P@5, R@5 ], all_qrels, run):
    print(f'query id = {metric.query_id}| measure = {metric.measure}| value = {metric.value}')




Global evaluation
R@20:0.18269132762816245 | nDCG@20:0.32431975510168626 | P@20:0.25131578947368427

query id = 1| measure = P@5| value = 1.0
query id = 1| measure = R@5| value = 0.10869565217391304
query id = 1| measure = nDCG@5| value = 1.0
query id = 2| measure = P@5| value = 0.0
query id = 2| measure = R@5| value = 0.0
query id = 2| measure = nDCG@5| value = 0.0
query id = 3| measure = P@5| value = 1.0
query id = 3| measure = R@5| value = 0.11363636363636363
query id = 3| measure = nDCG@5| value = 1.0
query id = 4| measure = P@5| value = 0.0
query id = 4| measure = R@5| value = 0.0
query id = 4| measure = nDCG@5| value = 0.0
query id = 5| measure = P@5| value = 0.0
query id = 5| measure = R@5| value = 0.0
query id = 5| measure = nDCG@5| value = 0.0
query id = 6| measure = P@5| value = 0.0
query id = 6| measure = R@5| value = 0.0
query id = 6| measure = nDCG@5| value = 0.0
query id = 7| measure = P@5| value = 0.0
query id = 7| measure = R@5| value = 0.0
query id = 7| measure = nDCG@