Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

---

Information retrieval exercise
===


This exercise amounts to design a toy vector space information retrieval
on the CISI data set.  The system can be augmented with latent representations using Latent Semantic Analysis
(or Latent Semantic Indexing) methodologies.    

The data set is small enough to allow an in-class exercise with naive implementations.

Download packages and data
----

The first step is to download required packages and data...

**WARNING** No other downloaded packages and data sets are allowed for this exercise than those actually downloaded here


In [None]:
INSTALL = True # set this variable to True to download the necessary materials

if INSTALL:
    !pip install numpy
    !pip install spacy
    !pip install gdown
    !python -m spacy download en_core_web_sm
    !pip install ir_measures 
    
    import gdown
    gdown.download('https://drive.google.com/uc?id=14BbZBflc0rkkvZMA_DRdd3xvQZuo_q2Z', 'cisi_queries.json')
    gdown.download('https://drive.google.com/uc?id=1wl6Rh5PvI6_kV9LSiBeTOVfuqM_qfA5-',  'cisi_qrels.json')
    gdown.download('https://drive.google.com/uc?id=1MIEDbQt2NBAhJjngN4Nr2JUnaKdZ2_YJ', 'cisi_all.json')


In [None]:
import json

# The docset is the set of documents used in this exercise
# queries is the set of queries used for testing the search engine

with open('cisi_all.json') as docstream:
    docset     = json.loads(docstream.read())

with open('cisi_queries.json') as qstream:
    queries     = json.loads(qstream.read())

print('Total number of documents',len(docset))
print('Total number of queries',len(queries))

View documents
-----

Documents are stored as a list of dictionaries. 
The following code displays the first $k$ elements of this list.

In [None]:
def show_docs(start=0,end=4):
    for doc in docset[start:end]:
        print(doc)
        print('-'*80)

show_docs()

View queries
------

Queries are stored as a list of dictionaries. 
The following code displays the first $k$ elements of this list.

In [None]:
def show_queries(start=0,end=3):
    for query in queries[start:end] :
        print(query)
        print('-'*80)

show_queries()

Preprocess data
------

The first exercise amounts to design a preprocessing function for documents and queries.
This preprocessing function typically involves:

- Text normalisation (lowercase, removal of punctuation)
- Lemmatisation and or stemmatisation
- Removal of stop words

You are allowed to use the `spacy` library here

In [None]:
import spacy
import string


tokenizer = spacy.load('en_core_web_sm')
stopwords = spacy.util.get_lang_class('en').Defaults.stop_words

#You may customize stop words
# YOUR CODE HERE
raise NotImplementedError()

def preproc_doc(text,lower=True,use_stopwords=True):
    """
    Args: 
        text(string)   : the input string
        lower(bool)    : whether to lowercase the text or not
        stopwords(bool): whether to remove stopwords from text or not
    Returns:
        string. The preprocessed text
    """
    # YOUR CODE HERE
    raise NotImplementedError()


In [None]:
#You may test your preprocessing on documents and queries here

In [None]:
# Automatic tests #


Create vocabulary
----

In [None]:
from collections import Counter

def make_vocab(docset,max_vocab_size=50000):
    """
    Extracts the vocabulary from the docset as a list of unique word forms
    Args:
        docset (list): the list of documents
        max_vocab_size (int): the maximum number of entries in the vocabulary
    Return:
       list of strings. The vocabulary list made of the most frequent words in case of cutoff
    """
    # YOUR CODE HERE
    raise NotImplementedError()

def index_vocab(vocab_list):
    """
    Turns a list of words into a python dictionary mapping words to indexes
    Args:
       vocab_list (list) : a list of strings
    Returns:
       a dict.
    """
    return dict( (word,idx) for idx, word in enumerate(vocab_list))
    

In [None]:
# TEST vocab


Term document matrix
----

These are functions for creating the term document matrix and some of its transforms

In [None]:
import numpy as np
from collections import Counter

def bow_TXD(docset,vocab_list,vocabulary=None):
    """
    Creates a term document matrix from bags of words
    Args:
        docset (list) : the list of documents 
        vocab_list()  : the vocabulary list
        vocabulary(dict): the vocabulary dictionary indexed by index_vocab(...) or None
    Returns:
       a numpy array. A term document matrix with vocab size lines and documents size columns
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
def tf(docvec):
    """
    Computes the true frequency from a vector of raw counts.
    Args :
       docvec (numpy.array) : a vector of word counts
    Returns:
      numpy.array. A vector of word frequencies
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    
def idf(termvec):
    
    """
    Computes the idf scores from a vector of raw counts
    Args:
       termvec (numpy.array): a vector of occurrence counts in the set of documents for some term t
    Returns:
       float. The idf score for the term t computed as log ( D / |counts(t,d) > 0| )  
    """
    # YOUR CODE HERE
    raise NotImplementedError()

def idfvec(tdmat):
    """
    Computes the idf scores for a full term document matrix
    
    Args:
       tdmat (numpy.array): a term document matrix
    Returns:
       numpy.array with idf scores, one score for each word in the vocabulary 
    """
    # YOUR CODE HERE
    raise NotImplementedError()


def tfidf(doc_vec,idf_vec):
    """
    Computes tf-idf scores from a vector of raw counts for a document  
    Args:
       doc_vec (numpy.array): a vector of word counts
       idf_vec (numpy.array): a vector of word idf scores
    Returns:
       numpy.array with tfidf scores
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
#TESTS

Querying 
----

In [None]:
from numpy.linalg import norm


def score_doc(qvec,docvec):
    """
    Scores the query against a document
    Args:
        qvec  (numpy.array): the vector for the query
        docvec(numpy.array): the vector for the document
    """
    # YOUR CODE HERE
    raise NotImplementedError()


We can also use Latent Semantic Analysis for document representation. This amounts to compute latent vectors for documents
with SVD decomposition:

$$
\mathbf{X} = \mathbf{T}\Sigma\mathbf{D}^\top
$$

where the latent document vectors are columns of $\mathbf{D}$. 
The query vectors built naturally as a single columns of $\mathbf{X}$ can be mapped to their reduced transformation
as:

$$
\Sigma^{-1}\mathbf{T}^\top \mathbf{X} = \mathbf{D}^\top
$$

this can be justified by the following rewriting of the SVD decomposition:

$$
\begin{align*}
\mathbf{X} &= \mathbf{T}\Sigma\mathbf{D}^\top\\
\mathbf{T}^\top \mathbf{X} &= \Sigma\mathbf{D}^\top\\
\Sigma^{-1}\mathbf{T}^\top \mathbf{X} &= \mathbf{D}^\top
\end{align*}
$$

and by recalling that inverses of orthonormal matrices are their transpose.


In [None]:
import ir_measures
from ir_measures import Qrel, ScoredDoc, P, R, nDCG

from numpy.linalg import svd


class IRModel:

    def __init__(self,docset, use_tfidf=True, use_svd=False, max_vocab_size=20000, k = None):
        """
        This class gathers the query evaluations functionalities.
        In case of efficiency issues, try to reduce the vocabulary size or k.
        (this won't scale up anyway for large datasets)   
        Args:
            docset (list)  : the list of documents as dictionaries
            use_tfidf(bool): controls whether the term document matrix is transformed with tf-idf
            use_svd  (bool): controls whether SVD is used to reduce dimensionality
            k (int)        : the size of the latent space for svd
            max_vocab_size (int): maximum size for the vocabulary.
        """
        self.use_tfidf = use_tfidf
        self.use_svd   = use_svd
        
        print('Building vocabulary...')
        vocab_list      = make_vocab(docset,max_vocab_size)
        self.vocabulary = index_vocab(vocab_list)
        
        print('Building term document matrix...')
        tdmat           = bow_TXD(docset,vocab_list,self.vocabulary)

        
        if self.use_tfidf:

            #update tdmat using the tfidf transformation
            
            # YOUR CODE HERE
            raise NotImplementedError()
        
        if self.use_svd:

            #assign self.U, self.sigma and tdmat using truncated SVD decomposition
            #read the doc 
            print('Computing SVD...')

            # YOUR CODE HERE
            raise NotImplementedError()
        
        else:
            print('shape',tdmat.shape)            

        #Stores document vectors with indices starting at 1 (to be compatible with dataset !)  
        self.docvecs = {str(idx+1):docvec for idx,docvec in enumerate(tdmat.T)}
    
        print('done')


    
    def query2vec(self,query):
        """
        Encodes a query on a vector.
        Args :
           query (string): the query text
        Returns:
           numpy.array : the vector for this query
        """
        text = preproc_doc(query)
        
        counts = Counter(text.split())
        docvec = np.zeros(len(self.vocabulary)) 
        for word in counts.elements():
            if word in self.vocabulary:
                idx = self.vocabulary[word]
                docvec[idx] = counts[word]
        
        if self.use_tfidf: 
    
            #maps the docvec with tf-idf
            
            # YOUR CODE HERE
            raise NotImplementedError()
        
        if self.use_svd:
            
            #maps the docvec to the lower dimensional space 
            # YOUR CODE HERE
            raise NotImplementedError()

        return docvec

    def make_run(self,queries, kresults=50,verbose=False):
        """
        A run is the list of the model search results for each query in the list of queries.

        Args:
           queries (list): a list of queries as dictionaries
           kresults (int): the max number of results per query
           verbose (bool): a flag controlling whether we want to display verbose debugging and analysis messages
        Returns:
           a list of ir_measures.ScoredDoc objects that can be used as is for quantitative evaluation
        """
        
        run = [ ]
        for query in queries:
            
            qvec       = self.query2vec(query['text']) 

            #create a list of scores for documents given this query and truncate it with cutoff kresults
            #for each document in self.docvecs create a couple (score,doc_id) and add it to the list

            #scores = ...

            # YOUR CODE HERE
            raise NotImplementedError()
            
            #Required for scoring the evaluation
            for score, doc_id in scores:
                run.append(ScoredDoc(query['query_id'],doc_id,score))
        return run

    
    def supervised_query(self,qidx, queries,qrels):
        """
        Prints the results of a query execution. Useful for debugging and/or qualitative analysis 
        Args:
            qidx    (int): the ID of the query given by qrels
            queries(list): list of queries as dictionaries
            qrels  (list): list of qrels as Qrel objects
        """
        #Optional exercise : 
        #    given a query ID, print the query and the reference results
        
        # YOUR CODE HERE
        raise NotImplementedError()


In [None]:
# Train your best model here, you may try several sets of parameters or even search programmatically for good models

search_model = IRModel(docset,use_svd=False,max_vocab_size=1000,k=None)

# your best model has to be assigned to the search_model variable



Evaluation section
-------
The evaluation relies on the `ir_measures` library.
This library evaluates so called "runs" returned by the model against reference solution
A run is a list of ScoredDoc. Each query in the evaluation set generates some answers stored as ScoredDoc.
The ScoredDoc list is compared to the reference solution stored as a qrels list

We will evaluate against nDCG@5 , P@5 and R@5 metrics.
Find out what these metrics actually mean on the web



In [None]:
#Loading reference solutions for the queries in all_qrels variable

with open('cisi_qrels.json') as qrels:
    all_qrels = [Qrel(QREL['query_id'],QREL['doc_id'],QREL['relevance']) for QREL in json.loads(qrels.read())]


In [None]:
#This cell cannot be modified !#

#Computes evaluation
best_run   = search_model.make_run(queries,kresults=5,verbose=False)
evaluation = ir_measures.calc_aggregate([nDCG@5, P@5, R@5 ], all_qrels, best_run)

#Prints the evaluation results
print('Global evaluation')
print(' | '.join(  f'{key}:{val}' for key, val in evaluation.items()))
print('='*80)
print()

for metric in ir_measures.iter_calc( [nDCG@5, P@5, R@5 ], all_qrels, best_run):
    print(f'query id = {metric.query_id}| measure = {metric.measure}| value = {metric.value}')


In [None]:
#Auto tests