## General requests
- comment the code
- code able to run on Linux/macOS/unix-like environment
- way to save and load the entire index from disk (avoid re-indexing when the program starts)
- consider performance

## Tasks
Write an IR system using the Vector Space Model:
1. able to answer free-form text queries 
2. allowing relevance feedback
3. allowing the use of pseudo-relevance feedback
4. evaluate it on a set of test queries

## Model
Each document is seen as a vector with:
- a component for each term in the dictionary
- as elements the tf-idf $_{t,d}$ of the term $t$ in the document

tf-idf $_{t,d}$ = 0 for all terms not in the document

### To score a document
Sum the tf-idf $_{t,d}$ values for all terms appeaing in $q$
$$ Score(q,d) = \sum_{t \in q} tf-idf_{t,d} $$

### Terms
Elements of the canonical base of $ \mathcal R ^n $ -> n = number of terms in the dictionary

### Documents
n-dimensional vector with each element the $tf-idf_{t,d}$

Document vectors can be normalized and be replaced with a unit vector -> $ v(d) = \frac{V(d)}{|V(d)|} $

### Queries
Unit vector with the non-zero components correspoding to the query terms

### Cosine similarity 
1. to compare documents: $ sim(d_1, d_2) = \frac{V(d_1) V(d_2)}{|V(d_1)| |V(d_2)|} $ -> the cosine of the angle formed by the two vectors

2. to answer queries: $ score(q,d_1)=v(q) v(d_1) \equiv score(q,d_1)=v(q) v(d_1) $

# Load MED dataset - Medline Documents Collection

In [1]:
import re

def load_docs(corpus):
    """
    Returns a dictionary with keys the docID and as 
    values the list of terms in the document.
    """
    arts = []
    with open(corpus, "r") as f:
        t = []
        r = False
        for row in f:
            if row.startswith(".W"):
                r = True
                continue
            if row.startswith(".I"):
                if t != []:
                    arts.append(t)
                t = []
                r = False
                # docid = int(row[3:].replace("\n", ""))
            if r:
                row = re.sub(r'[^a-zA-Z\s]+', '', row)
                t += row.split()
    return arts

def load_queries(query_file):
    """"
    Returns a dictionary of lists, with keys the queryID 
    and as values a list of terms occurring in the query.
    """
    q = dict()
    with open(query_file, "r") as f:
        t = []
        r = False
        for row in f:
            if row.startswith(".W"):
                r = True
                continue
            if row.startswith(".I"):
                if t != []:
                    q[qid] = t
                t = []
                r = False
                qid = int(row[3:].replace("\n", ""))
            if r:
                row = re.sub(r'[^a-zA-Z\s]+', '', row)
                t += row.split()
    return q

def load_relevance(relevance_file):
    """"
    Returns a dictionary of lists, with keys the queryID and 
    as values the list of documents relevant to that query.
    """
    rel = dict()
    with open(relevance_file, "r") as f:
        t = []
        for row in f:
            r = row.split(" ")
            qid = int(r[0])
            rel[qid] = rel.get(qid, []) + [int(r[2])]

    return rel

# Vector Space Model

In [4]:
from collections import OrderedDict, Counter
from math import log, sqrt
import re
import numpy as np

In [5]:
corpus = load_docs("./cran_data/cran.all.1400")

Function to compute the positional index: returns a dictionary in which for each term we have a dictionary with as keys the docIDs (integers referring to one document) and as values a list with the position of the term in the document.

$ \texttt{ \{"RUSSIA" : \{0:[1,6,35], 10:[6,22,105]\}, "COLD" : \{12:[6], 100:[2]\}\} } $

In [9]:
corpus = [["ciao", "come", "va", "ciao"], ["tutto", "bene", "te"], ["random", "ciao"]]

In [7]:
def inverted_index(corpus):
    """
    Builds an inverted index.
    Returns a dictionary with terms as keys and for each term it stores 
    a list as value, with docIDs of the documents contianing the term.
    """
    idx = dict()
    for docid in range(len(corpus)):  # for each document in the corpus
        for pos, t in enumerate(corpus[docid]):  # for each term in the document
            idx[t] = idx.get(t, set())
            idx[t].add(docid)
    return idx

In [10]:
inverted_index(corpus)

{'ciao': {0, 2},
 'come': {0},
 'va': {0},
 'tutto': {1},
 'bene': {1},
 'te': {1},
 'random': {2}}

Function to compute the inverse document frequency for each term. Returns a dictionary with terms as keys and IDF as values.
$ \texttt{ \{"THE":0.0, "REVENUES": 1.6094379124341003\}} $

In [11]:
def inverse_doc_freq(p_idx, n_docs):
    idf = dict()
    for t in p_idx.keys():
        idf[t] = log( n_docs / len(p_idx[t]) )
    return idf

In [13]:
inverse_doc_freq(inverted_index(corpus), len(corpus))

{'ciao': 0.4054651081081644,
 'come': 1.0986122886681098,
 'va': 1.0986122886681098,
 'tutto': 1.0986122886681098,
 'bene': 1.0986122886681098,
 'te': 1.0986122886681098,
 'random': 1.0986122886681098}

Function to compute the TF-IDF for each term and each document. Returns a dictionary with docIDs as keys and as values a dictionary with terms as keys and as values the corresponding tf-idf for the specific term in the specific document.

In this way, each document is seen as a n-dimensional vector, with n being the number of terms in the dictionary.

$ \texttt{ \{0:\{"THE": 0.0\}, 10:\{"THE": 1.6094379124341003\}\} } $

In [17]:
def tfidf_vectors(docs, p_idx):
    n_docs = len(docs)
    n_terms = len(p_idx.keys())
    vocab = list(p_idx.keys())
    vecs = np.zeros((n_docs, n_terms))
    idf = inverse_doc_freq(p_idx, n_docs)

    for docid in range(n_docs):
        count = Counter(docs[docid])

        for t in vocab:
            if t in docs[docid]:
                ind = vocab.index(t)
                vecs[docid][ind] = idf[t] * count[t]
    return vecs

In [18]:
tfidf_vectors(corpus, inverted_index(corpus))

array([[0.81093022, 1.09861229, 1.09861229, 0.        , 0.        ,
        0.        , 0.        ],
       [0.        , 0.        , 0.        , 1.09861229, 1.09861229,
        1.09861229, 0.        ],
       [0.40546511, 0.        , 0.        , 0.        , 0.        ,
        0.        , 1.09861229]])

In [83]:
# def tf_idf(docs, p_idx):
#     tfidf = {}
#     n_docs = len(docs)
#     idf = inverse_doc_freq(p_idx, n_docs)
#     for docid in range(len(docs)):  #for each document
#         doc_vector = {}  # we store for each term in the document the tfidf 
#                          # and for those not in the document a 0
#         count = Counter(docs[docid])
#         for t in p_idx.keys():
#             if t in docs[docid]:
#                 doc_vector[t] = idf[t] * count[t]
#             else:
#                 doc_vector[t] = 0
#         tfidf[docid] = doc_vector
#     return tfidf

Function to convert a query, given as a free-form text, to a n-dimensional vector, with n being the number of terms in the dictionary, with 1 if the term is present in the query and 0 otherwise.

$ \texttt{query = "revenues tree"} $

$ \texttt{ \{"THE":0, "REVENUES":1\} } $


In [14]:
queries = load_queries("./cran_data/cran.qry")

In [15]:
queries = "ciao sono elena come stai"

In [19]:
def query_vector(query, p_idx):
    q = query.lower().split(" ")
    n_terms = len(p_idx.keys())
    vocab = list(p_idx.keys())
    vec = np.zeros(n_terms)

    for t in vocab:
        if t in q:
            ind = vocab.index(t)
            vec[ind] = 1
    return vec

In [21]:
query_vector(queries, inverted_index(corpus))

array([1., 1., 0., 0., 0., 0., 0.])

In [86]:
# def query_as_vector(q, p_idx):
#     # q = q.lower().split(" ")
#     q_vec = {}
#     count = Counter(q)
#     # l = len(q)
#     idf = inverse_doc_freq(p_idx, n_docs)
#     for t in list(p_idx.keys()):
#         if t in q:
#             # q_vec[t] = 1
#             q_vec[t] = count[t] * idf[t]
#         else:
#             q_vec[t] = 0
#     return q_vec

Function to compute the relevance score for each document given a query. The score is computed as the cosine similarity between the query vector and the document vector. It returns a dictionary with the docIDs as keys and the relevance score as values.

$ \texttt{ \{0:0.0, 1:0.03272921367166952\} } $

In [22]:
def relevance_scores(query, p_idx, docs):
    scores = dict()  # for each document we store the cosine similarity
                     # between the query and the document
    query_terms = query.split(" ")
    q = query_vector(query, p_idx)
    q_length = sqrt(sum(q**2))

    vocab = list(p_idx.keys())
    tfidf = tfidf_vectors(docs, p_idx)

    for docid in range(len(docs)):
        d = tfidf[docid,]
        d_length = sqrt(sum(d**2))
        cos_sim = 0
        for t in query_terms:
            if t in vocab:
                idx = vocab.index(t)
                cos_sim += (d[idx] * q[idx])
        
        if cos_sim == 0:
            scores[docid] = 0
        else:
            scores[docid] = cos_sim / (q_length * d_length)
    
    return scores


In [23]:
relevance_scores(queries, inverted_index(corpus), corpus)

{0: 0.7704397233076248, 1: 0, 2: 0.24482975009584626}

Function to return the k most relevant documents given a query.



In [25]:
def vector_space_model(query, k, corpus):
    # for a given query and the tfidf matrix, 
    # we return the top k documents
    # relevant for the given query

    index = inverted_index(corpus)

    scores = relevance_scores(query, index, corpus)
    sorted_value = OrderedDict(sorted(scores.items(), key=lambda x: x[1], reverse=True))
    topk = {key : sorted_value[key] for key in list(sorted_value)[:k] if sorted_value[key]!=0}

    return topk

In [26]:
topk_docs = vector_space_model(queries, 1, corpus)
topk_docs

{0: 0.7704397233076248}

# Relevance and pseudo-relevance feedback

Feedback given as a query and a list of `(docID, feedback_value)` with 0. means irrelevant and 1. means relevant.

In [36]:
rel_docs = [1]
nrel_docs = [0,2]

In [37]:
tfidf = tfidf_vectors(corpus, inverted_index(corpus))
vocab = list(inverted_index(corpus).keys())

In [61]:
def relevance_feedback_rocchio(rel_docs, nrel_docs, query, tfidf, vocab, alpha=1, beta=0.75, gamma=0.15):
    query = query_vector(query, inverted_index(corpus))
    q_opt = np.zeros(len(vocab))
    for t in vocab:
        idx = vocab.index(t)
        r = 0
        for docid in rel_docs:
            r += tfidf[docid,].sum()
        r /= len(rel_docs)

        n = 0
        if len(nrel_docs) != 0:
            for docid in nrel_docs:
                n += tfidf[docid,].sum()
            n /= len(nrel_docs)
        else:
            gamma = 0

        opt = alpha*query[idx] + beta*r - gamma*n
        if opt > 0:
            q_opt[idx] = opt

    return q_opt

In [55]:
q_opt = relevance_feedback_rocchio(rel_docs, nrel_docs, queries, tfidf, vocab)

In [62]:
def pseudo_relevance_feedback(query, tfidf, vocab, k=10):
    rel_docs = list(vector_space_model(query, k, corpus).keys())
    q_opt = relevance_feedback_rocchio(rel_docs=rel_docs, nrel_docs=[], query=query, tfidf=tfidf, vocab=vocab, gamma=0)
    return q_opt

In [63]:
pseudo_relevance_feedback(queries, tfidf, vocab, k=1)

array([3.2561161, 3.2561161, 2.2561161, 2.2561161, 2.2561161, 2.2561161,
       2.2561161])

# Stemming

# Evaluation

In [None]:
def calculate_precision(k):
    """To generate list of precision values for each query for given value of k
    
    Arguments:
        k {[type]} -- number of top documents to be retrieved
    
    Returns:
        list -- list of precision values for each query
    """
    precision = []
    for i in range(len(queries)):
        
        # Number of relevant documents retrieved
        a = intersection(list_of_docs(k)[i][1].tolist(), query_rel[i])
        
        # Total number of documents retrieved
        b = k
        p = a / b
        precision.append(p)
    return precision

# for top 100 docs
calculate_precision(no_of_top)

In [None]:
def calculate_recall(k):
    """To generate list of recall values for each query for given value of k
    
    Arguments:
        k {integer} -- number of top documents to be retrieved 
    
    Returns:
        list -- list of recall values for each query
    """
    
    recall = []
    for i in range(len(queries)):
        
        # Number of relevant documents retrieved
        a = intersection(list_of_docs(k)[i][1].tolist(), query_rel[i])
        
        # Total number of relevant documents
        b = len(query_rel[i])
        r = a / b
        recall.append(r)
    return recall   
# for top 100 docs
calculate_recall(no_of_top)

In [None]:
def intersection(lst1, lst2): 
    """To count number of common items between 2 lists
    
    Arguments:
        lst1 {list} -- list 1
        lst2 {list} -- list 2
    
    Returns:
        integer -- number of common items between list 1 & list 2 
    """
    lst3 = [value for value in lst1 if value in lst2] 
    return len(lst3) 