# Information Retrieval Project

### In order to evaluate the Nordlys toolkit, we want to answer to the following questions:
- How  are  Nordlys  toolkit  performances,  considering  the  different  types  of  questions  (SemSearchES,INEX-LD, QALD2, ListSearch)?
- How are Nordlys performances compared to the Google search engines?


Here we load the list of queries

In [1]:
with open('query_list.txt', 'r') as f:
    text = f.read()
    
raw_queries = text.split("\n")
raw_queries[0].split("\t")

list_queries=[]
for i in list_queries:
    list_queries.append(i.split("\t")[1])
    
list_queries

[]

In order to calculate MAP and Normalized DCG we need to know how mnay relevant results exist for each query

In [2]:
import codecs
f2 = codecs.open("query_relevance.txt", encoding="utf-8")
text = f2.read()


relevance_lines = text.split("\n")

query_counts = {}
current_query = text.split("\n")[0].split("\t")[0]
temp_count_list = [0,0,0]


counter = 0
while (counter < len(relevance_lines)):
    query_name = relevance_lines[counter].split("\t")[0]
    relevance = relevance_lines[counter].split("\t")[3]
    if query_name == current_query:
        temp_count_list[int(relevance)] += 1
    else:
        query_counts[current_query] = temp_count_list
        current_query = query_name
        temp_count_list = [0,0,0]
    counter += 1
    if counter == len(relevance_lines):
        query_counts[current_query] = temp_count_list

print(query_counts)

{'INEX_LD-2009022': [97, 24, 11], 'INEX_LD-2009039': [52, 101, 37], 'INEX_LD-2009053': [38, 23, 2], 'INEX_LD-2009061': [46, 57, 6], 'INEX_LD-2009062': [77, 19, 0], 'INEX_LD-2009063': [83, 134, 20], 'INEX_LD-2009074': [42, 17, 4], 'INEX_LD-2009096': [43, 32, 5], 'INEX_LD-2009111': [58, 32, 14], 'INEX_LD-2009115': [60, 30, 16], 'INEX_LD-2010004': [29, 76, 55], 'INEX_LD-2010014': [120, 20, 5], 'INEX_LD-2010019': [57, 8, 2], 'INEX_LD-2010020': [76, 14, 8], 'INEX_LD-2010037': [63, 21, 2], 'INEX_LD-2010043': [115, 47, 29], 'INEX_LD-2010057': [36, 47, 11], 'INEX_LD-2010069': [115, 179, 148], 'INEX_LD-2010100': [81, 8, 1], 'INEX_LD-2010106': [77, 16, 0], 'INEX_LD-20120111': [67, 26, 23], 'INEX_LD-20120112': [57, 78, 28], 'INEX_LD-20120121': [84, 7, 4], 'INEX_LD-20120122': [82, 7, 1], 'INEX_LD-20120131': [73, 19, 19], 'INEX_LD-20120132': [68, 21, 23], 'INEX_LD-20120211': [56, 32, 14], 'INEX_LD-20120212': [66, 30, 0], 'INEX_LD-20120221': [62, 59, 11], 'INEX_LD-20120222': [68, 18, 0], 'INEX_LD-20

In [10]:
print(query_counts["INEX_LD-2009022"][1])

24


## Web scraping Google

## Nordlys Entity Retrieval

In [None]:
from nord


## Calculating Mean Average Precision (MAP) measure

In the following section we will calculate the MAP of a list of query results. The result of a single query in the query itself  and a list of tuples containing the retrieved document and the relevancy of that document. These tuples are in the same order as they were returned from the retrieval system.

Since MAP only works on binary relevancy levels, we have decided that both the level 1 and 2 are relevant while level 0 is not relevant.

P(q,k) calculates the precision of a query q at k retrieved documents but only on steps wher the new retrieved document is considered to be relevant.

In [68]:
test_queries = {"INEX_LD-20120111":[("d1",1),("d2",0),("d3",0),("d4",2),("d5",1)],"INEX_LD-20120112":[("d11",2),("d21",1),("d33",0),("d4",2),("d5",1)]}

In [3]:
import numpy as np
max_k = 3

def rel(q,k,input_queries):
    return int(input_queries[q][k][1]) > 0

def P(q,k,input_queries):
    k_retrieved = input_queries[q][0:(k)]
    relevant_retrieved = (x for x in k_retrieved if int(x[1]) > 0)    
    return len(list(relevant_retrieved))/(len(k_retrieved)*1.0)

def AveP(q,input_queries):
    res = 0
    for k in range(0,max_k):
        res += np.dot(P(q,k+1,input_queries),rel(q,k+1,input_queries))
    number_of_relevant = query_counts[q][1] +  query_counts[q][2]
    return res/(number_of_relevant*1.0)

def MAP(input_queries):    
    res = 0
    for q in input_queries:
        res += AveP(q,input_queries)
    return res/len(input_queries)
    
    

## Calculating Precision At K (P@K) measure

## Calculating Discounted Cumulative Gain (DCG) measure

Discounted cumulative gain sums the knowledge gain of each retrieved document divided by log2 of the position. This means that a a very important document (rank 2) at position 3 will give a higher score than a very important document at position 10.

For each query we canculate DCG@k meaning that only the first k query results are considered in the calculation.
For the normalization we also calculate IDCG, which is the ideal DCG for a perfect ranking. This is done by based on the list how many relevant documents of each rank exists for each query.

In [26]:
import numpy as np

def DCG(query_results,k):
    dcg = float(query_results[0][1])
    for i in range(1,k):
        dcg += float(query_results[i][1])/float(np.log2(i+1))
    return dcg

def IDCG(q,k):
    #savek in j for later
    j = k
    relevant_for_query = query_counts[q]
    gains = []
    #Add all gains of 2 and subtract that number from k
    for i in range(max(k,relevant_for_query[2])):
        gains.append(["Dummy",2])
    k = k-relevant_for_query[2]
    #Add all gains of 1 and subtract that number from k
    for i in range(max(k,relevant_for_query[1])):
        gains.append(["Dummy",1])
    k = k-relevant_for_query[1]
    #Fill the rest of the list with gain of 0 to avoid indexing problems
    for i in range(k):
        gains.append(["Dummy",0])        
    
    idcg = DCG(gains,j)    
    #print(idcg)
    return idcg
    

def AvgNDCG(input_queries,k):
    sum_ndcg = 0
    for q in input_queries:
        if len(input_queries[q]) >= 20:
            addition = (DCG(input_queries[q],k)/IDCG(q,k))
            sum_ndcg += addition
    return sum_ndcg/len(input_queries)
        


## Calculations for Nordlys

In [29]:
import codecs
nordlys_reader = codecs.open("nordlys_retrievals_jm.txt", encoding="utf-8")
text = nordlys_reader.read()
import json
Nordlys_queries = json.loads(text)
#print(parsed_json)
#for q in parsed_json:
#    if len(json_queries[q]) < 20:
#        print(q)
#        print(json_queries[q])
AvgNDCG(Nordlys_queries,3)

0.24260516178473907

In [115]:
for x in query_counts:
    print(x)
len(query_counts)

INEX_LD-2009022
INEX_LD-2009039
INEX_LD-2009053
INEX_LD-2009061
INEX_LD-2009062
INEX_LD-2009063
INEX_LD-2009074
INEX_LD-2009096
INEX_LD-2009111
INEX_LD-2009115
INEX_LD-2010004
INEX_LD-2010014
INEX_LD-2010019
INEX_LD-2010020
INEX_LD-2010037
INEX_LD-2010043
INEX_LD-2010057
INEX_LD-2010069
INEX_LD-2010100
INEX_LD-2010106
INEX_LD-20120111
INEX_LD-20120112
INEX_LD-20120121
INEX_LD-20120122
INEX_LD-20120131
INEX_LD-20120132
INEX_LD-20120211
INEX_LD-20120212
INEX_LD-20120221
INEX_LD-20120222
INEX_LD-20120231
INEX_LD-20120232
INEX_LD-20120311
INEX_LD-20120312
INEX_LD-20120321
INEX_LD-20120322
INEX_LD-20120331
INEX_LD-20120332
INEX_LD-20120411
INEX_LD-20120412
INEX_LD-20120421
INEX_LD-20120422
INEX_LD-20120431
INEX_LD-20120432
INEX_LD-20120511
INEX_LD-20120512
INEX_LD-20120521
INEX_LD-20120522
INEX_LD-20120531
INEX_LD-20120532
INEX_LD-2012301
INEX_LD-2012303
INEX_LD-2012305
INEX_LD-2012307
INEX_LD-2012309
INEX_LD-2012311
INEX_LD-2012313
INEX_LD-2012315
INEX_LD-2012317
INEX_LD-2012318
INEX_LD-20

467