# Cranfield dataset processing

This notebook creates vector space model for documents and queries contained in [Cranfield](http://ir.dcs.gla.ac.uk/resources/test_collections/cran/) collection.

In [3]:
from sklearn.metrics.pairwise import cosine_similarity, pairwise_distances
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import numpy as np

Helper functions. `get_top_n()` simply returns indices of top ten relevant documents for each query. `get_precision()` returns precision, recall and f-score for a query (represented by query index and top ten documents retrieved).

In [4]:
def get_top_n(matrix, n=10):
	return np.array([ matrix[i].argsort()[-n:][::-1]+1 for i in range(225)])

def get_precision(query_index, top_retrieved):
	relevant = []
	with open('cranfield/r/{}.txt'.format(query_index)) as f:
		for line in f:
			relevant.append(int(line))

	tp = 0
	fn = 0
	fp = 0

	for doc in relevant:
		if doc in retrieved:
			tp += 1
		else:
			fn += 1 

	for doc in retrieved:
		if doc not in relevant:
			fp += 1

	p = tp / (tp + fp)
	r = tp / (tp + fn)
	f = 2 * ((p * r)/(p + r))

	return p, r, f

Here we prepare corpus of documents and queries for processing. Note that `corpus[:1400]` contains documents and `corpus[1400:]` contains queries.

In [5]:
corpus = []

for d in range(1400):
    f = open("cranfield/d/"+str(d+1)+".txt")
    corpus.append(f.read())
    f.close()
for q in range(225):
    f = open("cranfield/q/"+str(q+1)+".txt")
    corpus.append(f.read())
    f.close()

Initialization of different vectorizers we are going to use to create vector space model.
* TFIDF vectorizer -- calculates [TFIDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) score for each document or query
* Count vectorizer -- counts term in each document or query
* Binary vectorizer -- 1 if term is present in document/query, 0 otherwise

In [6]:
tfidf_vectorizer = TfidfVectorizer()
count_vectorizer = CountVectorizer()
binary_vectorizer = CountVectorizer(binary=True)

Matrices with dimensions (1625 -- number of documents and queries, 20679 -- total number of terms) for each vectorizer. Matrix rows are vector of given vector space models (each row represent document or query).

In [13]:
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)
count_matrix = count_vectorizer.fit_transform(corpus)
bin_matrix = binary_vectorizer.fit_transform(corpus)

Calculate similarity between queries and documents using given vector space model (TFIDF, count, binary) and distance measure (cosine similiarity, euclidean distance). Each matrix has dimensions (225, 1400), each element represents similarity betweent one query and one document.

In [8]:
r_tfdif_cos = np.array(cosine_similarity(tfidf_matrix[1400:], tfidf_matrix[:1400]))
r_tfdif_euc = np.array(pairwise_distances(tfidf_matrix[1400:], tfidf_matrix[:1400]))

r_count_cos = np.array(cosine_similarity(count_matrix[1400:], count_matrix[:1400]))
r_count_euc = np.array(pairwise_distances(count_matrix[1400:], count_matrix[:1400]))

r_bin_cos = np.array(cosine_similarity(bin_matrix[1400:], bin_matrix[:1400]))
r_bin_euc = np.array(pairwise_distances(bin_matrix[1400:], bin_matrix[:1400]))

Get indices of 10 most relevant documents for each query using given vector space model and distance measure.

In [9]:
top_relevant_tfdif_cos = get_top_n(r_tfdif_cos)
top_relevant_tfdif_euc = get_top_n(r_tfdif_euc)

top_relevant_count_cos = get_top_n(r_count_cos)
top_relevant_count_euc = get_top_n(r_count_euc)

top_relevant_bin_cos = get_top_n(r_bin_cos)
top_relevant_bin_euc = get_top_n(r_bin_euc)
