# Information Retrieval

Let's download the classical data set, i.e. the CRANFIELD text set on aeronautics

In [1]:
! wget -q http://ir.dcs.gla.ac.uk/resources/test_collections/cran/cran.tar.gz
! tar -xvf cran.tar.gz
! rm cran.tar.gz*

zsh:1: command not found: wget
tar: Error opening archive: Failed to open 'cran.tar.gz'
zsh:1: no matches found: cran.tar.gz*


We take queries only (we will consider queries as documents)

In [2]:
! grep -v "^\." cran.qry > just.qry
! head -3 just.qry

grep: cran.qry: No such file or directory


We combine  multi-string queries into one

In [3]:
raw_query_data = [line.strip() for line in open("just.qry", "r").readlines()]
query_data = [""]

for query_part in raw_query_data:
  query_data[-1] += query_part + " "
  if query_part.endswith("."):
    query_data.append("")

query_data[:2] #Let's output the couple of documents as an example

['']

### Let's make queries to our documents

In [4]:
QUERIES = ['theory of bending', 'aeroelastic effects']


## Boolean retrieval
Let's represent each document as a "bitmask": that is a vector with a dimensionality equal to the vocabulary size, which has 1 at every position if the document contains the corresponding term; and 0 if it does not

In [5]:
# in different versions the answer could also differ, therefore it's important to have the same version
! pip install -q scikit-learn==0.22.2.post1

[31m  ERROR: Command errored out with exit status 1:
   command: /Users/ipsadm/opt/anaconda3/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/x0/50gt0r5j31xgwlz771y2q_340000gp/T/pip-install-fjvv8j34/scikit-learn_3debc9675a89421188c35cad03f9b911/setup.py'"'"'; __file__='"'"'/private/var/folders/x0/50gt0r5j31xgwlz771y2q_340000gp/T/pip-install-fjvv8j34/scikit-learn_3debc9675a89421188c35cad03f9b911/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /private/var/folders/x0/50gt0r5j31xgwlz771y2q_340000gp/T/pip-wheel-_qk9k0ie
       cwd: /private/var/folders/x0/50gt0r5j31xgwlz771y2q_340000gp/T/pip-install-fjvv8j34/scikit-learn_3debc9675a89421188c35cad03f9b911/
  Complete output (1191 lines):
  Partial imp

[31m    ERROR: Command errored out with exit status 1:
     command: /Users/ipsadm/opt/anaconda3/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/x0/50gt0r5j31xgwlz771y2q_340000gp/T/pip-install-fjvv8j34/scikit-learn_3debc9675a89421188c35cad03f9b911/setup.py'"'"'; __file__='"'"'/private/var/folders/x0/50gt0r5j31xgwlz771y2q_340000gp/T/pip-install-fjvv8j34/scikit-learn_3debc9675a89421188c35cad03f9b911/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/x0/50gt0r5j31xgwlz771y2q_340000gp/T/pip-record-pbyl3yrj/install-record.txt --single-version-externally-managed --user --prefix= --compile --install-headers /Users/R355-W-12-Stud/.local/include/python3.9/scikit-learn
         cwd: /pr

In [6]:
from  sklearn.feature_extraction.text import CountVectorizer

encoder = CountVectorizer(binary=True)
encoded_data = encoder.fit_transform(query_data)
encoded_queries = encoder.transform(QUERIES)
list(encoder.vocabulary_)[:3]

ValueError: empty vocabulary; perhaps the documents only contain stop words

Let's look at the representation of the first sentence

In [None]:
id2term = {idx: term for term, idx in encoder.vocabulary_.items()}
non_zero_values_ids = encoded_data[0].nonzero()[1]

terms = [id2term[idx] for idx in non_zero_values_ids]
terms

It's fine.

## Task 0

Now for each query from `QUERIES` let's find the nearest document from `query_data` according to the Jaccard similarity index. There are more effictive solutions to do it, but your task is to realize the algorithm computing the Jaccard index and then apply it to our data.

In [None]:
import numpy as np 

def jaccard_sim(vector_a: np.array, vector_b: np.array) -> float:
  """
    Similarity or Jaccard similarity index: the ratio of the intersection cardinality to the union cardinality
  """
  # your code here
  
  return
#Check that the function works correctly
assert jaccard_sim(np.array([1, 0, 1, 0, 1]), np.array([0, 1, 1, 1, 1])) == 0.4

## Task 1
Now using the code below find the most similar documents for each query.

In [None]:
for q_id, query in enumerate(encoded_queries):
  # bring to the required datatype
  query = query.todense().A1
  docs = [doc.todense().A1 for doc in encoded_data]
  # calculate the Jaccard index
  id2doc2similarity = [(doc_id, doc, jaccard_sim(query, doc)) for doc_id, doc in enumerate(docs)]
  # sort according to it
  closest = sorted(id2doc2similarity, key=lambda x: x[2], reverse=True)
  
  print("Q: %s\nFOUND:" % QUERIES[q_id])
  # output 3 most similar documents for each query
  for closest_id, _, sim in closest[:3]:
    print("    %d\t%.2f\t%s" %(closest_id, sim, query_data[closest_id]))



We see that some texts intersecting with the query only in insignificant terms have a high Jaccard index (that is our ranking function).

# VSM

Now we are going to do similar calculations, but using tf-idf and cosine distance. To practice we make everything "manually", but in "real life" it's better to use existing effective solutions, e.g., cosine distance from scipy library.

In [None]:
from  sklearn.feature_extraction.text import TfidfVectorizer

# Advice: we highly recommend to check what tf-idf vectorizer
# is able to do, and change its parameters

tfidf_encoder = TfidfVectorizer()
tfidf_encoded_data = tfidf_encoder.fit_transform(query_data)
tfidf_encoded_queries = tfidf_encoder.transform(QUERIES)

list(tfidf_encoder.vocabulary_)[:3]

## Task 2
Realize the cosine distance computation

In [None]:
import numpy as np 

def cosine_distance(vector_a: np.array, vector_b: np.array) -> float:
  """
    Cosine distance is 1 minus the ratio of the dot product 
    and the product of L2-norm (hint: there are such norms in numpy)
  """
  # your code here

  return  
#Check that the function is working correctly
assert cosine_distance(np.array([1, 0, 1, 1, 1]), np.array([0, 0, 1, 0, 0])) == 0.5


Now let's find the nearset documents to the query according to the cosine distance between the document vector and the query representation

In [None]:
for q_id, query in enumerate(tfidf_encoded_queries):
  
  # bring to the required datatype
  query = query.todense().A1
  docs = [doc.todense().A1 for doc in tfidf_encoded_data]
  # Cosine distance
  id2doc2similarity = [(doc_id, doc, cosine_distance(query, doc)) \
                       for doc_id, doc in enumerate(docs)]
  # sort according to it
  closest = sorted(id2doc2similarity, key=lambda x: x[2], reverse=False)
  
  print("Q: %s\nFOUND:" % QUERIES[q_id])
  
  for closest_id, _, sim in closest[:3]:
    print("    %d\t%.2f\t%s" %(closest_id, sim, query_data[closest_id]))