# CAIM Lab Session 4: Implementing search in the vector space model

In this session you will:

- Continue to work with the `arxiv` repository from last session
- Learn how to do atomic and compound search queries with ElasticSearch
- Build an inverted index for the `arxiv` repository from last session (should fit in main memory)
- Implement search in the vector space model and compare it with ElasticSearch built-in search mechanism
- Compare different implementations of search

## 1. Built-in search in ElasticSearch

ElasticSearch provides a search mechanism to make queries against a database. 
In the next code snippet you can find examples on how to do this with an atomic query (single term)
and a complex one with a so-called 'should' query (a type of OR) which admits weights in each term within the query.

In [18]:
from elasticsearch import Elasticsearch
from pprint import pprint

client = Elasticsearch("http://localhost:9200", request_timeout=1000)
client

<Elasticsearch(['http://localhost:9200'])>

#### Atomic query

In [19]:
# define query
atomic_query = {"match": {"text": "magic"}}

# search
response = client.search(index="ex3", query=atomic_query, track_total_hits=True)

# show results
# Print the results
print(f"Found {response['hits']['total']['value']} documents.")
for hit in response["hits"]["hits"][:5]:
    print(
        f"id: {hit['_id']}, score: {hit['_score']:.2f}, path: {hit['_source']['path']}, text: {hit['_source'].get('text')}"
    )

Found 5 documents.
id: tHcrv5kBzsmBekYx1dP6, score: 10.65, path: arxiv/hep-th.updates.on.arXiv.org/000265, text: We introduce the extended Freudenthal-Rosenfeld-Tits magic square based on six algebras: the
reals $\mathbb{R}$, complexes $\mathbb{C}$, ternions $\mathbb{T}$, quaternions $\mathbb{H}$,
sextonions $\mathbb{S}$ and octonions $\mathbb{O}$. The ternionic and sextonionic rows/columns
of the magic square yield non-reductive Lie algebras, including $\mathfrak{e}_{7\scriptscriptstyle{\frac{1}{2}}}$.
It is demonstrated that the algebras of the extended magic square appear quite naturally as the symmetries
of supergravity Lagrangians. The sextonionic row (for appropriate choices of real forms) gives
the non-compact global symmetries of the Lagrangian for the $D=3$ maximal $\mathcal{N}=16$, magic
$\mathcal{N}=4$ and magic non-supersymmetric theories, obtained by dimensionally reducing
the $D=4$ parent theories on a circle, with the graviphoton left undualised. In particular, the
extre

#### Complex query with weights

In [11]:
# define your query with set of weighted terms
weighted_terms = {
    "search": 0.5,
    "magic": 2.0,
}

# 3. build the 'should' clauses dynamically (behaves like an OR there are other options, too)
clauses = [
    {"match": {"text": {"query": term, "boost": weight}}}  # field to search over
    for term, weight in weighted_terms.items()
]

for clause_type in ['must', 'should']:
    print()
    print(f"query type with {clause_type} clauses")

    # construct the final bool query from set of weighted terms
    es_query = {"bool": {clause_type: clauses}}

    # execute the search
    response = client.search(index="ex3", query=es_query, track_total_hits=True)

    # Print the results
    print(f"Found {response['hits']['total']['value']} documents.")
    for hit in response["hits"]["hits"][:10]:
        print(f"id: {hit['_id']}, score: {hit['_score']:.2f}, path: {hit['_source']['path']}, text: {hit['_source'].get('text')}")


query type with must clauses
Found 2 documents.
id: eXcyv5kBzsmBekYxodpY, score: 12.70, path: arxiv/hep-ph.updates.on.arXiv.org/000052, text: The interference effects between an extra neutral spin-1 Z'-boson and the Standard Model background
in the Drell-Yan channel at the LHC are studied in detail. The final state with two oppositely charged
leptons is considered. The interference contribution to the new physics signal, currently neglected
by experimental collaborations in Z'-searches and in the interpretation of the results, can be
substantial. It may affect limits or discovery prospects of Z' at the LHC. As the Z'-boson interference
is model-dependent, a proper treatment would a priori require a dedicated experimental analysis
for each particular model. Doing so could potentially improve the sensitivity to new physics, but
would require a much bigger effort from the experimental side. At the same time, it is shown that one
can define an invariant mass window, valid for a wide range

## 2. Excruciatingly slow search

In class we have presented a _slow_ version of search that, given a search query $q$, loops over every document in the database
computing the cosine similarity between document and query. Once this is done, it sorts documents by their similarity w.r.t. $q$ and returns the top $r$
scoring ones. 

```
1. for each d in D:
    sim(d,q) = 0
    get vector representing d
    for each w in q:
        sim(d,q) += tf(d,w) * idf(w)
    normalize sim(d,q) by |d|*|q|
2. sort results by similarity
3. return top r docs
```

A possible implementation can be found below. 

__Remark:__ _It is important to note that there are certain elements in the implementation below that refer to my own
implementation, and that you should adapt to your own; in particular, the line_

```    weights = dict(normalize(tf_idf(s['_id'])))   # gets weights as a python dict of term -> weight ```

_obtains tf-idf weights through calling a function `tf_idf` that I have implemented that, given a docid, returns a list of pairs (term, weight); and `normalize` takes such a list a normalizes weights so that the corresponding vector has length 1. 
Obviously, you should adapt the code to your own implementations from previous sessions._


In [12]:
import os
import math
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Index, analyzer, tokenizer
from elasticsearch.exceptions import NotFoundError
from pprint import pprint
import requests
import pandas as pd
from collections import defaultdict
from random import sample
from elasticsearch.helpers import scan
from elasticsearch_dsl import Index, analyzer, tokenizer

client = Elasticsearch("http://localhost:9200", request_timeout=1000)


file_path = "arxiv"
list_dirs = os.listdir(file_path)
list_files = []
aux = []
for dir_name in list_dirs:
    dir_path = os.path.join(file_path, dir_name)
    if os.path.isdir(dir_path):
        for filename in os.listdir(dir_path):
            full_path = os.path.join(dir_name, filename)
            aux.append(full_path)
    list_files.append(sample(aux, 200))
    aux = []
    
# número de documentos
files = [f for sub in list_files for f in sub]
print(files)
D = len(files)
print(D)

index = Index('ex3', using=client)
for filename in files:
    path = os.path.join(file_path, filename)
    with open(path, "r", encoding="utf-8") as f:
        text = f.read()
        client.index(index='ex3', document={'text': text, 'path': path})
        
client.indices.refresh(index='ex3')

['hep-ph.updates.on.arXiv.org/001072', 'hep-ph.updates.on.arXiv.org/000701', 'hep-ph.updates.on.arXiv.org/001133', 'hep-ph.updates.on.arXiv.org/001651', 'hep-ph.updates.on.arXiv.org/001381', 'hep-ph.updates.on.arXiv.org/000439', 'hep-ph.updates.on.arXiv.org/000493', 'hep-ph.updates.on.arXiv.org/000396', 'hep-ph.updates.on.arXiv.org/000276', 'hep-ph.updates.on.arXiv.org/000906', 'hep-ph.updates.on.arXiv.org/000995', 'hep-ph.updates.on.arXiv.org/001627', 'hep-ph.updates.on.arXiv.org/001429', 'hep-ph.updates.on.arXiv.org/000444', 'hep-ph.updates.on.arXiv.org/001148', 'hep-ph.updates.on.arXiv.org/001277', 'hep-ph.updates.on.arXiv.org/000157', 'hep-ph.updates.on.arXiv.org/000412', 'hep-ph.updates.on.arXiv.org/000772', 'hep-ph.updates.on.arXiv.org/000379', 'hep-ph.updates.on.arXiv.org/000258', 'hep-ph.updates.on.arXiv.org/001348', 'hep-ph.updates.on.arXiv.org/000663', 'hep-ph.updates.on.arXiv.org/000420', 'hep-ph.updates.on.arXiv.org/000830', 'hep-ph.updates.on.arXiv.org/000968', 'hep-ph.upd

ObjectApiResponse({'_shards': {'total': 2, 'successful': 1, 'failed': 0}})

In [7]:
# diccionario: d{id} -> {word: tf-idf}
tfidf_table = defaultdict(dict)

sc = scan(client, index='ex3', query={"query" : {"match_all": {}}})
for i, s in enumerate(sc):
    tv = client.termvectors(
        index='ex3', id=s['_id'],
        fields=['text'], term_statistics=True, positions=False
    )
    if 'text' in tv['term_vectors']:
        terms = tv['term_vectors']['text']['terms']
        for word, stats in terms.items():
            tf = stats['term_freq']
            df = stats['doc_freq']
            # usamos la variación smooth idf, que evita la división por 
            # cero y modera los pesos de los terminos raros.
            idf = math.log(D / (1 + df)) + 1
            tfidf = tf * idf
            tfidf_table[f"d{i+1}"][word] = tfidf

#index.delete()

In [8]:
# get tf-idf vector from doc (internal) id
def tf_idf(doc_id):
    """
    Devuelve lista de pares (term, weight) con TF-IDF (smooth idf used).
    usa termvectors si está disponible.
    """
    try:
        tv = client.termvectors(index=index_name, id=doc_id, fields=[field], term_statistics=True, positions=False)
    except Exception as e:
        # no se pudo recuperar termvectors
        return []

    if "term_vectors" not in tv or field not in tv["term_vectors"]:
        return []

    terms = tv["term_vectors"][field]["terms"]
    # número total de documentos en el índice
    try:
        D = int(client.cat.count(index=index_name, format="json")[0]["count"])
    except Exception:
        D = 1

    res = []
    for term, stats in terms.items():
        tf = stats.get("term_freq", 0)
        df = stats.get("doc_freq", 0)
        # smooth idf (evita división por cero)
        idf = math.log(D / (1 + df)) + 1
        res.append((term, tf * idf))
    return res


# normalizes weights so that resulting vec has length 1
def normalize(l1):
    """
    Normaliza una lista de pares (term, weight) o un dict term->weight.
    Devuelve lista de pares (term, weight_normalizado).
    """
    if isinstance(l1, dict):
        items = list(l1.items())
    else:
        items = list(l1)

    # calcular norma l2
    norm = math.sqrt(sum((w ** 2 for _, w in items)))
    if norm == 0:
        return [(t, 0.0) for t, _ in items]
    return [(t, w / norm) for t, w in items]


In [9]:
from elasticsearch.helpers import scan
from pprint import pprint
from elasticsearch import Elasticsearch
import tqdm
import numpy as np
import math


def preprocess_query_string(query_string, client, index_name, field_name):
    """
    given query string it outputs the list of preprocessed tokens from it
    using same analyzer (preprocessing pipeline) than the arxiv abstracts
    """

    # Use the analyze API on the specified index
    response = client.indices.analyze(
        index=index_name, field=field_name, text=query_string
    )

    # Extract just the token strings from the response
    preprocessed_terms = [token_info["token"] for token_info in response["tokens"]]

    # print(f"Original string: '{query_string}'")
    # print(f"Preprocessed terms: {preprocessed_terms}")
    return preprocessed_terms


client = Elasticsearch("http://localhost:9200", request_timeout=1000)

r = 10  # only return r top docs
# query will be list of tokens, preprocessed like the indexed arxiv articles
query_str = "searching magic"
query_tokens = preprocess_query_string(
    query_string=query_str, client=client, index_name="ex3", field_name="text"
)

print(f"Executing search of query string '{query_str}' with tokens {query_tokens} over documents on index 'arxiv'")
sims = dict()

l2query = np.sqrt(len(query_tokens))  # l2 of query assuming 0-1 vector representation

# get nr. of docs; just for the progress bar
ndocs = int(client.cat.count(index="ex3", format="json")[0]["count"])

# scan through docs, compute cosine sim between query and each doc
for s in tqdm.tqdm(
    scan(client, index="ex3", query={"query": {"match_all": {}}}), total=ndocs
):

    docid = s["_source"]["path"]  # use path as id
    weights = dict(
        normalize(tf_idf(s["_id"]))
    )  # gets weights as a python dict of term -> weight (see remark above)
    sims[docid] = 0.0
    for w in query_tokens:  # gets terms as a list
        if (
            w in weights
        ):  
            sims[docid] += weights[w]  # accumulates if w in current doc
    # normalize sim
    sims[docid] /= l2query

# now sort by cosine similarity
sorted_answer = sorted(sims.items(), key=lambda kv: kv[1], reverse=True)
pprint(sorted_answer[:r])

Executing search of query string 'searching magic' with tokens ['searching', 'magic'] over documents on index 'arxiv'


  0%|          | 0/3600 [00:00<?, ?it/s]

100%|██████████| 3600/3600 [00:00<00:00, 14037.61it/s]

[('arxiv/cond-mat.updates.on.arXiv.org/005200', np.float64(0.0)),
 ('arxiv/cond-mat.updates.on.arXiv.org/003651', np.float64(0.0)),
 ('arxiv/cond-mat.updates.on.arXiv.org/002974', np.float64(0.0)),
 ('arxiv/cond-mat.updates.on.arXiv.org/002127', np.float64(0.0)),
 ('arxiv/cond-mat.updates.on.arXiv.org/003368', np.float64(0.0)),
 ('arxiv/cond-mat.updates.on.arXiv.org/004613', np.float64(0.0)),
 ('arxiv/cond-mat.updates.on.arXiv.org/001353', np.float64(0.0)),
 ('arxiv/cond-mat.updates.on.arXiv.org/002833', np.float64(0.0)),
 ('arxiv/cond-mat.updates.on.arXiv.org/001027', np.float64(0.0)),
 ('arxiv/cond-mat.updates.on.arXiv.org/003008', np.float64(0.0))]





In [10]:
nz = len([x for x, s in sorted_answer if s > 0])
total = len(sorted_answer)
print(
    f"There are {nz} docs with non-zero similarity out of {total}, i.e. {100.0*nz/total:.1f}%"
)

There are 0 docs with non-zero similarity out of 3475, i.e. 0.0%


## 3. Your tasks

---

**Exercise 1:**  

Make sure you understand the algorithm for implementing search described in the lecture notes. Both slow and efficient versions. Describe
the number of operations you need to do in both slow and quick versions for the following toy example with a vocabulary of size 4 and four documents:

- $q = 0,1,1,0$

- document-term matrix:
<center>


|        | t1  | t2  | t3  | t4  |
|--------|-----|-----|-----|-----|
| **d1** | 1.2 | 0.0 | 0.0 | 0.0 |
| **d2** | 0.7 | 0.3 | 1.5 | 0.1 |
| **d3** | 0.0 | 0.0 | 0.0 | 0.7 |
| **d4** | 2.0 | 0.0 | 0.0 | 0.0 |

</center>

---

**Exercise 2:**

Implement the quick version; run both slow and quick versions and report times (as a reference, in my old laptop it takes around 5m20s to run the slow version in the code above). Make sure both versions return the same answer. Note that you will need to build an inverted index in order to implement the efficient version as explained in class; it may take time but this is done once for all queries, and can be done "off-line". Also, you could improve on the code by implementing the top-$r$ sort of the final answer using
the minheap tree as discussed in class. Python has a minheap built-in implementation called `heapq`.

---

**Exercise 3:**

Compare the results for a few sample queries that you get from your quick version and ElasticSearch search. Do you get similar results? Which is faster?

---

## 4. Rules of delivery

- To be solved in _pairs_.

- No plagiarism; don't discuss your work with other teams. You can ask for help to others for simple things, such as recalling a python instruction or module, but nothing too specific to the session.

- If you feel you are spending much more time than the rest of the classmates, ask us for help. Questions can be asked either in person or by email, and you'll never be penalized by asking questions, no matter how stupid they look in retrospect.

- Write a short report listing the solutions to the exercises proposed. Include things like the important parts of your implementation (data structures used for representing objects, algorithms used, etc). You are welcome to add conclusions and findings that depart from what we asked you to do. We encourage you to discuss the difficulties you find; this lets us give you help and also improve the lab session for future editions.

- Turn the report to PDF. Make sure it has your names, date, and title. Include your code in your submission.

- Submit your work through the [raco](http://www.fib.upc.edu/en/serveis/raco.html); see date at the raco's submissions page.