# CAIM Lab Session 3: Programming with Elastic Search

In this session you will:

- Learn how to tell ElasticSearch to apply different tokenizers and filters to the documents, like removing stopwords or stemming the words.
- Study how these changes affect the terms that ElasticSearch puts in the index, and how this in turn affects searches.
- Continuing previous work, implement tf-idf scheme over a repository of scietific article abstracts, including cosine measure for document similarities

## 1. Preprocessing with ElasticSearch

One of the tasks of the previous session was to remove from the documents vocabulary all those strings that were not proper words. Obviously this is a frequent task and all these kinds of DB have standard processes that help to filter and reduce the terms that are not useful for searching.

Text, before being indexed, can be subjected to a pipeline of different processes that strips it from anything that will not be useful for a specific application. In ES these preprocessing pipelines are called _Analyzers_; ES includes many choices for each preprocessing step.


The [following picture](https://www.elastic.co/es/blog/found-text-analysis-part-1) illustrates the chaining of preprocessing steps:

![](https://api.contentstack.io/v2/assets/575e4c8c3dc542cb38c08267/download?uid=blt51e787daed39eae9?uid=blt51e787daed39eae9)

The first step of the pipeline is usually a process that converts _raw text_ into _tokens_. We can for example tokenize a text using blanks and punctuation signs or use a language specific analyzer that detects words in an specific language or parse HTML/XML...

[This section](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html) of the ElasticSearch manual explains the different text tokenizers available.

Once we have obtained tokens, we can _normalize_ the strings and/or filter out valid tokens that are not useful. For instance, strings can be transformed to lowercase so all occurrences of the same word are mapped to the same token regardless of whether they were capitalized. Also, there are words that are not semantically useful when searching such as adverbs, articles or prepositions, in this case each language will have its own standard list of words; these are usually called "_stopwords_". Another language-specific token normalization is stemming. The stem of a word corresponds to the common part of a word from all variants are formed by inflection or addition of suffixes or prefixes. For instance, the words "unstoppable", "stops" and "stopping" all derive from the stem "stop". The idea is that all variations of a word will be represented by the same token.

[This section](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html) of ElasticSearch manual will give you an idea of the possibilities.


## 2. Modifying `ElasticSearch` index behavior (using Analyzers)

In this section we are going to learn how to set up preprocessing with ElasticSearch. We are going to do it _inline_ so that you have a few examples and get familiar with how to set up ES analyzers. We are going to showcase the different options with the made up English phrase

```
My taylor 4ís was% &printing Printed rich the.
```

which contains symbols and weird things to see what effect the different tokenizers and filtering options have. We are going to work with three of the usual processes:

* Tokenization
* Normalization
* Token filtering (stopwords and stemming)

The next cells allow configuring the default tokenizer for an index and analyze an example text. We are going to play a little bit with the possibilities and see what tokens result from the analysis.


In [4]:
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Index, analyzer, tokenizer
from elasticsearch.exceptions import NotFoundError
from pprint import pprint
import requests


client = Elasticsearch("http://localhost:9200", request_timeout=1000)

try:
    resp = requests.get('http://localhost:9200/')
    pprint(resp.content)

except Exception:
    print('elasticsearch is not running')

(b'{\n  "name" : "10-192-204-88client.eduroam.upc.edu",\n  "cluster_name" : "'
 b'elasticsearch",\n  "cluster_uuid" : "nhigYoLMQYqHsoEar5DFlw",\n  "version"'
 b' : {\n    "number" : "9.1.4",\n    "build_flavor" : "default",\n    "build_'
 b'type" : "tar",\n    "build_hash" : "0b7fe68d2e369469ff9e9f344ab6df64ab9c5'
 b'293",\n    "build_date" : "2025-09-16T22:05:19.073893347Z",\n    "build_sn'
 b'apshot" : false,\n    "lucene_version" : "10.2.2",\n    "minimum_wire_comp'
 b'atibility_version" : "8.19.0",\n    "minimum_index_compatibility_version"'
 b' : "8.0.0"\n  },\n  "tagline" : "You Know, for Search"\n}\n')


In [5]:
# Index analyzer cofiguration
# Change the configuration and run this cell and the next to see the changes

# work with dummy index called 'foo'
ind = Index("foo", using=client)

# Drop existing index
if ind.exists():
    ind.delete()

# Tokenizers: whitespace, standard, classic, letter
# Filters: lowercase, asciifolding, stop, porter_stem, kstem, snowball
ind.settings(
    number_of_shards=1,
    analysis={
        "analyzer": {
            "english_stem": {  # Analyzer with stemming for English
                "type": "custom",
                "tokenizer": "standard",
                "filter": ["lowercase", "stop", "porter_stem"]
            },
            "exact_match": {   # Analyzer that preserves terms
                "type": "custom",
                "tokenizer": "keyword",
                "filter": ["lowercase"]
            },
            "whitespace_fold": {  # Analyzer splitting on whitespace and folding accents
                "type": "custom",
                "tokenizer": "whitespace",
                "filter": ["asciifolding"]
            }
        }
    }
)
ind.create()

# now you can ask the index to analyze any text, feel free to change the text

#res = ind.analyze({'text':u'my taylor 4ís was% &printing printed rich the.'})
for prep_type in ['english_stem', 'exact_match', 'whitespace_fold']:
    print(f'\n*** Using analyzer {prep_type}')
    res = client.indices.analyze(
        index="foo",
        analyzer=prep_type,
        text="My taylor 4ís was% &printing Printed rich the."
    )

    print("Tokens:", [t["token"] for t in res["tokens"]])
    


*** Using analyzer english_stem
Tokens: ['my', 'taylor', '4í', 'print', 'print', 'rich']

*** Using analyzer exact_match
Tokens: ['my taylor 4ís was% &printing printed rich the.']

*** Using analyzer whitespace_fold
Tokens: ['My', 'taylor', '4is', 'was%', '&printing', 'Printed', 'rich', 'the.']


---

**Exercise 1:** solve exercise 1 from problem set 1 using ElasticSearch. You can use the following string.

---

In [7]:
moonstone = """
We found my lady with no light in the room but the reading-lamp.
The shade was screwed down so as to over-shadow her face. Instead of looking up at us in her usual straightforward way, she sat
close at the table, and kept her eyes fixed obstinately on an open
book.
“Officer,” she said, “it is important to the inquiry you are conducting to know beforehand if any person now in this house wishes
to leave it?”
"""

ind = Index("ex1", using=client)

if ind.exists():
    ind.delete()

ind.settings(
    number_of_shards=1,
    analysis={
        "analyzer": {
            "english_stem": {  # Analyzer with stemming for English
                "type": "custom",
                "tokenizer": "standard",
                "filter": ["lowercase", "stop", "porter_stem"]
            },
            "exact_match": {   # Analyzer that preserves terms
                "type": "custom",
                "tokenizer": "keyword",
                "filter": ["lowercase"]
            },
            "whitespace_fold": {  # Analyzer splitting on whitespace and folding accents
                "type": "custom",
                "tokenizer": "whitespace",
                "filter": ["asciifolding"]
            }
        }
    }
)

ind.create()

# now you can ask the index to analyze any text, feel free to change the text

#res = ind.analyze({'text':u'my taylor 4ís was% &printing printed rich the.'})
for prep_type in ['english_stem', 'exact_match', 'whitespace_fold']:
    print(f'\n*** Using analyzer {prep_type}')
    res = client.indices.analyze(
        index="foo",
        analyzer=prep_type,
        text=moonstone
    )

    print("Tokens:", [t["token"] for t in res["tokens"]])



*** Using analyzer english_stem
Tokens: ['we', 'found', 'my', 'ladi', 'light', 'room', 'read', 'lamp', 'shade', 'screw', 'down', 'so', 'over', 'shadow', 'her', 'face', 'instead', 'look', 'up', 'us', 'her', 'usual', 'straightforward', 'wai', 'she', 'sat', 'close', 'tabl', 'kept', 'her', 'ey', 'fix', 'obstin', 'open', 'book', 'offic', 'she', 'said', 'import', 'inquiri', 'you', 'conduct', 'know', 'beforehand', 'ani', 'person', 'now', 'hous', 'wish', 'leav']

*** Using analyzer exact_match
Tokens: ['\nwe found my lady with no light in the room but the reading-lamp.\nthe shade was screwed down so as to over-shadow her face. instead of looking up at us in her usual straightforward way, she sat\nclose at the table, and kept her eyes fixed obstinately on an open\nbook.\n“officer,” she said, “it is important to the inquiry you are conducting to know beforehand if any person now in this house wishes\nto leave it?”\n']

*** Using analyzer whitespace_fold
Tokens: ['We', 'found', 'my', 'lady', 'wi

Como podemos observar en los resultados podemos ver que el whitespace_fold es el que nos da el resultado más parecido al que obtuvimos en el set de problemas 1. 

## 3. Indexing script `IndexFilesPreprocess.py`

You should study how the provided indexer script named `IndexFilesPreprocess.py` works.
Its usage is as follows:

```
usage: IndexFilesPreprocess.py [-h] --path PATH --index INDEX
                               [--token {standard,whitespace,classic,letter}]
                               [--filter ...]

optional arguments:
  -h, --help            show this help message and exit
  --path PATH           Path to the files
  --index INDEX         Index for the files
  --token {standard,whitespace,classic,letter}
                        Text tokenizer
  --filter ...          Text filter: lowercase, asciifolding, stop,
                        porter_stem, kstem, snowball
```

So, you can pass a `--path` argument which is the path to a directory where the files that you want to index are located (possibly in subdirectories);
you can specify through `--index` the name of the index to be created; you can also specify the _tokenization_ procedure to be used with the `--token` argument;
and finally you can apply preprocessing filters through the `--filter` argument. As an example call,

```
$ python3 IndexFilesPreprocess.py --index toy --path toy-docs --token letter --filter lowercase asciifolding
```

would create an index called `toy` adding all files located within the subdirectory `toy-docs`, applying the letter tokenizer and applying `lowercase` and `asciifolding` preprocessing.


In particular, you should pay attention to:

- how preprocessing is done within the script
- how the `bulk` operation is used for adding documents to the index (instead of adding files one-by-one)
- the structure of documents added, which contains a `text` field with the content but also a `path` field with the name of the file being added



## 4. Coding exercises

---

**Exercise 2:**  

Download the `arxiv_abs.zip` repository from `https://www.cs.upc.edu/~marias/arxiv_abs.zip`; unzip it. You should see a directory containing folders that contain
text files. These correspond to abstracts of scientific papers in several topics from the [arXiv.org](https://arxiv.org) repository. Index these abstracts using the `IndexFilesPreprocess.py` script (be patient, it takes a while). Double check that your index contains around 58K documents. Pay special attention to how file names are stored in the `path` field of the indexed elasticsearch documents.

---

**Exercise 3:**

Write a function that computes the _cosine similarity_ between pairs of documents in your index. For that, you will find useful the computations from last week that computed the _tf-idf_ vectors of documents in the toy-document dataset. It is important to use _sparse representation_ for these vectors, either through the use of a python dictionary (with `term: weight` entries), or alternatively you could use a list of pairs `(term, weight)`; if you choose the latter, then it is going to be useful to sort the lists by term so that you can find common terms in order to compute the similarities.


_Hint: the `termvector` function that we saw in the last lab session also returns (corpus) document frequencies that you need in order to compute the idf part of the weights:_ 

- `tv['term_vectors']['text']['terms'][t]['term_freq']`
        this gives you the frequency of term t within the document

- `tv['term_vectors']['text']['terms'][t]['doc_freq']`
        this gives you the document frequency (nr. of docs containing `t`) of term t


In [None]:
import os
import math
import pandas as pd
from collections import defaultdict
from random import sample

file_path = "arxiv"
list_dirs = os.listdir(file_path)
list_files = []
aux = []
for dir_name in list_dirs:
    dir_path = os.path.join(file_path, dir_name)
    if os.path.isdir(dir_path):
        for filename in os.listdir(dir_path):
            full_path = os.path.join(dir_name, filename)
            aux.append(full_path)
    list_files.append(sample(aux, 200))
    aux = []
# número de documentos
print(list_files)
D = len(list_files)
print(D)

index = Index('ex3', using=client) 

for filename in list_files:
    with open(os.path.join(file_path, filename), "r", encoding="utf-8") as f:
        text = f.read()
        client.index(index='ex3', document={'text': text})
        
client.indices.refresh(index='ex3')

# diccionario: d{id} -> {word: tf-idf}
tfidf_table = defaultdict(dict)

sc = scan(client, index='ex3', query={"query" : {"match_all": {}}})
for i, s in enumerate(sc):
    tv = client.termvectors(
        index='ex3', id=s['_id'],
        fields=['text'], term_statistics=True, positions=False
    )
    if 'text' in tv['term_vectors']:
        terms = tv['term_vectors']['text']['terms']
        for word, stats in terms.items():
            tf = stats['term_freq']
            df = stats['doc_freq']
            # usamos la variación smooth idf, que evita la división por 
            # cero y modera los pesos de los terminos raros.
            idf = math.log(D / (1 + df)) + 1
            tfidf = tf * idf
            tfidf_table[f"d{i+1}"][word] = tfidf

# convertir a DataFrame para visualizar
df = pd.DataFrame(tfidf_table).fillna(0)
print(df)

index.delete()


58102


---

**Exercise 4:**

Finally, using your code above, build a matrix that reflects the average cosine similarities between pairs of documents in different paper abstract categories. These categories are reflected in the path names of the files, e.g. in my computer, the path name to abstract `/tmp/arxiv/hep-ph.updates.on.arXiv.org/000787` corresponds to the category of `hep-ph` papers. The categories are `astro-ph, cs, hep-th, physics, cond-mat, hep-ph, math, quant-ph`, which can be extracted from path names.

---

Finally, the following piece of code may be useful to see the content of a few random documents within an index

In [11]:
def print_docs_from_index(index_name, client, max_docs):

    print(f"===================")
    info = client.cat.count(index=index_name, format = "json")[0]
    print(f"Index: {index_name} with {info['count']} documents.")
    print()

    res = client.search(index=index_name, size = max_docs, query= {'match_all' : {}})

    for doc in res['hits']['hits']:
        print (doc['_id'], doc['_source'])

print_docs_from_index('arxiv', Elasticsearch("http://localhost:9200", request_timeout=1000), max_docs=10)

Index: arxiv with 58102 documents.

1 {'path': '/tmp/arxiv/hep-ph.updates.on.arXiv.org/000787', 'text': "A set of eight self-consistent, time-dependent supernova (SN) simulations in three spatial dimensions (3D) for 9 solar-mass and 20 solar-mass progenitors is evaluated for the presence of dipolar asymmetries of the electron lepton-number emission as discovered by Tamborra et al. and termed lepton-number emission self-sustained asymmetry (LESA). The simulations were performed with the Aenus-Alcar neutrino/hydrodynamics code, which treats the energy- and velocity-dependent transport of neutrinos of all flavors by a two-moment scheme with algebraic M1 closure. For each of the progenitors, results with fully multi-dimensional (FMD) neutrino transport and with ray-by-ray-plus (RbR+) approximation are considered for two different grid resolutions. While the 9 solar-mass models develop explosions, the 20 solar-mass progenitor does not explode with the employed version of simplified neutrino

---

**Exercise 5: (may take time..)**

Can you find duplicate documents in the corpus provided?

---