# CAIM Lab Session 2: Intro to ElasticSearch

In this session you will learn:

- a few basics on the `ElasticSearch` database
- how to index a set of documents and how to ask simple queries about these documents
- how to do this from `Python`
- based on the previous, you will compute the boolean and tf-idf matrix for the toy corpus used in class

## 1. ElasticSearch

[ElasticSearch](https://www.elastic.co/) is a _NoSQL/document_ database with the capability of indexing and searching text documents. As a rough analogue, we can use the following table for the equivalence between ElasticSearch and a more classical relational database:

| Relational DB | ElasticSearch |
|---|---|
| Database | Index |
| Row / record | Document |
| Column | Field |

An index can be thought of as an optimized collection of documents and each document is a collection of fields, which are the key-value pairs that contain your data.

`ElasticSearch` is a pretty big beast with many options. Luckily, there is much documentation, a few useful links are:

- Here is the [full documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html)
- Intros you may want to have a look at:
    - https://medium.com/expedia-group-tech/getting-started-with-elastic-search-6af62d7df8dd
    - http://joelabrahamsson.com/elasticsearch-101
- You found another one that you liked? Let us know.

## 2. Running ElasticSearch

This database runs as a web service in a machine and can be accessed using a REST
web API.

The ElasticSearch binaries are in `/opt/elasticsearch-8.2.2/`.

Depending on the disk space that you have available you can run directly the script that starts the
database, so all the data will be stored in your user directory, or you can change the configuration
of the database in order to use the space in `/tmp`. 
Also, security needs to be disabled (through the `xpack.security.enabled` configuration option) so make
sure the line is found in the configuration file.

```bash
cp -r /opt/elasticsearch-8.2.2/config/ /tmp
```

Modify or add the following lines to `/tmp/config/elasticsearch.yml`:

```
path.data : /tmp/elastic_data
path.logs : /tmp/elastic_logs
xpack.security.enabled : false
xpack.security.enrollment.enabled : false
```

Set the environment variable `ES_PATH_CONF` to point to the configuration files. Example (if using `tcsh`):

```bash
setenv ES_PATH_CONF /tmp/config
```

Now you can run ElasticSearch with:

```bash
/opt/elasticsearch-8.2.2/bin/elasticsearch
```

After a few seconds (and a lot of logging) the database will be up and running; you may need to hit return for the prompt to show up. To test whether `ElasticSearch` is working execute the code in the cell below. __The database needs to be running throughout the execution of this script, otherwise you will get a connection error.__

In [2]:
from pprint import pprint
import requests

try:
    resp = requests.get('http://localhost:9200/')
    pprint(resp.content)

except Exception:
    print('elasticsearch is not running')

(b'{\n  "name" : "10-192-204-88client.eduroam.upc.edu",\n  "cluster_name" : "'
 b'elasticsearch",\n  "cluster_uuid" : "nhigYoLMQYqHsoEar5DFlw",\n  "version"'
 b' : {\n    "number" : "9.1.4",\n    "build_flavor" : "default",\n    "build_'
 b'type" : "tar",\n    "build_hash" : "0b7fe68d2e369469ff9e9f344ab6df64ab9c5'
 b'293",\n    "build_date" : "2025-09-16T22:05:19.073893347Z",\n    "build_sn'
 b'apshot" : false,\n    "lucene_version" : "10.2.2",\n    "minimum_wire_comp'
 b'atibility_version" : "8.19.0",\n    "minimum_index_compatibility_version"'
 b' : "8.0.0"\n  },\n  "tagline" : "You Know, for Search"\n}\n')


If `ElasticSearch` is working you will see an answer from the server; otherwise you will see a message indicating that it is not running. You can try also throwing the URL http://localhost:9200 to your browser; you should get a similar answer.

## 3. Indexing and querying

`ElasticSearch` is a database that allows storing documents (tables do not need a predefined schema as in relational databases). Text in these documents can be processed so the queries extend beyond exact matches allowing complex queries, fuzzy matching and ranking documents respect to the actual match.

These kinds of databases are behind search engines like Google Search or Bing.

There are different ways of operating with ElasticSearch. It is deployed esentially as a web service with a REST API, so it can be accessed basically from any language with a library for operating with HTTP servers.

We are going to use two python libraries for programming on top of ElasticSearch: `elasticsearch` and `elasticsearch-dsl`. Both provide access to ElasticSearch functionalities hiding and making more programming-friendly the interactions, the second one is more convenient for configurating and searching. Make sure both python libraries are installed to proceed with this session.

In [None]:
!pip3 install elasticsearch --user
!pip3 install elasticsearch-dsl --user

We are only going to see the essential elements for developing the session but feel free to learn more.

To interact with ElasticSearch with need a client object of type `Elasticsearch`.

In [1]:
from elasticsearch import Elasticsearch

client = Elasticsearch("http://localhost:9200", request_timeout=1000)

With this client you have a connection for operating with Elastic search. Now we will create an index. There are index operations in each library, but the one in `elasticseach-dsl` is simpler to use.

In [None]:
from elasticsearch_dsl import Index

index = Index('test', using=client)  # if it does not exist, it is created; if it does exist, then it connects

First we will need some text to index, for testing purposes we are going to use the python library `loremipsum`. We will need to install it first if it is not installed already, uncomment the code in next cell if you need to install the library

In [None]:
!pip3 install lorem --user  # Restart the kernel if you are not able to import the library in the next cell

Now we create some random paragraphs

In [None]:
import lorem

texts = [lorem.paragraph() for _ in range(10)]
print(len(texts))
print(texts[0])

Now we can index the paragraphs in ElasticSearch using the `index` method. The document is passed as a python dictionary with the `document` parameter. The keys of the dictionary will be the fields of the document, in this case we well have only one (`text`) -- here, we use this tag but could use anything we wanted to.

In [20]:
for t in texts:
    client.index(index='test', document={'text': t})
    print(f'Indexing new text: {t[:70]} ...')
client.indices.refresh(index='test')

Indexing new text: Est aliquam amet quiquia quaerat consectetur porro. Quiquia quisquam n ...
Indexing new text: Consectetur eius porro quisquam sit. Numquam modi neque etincidunt por ...
Indexing new text: Aliquam neque tempora dolore consectetur adipisci. Modi tempora non ne ...
Indexing new text: Amet ut modi dolorem amet etincidunt voluptatem. Sed etincidunt est qu ...
Indexing new text: Modi porro eius dolorem aliquam sit porro quaerat. Sed ut labore modi  ...
Indexing new text: Numquam non velit adipisci quisquam aliquam ipsum. Magnam dolore ipsum ...
Indexing new text: Modi sed ut porro eius magnam. Neque ipsum consectetur etincidunt amet ...
Indexing new text: Dolor quisquam est velit consectetur consectetur sit. Aliquam dolor la ...
Indexing new text: Est velit dolore sed adipisci magnam eius. Amet non amet etincidunt ve ...
Indexing new text: Adipisci consectetur est eius sit aliquam. Dolore neque dolore quisqua ...


ObjectApiResponse({'_shards': {'total': 2, 'successful': 1, 'failed': 0}})

In case we want to get all docs in the index, we can do the following:

In [None]:
# get all docs in index 'test'
resp = client.search(index="test", query={"match_all": {}})

# print them
print(f"Got {resp['hits']['total']['value']} hits:")
for hit in resp['hits']['hits']:
    pprint(hit["_source"])

We can also search for documents that contain a given keyword:

In [None]:
from elasticsearch_dsl import Search

# the following search query specifies the field where we want to search
s_obj = Search(using=client, index='test')
sq = s_obj.query('match', text='non')
resp = sq.execute()

print(f'Found {len(resp)} matches.')

for hit in resp:
    print(f'\nID: {hit.meta.id}\nText: {hit.text}')

## 4. Counting words and docs

`Elastic search` helps us to obtain the counts of words in each document. For example, the following code obtains the counts of words of a whole index by adding the counts of words obtained from each document through the functionality of `termvectors`. This function also allows us to get _document counts_ for computing tf-idf weights, by setting the `term_statistics` option to `True`.

In [None]:
from elasticsearch.helpers import scan
from collections import Counter

# Search for all the documents and query the list of (word, frequency) of each one
# Totals are accumulated using a Counter for term frequencies
word_counts = Counter()
sc = scan(client, index='test', query={"query" : {"match_all": {}}})
for s in sc:
    tv = client.termvectors(index='test', id=s['_id'], fields=['text'], term_statistics=True, positions=False)
    if 'text' in tv['term_vectors']:   # just in case some document has no field named 'text'
        for t in tv['term_vectors']['text']['terms']:
            word = t
            count = tv['term_vectors']['text']['terms'][t]['term_freq']
            word_counts.update({word: count})


In [None]:
# show word frequencies
word_counts.most_common()

## 5. Proposed simple exercise

To get more familiar with elasticsearch, we propose that you _generate the Boolean and tf-idf matrices_ for the toy example that we used in class. You will find 7 text documents that contain the toy documents with the materials for this session in the racó. The steps to follow are:

- create an empty index
- open each text document in the `toy-docs` folder provided, read its contents and add it to the index as a new document; your index should contain 7 documents after this
- use the `termvectors` function to obtain term counts, generate Boolean and tf-idf matrices based on these counts.
- double check that your results coincide with the numbers in theory slides

For this toy corpus, you may build dense Boolean and TF-IDF matrices (e.g., using NumPy arrays).  Note, however, that for real datasets it would be necessary to store them as sparse matrices (e.g. in compressed row store format), which would use far less memory.

In [57]:
from elasticsearch_dsl import Index
from collections import Counter
import os
import numpy as np

# Creamos un nuevo indice vacio
#index.delete()
index = Index('ex5', using=client)

# Leemos los documentos de la carpeta toy-docs
folder_path = 'toy-docs'
file_names = os.listdir(folder_path)
file_names = sorted(file_names)
documents = []
for doc in file_names:
    with open(os.path.join(folder_path, doc), 'r', encoding='utf-8') as f:
        documents.append(f.read())

# Añadimos los documentos al indice
for t in documents:
    client.index(index='ex5', document={'text': t})
client.indices.refresh(index='ex5')

word_counts1 = Counter()
booleanMatrix = np.zeros((7,6))
sc = scan(client, index='ex5', query={"query" : {"match_all": {}}})
# Primero, construimos el vocabulario recorriendo todos los documentos
vocab = []
sc_vocab = scan(client, index='ex5', query={"query" : {"match_all": {}}})
for s in sc_vocab:
    tv = client.termvectors(index='ex5', id=s['_id'], fields=['text'], term_statistics=True, positions=False)
    if 'text' in tv['term_vectors']:
        for t in tv['term_vectors']['text']['terms']:
            if t not in vocab:
                vocab.append(t)

# Ahora rellenamos la booleanMatrix
sc_bool = scan(client, index='ex5', query={"query" : {"match_all": {}}})
for doc_idx, s in enumerate(sc_bool):
    tv = client.termvectors(index='ex5', id=s['_id'], fields=['text'], term_statistics=True, positions=False)
    if 'text' in tv['term_vectors']:
        for t in tv['term_vectors']['text']['terms']:
            if t in vocab:
                word_idx = vocab.index(t) - 1
                booleanMatrix[doc_idx, word_idx] = 1
booleanMatrix

IndexError: index 7 is out of bounds for axis 0 with size 7

## 6. Cleanup

Finally, we remove the test index..

In [39]:
index.delete()

ObjectApiResponse({'acknowledged': True})