# CAI Lab Session 2: Intro to ElasticSearch

In this session you will learn:

- a few basics of the `ElasticSearch` database
- how to index a set of documents and how to ask simple queries about these documents
- how to do this from `Python`
- based on the previous, you will compute the boolean and tf-idf matrix for the toy corpus used in class

## 1. ElasticSearch

[ElasticSearch](https://www.elastic.co/) is a _NoSQL/document_ database with the capability of indexing and searching text documents. As a rough analogue, we can use the following table for the equivalence between ElasticSearch and a more classical relational database:

| Relational DB | ElasticSearch |
|---|---|
| Database | Index |
| Table | Type |
| Row / record | Document |
| Column | Field |

An index can be thought of as an optimized collection of documents and each document is a collection of fields, which are the key-value pairs that contain your data.

`ElasticSearch` is a pretty big beast with many options. Luckily, there is much documentation, a few useful links are:

- Here is the [full documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html)
- Intros you may want to have a look at:
    - https://medium.com/expedia-group-tech/getting-started-with-elastic-search-6af62d7df8dd
    - http://joelabrahamsson.com/elasticsearch-101
- You found another one that you liked? Let us know.

## 2. Running ElasticSearch

First you will need to install `ElasticSearch` following instructions in their [documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html).

This database runs as a web service in a machine and can be accessed using a REST web API; however we will interact with the database through its python libraries `elasticsearch-py` and `elasticsearch-dsl`, so you will need to install these as well.  You can run `ElasticSearch` by typing from the command-line prompt:

```
$ <path_to_elasticsearch_bin>/elasticsearch &
```



After a few seconds (and a lot of logging) the database will be up and running; you may need to hit return for the prompt to show up. To test whether `ElasticSearch` is working execute the code in the cell below

In [None]:
from pprint import pprint

In [None]:
import requests

try:
    resp = requests.get('http://localhost:9200/')
    pprint(resp.content)

except Exception:
    print('elasticsearch is not running')

(b'{\n  "name" : "10-192-2-70client.eduroam.upc.edu",\n  "cluster_name" : "el'
 b'asticsearch",\n  "cluster_uuid" : "kJEWqbK9Q0CyycBbKYd1RQ",\n  "version" :'
 b' {\n    "number" : "8.10.0",\n    "build_flavor" : "default",\n    "build_t'
 b'ype" : "tar",\n    "build_hash" : "e338da74c79465dfdc204971e600342b0aa87b'
 b'6b",\n    "build_date" : "2023-09-07T08:16:21.960703010Z",\n    "build_sna'
 b'pshot" : false,\n    "lucene_version" : "9.7.0",\n    "minimum_wire_compat'
 b'ibility_version" : "7.17.0",\n    "minimum_index_compatibility_version" :'
 b' "7.0.0"\n  },\n  "tagline" : "You Know, for Search"\n}\n')


If `ElasticSearch` is working you will see an answer from the server; otherwise you will see a message indicating that it is not running. You can try also throwing the URL http://localhost:9200 to your browser; you should get a similar answer.

**In version 8 they introduced enhanced security, which may give you trouble when executing the code here, to deal with this you can either install an earlier version (7 or older) or turn off security settings in their `config/elasticsearch.yml` config file (just set to _false_ everything concerning the security options).** Since we are using the database in offline, local mode this should not be a problem.

Also, you should run this script locally in your machine, if you use Google Collab or similar this is not going to work because elasticsearch should be running on the machine where the script is being executed.

## 3. Indexing and querying

`ElasticSearch` is a database that allows storing documents (tables do not need a predefined schema as in relational databases). Text in these documents can be processed so the queries extend beyond exact matches allowing complex queries, fuzzy matching and ranking documents respect to the actual match.

These kinds of databases are behind search engines like Google Search or Bing.

There are different ways of operating with ElasticSearch. It is deployed esentially as a web service with a REST API, so it can be accessed basically from any language with a library for operating with HTTP servers.

We are going to use two python libraries for programming on top of ElasticSearch: `elasticsearch` and `elasticsearch-dsl`. Both provide access to ElasticSearch functionalities hiding and making more programming-friendly the interactions, the second one is more convenient for configurating and searching. Make sure both python libraries are installed to proceed with this session.

In [None]:
!pip3 install elasticsearch --user
!pip3 install elasticsearch-dsl --user

Collecting elasticsearch
  Obtaining dependency information for elasticsearch from https://files.pythonhosted.org/packages/bb/06/81b1d71ba0567ff39d0f98f3637e810846df92f6733aee46004a194b51ea/elasticsearch-8.9.0-py3-none-any.whl.metadata
  Downloading elasticsearch-8.9.0-py3-none-any.whl.metadata (5.2 kB)
Collecting elastic-transport<9,>=8 (from elasticsearch)
  Downloading elastic_transport-8.4.0-py3-none-any.whl (59 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.5/59.5 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting urllib3<2,>=1.26.2 (from elastic-transport<9,>=8->elasticsearch)
  Obtaining dependency information for urllib3<2,>=1.26.2 from https://files.pythonhosted.org/packages/c5/05/c214b32d21c0b465506f95c4f28ccbcba15022e000b043b72b3df7728471/urllib3-1.26.16-py2.py3-none-any.whl.metadata
  Downloading urllib3-1.26.16-py2.py3-none-any.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.4/48.4 kB[0m [31m5.0 

We are only going to see the essential elements for developing the session but feel free to learn more.

To interact with ElasticSearch with need a client object of type `Elasticsearch`.

In [None]:
from elasticsearch import Elasticsearch

client = Elasticsearch("http://localhost:9200")

With this client you have a connection for operating with Elastic search. Now we will create an index. There are index operations in each library, but the one in `elasticseach-dsl` is simpler to use.

In [None]:
from elasticsearch_dsl import Index

index = Index('test', using=client)

First we will need some text to index, for testing purposes we are going to use the python library `loremipsum`. We will need to install it first if it is not installed already, uncomment the code in next cell if you need to install the library

In [None]:
!pip3 install lorem --user  # Restart the kernel if you are not able to import the library in the next cell

Collecting lorem
  Downloading lorem-0.1.1-py3-none-any.whl (5.0 kB)
Installing collected packages: lorem
Successfully installed lorem-0.1.1


Now we create some random paragraphs

In [None]:
import lorem

texts = [lorem.paragraph() for _ in range(10)]
print(len(texts))
print(texts[0])

10
Aliquam tempora quisquam non dolore velit. Magnam sed modi neque adipisci consectetur modi consectetur. Neque eius amet modi velit amet ipsum voluptatem. Sit neque ipsum neque ut. Aliquam quiquia quiquia eius etincidunt dolorem consectetur. Adipisci eius voluptatem amet tempora.


Now we can index the paragraphs in ElasticSearch using the `index` method. The document is passed as a python dictionary with the `body` parameter. The keys of the dictionary will be the fields of the document, in this case we well have only one (`text`) -- here, we use this tag but could use anything we wanted to.

In [None]:
for t in texts:
    client.index(index='test', document={'text': t})
    print(f'Indexing new text: {t[:70]} ...')

Indexing new text: Aliquam tempora quisquam non dolore velit. Magnam sed modi neque adipi ...
Indexing new text: Velit non adipisci sed etincidunt. Ipsum porro eius ipsum non dolorem  ...
Indexing new text: Porro sit neque modi consectetur tempora quaerat dolor. Quisquam numqu ...
Indexing new text: Dolore amet porro quiquia numquam. Quisquam quisquam labore sit quaera ...
Indexing new text: Amet ipsum numquam est sit etincidunt quiquia. Quisquam ut tempora qui ...
Indexing new text: Sit porro non voluptatem non. Modi etincidunt ut modi tempora. Aliquam ...
Indexing new text: Ipsum dolor etincidunt velit amet sit ut non. Quaerat quaerat dolore s ...
Indexing new text: Velit quaerat voluptatem ipsum quaerat ipsum neque. Etincidunt dolore  ...
Indexing new text: Dolorem ut adipisci magnam aliquam sed. Voluptatem quiquia porro dolor ...
Indexing new text: Adipisci labore dolore magnam quisquam ipsum. Velit consectetur magnam ...


In case we want to get all docs in the index, we can do the following:

In [None]:
# get all docs in index 'test'
resp = client.search(index="test", query={"match_all": {}})

# print them
print(f"Got {resp['hits']['total']['value']} hits:")
for hit in resp['hits']['hits']:
    pprint(hit["_source"])

Got 10 hits:
{'text': 'Aliquam tempora quisquam non dolore velit. Magnam sed modi neque '
         'adipisci consectetur modi consectetur. Neque eius amet modi velit '
         'amet ipsum voluptatem. Sit neque ipsum neque ut. Aliquam quiquia '
         'quiquia eius etincidunt dolorem consectetur. Adipisci eius '
         'voluptatem amet tempora.'}
{'text': 'Velit non adipisci sed etincidunt. Ipsum porro eius ipsum non '
         'dolorem aliquam amet. Etincidunt magnam consectetur numquam magnam '
         'ut labore etincidunt. Magnam quisquam numquam non. Quisquam ut '
         'labore non est est. Sit modi quaerat ut quisquam.'}
{'text': 'Porro sit neque modi consectetur tempora quaerat dolor. Quisquam '
         'numquam aliquam dolorem amet magnam ut. Dolore quiquia quiquia '
         'etincidunt etincidunt magnam quiquia ut. Voluptatem amet ut sed. '
         'Labore quiquia magnam sit. Ut sed magnam tempora. Non est velit eius '
         'labore. Aliquam non dolorem ipsum mod

We can also search for documents that contain a given keyword:

In [None]:
from elasticsearch_dsl import Search

# the following search query specifies the field where we want to search
s_obj = Search(using=client, index='test')
sq = s_obj.query('match', text='dolor')
resp = sq.execute()

print(f'Found {len(resp)} matches.')

for hit in resp:
    print(f'\nID: {hit.meta.id}\nText: {hit.text}')

Found 7 matches.

ID: sFkSmIoBzo4VIHJJE4RC
Text: Sit porro non voluptatem non. Modi etincidunt ut modi tempora. Aliquam quisquam aliquam dolor dolore dolor modi numquam. Aliquam quisquam magnam modi dolorem dolor quaerat. Velit sit adipisci etincidunt modi.

ID: tFkSmIoBzo4VIHJJE4SW
Text: Adipisci labore dolore magnam quisquam ipsum. Velit consectetur magnam dolor voluptatem. Quisquam ipsum velit non tempora. Sed ipsum aliquam neque labore. Sed eius numquam ut neque. Magnam amet magnam modi. Sed voluptatem modi est dolor. Quaerat labore adipisci quaerat neque.

ID: sVkSmIoBzo4VIHJJE4RU
Text: Ipsum dolor etincidunt velit amet sit ut non. Quaerat quaerat dolore sed. Magnam magnam consectetur magnam voluptatem numquam. Voluptatem sed velit ut neque. Eius aliquam amet numquam sed amet magnam. Numquam labore sit adipisci aliquam ipsum.

ID: rVkSmIoBzo4VIHJJE4QQ
Text: Porro sit neque modi consectetur tempora quaerat dolor. Quisquam numquam aliquam dolorem amet magnam ut. Dolore quiquia quiqu

## 4. Counting words and docs

`Elastic search` helps us to obtain the counts of words in each document. For example, the following code obtains the counts of words of a whole index by adding the counts of words obtained from each document through the functionality of `termvectors`. This function also allows us to get _document counts_ for computing tf-idf weights, by setting the `term_statistics` option to `True`.

In [None]:
from elasticsearch.helpers import scan
from collections import Counter

# Search for all the documents and query the list of (word, frequency) of each one
# Totals are accumulated using a Counter
word_counts = Counter()
sc = scan(client, index='test', query={"query" : {"match_all": {}}})
for s in sc:
    doc_counts = Counter()   # I place the counter here so that it is overwritten each time, since doc_freq is constant for every doc
    tv = client.termvectors(index='test', id=s['_id'], fields=['text'], term_statistics=True, positions=False)
    if 'text' in tv['term_vectors']:   # just in case some document has no field named 'text'
        for t in tv['term_vectors']['text']['terms']:
            word = t
            count = tv['term_vectors']['text']['terms'][t]['term_freq']
            df = tv['term_vectors']['text']['terms'][t]['doc_freq']
            #pprint(tv['term_vectors']['text']['terms'][t])
            word_counts.update({word: count})
            doc_counts.update({word: df})

In [None]:
# show word frequencies
word_counts.most_common()

[('magnam', 24),
 ('ipsum', 22),
 ('sed', 22),
 ('quisquam', 20),
 ('ut', 20),
 ('modi', 19),
 ('voluptatem', 19),
 ('non', 18),
 ('aliquam', 17),
 ('amet', 17),
 ('neque', 17),
 ('etincidunt', 16),
 ('sit', 16),
 ('labore', 16),
 ('numquam', 16),
 ('adipisci', 15),
 ('dolore', 15),
 ('quiquia', 15),
 ('dolorem', 14),
 ('velit', 14),
 ('eius', 13),
 ('consectetur', 12),
 ('est', 12),
 ('quaerat', 12),
 ('porro', 11),
 ('tempora', 10),
 ('dolor', 10)]

In [None]:
# show doc freq
doc_counts.most_common()

[('magnam', 10),
 ('non', 10),
 ('ut', 10),
 ('adipisci', 9),
 ('aliquam', 9),
 ('amet', 9),
 ('consectetur', 9),
 ('dolore', 9),
 ('ipsum', 9),
 ('numquam', 9),
 ('quaerat', 9),
 ('sed', 9),
 ('velit', 9),
 ('voluptatem', 9),
 ('eius', 8),
 ('labore', 8),
 ('modi', 8),
 ('neque', 8),
 ('quisquam', 8),
 ('dolor', 7),
 ('est', 7),
 ('tempora', 6)]

## 5. Proposed simple exercise

To get more familiar with elasticsearch, we propose that you _generate the Boolean and tf-idf matrices_ for the toy example that we used in class. You will find 7 text documents that contain the toy documents with the materials for this session in the racó. The steps to follow are:

- create an empty index
- open each text document in the `toy-docs` folder provided, read its contents and add it to the index as a new document; your index should contain 7 documents after this
- use the `termvectors` function to obtain term and doc counts, generate Boolean and tf-idf matrices based on these counts
- double check that your results coincide with the numbers in theory slides

In [None]:
# Create an empty index
index = Index('exercise1', using=client)

In [None]:
# Open all documents in the toy-docs folder
import os

def read_document(file_path):
    with open(file_path, 'r') as f:
        return f.read()

path = os.path.join(os.getcwd(), 'toy-docs')
for doc in os.listdir(path):
    t = read_document(os.path.join(path, doc))
    client.index(index='exercise1', document={'text': t})
    print(f'Indexing new text ({doc}) : {t}')

Indexing new text (d7.txt) : four five

Indexing new text (d6.txt) : three three three six six

Indexing new text (d4.txt) : one two two two two three six six

Indexing new text (d5.txt) : three four four four six

Indexing new text (d1.txt) : one three

Indexing new text (d2.txt) : two two three

Indexing new text (d3.txt) : one three four five five five



In [None]:
# Obtain word-count pairs
word_counts = Counter()
sc = scan(client, index='exercise1', query={"query" : {"match_all": {}}})
for s in sc:
    doc_counts = Counter()   # I place the counter here so that it is overwritten each time, since doc_freq is constant for every doc
    tv = client.termvectors(index='exercise1', id=s['_id'], fields=['text'], term_statistics=True, positions=False)
    print(tv)
    if 'text' in tv['term_vectors']:   # just in case some document has no field named 'text'
        for word in tv['term_vectors']['text']['terms']:
            count = tv['term_vectors']['text']['terms'][word]['term_freq']
            df = tv['term_vectors']['text']['terms'][word]['doc_freq']
            word_counts.update({word: count})
            doc_counts.update({word: df})

# show word frequencies
print(word_counts.most_common())

# show doc freq
print(doc_counts.most_common())

{'_index': 'exercise1', '_id': '31kpmIoBzo4VIHJJs4Sx', '_version': 1, 'found': True, 'took': 1, 'term_vectors': {'text': {'field_statistics': {'sum_doc_freq': 19, 'doc_count': 7, 'sum_ttf': 31}, 'terms': {'five': {'doc_freq': 2, 'ttf': 4, 'term_freq': 1, 'tokens': [{'start_offset': 5, 'end_offset': 9}]}, 'four': {'doc_freq': 3, 'ttf': 5, 'term_freq': 1, 'tokens': [{'start_offset': 0, 'end_offset': 4}]}}}}}
{'_index': 'exercise1', '_id': '4FkpmIoBzo4VIHJJtIRu', '_version': 1, 'found': True, 'took': 1, 'term_vectors': {'text': {'field_statistics': {'sum_doc_freq': 19, 'doc_count': 7, 'sum_ttf': 31}, 'terms': {'six': {'doc_freq': 3, 'ttf': 5, 'term_freq': 2, 'tokens': [{'start_offset': 18, 'end_offset': 21}, {'start_offset': 22, 'end_offset': 25}]}, 'three': {'doc_freq': 6, 'ttf': 8, 'term_freq': 3, 'tokens': [{'start_offset': 0, 'end_offset': 5}, {'start_offset': 6, 'end_offset': 11}, {'start_offset': 12, 'end_offset': 17}]}}}}}
{'_index': 'exercise1', '_id': '4VkpmIoBzo4VIHJJtIR-', '_ve

In [None]:
# Obtain word-count pairs - Boolean model
bool_matrix = []
sc = scan(client, index='exercise1', query={"query" : {"match_all": {}}})
for s in sc:
    doc_dict = {'five': 0, 'four': 0, 'one': 0, 'six': 0, 'three': 0, 'two': 0}
    tv = client.termvectors(index='exercise1', id=s['_id'], fields=['text'], term_statistics=True, positions=False)

    for word in tv['term_vectors']['text']['terms']:
        doc_dict[word] = 1

    bool_matrix.append(doc_dict)


# show word frequencies
bool_matrix

[{'five': 1, 'four': 1, 'one': 0, 'six': 0, 'three': 0, 'two': 0},
 {'five': 0, 'four': 0, 'one': 0, 'six': 1, 'three': 1, 'two': 0},
 {'five': 0, 'four': 0, 'one': 1, 'six': 1, 'three': 1, 'two': 1},
 {'five': 0, 'four': 1, 'one': 0, 'six': 1, 'three': 1, 'two': 0},
 {'five': 0, 'four': 0, 'one': 1, 'six': 0, 'three': 1, 'two': 0},
 {'five': 0, 'four': 0, 'one': 0, 'six': 0, 'three': 1, 'two': 1},
 {'five': 1, 'four': 1, 'one': 1, 'six': 0, 'three': 1, 'two': 0}]

In [None]:
# Obtain word-count pairs - Boolean model
import math


sc = scan(client, index='exercise1', query={"query" : {"match_all": {}}})

f_dict = []
df_dict = {'five': 0, 'four': 0, 'one': 0, 'six': 0, 'three': 0, 'two': 0}
for s in sc:
    fs_dict = {'five': 0, 'four': 0, 'one': 0, 'six': 0, 'three': 0, 'two': 0}
    tv = client.termvectors(index='exercise1', id=s['_id'], fields=['text'], term_statistics=True, positions=False)

    D = tv['term_vectors']['text']['field_statistics']['doc_count']

    for word in tv['term_vectors']['text']['terms']:
        fs_dict[word] = tv['term_vectors']['text']['terms'][word]['term_freq']
        df_dict[word] = tv['term_vectors']['text']['terms'][word]['doc_freq']

    f_dict.append(fs_dict)

weight_matrix = []
for i in range(len(f_dict)):
    doc_dict = {'five': 0, 'four': 0, 'one': 0, 'six': 0, 'three': 0, 'two': 0}

    maxf = max(f_dict[i].values())
    for word in doc_dict.keys():
        doc_dict[word] = (f_dict[i][word]/maxf)*(math.log(D/df_dict[word], 2))

    weight_matrix.append(doc_dict)

# show word frequencies
weight_matrix

[{'five': 1.8073549220576042,
  'four': 1.222392421336448,
  'one': 0.0,
  'six': 0.0,
  'three': 0.0,
  'two': 0.0},
 {'five': 0.0,
  'four': 0.0,
  'one': 0.0,
  'six': 0.8149282808909654,
  'three': 0.22239242133644802,
  'two': 0.0},
 {'five': 0.0,
  'four': 0.0,
  'one': 0.305598105334112,
  'six': 0.611196210668224,
  'three': 0.055598105334112004,
  'two': 1.8073549220576042},
 {'five': 0.0,
  'four': 1.222392421336448,
  'one': 0.0,
  'six': 0.4074641404454827,
  'three': 0.07413080711214934,
  'two': 0.0},
 {'five': 0.0,
  'four': 0.0,
  'one': 1.222392421336448,
  'six': 0.0,
  'three': 0.22239242133644802,
  'two': 0.0},
 {'five': 0.0,
  'four': 0.0,
  'one': 0.0,
  'six': 0.0,
  'three': 0.11119621066822401,
  'two': 1.8073549220576042},
 {'five': 1.8073549220576042,
  'four': 0.4074641404454827,
  'one': 0.4074641404454827,
  'six': 0.0,
  'three': 0.07413080711214934,
  'two': 0.0}]

## 6. Cleanup

Finally, we remove the test index..

In [None]:
index.delete()

ObjectApiResponse({'acknowledged': True})