# Session 2 - Programming with Elastic Search

## 1 Modifying ElasticSearch index behavior

In the previous session we had to clean manually the list of words in order to compute Zipf's and Heaps' laws. 

ElasticSearch allows using a pipeline of processes that allows to clean the text that is indexed discarding anything not useful.

We are going to work with three of the usual processes:

* Tokenization
* Normalization
* Token filtering (stopwords and stemming)

The next cells allow configuring the default tokenizer for an index and analyze an example text. We are going to play a little bit with the possibilities and see what tokens result from the analysis.


In [None]:
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Index, analyzer, tokenizer

client = Elasticsearch()

In [None]:
# Index analyzer cofiguration

# Tokenizers: whitespace, standard, classic, letter
# Filters: lowercase, asciifolding, stop, porter_stem, kstem, snowball
my_analyzer = analyzer('default',
    type='custom',
    tokenizer=tokenizer('whitespace'),
    filter=['lowercase']
)
   
ind = Index('news', using=client)
ind.close()
ind.analyzer(my_analyzer)    
ind.save()
ind.open()

Now you can ask the index to analyze any text, feel free to change the text

In [None]:
res = ind.analyze(analyzer='default', text=u'my taylor 4ís was% &printing printed rich the.')
for r in res['tokens']:
    print(r)

Now **follow the instructions** of the documentation, index the documents from the previous session using the script 'IndexFilesPreprocess.py' and use the script 'CountWords.py' from the previous session to see how the set of tokens change.

***

## 2 The index reloaded

You can use the modified indexer ```IndexFilesPreprocess.py``` script to play with the different possibilities for the preprocessing pipeline.

You can change the **tokenizer** and apply different processes to the tokens like lowercasing, asccii folding, removing stopwords and different stemming algorithms.

***

## 3 Computing Tf-Idf and Cosine similarity

Now is your turn to work in the session task.

The idea is to program a script that given two document paths obtains their ids, computes the Tf-Idf representation of the documents and then computes and prints their cosine similarity

**Follow the instructions** in the documentation and and **pay attention** to the documentation that you have to deliver for this session.