In [1]:
import requests
import json
import os
import pprint

# Test connection
r = requests.get('http://localhost:9200')

print(r.json()['tagline'])

## Define an ES analyzer for Polish texts containing:
- standard tokenizer
- synonym filter with the following definitions:
    - kpk - kodeks postępowania karnego
    - kpc - kodeks postępowania cywilnego
    - kk - kodeks karny
    - kc - kodeks cywilny
- Morfologik-based lemmatizer
- lowercase filter

In [2]:
headers = {"Content-Type": "application/json"}  # needed for all es queries
es_url = 'http://localhost:9200'  # default path to warking es sevice
index_name = 'my_index_lab2'

In [3]:
query = json.dumps({
   "settings":{
       "analysis":{
           "filter": {
               "synonyms_lab2": {
                   "type": "synonym",
                   "synonyms": [ 
                       "kpk => kodeks postępowania karnego",
                       "kpc => kodeks postępowania cywilnego",
                       "kk => kodeks karny",
                       "kc => kodeks cywilny"]
               }
           },
           "analyzer":{
               "analyzer_lab2":{ 
                   "type":"custom",
                   "tokenizer":"standard",
                   "filter":[
                       "lowercase",
                       "synonyms_lab2",
                       "morfologik_stem",
                       "stop"
                   ]
               }
           }
       }
   },
   "mappings":{
       "properties":{
          "content": {
             "type":"text",
             "analyzer":"analyzer_lab2" 
         }
      }
   }
})


For synosyms is needed to create own custom filter. Lowercase is helpful make all letter lowercase, morfologik_stem is lemmatizer for Polish language and stop filter ignores tokens like full stop or comma.

Mapping let us analyze defined field using custom analyzer. 

## Define an ES index for storing the contents of the legislative acts.

In [4]:
response = requests.put(es_url + '/' + index_name, 
                        headers=headers,
                        data=query)
print(response.text)

{"acknowledged":true,"shards_acknowledged":true,"index":"my_index_lab2"}


Using the elasticsearch extension for browser (https://elasticvue.com/), I can check the created index:

![title](img/index_created.JPG)

## Load the data to the ES index.

In [5]:
import os

data_path = "./data_lab1/"

for filename in os.listdir(data_path):
    if filename.endswith(".txt"): 
        filepath = os.path.join(data_path, filename) 
        content = open(filepath, 'r', encoding='utf-8').read().split()
        content = " ".join(content)
        
        query = json.dumps({"content": content, "title": filename})
        
        response = requests.post(es_url + '/' + index_name + '/_doc', headers=headers,
                       data=query)
#         print(response.text)
    else:
        continue

Using elasticsearch extension for browser could be verified also the number of indexed documents.

![title](img/indexed_docs.JPG)

In [6]:
query = {}
response = requests.get(es_url + '/' + index_name + '/_mapping',
                        headers=headers,
                        data=query)
print(response.text)

{"my_index_lab2":{"mappings":{"properties":{"content":{"type":"text","analyzer":"analyzer_lab2"},"title":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}}}}


## Determine the number of legislative acts containing the word ustawa (in any form).

In [7]:
query = json.dumps(
    {
        "query": {
            "match": {
                "content": {
                    "query": "ustawa"
#                     "analyzer": "analyzer_lab2" ---> default for field
                }
            }
        }
    })
response = requests.get(es_url + '/' + index_name + '/_search', 
                        headers=headers,
                        data=query)
print(response.json()['hits']['total']['value'])

1179


## Determine the number of legislative acts containing the words kodeks postępowania cywilnego in the specified order, but in an any inflection form.

In [8]:
# match_phrase - search phrase in the specified order

query = json.dumps(
    {
        "query": {
            "match_phrase": {
                "content": {
                    "query": "kodeks postępowania cywilnego"
#                     "analyzer": "analyzer_lab2" ---> default for field
                }
            }
        }
    })
response = requests.get(es_url + '/' + index_name + '/_search', 
                        headers=headers,
                        data=query)
print(response.json()['hits']['total']['value'])

100


## Determine the number of legislative acts containing the words wchodzi w życie (in any form) allowing for up to 2 additional words in the searched phrase.

In [9]:
query = json.dumps(
    {
        "query": {
            "match_phrase": {
                "content": {
                    "query": "wchodzi w życie",
                    "slop": 2  # how many words could be added in the searched phrase
#                     "analyzer": "analyzer_lab2" #---> default for field
                }
            }
        }
    })
response = requests.get(es_url + '/' + index_name + '/_search', 
                        headers=headers,
                        data=query)
print(response.json()['hits']['total']['value'])

1175


Simple check for phrase "wchodzi w życie" gave 1091 documents. So above query included also for example: acts containing "weszła w życie". 

## Determine the 10 documents that are the most relevant for the phrase konstytucja.
## Print the excerpts containing the word konstytucja (up to three excerpts per document) from the previous task.

In [10]:
query = json.dumps(
    {
        "query": {
            "match_phrase": {
                "content": {
                    "query": "konstytucja"
#                     "analyzer": "analyzer_lab2" #---> default for field
                }
            }
        },
        "size": 45,  # needed to extract all found documents 
        "highlight": {
            "fields": {
              "content": {}
            },
            "boundary_scanner": "sentence",  # Excerpts should be sentence
            "number_of_fragments": 3,  # up to 3 excerpts per document
            "order": "score"
        }
    })
response = requests.get(es_url + '/' + index_name + '/_search', 
                        headers=headers,
                        data=query)

hits = response.json()['hits']

# Sorting documents by score - highest score most relevant doc
sorted_hits = sorted(hits['hits'], key=lambda x: x['_score'], reverse=True)
for hit in sorted_hits[:10]:
    print("Score:", hit['_score'])
    pprint.pprint(hit['highlight'])

Score: 6.870618
{'content': ['Zasady, na których opierać się ma <em>Konstytucja</em> mogą być '
             'poddane pod referendum. 2.',
             'Inicjatywa ustawodawcza w zakresie przedstawienia Zgromadzeniu '
             'Narodowemu projektu nowej <em>Konstytucji</em>',
             'Do zgłoszenia projektu <em>Konstytucji</em> załącza się wykaz '
             'obywateli popierających zgłoszenie, zawierający']}
Score: 6.682023
{'content': ['Polskiej do ratyfikacji jest dokonywane po uzyskaniu zgody, o '
             'której mowa w art. 89 ust. 1 i art. 90 <em>Konstytucji</em>',
             'umowy międzynarodowej lub załącznika nie wypełnia przesłanek '
             'określonych w art. 89 ust. 1 lub art. 90 <em>Konstytucji</em>',
             'okoliczności, a umowa międzynarodowa nie wypełnia przesłanek '
             'określonych w art. 89 ust. 1 lub art. 90 <em>Konstytucji</em>']}
Score: 6.6318135
{'content': ['Jeżeli Trybunał Konstytucyjny wyda orzeczenie o sprzeczności '
 

Downloading all relevant documents take a little bit more time than by default query. Above query shows score per document and the excerpts for each doc.