# Indexing hackathon documents to Elasticsearch

This notebook indexes the 358 June, 2019 hackathon documents to Elasticsearch. Each document is stored with its corresponding metadata.


### Configuration
First, ensure that the appropriate credentials are stored in your AWS credentials at `~/.aws/credentials`.

These should be stored under the `wmuser` profile with something like:

```
[wmuser]
aws_access_key_id = WMUSER_ACCESS_KEY
aws_secret_access_key = WMUSER_SECRET_KEY
```

> Note that this profile must be specified by name when creating the `boto3` session.

### Requirements

```
pip install requests-aws4auth==0.9
pip install elasticsearch==7.0.2
```

## Connecting to Elasticsearch

In [1]:
import boto3, json
import os
from hashlib import sha256
from elasticsearch import Elasticsearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth

region = 'us-east-1'
service = 'es'
eshost = 'search-world-modelers-dev-gjvcliqvo44h4dgby7tn3psw74.us-east-1.es.amazonaws.com'

session = boto3.Session(region_name=region, profile_name='wmuser')
credentials = session.get_credentials()
credentials = credentials.get_frozen_credentials()
access_key = credentials.access_key
secret_key = credentials.secret_key
token = credentials.token

aws_auth = AWS4Auth(
    access_key,
    secret_key,
    region,
    service,
    session_token=token
)

In [2]:
es = Elasticsearch(
    hosts = [{'host': eshost, 'port': 443}],
    http_auth=aws_auth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection
)

print(json.dumps(es.info(), indent=2))

{
  "name": "ZhaR9MU",
  "version": {
    "build_snapshot": false,
    "number": "6.7.0",
    "minimum_index_compatibility_version": "5.0.0",
    "build_date": "2019-04-17T05:34:35.022392Z",
    "build_flavor": "oss",
    "lucene_version": "7.7.0",
    "build_type": "zip",
    "minimum_wire_compatibility_version": "5.6.0",
    "build_hash": "8453f77"
  },
  "cluster_name": "342635568055:world-modelers-dev",
  "cluster_uuid": "nGeAO1lMTKaG6_LOpSg17w",
  "tagline": "You Know, for Search"
}


## Indexing documents
Now that we have a connection with Elasticsearch we can index the documents. 

> Note: you must update the `dir_path` to the appropriate path to the hackathon documents

In [3]:
docs = []

# Update this path with the correct path to the hackathon documents
dir_path = '/PATH_TO/Docs_20-May-2019'

txt_files = os.listdir(dir_path + '/txt')
for txt in txt_files:
    
    # load txt file
    with open(dir_path + '/txt/' + txt, 'r') as f:
        txt_file = f.read()
        
    # load metadata file
    with open(dir_path + '/meta/' + txt.replace('txt','json'), 'r') as f:
        meta_file = json.loads(f.read())
    
    # store the txt as key `content
    doc = {'content': txt_file}
    
    # add metadata to document object
    for m in meta_file:
        k = m['MT']['N']
        v = m['MT']['V']
    
        # we should standardize the key names
        if k == 'publisherName': 
            k = 'publisher name'
        k = k.replace(' ','_')
        doc[k] = v

    # we should also add in the original file name
    doc['file_name'] = txt
    
    # we can sha256 hash the text to generate a UUID which we can use instead of the 
    # auto-generated ID Elasticsearch would otherwise provide
    doc['_id'] = sha256(txt_file.encode('utf-8')).hexdigest()
    
    docs.append(doc)

Now we should create an index for the Migration documents called `migration`

In [4]:
index = 'migration'
doc_type = 'document'

In [5]:
# create the index if it does not exist
if not es.indices.exists(index):
    es.indices.create(index)

Note that when we index the document we pop its `_id` and store that as its Elasticsearch `_id`. This ensures that if we index a document to that `_id` we will be updating the document in place, not generating a new document in Elasticsearch.

Since each document was hashed to generate an `_id`, of the 358 original documents there were at least 2 exact duplicates; once hashed there were 356 unique documents.

In [6]:
for doc in docs:
    es.index(index=index, doc_type=doc_type, id=doc.pop('_id'), body=doc)

## Retrieving documents
Now we can query the Elasticsearch `migration` index using a variety of querying approaches outlined [here](https://www.elastic.co/guide/en/elasticsearch/reference/6.7/query-dsl.html).

Below is an example of using a `query_string` and Lucene query syntax (boolean) search:

In [7]:
query = {
    "query": {
        "query_string" : {
            "default_field" : "content", # Ensure we use the correct field (could search on `title` as well)
            "query" : "refugee AND aid AND (addis OR NGO)" # Lucene query syntax
        }
    }
}

In [8]:
results = es.search(index=index, body=query)['hits']['hits']

Results are stored as an array. Each item in the array is an object which has the following keys:

In [9]:
results[0].keys()

dict_keys(['_score', '_index', '_type', '_source', '_id'])

The `_id` key is a unique identifer for that document in the `migration` index. The actual document is stored in the `_source` key.

In [10]:
results[0]['_id']

'120c5949cc2895689c9f02d2ff59f92b78f332e2e4647a9b0d03783367d60233'

In [11]:
results[0]['_source']['title']

'The Aid in Danger Monthly News Brief'

You can retrieve documents by their `_id` if you wish:

In [12]:
es.get(index=index, doc_type=doc_type, id=results[0]['_id'])

{'_id': '120c5949cc2895689c9f02d2ff59f92b78f332e2e4647a9b0d03783367d60233',
 '_index': 'migration',
 '_primary_term': 1,
 '_seq_no': 55,
 '_source': {'content': ".\n\nAid in Danger \nMonthly News Brief \nInsecurity affecting the delivery of aid.\n\nSecurity Incidents and Access Constraints.\n\nAfrica  \nCentral African Republic \n05  February  2018:  Update:  The  NGO  Solidarity \nInternational \nannounced the release of one of its staffers after six days in captivity. \nSource: Le Réseau des Journalistes pour les Droits de l'Homme (RJDH).\n\n09 February 2018: In Bangassou, Mbomou prefecture, a month-long \nblockade of the aid corridor in the city has been lifted, thus allowing \nthe delivery of food and medical supplies to those in need. Sources: \nCorbeau News and Xinhua.\n\n13  February  2018:  In  Bangassou,  Mbomou  prefecture,  anti-Balaka \nfighters  stole  a  pickup  truck  belonging  to  the  United  Nations \nMultidimensional  Integrated  Stabilization  Mission  in  the  Cen

If you would like to retrieve all the documents you can do so with a `match all` query:

In [13]:
query = {
    "query": {
        "match_all": {}
    }
}

count = es.count(index=index, body=query)['count']
print("There are {0} total documents in the {1} index.".format(count,index))

There are 356 total documents in the migration index.


In [14]:
results = es.search(index=index, body=query, size=count)['hits']['hits']

In [15]:
print("This query returned {} total documents".format(len(results)))

This query returned 356 total documents
