# Indexing hackathon documents to Elasticsearch

This notebook indexes the 358 June, 2019 hackathon documents to Elasticsearch. Each document is stored with its corresponding metadata.


### Configuration
First, ensure that the appropriate credentials are stored in your AWS credentials at `~/.aws/credentials`.

These should be stored under the `wmuser` profile with something like:

```
[wmuser]
aws_access_key_id = WMUSER_ACCESS_KEY
aws_secret_access_key = WMUSER_SECRET_KEY
```

> Note that this profile must be specified by name when creating the `boto3` session.

### Requirements

```
pip install requests-aws4auth==0.9
pip install elasticsearch==7.0.2
```

In [1]:
import boto3, json
from elasticsearch import Elasticsearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth

region = 'us-east-1'
service = 'es'
eshost = 'search-world-modelers-dev-gjvcliqvo44h4dgby7tn3psw74.us-east-1.es.amazonaws.com'

session = boto3.Session(region_name=region, profile_name='wmuser')
credentials = session.get_credentials()
credentials = credentials.get_frozen_credentials()
access_key = credentials.access_key
secret_key = credentials.secret_key
token = credentials.token

aws_auth = AWS4Auth(
    access_key,
    secret_key,
    region,
    service,
    session_token=token
)

In [2]:
es = Elasticsearch(
    hosts = [{'host': eshost, 'port': 443}],
    http_auth=aws_auth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection
)

print(json.dumps(es.info(), indent=2))

{
  "tagline": "You Know, for Search",
  "cluster_uuid": "nGeAO1lMTKaG6_LOpSg17w",
  "version": {
    "build_date": "2019-04-17T05:34:35.022392Z",
    "build_hash": "8453f77",
    "build_flavor": "oss",
    "minimum_index_compatibility_version": "5.0.0",
    "build_snapshot": false,
    "lucene_version": "7.7.0",
    "build_type": "zip",
    "number": "6.7.0",
    "minimum_wire_compatibility_version": "5.6.0"
  },
  "name": "ZhaR9MU",
  "cluster_name": "342635568055:world-modelers-dev"
}


Now that we have a connection with Elasticsearch we can index the documents. 

> Note: you must update the `dir_path` to the appropriate path to the hackathon documents

In [3]:
import os

In [4]:
docs = []

# Update this path with the correct path to the hackathon documents
dir_path = '/PATH_TO/Docs_20-May-2019'

txt_files = os.listdir(dir_path + '/txt')
for txt in txt_files:
    
    # load txt file
    with open(dir_path + '/txt/' + txt, 'r') as f:
        txt_file = f.read()
        
    # load metadata file
    with open(dir_path + '/meta/' + txt.replace('txt','json'), 'r') as f:
        meta_file = json.loads(f.read())
    
    # store the txt as key `content
    doc = {'content': txt_file}
    
    # add metadata to document object
    for m in meta_file:
        k = m['MT']['N']
        v = m['MT']['V']
    
        # we should standardize the key names
        if k == 'publisherName': 
            k = 'publisher name'
        k = k.replace(' ','_')
        doc[k] = v

    # we should also add in the original file name
    doc['file_name'] = txt
    docs.append(doc)

Now we should create an index for the Migration documents called `migration`

In [5]:
index = 'migration'
doc_type = 'document'

es.indices.create(index)

{'acknowledged': True, 'index': 'migration', 'shards_acknowledged': True}

In [6]:
for doc in docs:
    es.index(index=index, doc_type=doc_type, body=doc)

Now we can query the Elasticsearch `migration` index using a variety of querying approaches outlined [here](https://www.elastic.co/guide/en/elasticsearch/reference/6.7/query-dsl.html).

Below is an example of using a `query_string` and Lucene query syntax (boolean) search:

In [7]:
query = {
    "query": {
        "query_string" : {
            "default_field" : "content", # Ensure we use the correct field (could search on `title` as well)
            "query" : "refugee AND aid AND (addis OR NGO)" # Lucene query syntax
        }
    }
}

In [8]:
results = es.search(index=index, body=query)['hits']['hits']

Results are stored as an array. Each item in the array is an object which has the following keys:

In [9]:
results[0].keys()

dict_keys(['_index', '_type', '_score', '_id', '_source'])

The `_id` key is a unique identifer for that document in the `migration` index. The actual document is stored in the `_source` key.

In [10]:
results[0]['_id']

'yu6GQ2sB3Wd34S8RQMIB'

In [11]:
results[0]['_source']['title']

'The Aid in Danger Monthly News Brief'

You can retrieve documents by their `_id` if you wish:

In [12]:
es.get(index=index, doc_type=doc_type, id=results[0]['_id'])

{'_id': 'yu6GQ2sB3Wd34S8RQMIB',
 '_index': 'migration',
 '_primary_term': 1,
 '_seq_no': 39,
 '_source': {'content': ".\n\nThe Aid Agency  \nMonthly News Brief \nInsecurity affecting the delivery of aid.\n\nattacked.\n\nanti-Balaka  militants.\n\nSecurity Incidents and Access Constraints.\n\nAfrica  \nCentral African Republic \nIn  Bria  city,  Haute-Kotto  prefecture,  an \n04  December  2017: \nunspecified  number  of \na \nMultidimensional  Integrated  Stabilization  Mission  in  the  Central \nAfrican Republic (MINUSCA) police checkpoint at the entrance to the \nPK3  internally  displaced  people  (IDP)  site,  killing  a  Mauritanian \npeacekeeper,  and  wounding  two  other  Mauritanians,  and  one \nZambian.  Sources:  Africa  Time,  Le  Réseau  des  Journalistes  pour  les \nDroits de l'Homme (RJDH) and The Associated Press.\n\n04 December 2017: In Bria city, Haute-Kotto prefecture, unspecified \narmed men stopped and pointed their weapons at a Médecins Sans \nFrontières  (MSF)

If you would like to retrieve all the documents you can do so with a `match all` query:

In [13]:
query = {
    "query": {
        "match_all": {}
    }
}

count = es.count(index=index, body=query)['count']
print("There are {0} total documents in the {1} index.".format(count,index))

There are 358 total documents in the migration index.


In [14]:
results = es.search(index=index, body=query, size=count)['hits']['hits']

In [15]:
print("This query returned {} total documents".format(len(results)))

This query returned 358 total documents
