# Indexing hackathon documents to Elasticsearch

This notebook indexes the 358 June, 2019 hackathon documents to Elasticsearch. Each document is stored with its corresponding metadata.


### Configuration
First, ensure that the appropriate credentials are stored in your AWS credentials at `~/.aws/credentials`.

These should be stored under the `wmuser` profile with something like:

```
[wmuser]
aws_access_key_id = WMUSER_ACCESS_KEY
aws_secret_access_key = WMUSER_SECRET_KEY
```

> Note that this profile must be specified by name when creating the `boto3` session.

### Requirements

```
pip install requests-aws4auth==0.9
pip install elasticsearch==7.0.2
pip install tika==1.19
pip install PyPDF2==1.26.0
pip install boto3==1.9.172
pip install beautifulsoup4==4.5.3
```

## Connecting to Elasticsearch
First we should connect to Elasticsearch using AWS authentification. This will make it easy to index each parsed document later.

In [1]:
import boto3, json
import os
from hashlib import sha256
from elasticsearch import Elasticsearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth
from tika import parser
import PyPDF2
from bs4 import BeautifulSoup

region = 'us-east-1'
service = 'es'
eshost = 'search-world-modelers-dev-gjvcliqvo44h4dgby7tn3psw74.us-east-1.es.amazonaws.com'

session = boto3.Session(region_name=region, profile_name='wmuser')
credentials = session.get_credentials()
credentials = credentials.get_frozen_credentials()
access_key = credentials.access_key
secret_key = credentials.secret_key
token = credentials.token

aws_auth = AWS4Auth(
    access_key,
    secret_key,
    region,
    service,
    session_token=token
)



In [2]:
es = Elasticsearch(
    hosts = [{'host': eshost, 'port': 443}],
    http_auth=aws_auth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection
)

print(json.dumps(es.info(), indent=2))

{
  "name": "ZhaR9MU",
  "cluster_name": "342635568055:world-modelers-dev",
  "cluster_uuid": "nGeAO1lMTKaG6_LOpSg17w",
  "version": {
    "number": "6.7.0",
    "build_flavor": "oss",
    "build_type": "zip",
    "build_hash": "8453f77",
    "build_date": "2019-04-17T05:34:35.022392Z",
    "build_snapshot": false,
    "lucene_version": "7.7.0",
    "minimum_wire_compatibility_version": "5.6.0",
    "minimum_index_compatibility_version": "5.0.0"
  },
  "tagline": "You Know, for Search"
}


## Connecting to S3
Next we should establish a connection with the S3 `world-modelers` bucket so that we can store each file to S3.

In [3]:
profile = "wmuser"
bucket_name = "world-modelers"

In [4]:
session = boto3.Session(profile_name=profile)

s3 = session.resource("s3")
s3_client = boto3.client("s3")

## Document Parsing Functions
Below are a set of functions to extract text from PDF and HTML and to extract appropriate metadata.

In [5]:
def extract_tika(file_path):
    """
    Take in a file path of a PDF and return its Tika extraction
    https://github.com/chrismattmann/tika-python
    
    Returns: a tuple of (extracted text, extracted metadata)
    """
    tika_data = parser.from_file(file_path)
    tika_extraction = tika_data.pop('content')
    tika_metadata = tika_data.pop('metadata')
    return (tika_extraction, tika_metadata)

def extract_pypdf2(file_path):
    """
    Take in a file path of a PDF and return its PyPDF2 extraction
    https://github.com/mstamy2/PyPDF2
    """
    
    pdfFileObj = open(file_path, 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    page_count = pdfReader.numPages
    pypdf2_extraction = ''
    for page in range(page_count):
        pageObj = pdfReader.getPage(page)
        page_text = pageObj.extractText()
        pypdf2_extraction += page_text
    return pypdf2_extraction

def extract_bs4(file_path):
    """
    Take in a file path of an HTML document and return its Beautiful Soup extraction
    https://www.crummy.com/software/BeautifulSoup/bs4/doc/
    """        
    htmlFileObj = open(file_path, 'r')
    soup = BeautifulSoup(htmlFileObj, "lxml")
    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.decompose()    # rip it out
    # get text
    text = soup.get_text()        
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    bs4_extraction = '\n'.join(chunk for chunk in chunks if chunk)
    return bs4_extraction

def parse_pdfinfo(tika_metadata, doc):
    """
    Takes in pdfinfo from Tika and a document and enriches the document
    with metadata fields
    """
    t_m = extract_tika(f"{dir_path}/pdf/{file_path}")[1]
    title = t_m.get('title',None)
    date = t_m.get('Creation-Date',t_m.get('created',None))
    author = t_m.get('Author',None)
    last_modified = t_m.get('Last-Modified',None)
    if title:
        doc['title'] = title
    if date:
        doc['creation_date'] = date
    if author:
        doc['author_name'] = author
    if last_modified:
        doc['modification_date'] = last_modified
    return doc

def parse_document(file_path):
    """
    Take in the full path to a file and perform appropriate text extrraction
    as well as metadata enrichment (if a PDF, using pdfinfo fields)
    """
    file_name = os.path.basename(file_path)
    file_type = os.path.splitext(file_path)[1]
    
    # sha256 hash the raw contents of the file to generate a UUID
    raw = open(file_path,'rb').read()
    _id = sha256(raw).hexdigest()
    
    doc = {'_id': _id,
           'file_name': file_name, 
           'file_type': file_type}
    
    extracted_text = {}
    
    # set tika_metadata to None and overwrite it
    # if we are able to extract pdfinfo with Tika
    tika_metadata = None
    
    if file_type == '.pdf':
        doc['file_type'] = file_type
        try:
            tika_extraction, tika_metadata = extract_tika(file_path)
            extracted_text['tika'] = tika_extraction
        except Exception as e:
            print(f"Tika extraction failed: {e}")

        try:
            extracted_text['pypdf2'] = extract_pypdf2(file_path)
        except Exception as e:
            print(f"PyPDF2 extraction failed: {e}")
    elif file_type == '.html':
        try:
            extracted_text['bs4'] = extract_bs4(file_path)
        except Exception as e:
            print(f"BS4 extraction failed: {e}")
    
    if tika_metadata:
        doc = parse_pdfinfo(tika_metadata, doc)
    
    doc['extracted_text'] = extracted_text
    
    return doc

## Indexing documents
Now that we have a connection with Elasticsearch and S3 we can index the documents. 

1. First, we will push the raw file (PDF or HTML) to S3.
2. Then will first perform text and metadata extraction in order to create a document to index.
3. Finally, we will index the document to Elasticsearch.

> Note: you must update the `dir_path` to the appropriate path to the hackathon documents

In [6]:
# Update this path with the correct path to the hackathon documents
dir_path = '/Users/brandon/Desktop/Docs_20-May-2019'
raw_files = os.listdir(dir_path + '/pdf')

Note that when we index the document we pop its `_id` and store that as its Elasticsearch `_id`. This ensures that if we index a document to that `_id` we will be updating the document in place, not generating a new document in Elasticsearch.

Since each document was hashed to generate an `_id`, of the 358 original documents there were at least 2 exact duplicates; once hashed there were 356 unique documents.

In [7]:
count = 0
for file_path in raw_files:    
    
    file_name = f"{dir_path}/pdf/{file_path}"
    s3_key = f"documents/migration/{file_path}"
    s3_uri = f"https://world-modelers.s3.amazonaws.com/{s3_key}"

    
    #############################################
    ### 1. Upload raw file to S3 ################
    #############################################
    s3_client.upload_file(file_name, 
                          bucket_name, 
                          s3_key)
    
    
    #############################################
    ### 2. Parse document #######################
    #############################################
    doc = parse_document(file_name)
    doc['stored_url'] = s3_uri

    
    #############################################
    ### 3. Index parsed document to Elasticsearch
    #############################################  
    index = 'migration'
    doc_type = 'document'
    
    # create the index if it does not exist
    if not es.indices.exists(index):
        es.indices.create(index)
        print(f"Created ES index: {index}")
        
    es.index(index=index, doc_type=doc_type, id=doc.pop('_id'), body=doc)
    count += 1
    if count % 25 == 0:
        print(count)

Tika extraction failed: 'latin-1' codec can't encode character '\uf03a' in position 46: ordinal not in range(256)




Tika extraction failed: 'latin-1' codec can't encode character '\uf03a' in position 46: ordinal not in range(256)
Tika extraction failed: 'latin-1' codec can't encode character '\uf03a' in position 39: ordinal not in range(256)
25
Tika extraction failed: 'latin-1' codec can't encode character '\uf03a' in position 40: ordinal not in range(256)




Tika extraction failed: 'latin-1' codec can't encode character '\uf03a' in position 40: ordinal not in range(256)
50
Tika extraction failed: 'latin-1' codec can't encode character '\uf03a' in position 40: ordinal not in range(256)
Tika extraction failed: 'latin-1' codec can't encode character '\uf03a' in position 81: ordinal not in range(256)
Tika extraction failed: 'latin-1' codec can't encode character '\uf03a' in position 72: ordinal not in range(256)
Tika extraction failed: 'latin-1' codec can't encode character '\uf03a' in position 40: ordinal not in range(256)
Tika extraction failed: 'latin-1' codec can't encode character '\uf03a' in position 40: ordinal not in range(256)
75
Tika extraction failed: 'latin-1' codec can't encode character '\uf07c' in position 80: ordinal not in range(256)
Tika extraction failed: 'latin-1' codec can't encode character '\uf03a' in position 81: ordinal not in range(256)
Tika extraction failed: 'latin-1' codec can't encode characters in position 64-65:



225
Tika extraction failed: 'latin-1' codec can't encode character '\uf03a' in position 40: ordinal not in range(256)
Tika extraction failed: 'latin-1' codec can't encode character '\uf03a' in position 40: ordinal not in range(256)
Tika extraction failed: 'latin-1' codec can't encode character '\uf03a' in position 40: ordinal not in range(256)
Tika extraction failed: 'latin-1' codec can't encode character '\uf03a' in position 40: ordinal not in range(256)
Tika extraction failed: 'latin-1' codec can't encode character '\uf03a' in position 40: ordinal not in range(256)
250
Tika extraction failed: 'latin-1' codec can't encode character '\uf03a' in position 40: ordinal not in range(256)
Tika extraction failed: 'latin-1' codec can't encode character '\uf03a' in position 40: ordinal not in range(256)
Tika extraction failed: 'latin-1' codec can't encode character '\uf07c' in position 101: ordinal not in range(256)
Tika extraction failed: 'latin-1' codec can't encode character '\uf03a' in posi

## Retrieving documents
Now we can query the Elasticsearch `migration` index using a variety of querying approaches outlined [here](https://www.elastic.co/guide/en/elasticsearch/reference/6.7/query-dsl.html).

Below is an example of using a `query_string` and Lucene query syntax (boolean) search:

In [11]:
query = {
    "query": {
        "query_string" : {
            "default_field" : "extracted_text.tika", # Ensure we use the correct field (could search on `title` as well)
            "query" : "refugee AND aid AND (addis OR NGO)" # Lucene query syntax
        }
    }
}

In [12]:
results = es.search(index=index, body=query)['hits']['hits']

In [13]:
results

[{'_index': 'migration',
  '_type': 'document',
  '_id': 'a3bd90c5473edea43d1db9c65555e348c3f8e7e597c31b48aa001eed2fd4cd73',
  '_score': 9.433365,
  '_source': {'file_name': 'Aid_workers_killed,_kidnapped_and_arrested_Dec-17.pdf',
   'file_type': '.pdf',
   'creation_date': '2018-04-11T15:52:08Z',
   'modification_date': '2018-04-11T15:52:08Z',
   'extracted_text': {'tika': "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAid in Danger Incident Trends \nKIK | Aid workers killed, injured, kidnapped or assaulted\n\nKilled \n•\t 179 aid workers were reportedly killed in 98 incidents in 20 countries. \n\n•\t The highest numbers of deaths occurred in Syria (66), South Sudan (30), CAR (15), \nNigeria (12) and Bangladesh (10). \n\n•\t Over half of reported incidents were attributed to non-state actors (57 out of \n98 incidents). State actors were reportedly responsible for 33 incidents. For eight \nincidents, the identity of the perpetrators is uncl

Results are stored as an array. Each item in the array is an object which conforms to the [schema described here](https://github.com/WorldModelers/Integration/blob/master/Plans/Document-Storage-Plan.md).

The `_id` key is a unique identifer for that document in the `migration` index. The actual document is stored in the `_source` key.

In [18]:
results[0]['_id']

'a3bd90c5473edea43d1db9c65555e348c3f8e7e597c31b48aa001eed2fd4cd73'

You can retrieve documents by their `_id` if you wish:

In [20]:
es.get(index=index, doc_type=doc_type, id=results[0]['_id'])

{'_index': 'migration',
 '_type': 'document',
 '_id': 'a3bd90c5473edea43d1db9c65555e348c3f8e7e597c31b48aa001eed2fd4cd73',
 '_version': 1,
 '_seq_no': 60,
 '_primary_term': 1,
 'found': True,
 '_source': {'file_name': 'Aid_workers_killed,_kidnapped_and_arrested_Dec-17.pdf',
  'file_type': '.pdf',
  'creation_date': '2018-04-11T15:52:08Z',
  'modification_date': '2018-04-11T15:52:08Z',
  'extracted_text': {'tika': "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAid in Danger Incident Trends \nKIK | Aid workers killed, injured, kidnapped or assaulted\n\nKilled \n•\t 179 aid workers were reportedly killed in 98 incidents in 20 countries. \n\n•\t The highest numbers of deaths occurred in Syria (66), South Sudan (30), CAR (15), \nNigeria (12) and Bangladesh (10). \n\n•\t Over half of reported incidents were attributed to non-state actors (57 out of \n98 incidents). State actors were reportedly responsible for 33 incidents. For eight \nincidents, t

If you would like to retrieve all the documents you can do so with a `match all` query:

In [21]:
query = {
    "query": {
        "match_all": {}
    }
}

count = es.count(index=index, body=query)['count']
print("There are {0} total documents in the {1} index.".format(count,index))

There are 356 total documents in the migration index.


In [22]:
results = es.search(index=index, body=query, size=count)['hits']['hits']

In [23]:
print("This query returned {} total documents".format(len(results)))

This query returned 356 total documents
