## What is knowledge graph (KG)?

A knowledge graph (KG) represents a network of entities and the relationship among them. Entities are objects, events, situations, and/or concepts. They are represented as nodes in KG. The relationship between two entities is expressed as an edge. In KG, the edge is described as a predicate connecting a source node (subject) and a destination node (object). The network of subject-predicate-object relationships between thousands of entities constitutes knowledge, therefore it is a knowledge graph.

## Text document is KG

Every document potentially contains KG. For example, the named entity recognition (NER) task extracts entities from a human text. In this case, the document and entities found in it are connected by a predicate describing that "these entities were found in the text". 
The natural language is also inherently KG. The relationship of subject-predicate-object is a basis of human language structure. In the context of NER, this task is called relationship extraction. 

Text data are unstructured. KG is a useful method to convert unstructured text data to structured data. 


## Information extraction pipeline for KG

We are going to showcase an ambitious prototype of the automatic information extraction pipeline that comes with KG. The pipeline consists of OCR, NER tool, and graph database. Entities are extracted from documents in pdf format and KG is built based on the extracted entities and relationships among them. 

In this pipeline, Amazon Textract, Amazon Comprehend Medical, and Amazon Neptune are used as OCR, NER tool, and graph database, respectively. We chose Gremlin as the graph traversal language thanks to its compatibility to Neptune ML as of August 2021. 

This workshop material is composed of four parts. 

- Document understanding solution using Amazon Textract and Amazon Comprehend Medical
- Building a Gremlin compatible graph dataset
- Building a Gremlin graph on Neptune
- Training and testing GNN models using NeptuneML 

This is still a prototype. We are hoping to elucidate some shortfalls in this pipeline as well as in each component. We are hoping to kick off constructive conversation towards the production scale automatic KG solution.  


### Preparation 

First, we start SageMaker notebook instance for NeptuneML using CloudFormation template described in the following link. 

https://docs.aws.amazon.com/neptune/latest/userguide/machine-learning.html

* On the Select Template page, choose Next.
* On the Specify Details page, choose Next.
* On the Configure Stack Options page, choose Next.
* On the Review page, there are two check boxes that you need to check 

Open a notebook instance namely **aws-neptune-NeptuneML-test**. 

Using the terminal in the notebook, clone the following repository. 

https://

Attach to following policies to Neptume ML - SageMaker Execution Role IAM role (e.g. NeptuneMLQuickStart-NeptunMLCoreStac-ExecutionRole-XXXXXXXXX).
* AWSCodeCommitFullAccess, 
* AmazonEC2ContainerRegistryFullAccess, 
* AmazonS3FullAccess, 
* AmazonTextractFullAccess, 
* ComprehendMedicalFullAccess, 
* AmazonSageMakerFullAccess 


### Acknowledgement

- Prithiviraj Jothikumar (prithivj@): Priti is the original author of Knoma
- Phi Nguyen (phi@): 
- Ryan Brand
- Miguel Romero
- Tatsuya Arai 