Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: Apache-2.0

# KGC 2024 MasterClass: Generating and analyzing knowledge graphs using GenAI and Neptune Analytics
# Notebook 0: Prep Data

This notebook prepares organizational graph data for the demo accompanying the master class. 

You can skip this if you like. The results of the data prep are shared in a public S3 bucket. You still may find it useful to explore *how* the data was prepared. 

Our model for this masterclass is the following:

<img src="images/kgc_model.png">

In this notebook, we prepare three sets of data.


1. Base org data: organizations, persons, and industries. These are the blue boxes in the figure above.

2. Press release documents and their text content. This is the white box in the figure above.

3. Amazon Comprehend extraction of press releases in graph form. We link the extraction results to the base org data.  These are the yellow boxes in the figure above.

The next figure depicts our design.

<img src="images/kgc_design.png">

In this notebook we prep structured data and extracted entities from unstructured data in the bottom third of the figure. 

Run this notebook in a notebook instance with access to a public S3 bucket. It creates its results in the file system of the notebook instance.

## Get source data
We require source data.
- dbpedia_orgs.csv - Organizations
- comprehend_events_amazon_press_releases.20201118.v1.4.1.jsonl - Press releases
- er_curated.csv - Resolved organizations

Get from public S3 bucket into local source folder

In [1]:
import graph_notebook as gn
config = gn.configuration.get_config.get_config()

region = config.aws_region
s3_bucket = f"s3://aws-neptune-customer-samples-{region}/kgc2024_na/source/"


In [None]:
%%bash -s "$s3_bucket"

aws s3 sync $1 source

## Create prep output folders

In [None]:
!mkdir -p graphdata rawtext

## Create base org, person, and industry data

We will add a few orgs, industries, and persons to our graph as base KG data. The result is a set of CSV files written to the graphdata folder:

- seed_orgs.csv - Organizations
- seed_persons.csv - Persons belonging to orgs
- seed_industries.csv - Industries of organizations.
- seed_rels.csv - Org relationships to persons, industries, and other orgs.

We'll use DBPedia as a source.

To get a few of the orgs, run the following query on https://dbpedia.org/sparql against default named graph http://dbpedia.org

```
select * where 
{
 values ?company { 
<http://dbpedia.org/resource/Rivian> 
 <http://dbpedia.org/resource/Whole_Foods_Market>
<http://dbpedia.org/resource/Amazon_(company)>
<http://dbpedia.org/resource/Amazon_Web_Services>
<http://dbpedia.org/resource/Lockheed_Martin>
 } .
OPTIONAL { ?company dbo:type ?otype . } .
OPTIONAL { ?company dbp:currentStatus ?pstatus . } .
OPTIONAL { ?company dbp:industry ?pindustry . } .
OPTIONAL { ?company dbo:keyPerson ?okeyPerson . } .
OPTIONAL { ?company dbp:name ?pname . } .
OPTIONAL { ?company dbp:parent ?pparent . } .
OPTIONAL { ?company dbp:type ?ptype . } .
OPTIONAL { ?company dbp:url ?purl . } .
OPTIONAL { ?company foaf:homepage ?fhomepage . } .
OPTIONAL { ?company foaf:name ?fname . } .

} 
ORDER BY ?company
```

To get some of these people, run the following query:

```
select * where 
{
 values ?person { 
<http://dbpedia.org/resource/Andy_Jassy> 
<http://dbpedia.org/resource/Jeff_Bezos> 
<http://dbpedia.org/resource/James_D._Taiclet>
 } .
OPTIONAL { ?person foaf:name ?fname . } .

} 
ORDER BY ?person
```



In [None]:
import pandas as pd
import csv

# return cell value or empty string
def cell_val(dicto, key):
    if key in dicto:
        return dicto[key]
    else:
        return ""
    
# make delimited list of vals for a cell
def multi_val(dicto, key):
    if key in dicto:
        return ";".join(dicto[key])
    else:
        return ""
    
# make an ID, if val is already an IRI, return it, else build one
def as_id(val, objtype):
    if val.startswith("http://"):
        return val
    else:
        return f"{objtype}_{val}"

# boolean: is val an IRI?
def is_iri(val):
    return val.startswith("http://")

# Known persons
PERSONS={
    "http://dbpedia.org/resource/Andy_Jassy": "Andy Jassy",
    "http://dbpedia.org/resource/James_D._Taiclet": "James D. Taiclet",
    "http://dbpedia.org/resource/Jeff_Bezos": "Jeff Bezos"
}

# tracker data structures
orgs = {}
o2p = {}
o2i = {}
o2o = {}
industries={}

# build orgs dynamically; loop through results from DBPedia
df = pd.read_csv(filepath_or_buffer="source/dbpedia_orgs.csv")
for index, row in df.iterrows():
    
    # Company ID, which is an IRI, maps to org record in orgs dictionary
    company_id = row['company']
    org = {}
    if not (company_id in orgs):
        orgs[company_id] = org
    else:
        org = orgs[company_id]

    # If val is defined, add to org record
    def add_single(key, val):
        if str(val) == "nan":
            return
        org[key] = val
        
    # If val is defined, add to org multi-val record
    def add_multi(key, val):
        if str(val) == "nan":
            return
        if not key in org:
            org[key] = [val]
        elif not(val in org[key]):
            org[key].append(val)
        
    add_single('leType', row['otype'])
    add_single('leStatus', row['pstatus'])
    add_single('parent', row['pparent'])
    add_multi('industries', row['pindustry'])
    add_multi('names', row['pname'])
    add_multi('names', row['fname'])
    if row['okeyPerson'] in PERSONS:
        add_multi('persons', row['okeyPerson'])

    # track distinct industry nodes
    # and add Org-Industry edges
    if 'industries' in org:
        for i in org['industries']:
            industry_id = as_id(i, "industry")
            industries[industry_id] = 'dontcare' # need it to build industry nodes
            edge_id = f"eo2i_{company_id}_{industry_id}"
            if not (edge_id in o2i):
                o2i[edge_id] = [edge_id, company_id, industry_id, "hasIndustry"]
    # Add Org-Person edges
    if 'persons' in org:
        for p in org['persons']:
            edge_id = f"eo2p_{company_id}_{p}"
            if not (edge_id in o2i):
                o2p[edge_id] = [edge_id, company_id, p, "hasKnownPerson"]
    # Add Org-Org edges
    if 'parent' in org and is_iri(org['parent']):
        edge_id = f"eo2o_{company_id}_{org['parent']}"
        if not (edge_id in o2o):
            o2p[edge_id] = [edge_id, company_id, org['parent'], "hasParentCompany"]
 
# write persons to CSV
with open('graphdata/seed_persons.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["~id", "~label", "name"])
    for p in PERSONS:
        writer.writerow([p, "OrgKG_Person", PERSONS[p]])

# write orgs to CSV
with open('graphdata/seed_orgs.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["~id", "~label", "names", "leType", "leStatus"])
    for o in orgs:
        writer.writerow([o, "OrgKG_Organization", multi_val(orgs[o], 'names'), cell_val(orgs[o], 'leType'), cell_val(orgs[o], 'leStatus')])

# write industries to CSV
with open('graphdata/seed_industries.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["~id", "~label"])
    for i in industries:
        writer.writerow([i, "OrgKG_Industry"])

# write edges to CSV
with open('graphdata/seed_rels.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["~id","~from", "~to", "~label"])
    for e in o2p:
        writer.writerow(o2p[e])
    for e in o2i:
        writer.writerow(o2i[e])
    for e in o2o:
        writer.writerow(o2o[e])


## Build press release documents

We have a set of press releases. These documents contain useful information that we would like to link to our base organization KG. 

In this set we build those documents.

The result is a CSV file written to the *graphdata* folder:

- prdocs.csv 

Also many *.txt* files are written to the *raw_text* folder. The name of each file is *<docid>.txt, where *docid* is the vertex ID of the document in the graph.
    
We do not load the text intent the graph ... not yet, anyway. Later (in 2-CreateLlamaIndex.ipynb) we will show how to use LlamaIndex with Neptune. There will be add the text and embeddings. 



In [None]:
prdocs = []

# Open the JSONL file that contains the documents plus the Comprehend extraction. 
jsonObj = pd.read_json(path_or_buf="source/comprehend_events_amazon_press_releases.20201118.v1.4.1.jsonl", lines=True)
for index, row in jsonObj.iterrows():
        
    # extract metadata about current press release
    metadata=row['metadata']
    m_keywords=metadata['keywords']
    m_title=metadata['title']
    m_doc=metadata['document_id']

    # write text to a file for chunking/embedding later
    with open(f"rawtext/{m_doc}.txt", "w") as f:
        f.write(row['raw_text'])
    
    prdocs.append([m_doc, "DOCUMENT", m_title, m_keywords])

# write docs to CSV
with open('graphdata/prdocs.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["~id", "~label", "title", "keywords"])
    for p in prdocs:
        writer.writerow(p)


## Build Comprehend extraction results

We ran Amazon Comprehend to extract entities and events from the press releases. 

In this step, we build those extractions and link them to documents and organizations.

The results is a set of CSV files written to the *graphdata* folder:

- comprehend_nodes.csv
- comprehend_edges.csv

For more on this code, see blog post https://aws.amazon.com/blogs/database/building-a-knowledge-graph-in-amazon-neptune-using-amazon-comprehend-events/.

In [None]:
# Comprehend vertices
class Vertex:
    def __init__(self):
        self.words = set()
        self.primaryName = ""
        self.entityType = ""
        self.id = ""
        
    def addWord(self, word):
        self.words.add(word)
    
    def setPrimaryName(self, name):
        self.primaryName = name
        
    def setEntityType(self, entityType):
        self.entityType = entityType
        
    def setId(self, rawIdString):
        self.id = "_".join(rawIdString.split()).lower() # replace all whitespace with underscores

    def getId(self):
        return self.id
    
    def toString(self):
        print("Entity " + self.getId() + "; type=" + self.entityType + "; words: " + str(self.words))
        
# Comprehend edges
class Edge:
    def __init__(self, fromEntity, toEntity, edgeType):
        self.fromEntity = fromEntity
        self.toEntity = toEntity
        self.edgeType = edgeType
    
    def getId(self):
        return "_".join(("edge__" + self.fromEntity.getId() + "_" + self.toEntity.getId() + "_" + self.edgeType).split()).lower()  # replace all whitespace with underscores
    
    def toString(self):
        print("Edge " + self.fromEntity.getId() + " --" + self.edgeType + "-> " + self.toEntity.getId())

nodeList = []
edgeList = []
nodeWordList = {}
# We will filter out names referring to each entity with less than 0.95 group certainty.
# You can change this threshold to be lower if you are tolerant of less certain values in your data set.
groupThreshold = 0.95

# Open the JSONL file that contains the documents plus the Comprehend extraction. 
jsonObj = pd.read_json(path_or_buf="source/comprehend_events_amazon_press_releases.20201118.v1.4.1.jsonl", lines=True)
for index, row in jsonObj.iterrows():
        
    # extract metadata about current press release
    metadata=row['metadata']
    document_id=metadata['document_id']
    documentNode = Vertex()
    documentNode.setId(document_id)
    
    # Comprehend Events references entities it refers to by index, so we need to retain the ordered list of entities
    # within the document
    docEntityList = []
    annotations = row['annotations']
    for entity in annotations["Entities"]:
        # convert each object under the "Entities" list into a Node
        theEntity = Vertex()
        theEntity.setPrimaryName(entity["Mentions"][0]["Text"])
        theEntity.setEntityType(entity["Mentions"][0]["Type"])
        theEntity.setId("node__" + entity["Mentions"][0]["Type"] + "_" + entity["Mentions"][0]["Text"])
        for mention in entity["Mentions"]:
            if (mention["GroupScore"] >= groupThreshold):
                theEntity.addWord(mention["Text"])

        docEntityList.append(theEntity)
        nodeList.append(theEntity)
        
    for event in annotations["Events"]:
        #convert each object under the "Events" list to a Node
        theEntity = Vertex()
        theEntity.setEntityType(event["Type"])
        theEntity.setPrimaryName(event["Triggers"][0]["Text"])
        theEntity.setId("node__event_" + document_id + "_" + event["Type"] + "_" + event["Triggers"][0]["Text"] + str(event["Triggers"][0]["BeginOffset"]))
        for trigger in event["Triggers"]:
            theEntity.addWord(trigger["Text"])

        nodeList.append(theEntity)

        # add edges between the event node and the entity node, 
        # annotated with a label describing the Comprehend Event role assigned to the entity in the event.
        for argument in event["Arguments"]:
            edgeList.append(Edge(theEntity, docEntityList[argument["EntityIndex"]], argument["Role"]))
        
        # add an edge between the document and the event nodes
        edgeList.append(Edge(documentNode, theEntity, "EVENT"))
        
# write all of our nodes to a CSV file
with open('graphdata/comprehend_nodes.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['~id','~label','primaryName','names'])
    for node in nodeList:
        for word in node.words:
            # there will be a row for each word assigned to the entity, 
            # but Neptune will aggregate them into single set of words on the node
            writer.writerow([node.id, node.entityType, node.primaryName, word])

# write all of our nodes to a CSV file
with open('graphdata/comprehend_edges.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['~id','~from','~to','~label'])
    for edge in edgeList:
        writer.writerow([edge.getId(), edge.fromEntity.id, edge.toEntity.id, edge.edgeType])


## Link comprehend orgs to base KG orgs

Press releases mention organizations like Amazon and Whole Foods. There are many possible variations. One press release could use the name *amazon.com*, another *Amazon*. We would like to link them the base organization from our KG. If we can do that, then the press release becomes useful structured data that enhances our understanding of organiations we have already baselines in the graph.

There is lots of noise too. For example, Comprehend shows *agencies* and *developers* as extracted organizations. We wish to ignore *common nouns* and consider only *proper nouns*. 

In this step we do two things:
1. Create orgs for any orgs mentioned in press releases that are NOT in our base orgs.
2. Link each org mentioned in the press release to a base org. This ends up being a quick/dirty entity resolution exercise. 

The result is a set of CSV files in *graphdata* folder:

- discovered_orgs.csv - Extraced orgs promoted to base orgs
- resolved_org_rels.csv - Edges connecting extracted orgs to base org

Note: we hand-curated these files but discuss in the class design approaches to resolve these entities.


In [None]:
# Get curated list extracted primary org names and how to resolve them
discovered_orgs=[]
org_name_to_id={}
df = pd.read_csv(filepath_or_buffer="source/er_curated.csv")
for index, row in df.iterrows():
    org_name = row['Org']
    discovered = row['Discovered']
    resolved = row['Resolved']
    if discovered == 'Y':
        org_id = as_id(org_name, "resolved_org")
        discovered_orgs.append([org_id, "OrgKG_Organization", org_name])
        org_name_to_id[org_name] = org_id
    elif str(resolved) != "nan":
        org_name_to_id[org_name] = resolved

# Link observed orgs to base orgs
resolved_orgs={}
df = pd.read_csv(filepath_or_buffer="graphdata/comprehend_nodes.csv")
for index, row in df.iterrows():
    if row['~label']=='ORGANIZATION':
        prname=row['primaryName']
        if prname in org_name_to_id:
            org_id=row["~id"]
            resolved_id=org_name_to_id[prname]
            edge_id = f"ereso_{org_id}_{resolved_id}"
            if not (edge_id in resolved_orgs):
                resolved_orgs[edge_id] = [edge_id, org_id, resolved_id, "resolvesToOrg"]
            

# write discovered orgs to CSV
with open('graphdata/discovered_orgs.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["~id", "~label", "name"])
    for o in discovered_orgs:
        writer.writerow(o)

# write edges to CSV
with open('graphdata/resolved_org_rels.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["~id","~from", "~to", "~label"])
    for e in resolved_orgs:
        writer.writerow(resolved_orgs[e])
