Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: Apache-2.0

# Ask the Graph
# Notebook 1: Prep Unstructured Data

In this notebook we prepare unstructured data, a set of press releases about organizations. This data complements the structured organization data we prepared in notebook 0. 

You DO NOT need to run this notebook. The notebook produces output that is already prepared for you and available in a public bucket. But you may wish to review how we prepared that data. In that case, follow along with the logic below.

Here is the input, sourced from s3://aws-neptune-customer-samples/tmls2024/source/. We download a local copy to the source folder:

- comprehend_events_amazon_press_releases.20201118.v1.4.1.jsonl: Press releases, including title, keywords, and text. This file also contains entities and events extracted from press releases using Amazon Comprehend.

We produce the following output locally:
- graphdata/lpg - CSV files of embeddings and extracted entities to bulk-load as LPG data to Neptune. 
- graphdata/rdf - Turtle files to bulk-load to Neptune as RDF data. (TODO - is this needed?)
- documents - Press releases as .txt files 
- chunks - Chunks of press releases as .txt files
- summaries - Summaries of press releases as .txt files

That same data is available in:
s3://aws-neptune-customer-samples/tmls2024/graphdata/
s3://aws-neptune-customer-samples/tmls2024/documents/
s3://aws-neptune-customer-samples/tmls2024/chunks/
s3://aws-neptune-customer-samples/tmls2024/summaries/

Specific files that we produce are the following. LPG files in graphdata/lpg are:

- documents.csv - Press release documents and their summary as an embedding.
- summaries.csv - References the documents from documents.csv but adding their embedding from summary.
- extractions.csv - Entities and events extracted from press releases.
- extraction_rels.csv - Links between documents and their extracted entities and events. 
- resolved_entities.csv - Link extracted entities to existing orgs, persons, industries, locations, products, and services in the graph.
- resolution_links.csv -  Resolution matches that can be linked from extracted node to structured node.
- chunks.csv - Document chunks and their embeddings.
- chunk2doc.csv -Link chunks to docs.
- entity_embeddings.csv - Embeddings of entities for searchability

RDF in graphdata/rdf are:
- documents.ttl - Press release documents.
- extractions.ttl - Entities and events extracted from press releases.
- resolved_entities.ttl -Link extracted entities to existing orgs, persons, industries, locations, products, and services in the graph.

TODO mention embeddings are kept in OpenSearch for RDF..

These files are also maintained in:

- s3://aws-neptune-customer-samples/tmls2024/graphdata/
- s3://aws-neptune-customer-samples/tmls2024/documents/
- s3://aws-neptune-customer-samples/tmls2024/chunks/
- s3://aws-neptune-customer-samples/tmls2024/summaries/

TODO - data model, including what we've built so far

In the next notebook, we load prepared data into Neptune for query.

## Install a few dependencies
We use Langchain very lightly

In [None]:
!pip install -qU langchain-text-splitters langchain-community unstructured

## Get source data and create output folders

In [None]:
!aws s3 sync s3://aws-neptune-customer-samples-us-east-1/tmls2024/source/ source

In [None]:
%%bash 
mkdir -p graphdata graphdata/rdf graphdata/lpg summaries chunks documents

## Build press release documents

We have a set of press releases. These documents contain useful information that we would like to link to our base organization KG. 

In this set we build those documents.

The result is a CSV file written to the *graphdata* folder:

- lpg/documents.csv 
- rdf/documents.ttl

Also *.txt* files are written to the *documents* folder. The name of each file is *docid*.txt, where *docid* is the vertex ID of the document in the graph.
    

In [None]:
import helpers
def make_document_uri(docid):
    return helpers.make_uri(f"Document/{docid}")

## The RAG-prep part: Summarize docs Chunk, embed, and extract files

In [None]:
import pandas as pd
import csv, json
import helpers
from rdflib import Graph, Literal, RDF, RDFS, URIRef, XSD, OWL, BNode, DC, SKOS

prdocs = []

# Open the JSONL file that contains the documents plus the Comprehend extraction. 
jsonObj = pd.read_json(path_or_buf="source/comprehend_events_amazon_press_releases.20201118.v1.4.1.jsonl", lines=True)
for index, row in jsonObj.iterrows():
        
    # extract metadata about current press release
    metadata=row['metadata']
    m_keywords=metadata['keywords']
    m_title=metadata['title']
    m_doc=metadata['document_id']

    # write text to a file for chunking/embedding later
    with open(f"documents/{m_doc}.txt", "w") as f:
        f.write(row['raw_text'])
    
    prdocs.append([make_document_uri(m_doc), "Document", m_doc, m_title, m_keywords])

# write docs to CSV for LPG
with open('graphdata/lpg/documents.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["~id", "~label", "docuuid", "title", "keywords"])
    for p in prdocs:
        writer.writerow(p)

# write docs for RDF
rdf_file = helpers.rdf_open()
for p in prdocs:
    helpers.rdf_write(rdf_file, URIRef(p[0]), RDF.type, helpers.make_uri("Document"))
    helpers.rdf_write(rdf_file, URIRef(p[0]), DC.title, Literal(p[3]))
    helpers.rdf_write(rdf_file, URIRef(p[0]), helpers.make_uri("uuid"), Literal(p[2]))
    helpers.rdf_write(rdf_file, URIRef(p[0]), helpers.make_uri("keywords"), Literal(p[4]))
    
        
helpers.rdf_close(rdf_file, "graphdata/rdf/documents.ttl")

## Build Comprehend extraction results

We ran Amazon Comprehend to extract entities and events from the press releases. 

In this step, we build those extractions and link them to documents.

Specific files that we produce are the followin:

- extracted_entities.csv
- extracted_events.csv 
- extraction.ttl - Entities and events extracted from press releases.

For more on this approach, see blog post https://aws.amazon.com/blogs/database/building-a-knowledge-graph-in-amazon-neptune-using-amazon-comprehend-events/.


In [None]:
# We will filter out names referring to each entity with less than 0.95 group certainty.
# You can change this threshold to be lower if you are tolerant of less certain values in your data set.
groupThreshold = 0.95

entities=[]
events=[]
ee_rels=[]
de_rels=[]
distinct_roles={}

def strip_x_id(rawid):
    return "_".join(rawid.split()).lower() # replace all whitespace with underscores

# Open the JSONL file that contains the documents plus the Comprehend extraction. 
jsonObj = pd.read_json(path_or_buf="source/comprehend_events_amazon_press_releases.20201118.v1.4.1.jsonl", lines=True)
for index, row in jsonObj.iterrows():
        
    # extract metadata about current press release
    metadata=row['metadata']
    m_doc=metadata['document_id']
    
    # Comprehend Events references entities it refers to by index, so we need to retain the ordered list of entities
    # within the document
    annotations = row['annotations']
    for entity in annotations["Entities"]:
        primary_name = entity["Mentions"][0]["Text"]
        entity_type =entity["Mentions"][0]["Type"]
        entity_local=strip_x_id(f"{entity_type}_{primary_name}")
        entity_id= helpers.make_uri(f"ExtractedEntity/{entity_local}")
        names=[]
        for mention in entity["Mentions"]:
            if (mention["GroupScore"] >= groupThreshold):
                if not(mention["Text"] in names):
                        names.append(mention["Text"])
        entities.append({
            '~id': entity_id,
            'local_id': entity_local,
            '~label': 'ExtractedEntity',
            'label': primary_name,
            'labels': names,
            'type': entity_type
        })
        
    for event in annotations["Events"]:
        primary_name=event["Triggers"][0]["Text"]
        event_type=event["Type"]
        offset_for_id=str(event["Triggers"][0]["BeginOffset"])
        event_local=strip_x_id(f"{m_doc}_{event_type}_{primary_name}{offset_for_id}")
        event_id=helpers.make_uri(f"ExtractedEvent/{event_local}")
        names=[]
        for trigger in event["Triggers"]:
            if not(trigger["Text"] in names):
                names.append(trigger["Text"])

        events.append({
            '~id': event_id,
            'local_id': event_local,
            '~label': 'ExtractedEvent',
            'label': primary_name,
            'labels': names,
            'type': event_type
        })

        # add edges between the event node and the entity node, 
        # annotated with a label describing the Comprehend Event role assigned to the entity in the event.
        for argument in event["Arguments"]:
            from_id=event_id
            from_local=event_local
            to_ent=entities[argument["EntityIndex"]]
            to_id=to_ent['~id']
            to_local=to_ent['local_id']
            edge_type=argument["Role"]
            ee_local=strip_x_id(f"{from_local}_{to_local}_{edge_type}")
            ee_edge_id=helpers.make_uri(f"ExtractedEventToEntity/{ee_local}")
            role= argument["Role"]
            role_uri=helpers.make_uri(f"ExtractedRole/{role}")
            distinct_roles[role_uri]=role
            ee_rels.append({
                '~id': ee_edge_id,
                '~label': "eventHasEntity",
                'type': event["Type"],
                '~from': from_id,
                '~to': to_id,
                "role":  role
            })
        
        # add an edge between the document and the event nodes
        document_id = make_document_uri(m_doc)
        de_local=strip_x_id(f"{m_doc}_{event_local}")
        de_edge_id=helpers.make_uri(f"DocumentToExtractedEvent/{de_local}")
        de_rels.append({
            '~id': de_edge_id,
            '~label': "documentHasEvent",
            '~from': document_id,
            '~to': event_id,
            'role': "" 
        })
        

# write docs to CSV for LPG
with open('graphdata/lpg/extractions.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["~id", "~label", "label", "labels", "type"])
    for p in entities + events:
        writer.writerow([p['~id'], p['~label'], p['label'], helpers.get_delim_string(p, 'labels'), p['type']])

with open('graphdata/lpg/extraction_rels.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["~id", "~from", "~to", "label", "role"])
    for p in ee_rels + de_rels:
        writer.writerow([p['~id'], p['~from'], p['~to'], p['~label'], p['role']])
        
# write docs for RDF
rdf_file = helpers.rdf_open()
    
for d in distinct_roles:
    helpers.rdf_write(rdf_file, URIRef(d), RDF.type, helpers.make_uri("ExtractionRole"))
    helpers.rdf_write(rdf_file, URIRef(d), RDFS.label, Literal(distinct_roles[d]))

for p in entities + events:
    helpers.rdf_write(rdf_file, URIRef(p['~id']), RDF.type, helpers.make_uri("Extraction"))
    helpers.rdf_write(rdf_file, URIRef(p['~id']), RDF.type, helpers.make_uri(p['~label']))
    helpers.rdf_write(rdf_file, URIRef(p['~id']), RDF.type, helpers.make_uri(p['type']))

    helpers.rdf_write(rdf_file, URIRef(p['~id']), RDFS.label, Literal(p['label']))
    helpers.rdf_write(rdf_file, URIRef(p['~id']), helpers.make_uri("xtype"), Literal(p['type']))
    for l in p['labels']:
        helpers.rdf_write(rdf_file, URIRef(p['~id']), SKOS.altLabel, Literal(l))

for p in de_rels:
    helpers.rdf_write(rdf_file, URIRef(p['~from']),helpers.make_uri(p['~label']), URIRef(p['~to']))
        
for p in ee_rels:
    helpers.rdf_write(rdf_file, URIRef(p['~from']),helpers.make_uri(p['role']), URIRef(p['~to']))
    
        
helpers.rdf_close(rdf_file, "graphdata/rdf/extactions.ttl")


### Entity Resolution, LLM Style
Now let's try to link entities mentioned in Comprehend output to orgs, persons, industries, locations, products, and services in the graph. 

We'll get creative! Let's ask the LLM to find well-known URIs and alternate names for the entities and events discovered above.

- resolved_entities.csv - Link extracted entities to existing orgs, persons, industries, locations, products, and services in the graph.
- resolution_links.csv -  Resolution matches that can be linked from extracted node to structured node.

RDF in graphdata/rdf are:
- resolved_entities.ttl -Link extracted entities to existing orgs, persons, industries, locations, products, and services in the graph.


In [None]:
# Track entities to resolve
er_entities={}
for e in entities:
    if e['type'] in ['PERSON', 'PERSON_TITLE', 'LOCATION', 'ORGANIZATION', 'STOCK_CODE']:
        er_entities[e['~id']] = {'~id': e['~id'], 'label': e['label'], 'type': e['type']}

count=len(er_entities)
progress=0
for e in er_entities:
    if progress % 10 == 0:
        print(str(progress))
    er_entities[e]['resolution']=helpers.resolve_entities(er_entities[e]['label'])
    progress += 1


In [None]:
with open('saved_er.json', 'w') as saved_er: 
    saved_er.write(json.dumps(er_entities))

## Save these entities and see if we can resolve them once they land in the graph

- resolved_entities.csv - Link extracted entities to existing orgs, persons, industries, locations, products, and services in the graph.

- extractions.ttl - Entities and events extracted from press releases.


In [None]:
def is_uri(uri):
    return uri.startswith("http://") or uri.startswith("https://")

resies=[]

for eid in er_entities:
    for index, r in enumerate(er_entities[eid]['resolution']):
        resid=helpers.make_uri(f"{eid}_res{index}")
        resies.append({
            '~id': resid, 
            '~label': "ExtractedEntityAltTerm", 
            'label': r, 
            'is_uri': is_uri(r),
            'entity_id': eid,
            'edge_id': helpers.make_uri(f"{eid}_res{index}_link"),
            'edge_type': "LinkedAtlTerm"
        })
        
# write docs to CSV for LPG
with open('graphdata/lpg/resolved_entities.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["~id", "~label", "label"])
    for e in resies:
        writer.writerow([e['~id'], e['~label'], e['label']])

with open('graphdata/lpg/extraction_links.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["~id", "~from", "~to", "label"])
    for e in resies:
        if e['is_uri']:
            writer.writerow([e['edge_id'], e['entity_id'], e['label'], e['edge_type']])
        
# write docs for RDF
rdf_file = helpers.rdf_open()

for e in resies:
    helpers.rdf_write(rdf_file, URIRef(e['~id']), RDF.type, helpers.make_uri(e['~label']))
    helpers.rdf_write(rdf_file, URIRef(e['~id']), RDFS.label, Literal(e['label']))
    if e['is_uri']:
        helpers.rdf_write(rdf_file, URIRef(e['entity_id']), helpers.make_uri(e['edge_type']), URIRef(e['~label']))
        
helpers.rdf_close(rdf_file, "graphdata/rdf/resolved_entities.ttl")


## Create document summaries
Ask LLM to summarize each press release document. 

Save summaries as .txt files in summaries folder.

Also create CSV file keeping embeddings of each summary:

- summaries.csv - References the documents from documents.csv but adding their embedding from summary.

In [None]:
import pandas as pd
import csv, json
import helpers
progress=0
with open("graphdata/lpg/summaries.csv", 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["~id", "~label", "embedding:vector"])

    jsonObj = pd.read_json(path_or_buf="source/comprehend_events_amazon_press_releases.20201118.v1.4.1.jsonl", lines=True)
    for index, row in jsonObj.iterrows():
    
        if progress % 5 == 0:
            print(str(progress))

        # extract metadata about current press release
        metadata=row['metadata']
        m_keywords=metadata['keywords']
        m_title=metadata['title']
        m_doc=metadata['document_id']

        summary=helpers.summarize(row['raw_text'])
        summary_embedding=helpers.embedding_string(helpers.make_embedding(summary))
    
        with open(f"summaries/{m_doc}.txt", "w") as f:
            f.write(summary)

        writer.writerow([make_document_uri(m_doc), "Document", summary_embedding])
        
        progress+=1

## Make chunks of docs
Split press release documents into chunks, create embeddings for each chunk, and link the chunk to the document in the graph.

Writes files to chunks folder.

Creates the following CSV files:

- chunks.csv - Document chunks and their embeddings.
- chunk2doc.csv -Link chunks to docs.


In [None]:
def write_chunk_to_file(docfile, chunkidx, content):
    file_name=f"chunks/{docfile}_{str(chunkidx)}"
    text_file = open(file_name, "w")
    text_file.write(content)
    text_file.close()
    return file_name

all_splits=helpers.make_doc_splits("documents")
chunk2doc={}
with open("graphdata/lpg/chunks.csv", 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["~id", "~label", "doc", "chunkidx:Integer", "embedding:vector"])
    
    for index, s in enumerate(all_splits):
        content=s.page_content
        doc=s.metadata['source'].split("/")[-1]
        file_name = write_chunk_to_file(doc, index, content)
        embedding=helpers.embedding_string(helpers.make_embedding(content))
        chunk_local =f"{doc}_{index}" 
        chunk_uri = helpers.make_uri(f"Chunk/{chunk_local}")
        doc_uri = make_document_uri(doc.split(".")[0])
        writer.writerow([chunk_uri, "Chunk", doc, index, embedding])
        chunk2doc[chunk_uri] = {
            'chunk_uri': chunk_uri,
            'doc_uri': doc_uri,
            'chunk_local': chunk_local,
            'doc_local': doc
        }
        
# links chunks to docs
with open("graphdata/lpg/chunk2doc.csv", 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["~id","~from", "~to", "~label"])
    
    for cd in chunk2doc:
        cdrec = chunk2doc[cd]
        edge_id=helpers.make_uri(f"chunkToDoc_{cdrec['chunk_local']}_{cdrec['doc_local']}")
        writer.writerow([edge_id, cdrec['chunk_uri'], cdrec['doc_uri'], "belongsToDocument"])

## Make entity embeddings so I can search them better

In scope:
- taxonomical concept
- orgs, persons, industries, services, products, locations
- any loose structured objects 
- extracted entities
- extracted events

Not in scope:
- relationships among structured entities
- roles in extraction

When a user queries, we extract entities from the QUESTION. We want reasonable searchability of those entities. The not-in-scope data is a small enumerated set. We can prompt the LLM to express predicates in the question from that set.

In the code below, we populate the following:

- entity_embeddings.csv - Embeddings of entities for searchability


In [None]:
import csv
import helpers

def make_entity_embedding(toks):
    text=" ".join(toks)
    print(text)
    return helpers.embedding_string(helpers.make_embedding(text))

taxonomy_labels=['prefLabel', 'altLabels', 'broaders']
struct_labels=['label', 'labels', 'ulabels', 'typeLabels', 'seeAlsoLabels', 'seeAlsoTypeLabels']
extract_labels=['label', 'labels']
entities_to_embed={
    'orgs': struct_labels, 
    'persons': struct_labels, 
    'products': struct_labels,  
    'industries': struct_labels, 
    'locations': struct_labels,  
    'services': struct_labels, 
    'taxonomy_concepts': taxonomy_labels, 
    'extractions': extract_labels
}

with open("graphdata/lpg/entity_embeddings.csv", 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["~id", "~label", "embedding:vector"])
    
    for ee in entities_to_embed:
        print(ee)
        with open(f'graphdata/lpg/{ee}.csv', newline='') as csvfile:
            csvreader = csv.reader(csvfile, delimiter=',')
            posof={}
            for index, row in enumerate(csvreader):
                if index==0:
                    for rindex, r in enumerate(row):
                        posof[r] = rindex
                    print(posof)
                else:
                    _id = row[0]
                    _label=row[1]
                    candidates=[]
                    for col in entities_to_embed[ee]:
                        candidates += row[posof[col]].split(helpers.CELL_DELIM)
                    embedding_text=make_entity_embedding(candidates)
                    writer.writerow([_id, _label, embedding_text])
