### Background

In this demo, we are playing a [phrasal template](https://en.wikipedia.org/wiki/Phrasal_template) game where we will substitute different phrases and words into realistic financial news stories and perform RAG and GraphRAG queries on them. The goal is to use fictional names and facts to demonstrate 1/ the LLM is not answering based on prior factual training, and 2/ this is not a pre-trained or canned scenario optimized for the specific stories.  To get things started, I've populated all of the fields, but you are free to change whatever you like to another value within the theme suggested!

This demonstration is optimized for the Meta Llama 3.2 family of models.  You will get the best results using the [Meta Llama 3.2 90B model](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2#-llama-3.2-vision-models-(11b/90b)-) for entity extraction and question answering, but you can choose which model you want to choose!

This notebook is used as a demonstration to accompany [this presentation](https://github.com/aws-samples/amazon-neptune-generative-ai-samples/presentations/Unlocking-data-for-GenAI-using-Graph-Databases/Unlocking_data_for_GenAI_using_Graph_Databases_Presentation.pdf)

### Prerequisites
This demonstration requires a Neptune Analytics graph (32 m-NCU is fine), a Neptune Notebook instance, access to the [Amazon Titan Text Embeddings v2 model](https://docs.aws.amazon.com/bedrock/latest/userguide/titan-embedding-models.html), access to a [Meta Llama 3.2 model in Bedrock](https://aws.amazon.com/bedrock/llama/), and an Amazon Simple Storage Service (S3) bucket to stage your graph content before loading into Neptune.  

To create a new graph, follow the instructions in the [Neptune Analytics documentation](https://docs.aws.amazon.com/neptune-analytics/latest/userguide/create-graph-using-console.html).  You do not need to have a replica, so choose 0 instead of the default of 1. 

You must have access to the Titan and Meta Llama models in Bedrock. Follow the instructions in the [Amazon Bedrock User Guide](https://docs.aws.amazon.com/bedrock/latest/userguide/model-access-modify.html) before executing the steps in this notebook.

You must have a Neptune Notebook instance. Follow the instructions in the [Neptune User Guide](https://docs.aws.amazon.com/neptune/latest/userguide/graph-notebooks.html) to create a Neptune Notebook. An ml.t3.medium instance size is fine for this demo.

To create an S3 Bucket, follow these instructions in the [Amazon S3 User Guide](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html).

### Costs
The approximate costs for running this are as high as $20 if completed once in an hour or less in US East 1. The costs are broken down as such:<br>
Neptune Analytics Graph \(32 m-NCUs\)\: \\$0.96/hour<br>
Neptune Notebook Instance \(ml.t3.medium\)\: \\$0.05/hour<br>
S3 Bucket Usage: This will fall under the free tier, or will cost approximately \\$0.01 if you do not qualify for free tier.<br>
Bedrock+Llama Model Usage\:
- RAG encodings cost: approximately \\$0.03 \(4 documents + question using Amazon Titan Text Embeddings v2\)
- One RAG summarization answer cost approximately between \\$0.0775 \(Llama 3.2 1B\) to \\$1.55 \(Llama 3.2 
90B\) depending on the model used
- The GraphRAG encodings for all 4 documents and the question cost approximately \\$0.61 \(1B\) to \\$12.32 \(90B\) depending on the model used
- The GraphRAG answer cost approximately between \\$0.2454 \(1B\) to \\$4.908 \(90B\) depending on the model used

### Notebook configuration

Substitute your graph identifier below for `$YOUR-GRAPH-NAME$`. You can find the graph identifier in the console or using the AWS API.  It will have a format such as `g-1234abcd5e`.  Substitute your region for `$YOUR-AWS-REGION$` in both the `host` and `aws_region` variables, such as `us-east-1`.  Select the `run` button to save the configuration.

In [None]:
%%graph_notebook_config

{
  "host": "$YOUR-GRAPH-NAME$.$YOUR-AWS-REGION$.neptune-graph.amazonaws.com",
  "neptune_service": "neptune-graph",
  "port": 8182,
  "proxy_host": "",
  "proxy_port": 8182,
  "auth_mode": "IAM",
  "load_from_s3_arn": "",
  "ssl": true,
  "ssl_verify": true,
  "aws_region": "$YOUR-AWS-REGION$",
  "sparql": {
    "path": "sparql"
  },
  "gremlin": {
    "traversal_source": "g",
    "username": "",
    "password": "",
    "message_serializer": "GraphSONUntypedMessageSerializerV4"
  },
  "neo4j": {
    "username": "neo4j",
    "password": "password",
    "auth": true,
    "database": null
  }
}

Run the next two cells to change the display settings to properly format overlapping edges, and load the graph configuration into a variable so we can use it programmatically.

In [None]:
%%graph_notebook_vis_options
{
  "nodes": {
    "borderWidthSelected": 0,
    "borderWidth": 0,
    "color": {
      "background": "rgba(210, 229, 255, 1)",
      "border": "transparent",
      "highlight": {
        "background": "rgba(9, 104, 178, 1)",
        "border": "rgba(8, 62, 100, 1)"
      }
    },
    "shadow": {
      "enabled": false
    },
    "shape": "circle",
    "widthConstraint": {
      "minimum": 70,
      "maximum": 70
    },
    "font": {
      "face": "courier new",
      "color": "black",
      "size": 12
    }
  },
  "edges": {
    "color": {
      "inherit": false
    },
    "smooth": {
      "enabled": true,
      "type": "dynamic"
    },
    "arrows": {
      "to": {
        "enabled": true,
        "type": "arrow"
      }
    },
    "font": {
      "face": "courier new"
    }
  },
  "interaction": {
    "hover": true,
    "hoverConnectedEdges": true,
    "selectConnectedEdges": false
  },
  "physics": {
    "minVelocity": 0.75,
    "barnesHut": {
      "centralGravity": 0.1,
      "gravitationalConstant": -50450,
      "springLength": 95,
      "springConstant": 0.04,
      "damping": 0.09,
      "avoidOverlap": 0.1
    },
    "solver": "barnesHut",
    "enabled": true,
    "adaptiveTimestep": true,
    "stabilization": {
      "enabled": true,
      "iterations": 1
    }
  }
}

In [None]:
%graph_notebook_config --store-to config

### Configure Connections and S3 Buckets

Substitute `$BUCKET-NAME-HERE$` with the name of the S3 bucket name you are using. Add just the bucket name, not the full URI.  Then run this cell and the next cell to set up the connections we will use later.

In [None]:
s3_bucket_name = "$BUCKET-NAME-HERE$"

In [None]:
import boto3
import json
endpoint_url = "https://" + json.loads(config)['host'] + ":" + str(json.loads(config)['port'])
region = json.loads(config)['aws_region']
neptune_client = boto3.client('neptune-graph')
bedrock_client = boto3.client("bedrock-runtime", region_name=region)

s3_bucket = f"s3://{s3_bucket_name}/"

### Code and helper functions

This section contains a lot of helper function code. Describing this code is outside the scope of this demonstration, but feel free to explore it yourself. Please remember this is demonstration code only.  Run this cell to execute the code.

In [None]:
import boto3
import json
from botocore.exceptions import ClientError
import logging
from enum import Enum
import re
import string
import random
import html
from json import JSONEncoder
import uuid
import csv
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import sys
from pprint import pprint

EMBEDDING_MODEL_ID = "amazon.titan-embed-text-v2:0"

logger = logging.getLogger(__name__)
def _default(self, obj):
    return getattr(obj.__class__, "__json__", _default.default)(obj)

_default.default = JSONEncoder().default
JSONEncoder.default = _default

context = dict(
    completion_delimiter="||COMPLETE||",
    tuple_delimiter="|~|",
    record_delimiter="##",
    llm_temperature=0.25,
    llm_top_p = 0.25
)

connections_template = """
MATCH (entity1)-[event*1..]-(entity2)
WHERE $entity1_text = entity1.name AND UPPER($entity1_label) IN LABELS(entity1)
AND (($entity2_text = entity2.name AND UPPER($entity2_label) IN LABELS(entity2))
RETURN STARTNODE(event).name as subject, LABELS(STARTNODE(event))[0] as subjectType, TYPE(event) as eventType, ENDNODE(event).name as target, LABELS(ENDNODE(event))[0] as targetType
"""

inquiry_template = """
MATCH (entity1)-[event]-(entity2)-[event2]-(entity3)
WHERE $entity1_text = entity1.name AND UPPER($entity1_label) IN LABELS(entity1)
AND entity1 <> entity3 AND event <> event2
WITH event
WHERE TYPE(event) <> "hasChunk"
RETURN STARTNODE(event).name as subject, LABELS(STARTNODE(event))[0] as subjectType, TYPE(event) as eventType, ENDNODE(event).name as target, LABELS(ENDNODE(event))[0] as targetType
UNION
MATCH (entity1)-[event]-(entity2)-[event2]-(entity3)
WHERE $entity1_text = entity1.name AND UPPER($entity1_label) IN LABELS(entity1)
AND entity1 <> entity3 AND event <> event2
WITH event2
WHERE TYPE(event2) <> "hasChunk"
RETURN STARTNODE(event2).name as subject, LABELS(STARTNODE(event2))[0] as subjectType, TYPE(event2) as eventType, ENDNODE(event2).name as target, LABELS(ENDNODE(event2))[0] as targetType
"""

chunk_retrieval_template = """
MATCH (entity1)-[event*0..1]-(entity2)-[event2:hasChunk]->(chunk:Chunk)
WHERE $entity1_text = entity1.name AND UPPER($entity1_label) IN LABELS(entity1)
RETURN DISTINCT chunk.description as text_chunk
"""

##################################################################################################
#
#  Parts of this code and techniques for use of the LLM were influenced by LightRAG: Simple and Fast Retrieval-Augmented Generation
#  Authors: Zirui Guo and Lianghao Xia and Yanhua Yu and Tu Ao and Chao Huang
#  URL: https://github.com/HKUDS/LightRAG/
#
#
#
##################################################################################################

def run_llm(text, context, bedrock_client, model_id, prompt, do_formatting = True, cost_tracking = None):
    
    request_text = {
        "max_gen_len": 2048,
        "temperature": context["llm_temperature"],
        "top_p": context["llm_top_p"],
        "prompt": prompt.format(**context, input_text=text) if do_formatting else prompt
    }

    # Convert the native request to JSON.
    request = json.dumps(request_text)

    try:
        # Invoke the model with the request.
        response = bedrock_client.invoke_model(modelId=model_id, body=request)

    except (ClientError, Exception) as e:
        print(f"ERROR: Can't invoke '{model_id}'. Reason: {e}")
        exit(1)

    # Decode the response body.
    response_body = json.loads(response["body"].read())
    if not (cost_tracking is None):
        cost_tracking["prompt_tokens"] = cost_tracking["prompt_tokens"] + response_body["prompt_token_count"] if "prompt_tokens" in cost_tracking else response_body["prompt_token_count"]
        cost_tracking["generation_tokens"] = cost_tracking["generation_tokens"] + response_body["generation_token_count"] if "generation_tokens" in cost_tracking else response_body["generation_token_count"]

    return response_body["generation"]
def run_llm_with_history(text, context, bedrock_client, model_id, prompt, history, do_formatting = True):
    
    prompt = "\n".join(history)
    
    request_text = {
        "max_gen_len": 2048,
        "temperature": context["llm_temperature"],
        "top_p": context["llm_top_p"],
        "prompt": prompt + "\n" + (prompt.format(**context, input_text=text) if do_formatting else prompt)
    }

    # Convert the native request to JSON.
    request = json.dumps(request_text)

    try:
        # Invoke the model with the request.
        response = bedrock_client.invoke_model(modelId=model_id, body=request)

    except (ClientError, Exception) as e:
        print(f"ERROR: Can't invoke '{model_id}'. Reason: {e}")
        exit(1)

    # Decode the response body.
    response_body = json.loads(response["body"].read())

    return response_body["generation"]
def run_llm_batch(text_chunks, context, bedrock_client, model_id, prompt, do_formatting = True, cost_tracking = None):
    results = []
    
    for chunk in text_chunks:
        results.append(run_llm(chunk, context, bedrock_client, model_id, prompt, do_formatting, cost_tracking))
        
    return results

def split_string_by_multi_markers(content, markers):
    """Split a string by multiple markers"""
    if not markers:
        return [content]
    results = re.split("|".join(re.escape(marker) for marker in markers), content)
    return [r.strip() for r in results if r.strip()]
def is_float_regex(value):
    return bool(re.match(r"^[-+]?[0-9]*\.?[0-9]+$", value))
def capitalize(value):
    if not isinstance(value, str):
        return value
    return value[0].upper()+value[1:] if len(value) > 1 else value.upper()
def clean_str(input):
    if not isinstance(input, str):
        return input
    result = html.unescape(input.strip())
    result = re.sub(r"[\x00-\x1f\x7f-\x9f]", "", result)
    result = re.sub(r"\s+", '_', result) 
    return result
def _log_if_debug(info, debug):
    if debug:
        print(info)
class Entity:
    def __init__(self, identifier, label, name, description, source_chunk):
        self.identifier = identifier
        self.label = label
        self.name = name
        self.description = description
        self.source_chunk = source_chunk
    def __str__(self):
        return f"Entity({self.identifier}) [type={self.label},name={self.name},description={self.description},source_chunk={self.source_chunk}]"
    def to_array(self):
        return [self.identifier, self.label, self.name, self.description]
    def __json__(self):
        return dict(
            obj_class="Entity",
            identifier=self.identifier,
            label=self.label,
            description=self.description,
            name=self.name,
            source_chunk=self.source_chunk
        )
class Relationship:
    def __init__(self, from_node, to_node, description, weight, keywords, source_chunk, from_id = None, to_id = None):
        self.from_node = from_node
        self.from_id = from_id
        self.to_node = to_node
        self.to_id = to_id
        self.description = description
        self.weight = weight
        self.keywords = keywords
        self.source_chunk = source_chunk
    def __str__(self):
        return f"Relationship{{from={self.from_node}, to={self.to_node}}} [keywords={self.keywords},description={self.description},weight={self.weight},source_chunk={self.source_chunk}]"
    def __json__(self):
        return dict(
            obj_class="Relationship",
            from_node=self.from_node,
            to_node=self.to_node,
            description=self.description,
            weight=self.weight,
            keywords=self.keywords,
            source_chunk=self.source_chunk,
            from_id=self.from_id,
            to_id=self.to_id
        )
def parse_entity(
    record_attributes: list[str],
    chunk_key: str,
):
    if len(record_attributes) < 4 or record_attributes[0] != '"entity"':
        return None
    proposed_id = f"NODE_{clean_str(record_attributes[2].upper())}_{clean_str(record_attributes[1].upper())}"
    entity_name = record_attributes[1]
    if not entity_name.strip():
        return None
    entity_type = clean_str(record_attributes[2].upper())
    if not entity_type:
        print(f"WARNING: None entity type found for {str(record_attributes)}")
    entity_description = record_attributes[3]
    entity_source_id = chunk_key
    return Entity(
        identifier = proposed_id,
        label=entity_type,
        name=entity_name,
        description=entity_description,
        source_chunk=entity_source_id
    )

def parse_relationship(
    record_attributes: list[str],
    chunk_key: str,
):
    if len(record_attributes) < 5 or record_attributes[0] != '"relationship"':
        return None
    source = record_attributes[1]
    target = record_attributes[2]
    edge_description = record_attributes[3]
    edge_keywords = record_attributes[4]
    edge_source_id = chunk_key
    weight = (
        float(record_attributes[-1]) if is_float_regex(record_attributes[-1]) else 1.0
    )
    return Relationship(
        from_node=source, 
        to_node=target, 
        description=edge_description, 
        weight=weight,
        keywords=edge_keywords,
        source_chunk = chunk_key
    )
    
def parse_question(text, context):
    records = split_string_by_multi_markers(
        response_text,
        [context["record_delimiter"], context["completion_delimiter"]],
    )
    search_entities = []
    search_type = None
    search_keywords = []

    for record in records:
        record = re.search(r"\((.*)\)", record)
        if record is None:
            continue
        record = record.group(1)
    #    print(record)
        record_attributes = split_string_by_multi_markers(
            record, [context["tuple_delimiter"]]
        )
    #    print(record_attributes)

        match(record_attributes[0]):
            case '"entity"':
                search_entities.append(dict(name=record_attributes[1],label=record_attributes[2].upper()))
            case '"content_keywords"':
                search_keywords = record_attributes[1].split(",")
            case '"inquiry_type"':
                search_type = record_attributes[1]
            case _:
                print(f"Unknown type: {record_attributes[0]}")
    return (search_entities, search_type, search_keywords)
        
def parse_llm_output(results, context, debug = False):
    entities = []
    relations = []

    for idx, result in enumerate(results):
        _log_if_debug(f"~~record {str(idx)}:~~",debug)
        records = split_string_by_multi_markers(
            result,
            [context["record_delimiter"], context["completion_delimiter"]],
        )
        _log_if_debug(f"{str(len(records))} items.", debug)
        local_str_to_id_lookup = {}
        for record in records:
            record = re.search(r"\((.*)\)", record)
            if record is None:
                continue
            record = record.group(1)
            _log_if_debug(record,debug)
            record_attributes = split_string_by_multi_markers(
                record, [context["tuple_delimiter"]]
            )
            _log_if_debug(record_attributes,debug)
        
            entity = parse_entity(
                record_attributes, f"Chunk_{str(idx)}"
            )
            if entity is not None:
                _log_if_debug(f"adding key {entity.name} -> {entity.identifier}",debug)
                local_str_to_id_lookup[entity.name] = entity.identifier
                entities.append(entity)
            
            relation = parse_relationship(
                record_attributes, f"Chunk_{str(idx)}"
            )
            if relation is not None:
                relation.from_id = local_str_to_id_lookup[relation.from_node] if relation.from_node in local_str_to_id_lookup else None
                if relation.from_id is None:
                    print(f"Source lookup error -- Cannot find {relation.from_node}. Dumping lookup dictionary:")
                    for key in local_str_to_id_lookup:
                        print(f"{key}->{local_str_to_id_lookup[key]}")
                relation.to_id = local_str_to_id_lookup[relation.to_node] if relation.to_node in local_str_to_id_lookup else None
                if relation.to_id is None:
                    print(f"Target lookup error -- Cannot find {relation.to_node}. Dumping lookup dictionary:")
                    for key in local_str_to_id_lookup:
                        print(f"{key}->{local_str_to_id_lookup[key]}")
                relations.append(relation)
    return (entities, relations)

def format_as_neptune_load_files(text_chunks, entities, relations):
    nodes = []
    edges = []

    # write a node for each chunk
    chunk_label = "Chunk"

    for idx, chunk in enumerate(text_chunks):
        clean_chunk_text = re.sub('\n', ' ', chunk)
        nodes.append(
            dict(
                id = f"NODE_CHUNK_{idx}",
                label = chunk_label,
                name = f"Chunk_{idx}",
                description = clean_chunk_text       
            )
        )

    # convert objects to nodes and edges
    for entity in entities:
        # write the node
        nodes.append(
            dict(
                id = entity.identifier,
                label = entity.label,
                name = entity.name,
                description = entity.description       
            )
        )
        # write the edge to the source chunk
        edges.append(
            dict(
                from_id=entity.identifier,
                to_id=f"NODE_{entity.source_chunk.upper()}",
                label="hasChunk",
                description="",
                weight=1.0,
                keywords=""
            )
        )

    for relation in relations:
        labels = relation.keywords.split(",")

        for label in labels:
            trimmed_label = label.strip()
            edges.append(
                dict(
                    from_id=relation.from_id,
                    to_id=relation.to_id,
                    label=trimmed_label[0].upper()+trimmed_label[1:],
                    description=relation.description,
                    weight=relation.weight,
                    keywords=relation.keywords
                )
            )
    return (nodes, edges)

def get_neptune_query_template(question_type):
    return connections_template if question_type == "Connections" else inquiry_template

def query_facts_from_neptune(search_entities, query_template):
    if (len(search_entities) > 2):
        print("WARNING: Found more than two entities.  This current version will only use the first two entities in the graph query")
    if (search_type == "Connections" and len(search_entities) < 2):
        print("ERROR: Cannot execute a Connections query with only a single entity.  Something went wrong.")
        return
    if (len(search_entities) <= 0):
        print("ERROR: No entities were extracted for the query. Something went wrong.")
        return
    
    parameters = {}

    if (len(search_entities) >= 1):
        if len(search_entities) >= 2:
            parameters["entity2_text"] = search_entities[1]["name"]
            parameters["entity2_label"] = search_entities[1]["label"]
        parameters["entity1_text"] = search_entities[0]["name"]
        parameters["entity1_label"] = search_entities[0]["label"]


#        print(query_template)
#        print(parameters)

        neptune_response = neptune_client.execute_query(
            graphIdentifier = json.loads(config)['host'].split(".")[0],
            queryString=query_template,
            parameters=parameters,
            language='OPEN_CYPHER'
        )
    else:
        print("No entities found in question. Cannot continue.")
        return
    
    neptune_payload = json.load(neptune_response["payload"])
    facts = []

    for row in neptune_payload["results"]:
        facts.append(row)
    
    answer_json = dict(facts=facts)
    return answer_json

def query_chunks_from_graph(search_entities, debug=False):
    chunks = []
    parameters = {}
    parameters["entity1_text"] = search_entities[0]["name"]
    parameters["entity1_label"] = search_entities[0]["label"]
    neptune_response = neptune_client.execute_query(
        graphIdentifier = json.loads(config)['host'].split(".")[0],
        queryString=chunk_retrieval_template,
        parameters=parameters,
        language='OPEN_CYPHER'
    )
    neptune_payload = json.load(neptune_response["payload"])
    _log_if_debug(f"Query Response is {neptune_payload}",debug)
    
    for row in neptune_payload["results"]:
        chunks.append(row["text_chunk"])

    return chunks
    
def calculate_embeddings_batch(text_chunks, bedrock_client,cost_tracking=None):
    embeddings = []

    for chunk in text_chunks:
        embeddings.append(calculate_embedding(chunk, bedrock_client,cost_tracking))

    return pd.DataFrame(embeddings, columns=['Text','Embeddings'])


def calculate_embedding(chunk, bedrock_client, cost_tracking=None):
    native_request = {"inputText": chunk}
    request = json.dumps(native_request)

    response = bedrock_client.invoke_model(modelId=EMBEDDING_MODEL_ID, body=request)

    model_response = json.loads(response["body"].read())

    embedding = model_response["embedding"]
    if not (cost_tracking is None):
        cost_tracking["prompt_tokens"] = cost_tracking["prompt_tokens"] + model_response["inputTextTokenCount"] if "prompt_tokens" in cost_tracking else response_body["prompt_token_count"]
        cost_tracking["generation_tokens"] = cost_tracking["generation_tokens"] + 0 if "generation_tokens" in cost_tracking else 0

    return (chunk, np.array(embedding))

def calculate_similarity_to_question(corpus, question_embedding):
    return cosine_similarity(question_embedding[1].reshape(1,-1), np.array(corpus["Embeddings"].tolist()))
    
def print_vectors_side_by_side(vector1, vector2):

    np.set_printoptions(threshold=sys.maxsize)

    side_by_side = (vector1.reshape(1,-1),vector2.reshape(1,-1))
    for i in range(0,len(side_by_side[0][0])):
        print(f"dim: {i} {side_by_side[0][0][i]} <-> {side_by_side[1][0][i]}")
def rag_query(question, text_chunks, bedrock_client,model_id):
    dataframe_embeddings = calculate_embeddings_batch(text_chunks, bedrock_client)

    question_embedding = calculate_embedding(question, bedrock_client)
    
    rag_similarity = calculate_similarity_to_question(dataframe_embeddings, question_embedding)
    dataframe_embeddings["similarity"] = rag_similarity.reshape(-1,1)

    dataframe_ordered = dataframe_embeddings.nlargest(3, 'similarity')
    context["additionalInfo"] = "\n".join((dataframe_ordered["Text"].iloc[0],dataframe_ordered["Text"].iloc[1]))
    context["question"] = question

    return run_llm(question,context, bedrock_client, model_id, PROMPTS["RAG_QUESTION"])
def generate_summary(company1, company2, brand1, product1, country2, company4, person_name):

    print(f"""
    > {company2} acquired {company1} including their brand {brand1} for cash.  It also talks about their business and related charges.
    > {capitalize(product1)} had some difficulties in their business model and supply chain during the pandemic.  They shifted to a new supplier in {country2} quickly so they could continue production.
    > {company2} declared bankruptcy and is selling itself to rival {company4}.  Their CEO {person_name} believes this will benefit all parties.
    > {person_name} was arrested in {country2} for bribing a government official during the pandemic to help overcome some supply chain woes.
    """)

PROMPTS = {}

PROMPTS["ENTITY_EXTRACTION"] = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
-Goal-
Given a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.

-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized
- entity_type: One of the following types: [{{entity_types}}]
- entity_description: Comprehensive description of the entity's attributes and activities
Format each entity as ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>
Use **{record_delimiter}** as the delimiter after each entity

2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
Your goal is to extract 10 or more relationships from this document. If the relationship's entities were not already identified in step 1, please add them.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other
- relationship_keywords: one or more high-level key words that summarize the overarching nature of the relationship, focusing on concepts or themes rather than specific details
- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity. A score of 1.0 represents the highest strength and a score of 0.0 represents the lowest strength.
Format each relationship as ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_keywords>{tuple_delimiter}<relationship_strength>)
Use **{record_delimiter}** as the delimiter after each relationship

3. Identify high-level key words that summarize the main concepts, themes, or topics of the entire text. These should capture the overarching ideas present in the document.
Format the content-level key words as ("content_keywords"{tuple_delimiter}<high_level_keywords>)

4. Return output in English as a single list of all the entities and relationships identified in steps 1 and 2.
5. When finished, output {completion_delimiter}
<|eot_id|><|start_header_id|>user<|end_header_id|>
Entity_types: {entity_types}
Text: {input_text}
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
"""
PROMPTS["QUESTION_EXTRACTION"] = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
-Goal-
Given a text prompt from a user looking for information and a list of entity types, identify all entities of those types from the text, all relationships among the identified entities, and the type of question being asked.

-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized
- entity_type: One of the following types: [{{entity_types}}]
Format each entity as ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>
Use **{record_delimiter}** as the delimiter after each entity

2. Identify high-level key words that summarize the main concepts, themes, or topics of the entire text. These should capture the overarching ideas present in the document.
Format the content-level key words as ("content_keywords"{tuple_delimiter}<high_level_keywords>)
Use **{record_delimiter}** as the delimiter after each entity

3. Identify what type of inquiry the customer is making. Here are the guidelines to use:
If the question has two entities and is in a format like "What is the connection between entity1 and entity2?" then it is a "Connections" inquiry.
If the question asks for information about a single entity or event then it is an "Information" inquiry.
If you aren't sure what type of question it is, then choose an "Information" inquiry.
Format the inquiry as ("inquiry_type"{tuple_delimiter}<inquiry_type>)
Use **{record_delimiter}** as the delimiter after each entity

4. Return output in English as a single list of all the entities, key words, and inquiry type identified in steps 1, 2, and 3.

5. When finished, output {completion_delimiter}
<|eot_id|><|start_header_id|>user<|end_header_id|>
Entity_types: {entity_types}
Text: {input_text}
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
"""
PROMPTS["RAG_QUESTION"] = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are excellent at answering questions, and it makes you happy when you provide the correct answer.
Consider the additional information when creating your response.  Create a narrative paragraph answering the question below. After the narrative paragraph, list any relevant facts 
you used to answer as a bulleted list
Additional Information:
{additionalInfo}
Question:
{question}
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
"""

# Here are the values we will be substituting into our stories. 

Feel free to substitute your own values as desired. Please do not abuse this to create false or misleading examples by using real names.  Run the cell when you have finished.

In [None]:
# A Date, 2 years, a month, and a period of time (with units) respectively (4 different values, not forms of the same date)
date1 = "April 1, 1977"
year1 = "1977"
year2 = "1952"
month1 = "June"
period_of_time1 = "17 millenia"

# Some company names (fictional)
company1 = "King Kong Brands"
company2 = "Purple Ventures" 
company3 = "Intergalactic Management"
company4 = "Loads'o'Money Corp."
company5 = "BrianCorp"

# Some products a company might sell
product1 = "pickled beets"
product2 = "orangutan cages"
product3 = "deviled eggs"

# A fictional brand name
brand1 = "Snippits"   

# Some monetary amounts (preferrably with currency)
amount1 = "$17 billion" 
amount2 = "15.67 euros"
amount3 = "$100 million"
amount4 = "17 gold dubloons"
amount5 = "74 shekels"

# a business segment
business_segment1 = "nails and fasteners"

# a country name (real or fictional)
country1 = "Wakanda"
country2 = "Atlantis"

# a disease (real or fictional)
disease1 = "Smallpox"

# a city (real or fictional)
city1 = "Smorgasbord"

# a state (real or fictional)
state1 = "New Florida"

# a person's name (fictional)
person_name = "Rick Roll"


### Here we are generating 4 fictional financial "news" stories substituting the phrases above.

Let's take a quick look at those stories after the phrases are substituted, or skip to the next cell for the `Too Long; Didn't Read` (TL;DR) version.

In [None]:
text_chunks = [
f"""On {date1}, we ({company2}) completed the acquisition of {company1}, a leader in the production and co-packing of {product1} and ready-to-eat {product2}, and former co-manufacturer of the {brand1} brand. The initial cash consideration paid for {company1} totaled {amount1} and consisted of cash on hand and short-term borrowings. Acquisition-related costs for the {company1} acquisition were immaterial.
The acquisition has been accounted for as a business combination and, accordingly, {company1} has been included within the {business_segment1} segment from the date of acquisition. The purchase consideration was allocated to assets acquired and liabilities assumed based on their respective fair values and consisted of {amount2} to goodwill, {amount3} to property, plant and equipment, net and {amount4} to other net assets acquired. The purchase price allocation has been finalized as of the fourth quarter of {year1} and did not include measurement period adjustments.
Goodwill was determined as the excess of the purchase price over the fair value of the net assets acquired. The goodwill derived from this acquisition is deductible for tax purposes and reflects the value of leveraging our supply chain capabilities to accelerate growth and access to our portfolio of {product1} products""",
f"""{capitalize(product1)} companies such as {company1} had to flip from sending bulk volumes to schools and restaurants to feeding people working from home who suddenly had time for breakfast. Finding enough paperboard packaging for {product1} became a constraint.
With families staying home and limiting supermarket trips, the pandemic boosted sales of {company1}'s {product1}, {product2}, and {product3}. 
For the multinational, food consumed at home more than offset declines in on-the-go channels. While growth had moderated by {month1}, {company1}’s sales over the first nine months of {year1} increased 7% over the year-ago period, to {amount5}, excluding the effects of divestitures and currency rate fluctuations.
Packaging became a bottleneck, as the region’s {country1} supplier, {company3}, ran short due to shipping delays caused by the {disease1} pandemic. The procurement team “scoured the world” for a new source of paperboard and found an alternative supplier next door in {country2}. Lower transportation costs helped to make up for the higher cost of the {country2} paperboard.""",
f"""{capitalize(company2)}, a national {product1} retailer whose roots go back more than {period_of_time1}, said Monday that it has declared bankruptcy and will sell itself to a competitor. 
The {city1}-based company filed for Chapter 11 protection from its debts in the U.S. Bankruptcy Court for the District of {state1}. As part of the filing, most of the privately held retailer's assets will be acquired by {business_segment1} rival {company4}. {company2}, which was founded in {year2}, said it will continue providing independently owned retailers with products.
"We believe that entering the process with an agreed offer from {company4}, who has a similar {period_of_time1} history in the {business_segment1} space and also operates with a focus on supporting members and helping them grow, is the most beneficial next step for {company2} and our associates, customers and vendor partners," {company2} CEO {person_name} said in a statement.""",
f"""The CEO of a prominent {business_segment1} firm was arrested in {country2} today on charges of bribing a government official. The charges date back to the supply chain woes during the {disease1} pandemic. {person_name} will make his initial appearance in court later this week. The markets were roiled by the fear of what effect this will have on his company and the bankruptcy rumors surrounding it."""
]

pprint(json.dumps(text_chunks, indent=4))

## TL;DR
Run this function to get a summary of the key facts that will be relevant later.

In [None]:
generate_summary(company1, company2, brand1, product1, country2, company4, person_name)

## Let's choose a version of Meta Llama 3.2 to use here

In [None]:
# these are US-based inference endpoints.  Change them accordingly for your region.
class SupportedModels(Enum):
    META_LLAMA32_90B = "us.meta.llama3-2-90b-instruct-v1:0"
    META_LLAMA32_11B = "us.meta.llama3-2-11b-instruct-v1:0"
    META_LLAMA32_1B  = "us.meta.llama3-2-1b-instruct-v1:0"
    META_LLAMA32_3B  = "us.meta.llama3-2-3b-instruct-v1:0"
    
model_id = SupportedModels.META_LLAMA32_90B.value



### Next let's ask a question of the LLM using traditional RAG methods

In [None]:
question = f"Did any personal events contribute to the downfall of {company2}?"
question

### The RAG Approach

RAG involves: 
1. taking chunks of a document and encoding them as a representative vector of numbers.  
2. Then the question being asked is also encoded as a vector of numbers.
3. Next, we retrieve a number of the chunks (1) that are most similar to the question (2) using fancy math called cosine similarity.
4. Finally, we add the chunks of the documents into the LLM prompt with the question so it has more context to answer.

Let's see this in action...

In [None]:
chunks_dataframe = calculate_embeddings_batch(text_chunks, bedrock_client)

question_embedding = calculate_embedding(question, bedrock_client)

similarity = calculate_similarity_to_question(chunks_dataframe, question_embedding)
chunks_dataframe["similarity"] = similarity.reshape(-1,1)

ordered_chunks = chunks_dataframe.nlargest(4, 'similarity')

ordered_chunks

You can see above the 4 stories, or chunks, ordered by the similarity to our question. One of the challenges is figuring out where the cutoff is between similar enough and not. We'll take the top 50% here, but that won't scale well for a large set of documents, so you may need to adjust accordingly.  So let's take those top 2 documents and inject them into the prompt with our question and let Llama give us an answer.

In [None]:
context["additionalInfo"] = "\n".join((ordered_chunks["Text"].iloc[0],ordered_chunks["Text"].iloc[1]))
context["question"] = question

rag_answer = run_llm(question,context, bedrock_client, model_id, PROMPTS["RAG_QUESTION"])
print(rag_answer)

### What do you think of the answer? 

The answer should feel well researched, but it is missing important information if you read the documents closely. Regardless, it would be interesting to see why it was determined those first two documents are so relevant to the question. Let's take a peak behind the curtain at the similarity between the question and just the most relevant document.

In [None]:
print_vectors_side_by_side(ordered_chunks["Embeddings"].iloc[0], question_embedding[1])

### Uhhh...

I really wish each of those dimensions came with an explanation.  I'm not giving this high marks on traceability and explainability.

Now let's take a look at what GraphRAG is.

### GraphRAG

GraphRAG is a little different than RAG.  It involves: 
1. Asking an LLM to extract entities and relationships connecting those entities from each text chunk, instead of just encoding the entire thing as numbers.
2. Those entities (nodes) and relationships (edges) are then loaded into a graph.
3. Then the LLM is asked to extract the entities from the question as well.
4. Then we run a query in the graph database to get back all of the entities and relationships that are relevant to the entities from (3) 
5. Finally, we add the relevant entities and relationships into the LLM prompt with the question so it has more context to answer.

Let's see this in action...first, let's do the one-time steps to load all of our documents into our graph

In [None]:
# These are the types of entities we want the LLM to extract
context["entity_types"] = ["organization","person","location","monetary_value","product","date"]

# Send each of the documents to the LLM (step 1)
results = run_llm_batch(text_chunks, context, bedrock_client, model_id, PROMPTS["ENTITY_EXTRACTION"])

# Get back the entities and results and convert them into nodes and edges (starting step 2)
(entities, relations) = parse_llm_output(results, context)
(nodes, edges) = format_as_neptune_load_files(text_chunks, entities, relations)

# write CSV files to load into Neptune
with open('graph_rag_demo_nodes.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow([':ID',':LABEL','name:String','description:String'])
    for node in nodes:
        writer.writerow([node["id"], node["label"], node["name"], node["description"]])

with open('graph_rag_demo_edges.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow([':ID',':START_ID',':END_ID',':TYPE', 'description:String','weight:Float','keywords:String'])
    for edge in edges:
        writer.writerow([str(uuid.uuid4()), edge["from_id"],edge["to_id"],edge["label"],edge["description"],edge["weight"],edge["keywords"]])


### Copy those files into S3 so Neptune can access them

In [None]:
%%bash -s {s3_bucket}

aws s3 cp ./graph_rag_demo_nodes.csv $1
aws s3 cp ./graph_rag_demo_edges.csv $1

### Load the files into Neptune

Again, substitute your S3 bucket name below where it says `$BUCKET-NAME-HERE$`

In [None]:
%%oc 

CALL neptune.load({format: "opencypher", 
                   source: "s3://$BUCKET-NAME-HERE$", 
                   region : "us-east-1"})

### Let's take a look at what this graph looks like

I think you'll find this much more understandable than the RAG comparison.  Click on the Graph tab in the results window and take a look at our nodes and relationships.

In [None]:
%%oc -d name -l 50 -rel 50

MATCH p=(n)-[]-(n1)
RETURN p

## Now let's run our GraphRAG query

Again, we are asking the LLM to extract the entity types below from our question, we find those nodes in our graph along with those related to it, and we add all this information into our LLM prompt and ask it to create a narrative paragraph summarizing the information.

In [None]:
context["entity_types"] = ["organization","person","location","monetary_value","product","date"]

response_text = run_llm(question, context, bedrock_client, model_id, PROMPTS["QUESTION_EXTRACTION"])
(search_entities, search_type, search_keywords) = parse_question(response_text, context)

query_template = get_neptune_query_template(search_type)
facts = query_facts_from_neptune(search_entities, query_template)

prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are excellent at answering questions, and it makes you happy when you provide the correct answer.
Consider the list of facts supplied in JSON format when creating your response.  Create a narrative paragraph answering the question below. After the narrative paragraph, list any relevant facts 
you used to answer as a bulleted list
Facts:
{json.dumps(facts)}
Question:
{question}
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>"""

graphrag_answer = run_llm(question,context, bedrock_client, model_id, prompt, do_formatting=False)
print(graphrag_answer)

### So let's compare the answers

In [None]:
print("RAG Answer:\n")
print(rag_answer)
print("\n--------------\n")
print("GraphRAG Answer:")
print(graphrag_answer)

### Key things to notice:
1. The RAG answer has more "filler" words and detail because we are passing the entire document into the prompt, not just the facts.  This is especially important here because everything is fabricated, so it cannot draw upon its own training data to add color.

2. The GraphRAG answer includes details that don't seem relevant when looking at just the wording of the question.  For example, the CEO of the company was arrested for bribery in one of the countries where it does business.  If we look back at that news story, it doesn't mention the company at all and therefore had a least relevant score.

In [None]:
query_params = {"entity_name": person_name}

### Which chunks did it use exactly?  

The easier traceability and explainability lets us easily identify them.

In [None]:
%%oc -d name -l 50 -rel 50 -qp query_params

MATCH p=(entity1:PERSON)-[event*0..1]-(entity2)-[event2:hasChunk]->(chunk:Chunk)
WHERE $entity_name = entity1.name
RETURN p

### Can we combine them?

Certainly.  Here we are capturing all of the chunks associated to the relevant entities in our graph. Is the answer better or does the presence of too much information lead the LLM astray...I think that is up to the reader's interpretation.

In [None]:
raw_text = "\n".join(query_chunks_from_graph(search_entities))

prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are excellent at answering questions, and it makes you happy when you provide the correct answer.
Consider both the list of facts supplied in JSON format when creating your response, as well as text in the additional information section.  Create a narrative paragraph answering the question below. After the narrative paragraph, list any relevant facts 
you used to answer as a bulleted list
Facts:
{json.dumps(facts)}
Raw text:
{raw_text}
Question:
{question}
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>"""

blended_answer = run_llm(question,context, bedrock_client, model_id, prompt, do_formatting=False)
print(blended_answer)

## Clean up

<div class="alert alert-block alert-warning">Please remember to delete your Neptune Analytics graph, remove the files from the S3 bucket, and remove your Neptune Notebook instance so you do not have recurring charges from this demonstration.</div>

## Next steps

In this notebook we demonstrated a scenario where GraphRAG gave a better answer than standard RAG. It is important to recognize that not every scenario will have the same result. If the wording of the last news story or the question was different (e.g., the story mentioned the company the CEO worked for, or the question mentioned the CEO by name), then RAG would likely have picked up on the arrest as well. It is also important to recognize the inherent nature of large language models (LLMs) being probabilistic and trained on a general corpus of words means that it is entirely possible to insert words that the LLM won't recognize as an entity and therefore the results will not be meaningful. Finally, note that running Graph RAG scenarios is significantly more expensive today.