# BYOKG RAG Demo
This notebook demonstrates a RAG (Retrieval Augmented Generation) system built on top of a Knowledge Graph. The system allows querying a knowledge graph using natural language questions and retrieving relevant information to generate answers.

1. **Graph Store**: Neptune DB Cluster endpoint for the graph structure
2. **KG Linker**: Links natural language queries to graph entities and paths
3. **Entity Linker**: Matches entities from text to graph nodes
4. **Triplet Retriever**: Retrieves relevant triplets from the graph
5. **Path Retriever**: Finds paths between entities in the graph
6. **Query Engine**: Orchestrates all components to answer questions

#### Setup
If you haven't already, install the toolkit and dependencies in [README.md](../../byokg-rag/README.md).
Let's validate if the package is correctly installed.

In [None]:
# !pip install https://github.com/awslabs/graphrag-toolkit/archive/refs/tags/v3.12.0.zip#subdirectory=byokg-rag

In [None]:
from graphrag_toolkit.byokg_rag.graphstore import NeptuneDBGraphStore

### Graph Store
The `NeptuneDBGraphStore` class provides an interface to work with a Neptune database cluster. If you already have a Neptune database cluster you want to use, simply change the cell below to assign `cluster_endpoint` variable to your Neptune database cluster endpoint.

If you don't already have a Neptune database then please refer to the documentation on how to [create a neptune db cluster](https://docs.aws.amazon.com/neptune/latest/userguide/get-started-create-cluster.html) and [load graph data into the cluster](https://docs.aws.amazon.com/neptune/latest/userguide/load-data.html).

After creating the cluster you should be able to access the cluster endpoint via the Neptune console or the AWS CLI by running `aws neptune describe-db-clusters`. Change the cell below to assign `graph_db_endpoint_url` variable to the Neptune DB cluster endpoint `https://{endpoint_url}:{port}`.

To run the rest of the notebook, you'll need to ensure that the environment is in the same VPC as the Neptune DB cluster and can access the cluster. You also need to make sure that the environment has the right [IAM permissions](https://docs.aws.amazon.com/neptune/latest/userguide/security-iam-access-manage.html) to read data from the database. The [read-only managed policy](https://docs.aws.amazon.com/neptune/latest/userguide/read-only-access-iam-managed-policy.html) is sufficient for this. 

If you are also using the example dataset, you will need s3 IAM read permissions so that `graphstore.read_from_csv` can access data from `s3://aws-neptune-customer-samples-*/*`. If you're using your own dataset then you also need to provide write access so that `read_from_csv` can upload your csv file to an s3 location you specify where it will be ingested by the Neptune DB bulk loader. 

You will also need to setup Bedrock Model access if you haven't already and IAM permissions to invoke Bedrock models from the environment running this notebook

In the rest of the notebook, we
1. Initialize the BYOKG graph store to use a Neptune DB Cluster
2. Optionally, load an example data from a CSV file for a new graph
3. Run the BYOKG retrieval functions and QueryEngine on a sample question

In [None]:
region = "us-west-2" #replace with aws region
graph_db_endpoint_url = "<>" # replace with cluster endpoint format = "https://<cluster_endpoint>:<port>"

In [None]:
graph_store = NeptuneDBGraphStore(endpoint_url=graph_db_endpoint_url,
                                  region=region)

### Loading Data

If you created a new empty Neptune DB cluster then uncomment the code cell below to load data into the cluster. The data we are loading is a KG in property graph format with information about AWS blog posts on Neptune and Neptune Analytics.

See [this example notebook](https://github.com/aws/graph-notebook/blob/main/src/graph_notebook/notebooks/01-Neptune-Database/03-Sample-Applications/02-Knowledge-Graphs/Building-a-Knowledge-Graph-Application-openCypher.ipynb) for more details on the dataset

#### Note

BYOKG Neptune DB GraphStore only supports Neptune DB clusters in property graph format. Graphs in RDF format are not yet supported

In [None]:
# role = "<>" replace with IAM roleArn that can be assumed by Neptune Bulk Loader
# graph_store.read_from_csv(s3_path=f"s3://aws-neptune-customer-samples-{region}/sample-datasets/gremlin/KG/", iam_role=role)

In [None]:
# Print graph schema
import json

schema = graph_store.get_schema()
print(json.dumps(schema, indent=4))

In order to customize how we refer to nodes in the graph, we can tell the graphstore to assign a property as the text representation key for each node.

From the graph schema and summary above we can choose which property to assign to which node as shown below 

In [None]:
text_repr_prop_for_node = {
    "organization": "text",
    "author": "name",
    "title": "text",
    "commercial_item": "text",
    "tag": "tag",
    "location": "text",
    "post": "title",
}
graph_store.assign_text_repr_prop_for_nodes(text_repr_prop_for_node)

### Question Answering

We define a sample question to test our system. The question requires reasoning through multiple hops in the knowledge graph to find the answer.

In [None]:
# set a question to test BYOKG
question = "Who is the author of post on migrating from blazegraph to amazon neptune"

### KG Linker
The `KGLinker` uses an LLM (Claude 3.5 Sonnet) to:
1. Extract entities from the question
2. Identify potential relationship paths in the graph
3. Generate initial responses based on its knowledge

In [None]:

from graphrag_toolkit.byokg_rag.graph_connectors import KGLinker
from graphrag_toolkit.byokg_rag.llm import BedrockGenerator



# Initialize llm
llm_generator = BedrockGenerator(
                model_name='us.anthropic.claude-3-5-sonnet-20240620-v1:0',
                region_name='us-west-2')

kg_linker = KGLinker(graph_store=graph_store, llm_generator=llm_generator)
response = kg_linker.generate_response(
                question=question,
                schema=schema,
                graph_context="Not provided. Use the above schema to understand the graph."
            )


In [None]:
artifacts = kg_linker.parse_response(response)
artifacts

### Entity Linking
The `EntityLinker` uses fuzzy string matching to
1. Match extracted entities to actual nodes in the graph
3. Link potential answers to graph nodes

In [None]:
from graphrag_toolkit.byokg_rag.indexing import FuzzyStringIndex
from graphrag_toolkit.byokg_rag.graph_retrievers import EntityLinker

# Add graph nodes text for string matching
string_index = FuzzyStringIndex()
string_index.add(graph_store.nodes())
retriever = string_index.as_entity_matcher()
entity_linker = EntityLinker(retriever=retriever)

linked_entities = entity_linker.link(artifacts["entity-extraction"], return_dict=False)
linked_answers = entity_linker.link(artifacts["draft-answer-generation"], return_dict=False)
linked_entities, linked_answers

### Triplet Retrieval
The `AgenticRetriever` uses an LLM to:
1. Navigate the graph starting from linked entities
2. Select relevant relations based on the question
3. Expand those relations and decide which relevant entities to explore next.
4. It returns the relevant (head->relation->tail) based on the question.


In [None]:
from graphrag_toolkit.byokg_rag.graph_retrievers import AgenticRetriever
from graphrag_toolkit.byokg_rag.graph_retrievers import GTraversal, TripletGVerbalizer
graph_traversal = GTraversal(graph_store)
graph_verbalizer = TripletGVerbalizer()
triplet_retriever = AgenticRetriever(
    llm_generator=llm_generator, 
    graph_traversal=graph_traversal,
    graph_verbalizer=graph_verbalizer)

In [None]:
triplet_context = triplet_retriever.retrieve(query=question, source_nodes=linked_entities)
triplet_context

### Path Retrieval
The `PathRetriever` uses the identified metapaths and candidate answers to:
1. Retrieve actual paths in the graph following the metapath
2. Retrieve shortest paths connecting question entities and candidate answers (if any) 
3. Verbalize the paths for context

In [None]:
from graphrag_toolkit.byokg_rag.graph_retrievers import PathRetriever
from graphrag_toolkit.byokg_rag.graph_retrievers import GTraversal, PathVerbalizer
graph_traversal = GTraversal(graph_store)
path_verbalizer = PathVerbalizer()
path_retriever = PathRetriever(
    graph_traversal=graph_traversal,
    path_verbalizer=path_verbalizer)

metapaths = [[component.strip() for component in path.split("->")] for path in artifacts["path-extraction"]]
shortened_paths = []
for path in metapaths:
    if len(path) > 1:
        shortened_paths.append(path[:1])
for path in metapaths:
    if len(path) > 2:
        shortened_paths.append(path[:2])
metapaths += shortened_paths
path_context = path_retriever.retrieve(linked_entities, metapaths, linked_answers)
path_context

Let's try answering the question now with the retrieved context from various retrieval mechanisms.

First we can create a `ByoKGQueryEngine` instance which can invoke an LLM and generate a response using the context we already retrieved from the graph

In [None]:
from graphrag_toolkit.byokg_rag.byokg_query_engine import ByoKGQueryEngine

byokg_query_engine = ByoKGQueryEngine(
    graph_store=graph_store,
    kg_linker=kg_linker,
    triplet_retriever=triplet_retriever,
    path_retriever=path_retriever,
    entity_linker=entity_linker
)

Generating a response using the triplet context from graph traversal. If the generated answer is `Dave Bechberger`, then the triplet context provides enough information for the LLM to answer the question correctly.

In [None]:
answers, response = byokg_query_engine.generate_response(question, "\n".join(triplet_context))

print("Generated answers: ", answers)


Now generating a response using the path context from the path reteriever. 

Similarly, if `Dave Bechberger` is included in the response, then the path context is sufficient to answer the question.

In [None]:
answers, response = byokg_query_engine.generate_response(question, "\n".join(path_context))

print("Generated answers: ", answers)

### BYOKG RAG Pipeline

We can also use the `ByoKGQueryEngine` to combine all into a single call to:
1. Process natural language questions
2. Retrieve relevant context from the graph
3. Generate answers based on the retrieved information

In [None]:
from graphrag_toolkit.byokg_rag.byokg_query_engine import ByoKGQueryEngine

byokg_query_engine = ByoKGQueryEngine(
    graph_store=graph_store,
    kg_linker=kg_linker,
    triplet_retriever=triplet_retriever,
    path_retriever=path_retriever,
    entity_linker=entity_linker
)

retrieved_context = byokg_query_engine.query(question)
answers, response = byokg_query_engine.generate_response(question, "\n".join(retrieved_context))

print(answers)
print(response)