# KGC 2024 MasterClass: Generating and analyzing knowledge graphs using GenAI and Neptune Analytics
# Notebook 2: Setup LlamaIndex 

This notebook creates a LlamaIndex graph store and vector store of press release data in the Neptune Analytics graph. It coexists with the organizational knowledge graph and related Comprehend extraction results. 

Here is our data model.

<img src="images/kgc_model.png">

LlamaIndex objects are colored pink. They are mostly independent of the data we loaded in the previous notebook. Their only link is that the DOCUMENT node (white box) representing a press release is linked to Chunk nodes created by the LlamaIndex vector store.

The next figure depicts our design.

<img src="images/kgc_design.png">

The LlamaIndex portion is shown in the upper third of the figure. 

To run this notebook you need a Neptune Analytics graph that is accessible from this notebook instance. You also need an S3 bucket in the same region. We will stage chunk-to-document links in that bucket to batch-load to Neptune Analytics graph. See README.md for detailed setup instructions.

## What is LlamaIndex?

LlamaIndex is a framework for RAG based LLM applications for use cases that applies LLMs on top of your private or domain-specific data. Some popular use cases include the following:

* Question-Answering Chatbots (commonly referred to as RAG systems, which stands for "Retrieval-Augmented Generation")
* Document Understanding and Extraction
* Autonomous Agents that can perform research and take actions

LlamaIndex provides an opinionated approach and tools to build these applications from ingest, to indexing, to querying, and evaluation.


![image.png](attachment:f5a06b3b-f267-457a-9441-09a08f5d4e53.png)

### Stages within LlamaIndex RAG applications?

There are five key stages LlamaIndex goes through when building a complete RAG application

![image.png](attachment:e3a4040c-420c-4a48-9546-0be9bb4ebf80.png)

## Setup

### Install LlamaIndex libraries
We first need to ensure that we have all the LlamaIndex libraries we need installed.

In [2]:
pip install -q llama-index llama-index-vector-stores-neptune llama-index-graph-stores-neptune  llama-index-llms-bedrock llama-index-embeddings-bedrock

Note: you may need to restart the kernel to use updated packages.


### Get the Graph Configuration

In [3]:
import graph_notebook as gn
config = gn.configuration.get_config.get_config()

region = config.aws_region
graph_identifier=config._host.split(".")[0]

### Imports and Global Settings

Below we are going to be importing several of the common libraries used by LlamaIndex.  We are also setting up the LLM and Embedding Models we will use.  For this example we will use Claude V3 Sonnet as our LLM and Titan Embedding V1 to generate our embeddings, both hosted on Amazon Bedrock.

In [4]:
import os
from llama_index.llms.bedrock import Bedrock
from llama_index.embeddings.bedrock import BedrockEmbedding
from llama_index.core import StorageContext, VectorStoreIndex, KnowledgeGraphIndex, Settings, load_index_from_storage
from llama_index.core import SimpleDirectoryReader

# define LLM
llm = Bedrock(model="anthropic.claude-3-sonnet-20240229-v1:0", 
    model_kwargs={"temperature": 0})
embed_model = BedrockEmbedding(model="amazon.titan-embed-text-v1")

# Set global LLM settings
Settings.llm = llm
Settings.embed_model = embed_model

documents = SimpleDirectoryReader(input_files = ["rawtext/c60e5356-635f-4dee-b6c9-c6e2af59c83c.txt"]).load_data()

## Loading, Indexing, and Storing

### Build LlamaIndex VectorStoreIndex

Now that we have setup all our shared dependencies, let's take a look at what it looks like to build a VectorStoreIndex over this data.

In [5]:
from llama_index.vector_stores.neptune import NeptuneAnalyticsVectorStore
VECTOR_PERSIST_DIR = '/tmp/storage/vss'


def load_vector_index(vector_store):
    # check if vss storage already exists
    if not os.path.exists(VECTOR_PERSIST_DIR):        
        storage_context = StorageContext.from_defaults(vector_store=vector_store)   
        vss_index = VectorStoreIndex.from_documents(
            documents,
            storage_context=storage_context,
            include_embeddings=True,
            show_progress=True,
        )

        # persistent storage
        vss_index.storage_context.persist(persist_dir=VECTOR_PERSIST_DIR)
    else:
        # load the existing index
        print("Loading VSS Index")
        storage_context = StorageContext.from_defaults(
            persist_dir=VECTOR_PERSIST_DIR, vector_store=vector_store
        )
        vss_index = load_index_from_storage(storage_context)
    
    return vss_index

# Define Vector Store
vector_store = NeptuneAnalyticsVectorStore(graph_identifier=graph_identifier, embedding_dimension=1536)   
# Load the Vector Index
vector_index = load_vector_index(vector_store)

Parsing nodes:   0%|          | 0/1 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/1 [00:00<?, ?it/s]

Nodes added


If you want to rerun the import than you can run the cell below

In [1]:
!rm -rf /tmp/storage

#### Examining the Workflow

![image.png](attachment:e53ccfd4-aeb3-4b81-83ef-a6fa1a6db0c3.png)

#### Examing the Vector Store

In [6]:
%%oc

MATCH (n:Chunk) 
CALL neptune.algo.vectors.get(n)
YIELD embedding
RETURN n.file_name, id(n), n.text, embedding, n
LIMIT 1

Tab(children=(Output(layout=Layout(max_height='600px', max_width='940px', overflow='scroll')), Force(network=<…

### Build LlamaIndex KnowledgeGraphIndex

Let's compare that to what it looks like to build a KnowledgeGraphIndex over this data.

In [8]:
from llama_index.graph_stores.neptune import NeptuneAnalyticsGraphStore
KG_PERSIST_DIR = '/tmp/storage/kg'

def load_kg_index(graph_store):
    # check if kg storage already exists
    if not os.path.exists(KG_PERSIST_DIR):
        storage_context = StorageContext.from_defaults(graph_store=graph_store)
        print("Creating KG Index")
        kg_index = KnowledgeGraphIndex.from_documents(
            documents,
            storage_context=storage_context,
            include_embeddings=True,
            show_progress=True,
        )

        # persistent storage
        kg_index.storage_context.persist(persist_dir=KG_PERSIST_DIR)
    else:
        # load the existing index
        print("Loading KG Index")
        storage_context = StorageContext.from_defaults(
            persist_dir=KG_PERSIST_DIR, graph_store=graph_store
        )
        kg_index = load_index_from_storage(storage_context)

    print("KG Index Loading Complete")
    return kg_index

# Define Graph Store
graph_store = NeptuneAnalyticsGraphStore(graph_identifier=graph_identifier)
# Load the KG Index
kg_index = load_kg_index(graph_store)

Creating KG Index


Parsing nodes:   0%|          | 0/1 [00:00<?, ?it/s]

Processing nodes:   0%|          | 0/1 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/10 [00:00<?, ?it/s]

KG Index Loading Complete


#### Examining the Workflow

![image.png](attachment:cfd5ab56-a9b2-48b0-b468-75ac3caa5ef2.png)

#### Examing the Graph Store

In [11]:
%%oc -d id -l 20

MATCH triple=(s:Entity)-[p]->(o)
RETURN triple
LIMIT 100

Tab(children=(Output(layout=Layout(max_height='600px', max_width='940px', overflow='scroll')), Force(network=<…

## Querying and Evaluation

Let's take a look at the Chatbot and then come back here to discuss how it works.

![image.png](attachment:dc8a6d2c-9fec-484d-8df8-ddda9e6bd9b4.png)


![image.png](attachment:44bfc4c7-4c0b-419c-a533-00691bbc6b49.png)

![image.png](attachment:1ba3a68d-dc54-4cb7-98f0-2f0e2868770a.png)

## Extras

In [None]:
from IPython.display import Markdown, display
query_engine = kg_index.as_query_engine()
response = query_engine.query("Does Amazon have a fulfillment center in Mississippi?")
display(Markdown(f"<b>{response}</b>"))

In [12]:
from IPython.display import Markdown, display
query_engine = vector_index.as_query_engine()
response = query_engine.query("Does Amazon have a fulfillment center in Mississippi?")
display(Markdown(f"<b>{response}</b>"))

<b>Yes, according to the context information provided, Amazon does have a fulfillment center in Mississippi. The announcement states that Amazon plans to open its second Mississippi fulfillment center, which will be located in DeSoto County. The new facility is expected to create 500 new, full-time jobs in the area.</b>

### Get stats to see node and edge types

In [None]:
%summary pg --detailed

### Explore the graph store
Triples

In [None]:
%%oc

MATCH (s:Entity)-[p]->(o)
RETURN s.id, type(p), o.id
LIMIT 100

### Look at the chunks in the vector store

In [None]:
%%oc

MATCH (n:Chunk) 
CALL neptune.algo.vectors.get(n)
YIELD embedding
RETURN n.file_name, id(n), n.text, embedding
LIMIT 20


### Do vector similarity search on vector store

In [None]:
embedding = embed_model.get_text_embedding("kindle")
embparams={'emb': embedding}

In [None]:
%%oc -qp embparams

WITH $emb as emb
CALL neptune.algo.vectors.topKByEmbedding(emb)
YIELD embedding, node, score
RETURN id(node), node.file_name, node.text
LIMIT 20


### Find chunks similar to a specific chunk

In [None]:
%%oc 

MATCH(n:Chunk {`~id`: "50e02811-229c-47c1-a241-7317183ab6d1"})
CALL neptune.algo.vectors.topKByNode(n)
YIELD node, score
WHERE n.file_name <> node.file_name
RETURN score, id(n) as sourceNodeId, n.file_name as sourceNodeFile, 
id(node) as matchedNodeId, node.file_name as matchedNodeFile, 
n.text as sourceText, node.text as matchedText
ORDER BY score desc
LIMIT 20

#### Link Comprehend documents to chunks created by LlamaIndex vector store
#### Let's summarize the chunks and how they link to doc file.

In [None]:
%%oc --store-to chunks_per_file

MATCH(c:Chunk)
RETURN id(c) as cid, c.file_name as docfile


### Build a CSV of edges linking chunks to docs

In [None]:
!mkdir -p graphdata

In [None]:
import csv

# write edges from chunks to document node

with open('graphdata/chunk2doc.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['~id','~from','~to','~label'])

    for res in chunks_per_file['results']:
        chunk_node_id = res['cid']
        docfile = res['docfile']
        docid = docfile.split(".")[0]
        writer.writerow([f"ce_{chunk_node_id}", chunk_node_id, docid, "belongsToDoc"])


### Copy graphdata files to S3 so we can load to Neptune

In [None]:
S3_WORKING_BUCKET="brian-blog-s3workingbucket-stkpqblygbtg"
S3_SOURCE=f"s3://{S3_WORKING_BUCKET}/chunk2doc.csv"
S3_SOURCE


In [None]:
%%bash -s "$S3_SOURCE"

aws s3 cp graphdata/chunk2doc.csv $1


### Batch-load to Neptune graph

In [None]:
%%oc

CALL neptune.load({
    format: "csv", 
    source: "${S3_SOURCE}", 
    region : "${region}",
    format: "csv",
    failOnError: False,
    concurrency: 1
})



### Verify using a query

In [None]:
%%oc

MATCH (d:DOCUMENT)<-[:belongsToDoc]-(c:Chunk)
RETURN d.title, collect(id(c))

### Finally we can combine vector similarity with observations from Comprehend!

In [None]:
embedding = embed_model.get_text_embedding("career skills")
embparams={'emb': embedding}

In [None]:
%%oc -qp embparams

WITH $emb as emb
CALL neptune.algo.vectors.topKByEmbedding(emb)
YIELD embedding, node, score

MATCH path=(node:Chunk)-[:belongsToDoc]->(d:DOCUMENT)-[ev]->(obs)-[role]->(ent)
RETURN path
LIMIT 200