# KGC 2024 MasterClass: Generating and analyzing knowledge graphs using GenAI and Neptune Analytics
# Notebook 2: Setup LlamaIndex 

This notebook creates a LlamaIndex graph store and vector store of press release data in the Neptune Analytics graph. It coexists with the organizational knowledge graph and related Comprehend extraction results. 

Here is our data model.

<img src="images/kgc_model.png">

LlamaIndex objects are colored pink. They are mostly independent of the data we loaded in the previous notebook. Their only link is that the DOCUMENT node (white box) representing a press release is linked to Chunk nodes created by the LlamaIndex vector store.

The next figure depicts our design.

<img src="images/kgc_design.png">

The LlamaIndex portion is shown in the upper third of the figure. 

To run this notebook you need a Neptune Analytics graph that is accessible from this notebook instance. You also need an S3 bucket in the same region. We will stage chunk-to-document links in that bucket to batch-load to Neptune Analytics graph. See README.md for detailed setup instructions.

## Install LlamaIndex libraries
We use Neptune graph and vector stores.

In [None]:
pip install llama-index llama-index-vector-stores-neptune llama-index-graph-stores-neptune  llama-index-llms-bedrock llama-index-embeddings-bedrock

## Build LlamaIndex vector store

In [None]:
import graph_notebook as gn
config = gn.configuration.get_config.get_config()

region = config.aws_region
graph_identifier=config._host.split(".")[0]
s3_bucket = f"s3://aws-neptune-customer-samples-{region}/kgc2024_na/rawtext/"

graph_identifier


In [None]:
%%bash -s "$s3_bucket"

aws s3 sync $1 rawtext

In [None]:
from llama_index.llms.bedrock import Bedrock
from llama_index.embeddings.bedrock import BedrockEmbedding
from llama_index.core import StorageContext, VectorStoreIndex, Settings
from llama_index.vector_stores.neptune import NeptuneAnalyticsVectorStore
from llama_index.core import download_loader, SimpleDirectoryReader
from llama_index.core import PromptTemplate

# define LLM
llm = Bedrock(model="anthropic.claude-v2")
embed_model = BedrockEmbedding(model="amazon.titan-embed-text-v1")

# Set global LLM settings
Settings.llm = llm
Settings.embed_model = embed_model

# Define Vector Store
vector_store = NeptuneAnalyticsVectorStore(graph_identifier=graph_identifier, embedding_dimension=1536)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

documents = SimpleDirectoryReader("rawtext").load_data()

vector_index = VectorStoreIndex.from_documents(
    documents=documents,
    storage_context=storage_context,
    include_embeddings=True,
    show_progress=True,
)


## Build LlamaIndex Graph store

In [None]:
from llama_index.llms.bedrock import Bedrock
from llama_index.embeddings.bedrock import BedrockEmbedding
from llama_index.core import StorageContext, KnowledgeGraphIndex, Settings
from llama_index.graph_stores.neptune import NeptuneAnalyticsGraphStore
from llama_index.core import download_loader, SimpleDirectoryReader
from llama_index.core import PromptTemplate

# define LLM
llm = Bedrock(model="anthropic.claude-v2")
embed_model = BedrockEmbedding(model="amazon.titan-embed-text-v1")

# Set global LLM settings
Settings.llm = llm
Settings.embed_model = embed_model

# Define Graph Store
graph_store = NeptuneAnalyticsGraphStore(graph_identifier=graph_identifier)
storage_context = StorageContext.from_defaults(graph_store=graph_store)

documents = SimpleDirectoryReader("rawtext").load_data()

kg_index = KnowledgeGraphIndex.from_documents(
     documents=documents,
     storage_context=storage_context,
     max_triplets_per_chunk=50,
     include_embeddings=True,
     show_progress=True,
)


## Explore these stores


### Ask questions of the stores
Start with the graph store, then same query to the vector store.

You can also try the chatbot.

In [None]:
from IPython.display import Markdown, display
query_engine = kg_index.as_query_engine()
response = query_engine.query("Tell me about mergers involving Amazon")
display(Markdown(f"<b>{response}</b>"))

In [None]:
from IPython.display import Markdown, display
query_engine = vector_index.as_query_engine()
response = query_engine.query("Tell me about mergers involving Amazon")
display(Markdown(f"<b>{response}</b>"))

### Get stats to see node and edge types

In [None]:
%summary pg --detailed

### Explore the graph store
Triples

In [None]:
%%oc

MATCH (s:Entity)-[p]->(o)
RETURN s.id, type(p), o.id
LIMIT 100

### Look at the chunks in the vector store

In [None]:
%%oc

MATCH (n:Chunk) 
CALL neptune.algo.vectors.get(n)
YIELD embedding
RETURN n.file_name, id(n), n.text, embedding
LIMIT 20


### Do vector similarity search on vector store

In [None]:
embedding = embed_model.get_text_embedding("kindle")
embparams={'emb': embedding}

In [None]:
%%oc -qp embparams

WITH $emb as emb
CALL neptune.algo.vectors.topKByEmbedding(emb)
YIELD embedding, node, score
RETURN id(node), node.file_name, node.text
LIMIT 20


### Find chunks similar to a specific chunk

In [None]:
%%oc 

MATCH(n:Chunk {`~id`: "50e02811-229c-47c1-a241-7317183ab6d1"})
CALL neptune.algo.vectors.topKByNode(n)
YIELD node, score
WHERE n.file_name <> node.file_name
RETURN score, id(n) as sourceNodeId, n.file_name as sourceNodeFile, 
id(node) as matchedNodeId, node.file_name as matchedNodeFile, 
n.text as sourceText, node.text as matchedText
ORDER BY score desc
LIMIT 20

## Link Comprehend documents to chunks created by LlamaIndex vector store
### Let's summarize the chunks and how they link to doc file.

In [None]:
%%oc --store-to chunks_per_file

MATCH(c:Chunk)
RETURN id(c) as cid, c.file_name as docfile


### Build a CSV of edges linking chunks to docs

In [None]:
!mkdir -p graphdata

In [None]:
import csv

# write edges from chunks to document node

with open('graphdata/chunk2doc.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['~id','~from','~to','~label'])

    for res in chunks_per_file['results']:
        chunk_node_id = res['cid']
        docfile = res['docfile']
        docid = docfile.split(".")[0]
        writer.writerow([f"ce_{chunk_node_id}", chunk_node_id, docid, "belongsToDoc"])


### Copy graphdata files to S3 so we can load to Neptune

In [None]:
S3_WORKING_BUCKET="<your working bucket - without leading s3:// or trailing slash >"
S3_SOURCE=f"s3://{S3_WORKING_BUCKET}/chunk2doc.csv"
S3_SOURCE


In [None]:
%%bash -s "$S3_SOURCE"

aws s3 cp graphdata/chunk2doc.csv $1


### Batch-load to Neptune graph

In [None]:
%%oc

CALL neptune.load({
    format: "csv", 
    source: "${S3_SOURCE}", 
    region : "${region}",
    format: "csv",
    failOnError: False,
    concurrency: 1
})



### Verify using a query

In [None]:
%%oc

MATCH (d:DOCUMENT)<-[:belongsToDoc]-(c:Chunk)
RETURN d.title, collect(id(c))

### Finally we can combine vector similarity with observations from Comprehend!

In [None]:
embedding = embed_model.get_text_embedding("career skills")
embparams={'emb': embedding}

In [None]:
%%oc -qp embparams

WITH $emb as emb
CALL neptune.algo.vectors.topKByEmbedding(emb)
YIELD embedding, node, score

MATCH path=(node:Chunk)-[:belongsToDoc]->(d:DOCUMENT)-[ev]->(obs)-[role]->(ent)
RETURN path
LIMIT 200