# Knowledge graphs and GenAI

This notebook shows how to build up a knowledge base from unstructured data using a large language model (LLM). This approach is useful if you have a lot of unstructured data like meeting notes or short articles, and you want to automatically see the relationships between different concepts.

We'll use LlamaIndex.

In [None]:
// Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: MIT-0

## Load local configuration

Create the file `config.yml` and then add settings for your neo4j database and AWS region. For example:

    aws:
        region: us-east-1
    neo4j:
        endpoint: 1.2.3.4
        user: neo4j
        password: my_neo4j_password

You should not include `config.yml` in your version control. If you use Git, add it to your `.gitignore` file.

In [1]:
import yaml
config = yaml.safe_load(open("config.yml"))

## Install dependencies and load data

We'll use the `datasets` module to load a sample set of financial news articles, and the `neo4j` library to interact with Neptune programmatically.

In [2]:
%pip install --upgrade --quiet boto3 botocore llama-index datasets neo4j llama-index-llms-bedrock llama-index-graph-stores-neo4j llama-index-embeddings-langchain langchain

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
aiobotocore 2.12.1 requires botocore<1.34.52,>=1.34.41, but you have botocore 1.34.71 which is incompatible.
awscli 1.32.55 requires botocore==1.34.55, but you have botocore 1.34.71 which is incompatible.
graph-notebook 4.1.0 requires neo4j<5.0.0,>=4.4.9, but you have neo4j 5.18.0 which is incompatible.
graph-notebook 4.1.0 requires nest-asyncio<=1.5.6,>=1.5.5, but you have nest-asyncio 1.6.0 which is incompatible.
graph-notebook 4.1.0 requires networkx==2.4, but you have networkx 3.2.1 which is incompatible.
gremlinpython 3.6.2 requires aiohttp<=3.8.1,>=3.8.0, but you have aiohttp 3.9.3 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
import lzma

lines = []
with lzma.open('/home/ec2-user/SageMaker/kleister-nda/train/in.tsv.xz', mode='rt', encoding='utf-8') as fid:
    for line in fid:
        fields = line.split('\t')
        lines.append(fields[2])

## Bedrock setup for LlamaIndex

Here we'll set up chat and embedding models to use with LlamaIndex.

In [10]:
from llama_index.llms.bedrock import Bedrock

llm = Bedrock(
    model="anthropic.claude-3-sonnet-20240229-v1:0",
    region_name=config['aws']['region'],
    additional_kwargs={'max_tokens': 2048}
)

In [11]:
llm.complete("tell me a joke")

CompletionResponse(text="Here's a silly joke for you:\n\nWhy can't a bicycle stand up by itself?\nBecause it's two-tired!", additional_kwargs={}, raw={'id': 'msg_01H7xiwYbRz1tVuDMUDggriX', 'type': 'message', 'role': 'assistant', 'content': [{'type': 'text', 'text': "Here's a silly joke for you:\n\nWhy can't a bicycle stand up by itself?\nBecause it's two-tired!"}], 'model': 'claude-3-sonnet-28k-20240229', 'stop_reason': 'end_turn', 'stop_sequence': None, 'usage': {'input_tokens': 11, 'output_tokens': 30}}, logprobs=None, delta=None)

In [12]:
from langchain_community.embeddings import BedrockEmbeddings
from llama_index.embeddings.langchain import LangchainEmbedding

lc_embed_model = BedrockEmbeddings(
    region_name=config['aws']['region'],
)
embed_model = LangchainEmbedding(lc_embed_model)

In [13]:
embeddings = embed_model.get_text_embedding(
    "It is raining cats and dogs here!"
)
print(len(embeddings), embeddings[:10])

1536 [0.96875, 0.0115356445, 0.16503906, 0.2890625, -0.21777344, 0.30664062, 0.48828125, -0.000541687, 0.19042969, 0.3515625]


In [14]:
import os
import json
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    KnowledgeGraphIndex,
)
import sys
from llama_index.core import Settings

from IPython.display import Markdown, display

Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 512

## Neo4j setup

In [15]:
username = config['neo4j']['user']
password = config['neo4j']['password']
url = f"bolt://{config['neo4j']['endpoint']}:7687"
database = "neo4j"

In [16]:
from llama_index.core import KnowledgeGraphIndex, SimpleDirectoryReader
from llama_index.core import StorageContext
from llama_index.graph_stores.neo4j import Neo4jGraphStore

In [17]:
from neo4j import GraphDatabase

AUTH = (username, password)

with GraphDatabase.driver(url, auth=AUTH) as driver:
    driver.verify_connectivity()

## Populate graph

In [18]:
max_articles = len(lines)
max_articles

254

In [19]:
article_indices = [0,1,2,3,4]

In [20]:
data_dir = 'data'
os.makedirs(data_dir, exist_ok=True)

In [21]:
for adx in article_indices:
    print(f"Article number {adx}")
    text = lines[adx]
    fname = os.path.join(data_dir, f"{adx}.txt")
    with open(fname, "w") as F:
        F.write(text)

Article number 0
Article number 1
Article number 2
Article number 3
Article number 4


In [22]:
documents = SimpleDirectoryReader(
    data_dir
).load_data()

In [23]:
graph_store = Neo4jGraphStore(
    username=username,
    password=password,
    url=url,
    database=database,
)

storage_context = StorageContext.from_defaults(graph_store=graph_store)

In [24]:
# NOTE: can take a while!
index = KnowledgeGraphIndex.from_documents(
    documents,
    storage_context=storage_context,
    max_triplets_per_chunk=2,
)

## Explore the data

In [25]:
!pip install pyvis gravis

Collecting pyvis
  Downloading pyvis-0.3.2-py3-none-any.whl.metadata (1.7 kB)
Collecting gravis
  Downloading gravis-0.1.0-py3-none-any.whl.metadata (6.3 kB)
Collecting jsonpickle>=1.4.1 (from pyvis)
  Downloading jsonpickle-3.0.3-py3-none-any.whl.metadata (7.3 kB)
Downloading pyvis-0.3.2-py3-none-any.whl (756 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m756.0/756.0 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hDownloading gravis-0.1.0-py3-none-any.whl (659 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m659.1/659.1 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hDownloading jsonpickle-3.0.3-py3-none-any.whl (40 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: jsonpickle, gravis, pyvis
Successfully installed gravis-0.1.0 jsonpickle-3.0.3 pyvis-0.3.2


In [26]:
from pyvis.network import Network

g = index.get_networkx_graph()
net = Network(notebook=True, cdn_resources="remote", directed=True)
net.from_nx(g)
net.show("example.html")

example.html


In [27]:
query_engine = index.as_query_engine(
    include_text=False, response_mode="tree_summarize"
)
response = query_engine.query(
    "Tell me more about Albitar Oncology",
)

In [28]:
response

Response(response='Unfortunately, I could not find any relevant information about Albitar Oncology from the provided context. The context did not contain any details related to this topic.', source_nodes=[NodeWithScore(node=TextNode(id_='5f4543c8-e1ac-4618-a8a6-a7c7467539b4', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='No relationships found.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=1.0)], metadata={'5f4543c8-e1ac-4618-a8a6-a7c7467539b4': {}})