## Enhancing RAG accuracy with Knowledge Graphs

### Query Process Workflow with Llama-Index and Nebula Graph:

  - **Build Knowledge Graph for the given Context**
  - **Get Key Entities/Relationships related to Query**
  - **Get SubGraphs**
  - **Generate answer based on SubGraphs**


In [8]:
%pip install --upgrade --quiet  langchain langchain-community 
%pip install --upgrade --quiet  langchain-openai langchain-experimental neo4j wikipedia tiktoken yfiles_jupyter_graphs

In [9]:
%pip install --upgrade --quiet  wikipedia

In [1]:
%pip install langchain-experimental
%pip install json-repair

In [11]:
from langchain_core.runnables import (
    RunnableBranch,
    RunnableLambda,
    RunnableParallel,
    RunnablePassthrough,
)
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.prompts.prompt import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from typing import Tuple, List, Optional
from langchain_core.messages import AIMessage, HumanMessage
from langchain_core.output_parsers import StrOutputParser
import os
from langchain_community.graphs import Neo4jGraph
from langchain_text_splitters import RecursiveCharacterTextSplitter
from neo4j import GraphDatabase
from yfiles_jupyter_graphs import GraphWidget
from langchain_community.vectorstores import Neo4jVector
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores.neo4j_vector import remove_lucene_chars
from langchain_core.runnables import ConfigurableField
from langchain_community.document_loaders import WikipediaLoader
import os
from langchain_community.vectorstores.neo4j_vector import Neo4jVector
from langchain_openai import AzureOpenAIEmbeddings
from langchain_openai import AzureOpenAI
from langchain_openai import AzureChatOpenAI, ChatOpenAI
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_text_splitters import TokenTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_core.documents import Document


try:
  import google.colab
  from google.colab import output
  output.enable_custom_widget_manager()
except:
  pass

In [12]:
neo4j_url = "neo4j+s://------------.databases.neo4j.io"
neo4j_user ="neo4j"
neo4j_password = "Z-------------------------------"

graph = Neo4jGraph(
    url=neo4j_url,
    username=neo4j_user,
    password=neo4j_password
)

In [13]:
emb_api_key = "e-------------------------------------"
emb_azure_endpoint = "https://---------------.openai.azure.com/"
emb_deployment_name= "text-embedding-ada-002"


api_version = "2023-07-01-preview"
azure_api_key = "3-----------------------------------"
azure_endpoint = "https://-------------------.openai.azure.com/"
deploy_name= "gpt-4-0613"

llm_chat = AzureChatOpenAI(temperature=0.0,
                           model_name="gpt-4",
                           openai_api_version=api_version,
                           azure_deployment=deploy_name,
                           openai_api_key=azure_api_key,
                           azure_endpoint=azure_endpoint
                          )

llm_transformer = LLMGraphTransformer(llm=llm_chat)


openai_embed = AzureOpenAIEmbeddings(
    model="text-embedding-ada-002",
    api_key=emb_api_key,
    azure_endpoint=emb_azure_endpoint,
    api_version=api_version,
    )

In [14]:
raw_documents = WikipediaLoader(query="Tesla Cybertruck").load()
text_splitter = TokenTextSplitter(chunk_size=512, chunk_overlap=24)
documents = text_splitter.split_documents(raw_documents[:3])

graph_documents = llm_transformer.convert_to_graph_documents(documents)
graph.add_graph_documents(
    graph_documents,
    baseEntityLabel=True,
    include_source=True
)

print(f"Nodes:{graph_documents[0].nodes}")
print(f"Relationships:{graph_documents[0].relationships}")



  lis = BeautifulSoup(html).find_all('li')


Nodes:[Node(id='Tesla Cybertruck', type='Vehicle'), Node(id='Tesla, Inc.', type='Company'), Node(id='2023', type='Year'), Node(id='November 2019', type='Date'), Node(id='North America', type='Location'), Node(id='Cyberbeast', type='Vehicle model'), Node(id='Us Nhtsa', type='Organization'), Node(id='April 2024', type='Date'), Node(id='Elon Musk', type='Person'), Node(id='2012', type='Year'), Node(id='2014', type='Year'), Node(id='Ford F-150', type='Vehicle'), Node(id='2016', type='Year'), Node(id='Tesla Master Plan', type='Plan'), Node(id='2017', type='Year'), Node(id='Tesla Semi', type='Vehicle'), Node(id='Roadster', type='Vehicle'), Node(id='March 2019', type='Date'), Node(id='Tesla Model Y', type='Vehicle'), Node(id='Model B', type='Vehicle'), Node(id='November 6, 2019', type='Date'), Node(id='Cybrtrk', type='Trademark'), Node(id='United States Patent And Trademark Office', type='Organization'), Node(id='August 10, 2020', type='Date'), Node(id='Los Angeles', type='Location'), Node(id

In [15]:
# directly show the graph resulting from the given Cypher query
default_cypher = "MATCH (s)-[r:!MENTIONS]->(t) RETURN s,r,t LIMIT 50"

url=neo4j_url,
username=neo4j_user,
password=neo4j_password,

def showGraph(cypher: str = default_cypher):
    # create a neo4j session to run queries
    driver = GraphDatabase.driver(
        uri = neo4j_url,
        auth = (neo4j_user,  neo4j_password))
    session = driver.session()
    widget = GraphWidget(graph = session.run(cypher).graph())
    widget.node_label_mapping = 'id'
    #display(widget)
    return widget

#showGraph()

In [16]:
vector_index = Neo4jVector.from_existing_graph(
    openai_embed,
    url=neo4j_url,
    username=neo4j_user,
    password=neo4j_password,
    index_name='tasks',
    node_label="Task",
    text_node_properties=['name', 'description', 'status'],
    embedding_node_property='embedding',
)

In [17]:
graph.query(
    "CREATE FULLTEXT INDEX entity IF NOT EXISTS FOR (e:__Entity__) ON EACH [e.id]")

# Extract entities from text
class Entities(BaseModel):
    """Identifying information about entities."""

    names: List[str] = Field(
        ...,
        description="All the person, organization, or business entities that "
        "appear in the text",
    )

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are extracting organization and person entities from the text.",
        ),
        (
            "human",
            "Use the given format to extract information from the following "
            "input: {question}",
        ),
    ]
)

entity_chain = prompt | llm_chat.with_structured_output(Entities)

In [18]:
entity_chain.invoke({"question": "Where was Amelia Earhart born?"}).names

['Amelia Earhart']

In [19]:
def generate_full_text_query(input: str) -> str:
    """
    Generate a full-text search query for a given input string.

    This function constructs a query string suitable for a full-text search.
    It processes the input string by splitting it into words and appending a
    similarity threshold (~2 changed characters) to each word, then combines
    them using the AND operator. Useful for mapping entities from user questions
    to database values, and allows for some misspelings.
    """
    full_text_query = ""
    words = [el for el in remove_lucene_chars(input).split() if el]
    for word in words[:-1]:
        full_text_query += f" {word}~2 AND"
    full_text_query += f" {words[-1]}~2"
    return full_text_query.strip()

# Fulltext index query
def structured_retriever(question: str) -> str:
    """
    Collects the neighborhood of entities mentioned
    in the question
    """
    result = ""
    entities = entity_chain.invoke({"question": question})
    for entity in entities.names:
        response = graph.query(
            """CALL db.index.fulltext.queryNodes('entity', $query, {limit:2})
            YIELD node,score
            CALL {
              WITH node
              MATCH (node)-[r:!MENTIONS]->(neighbor)
              RETURN node.id + ' - ' + type(r) + ' -> ' + neighbor.id AS output
              UNION ALL
              WITH node
              MATCH (node)<-[r:!MENTIONS]-(neighbor)
              RETURN neighbor.id + ' - ' + type(r) + ' -> ' +  node.id AS output
            }
            RETURN output LIMIT 50
            """,
            {"query": generate_full_text_query(entity)},
        )
        result += "\n".join([el['output'] for el in response])
    return result

In [20]:
print(structured_retriever("Who is Musk?"))

Elon Musk - ENVISIONED -> Tesla Cybertruck
Elon Musk - ENVISIONED_IN -> 2012
Elon Musk - COMPARED -> Ford F-150
Elon Musk - ANNOUNCED -> Tesla Cyberquad
Elon Musk - CEO -> Tesla, Inc.
Elon Musk - INVESTED_IN -> Tesla
Elon Musk - CHAIRMAN_OF -> Tesla
Elon Musk - COFOUNDER_OF -> Tesla
Elon Musk - SOLD_INTEREST_IN -> Paypal
Elon Musk - CHAIRMAN -> Tesla
Elon Musk - SHAREHOLDER -> Tesla
Elon Musk - SHAREHOLDER -> Paypal
Elon Musk - CO-FOUNDER -> Tesla
Elon Musk - ANNOUNCED_RESERVATIONS -> Cybertruck
Elon Musk - CEO_OF -> Tesla, Inc.
Elon Musk - INVOLVED_IN -> Roadster
Elon Musk - COMPARED_IN -> 2014
Elon Musk - TEASED_IN -> 2017
Elon Musk - DISTRIBUTED_TEASER -> March 2019
Elon Musk - DEPICTED_ON -> Golden Driller Statue
Tesla, Inc. - CEO -> Elon Musk
Giga Texas - ESTIMATED_BY -> Elon Musk
Tesla Cybertruck - RECALLED_BY -> Us Nhtsa


In [21]:
print(structured_retriever("What is Cybertruck?"))

Cybertruck - COMPARED -> Ford F-150
Cybertruck - EXHIBITED -> Petersen Automotive Museum
Cybertruck - AWARDED -> Concept Car Of The Year
Cybertruck - EXHIBITED_AT -> Petersen Automotive Museum
Cybertruck - ALTERNATIVE_TO -> Fossil-Fuel-Powered Trucks
Cybertruck - HAS -> Armor Glass
Cybertruck - COMPETES_WITH -> Ford F-150
Cybertruck - CARRIES -> Tesla Cyberquad
Franz Von Holzhausen - DAMAGED -> Cybertruck
Tesla Cyberquad - CHARGED -> Cybertruck
Automobile Magazine - AWARDED -> Cybertruck
Tesla - MANUFACTURED -> Cybertruck
Tesla - PRODUCES -> Cybertruck
Gigafactory Texas - PRODUCES -> Cybertruck
Tesla Cyberquad - CHARGED_BY -> Cybertruck
Elon Musk - ANNOUNCED_RESERVATIONS -> Cybertruck
Tesla - ACCEPTS_RESERVATIONS_FOR -> Cybertruck
Tesla Cybertruck - PART_OF -> Tesla Master Plan
Tesla Cybertruck - BUILT_BY -> Tesla, Inc.
Tesla Cybertruck - BUILT_SINCE -> 2023
Tesla Cybertruck - CONCEPT_INTRODUCED -> November 2019
Tesla Cybertruck - AVAILABLE_IN -> North America
Tesla Cybertruck - RECALL

In [22]:
def retriever(question: str):
    print(f"Search query: {question}")
    structured_data = structured_retriever(question)
    unstructured_data = [el.page_content for el in vector_index.similarity_search(question)]
    final_data = f"""Structured data:
{structured_data}
Unstructured data:
{"#Document ". join(unstructured_data)}
    """
    return final_data

In [23]:
# Condense a chat history and follow-up question into a standalone question
_template = """Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question,
in its original language.
Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:"""  # noqa: E501
CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(_template)

def _format_chat_history(chat_history: List[Tuple[str, str]]) -> List:
    buffer = []
    for human, ai in chat_history:
        buffer.append(HumanMessage(content=human))
        buffer.append(AIMessage(content=ai))
    return buffer

_search_query = RunnableBranch(
    # If input includes chat_history, we condense it with the follow-up question
    (
        RunnableLambda(lambda x: bool(x.get("chat_history"))).with_config(
            run_name="HasChatHistoryCheck"
        ),  # Condense follow-up question and chat into a standalone_question
        RunnablePassthrough.assign(
            chat_history=lambda x: _format_chat_history(x["chat_history"])
        )
        | CONDENSE_QUESTION_PROMPT
        | llm_chat
        | StrOutputParser(),
    ),
    # Else, we have no chat history, so just pass through the question
    RunnableLambda(lambda x : x["question"]),
)

In [24]:
template = """Answer the question based only on the following context:
{context}

Question: {question}
Use natural language and be concise.
Answer:"""
prompt = ChatPromptTemplate.from_template(template)

chain = (
    RunnableParallel(
        {
            "context": _search_query | retriever,
            "question": RunnablePassthrough(),
        }
    )
    | prompt
    | llm_chat
    | StrOutputParser()
)

In [25]:
chain.invoke({"question": "What is Cybertruck?"})

Search query: What is Cybertruck?


'Cybertruck is a concept car manufactured by Tesla, Inc. It was introduced in November 2019 and is part of the Tesla Master Plan. It is an alternative to fossil-fuel-powered trucks and competes with the Ford F-150. The Cybertruck is equipped with armor glass and can carry the Tesla Cyberquad. It has been exhibited at the Petersen Automotive Museum and was awarded Concept Car Of The Year by Automobile Magazine.'

In [26]:
chain.invoke({"question": "Who is Elon Musk?"})

Search query: Who is Elon Musk?


'Elon Musk is the CEO of Tesla, Inc. He is also a co-founder and shareholder of the company. He has been involved in envisioning products like the Tesla Cybertruck and Tesla Cyberquad. He has also invested in Tesla and was previously a shareholder in Paypal.'

In [30]:
chain.invoke({"question": "How powerful is the Cybertruck?"})

Search query: How powerful is the Cybertruck?


'The text does not provide information on how powerful the Cybertruck is.'

In [28]:
chain.invoke({"question": "Who has envisioned Tesla Cybertruck?"})

Search query: Who has envisioned Tesla Cybertruck?


'Elon Musk has envisioned the Tesla Cybertruck.'

In [29]:
chain.invoke(
    {
        "question": "When was it built?",
        "chat_history": [("Who has envisioned Tesla Cybertruck?", "Elon Musk")],
    }
)

Search query: When was the Tesla Cybertruck built?


'The Tesla Cybertruck has been built since 2023.'